1) Context I was recently performing I-V measurements of a MOS (Metal-Oxide-Semiconductor) structure. A full set of measurements contained a DC biaising voltage, a AC frequency, a small signal capacitance and conductance. I had to change a few times the measurement device configuration, so sometimes the sweeping occured first on frequency, then on voltage, sometimes in the reverse order. To make it short, I had to deal with many input files with inconsistent columns order. The code to identify this order quickly became clumsy. The idea of a dataframe is to implement a mix between matrix and cells. Its' like a matrix, where each column contains elements of the same type. Unlike a matrix, columns type may be dissimilar. Also, each colum MUST have a name, and rows MAY have a name. Moreover, to make it easy to interface with databases, each row must have an unique identifier. The goal is to make possible to use constructs like y(:, ["Fr*"; "VB*"; "C";"G"]) where y is the dataframe, and column selection is based on regexp. This way, the translation between names and indexes uses all the power of regexpes. 2) Implementation a dataframe is a class containing the following members: _cnt = [0 0] : row count, column count, ... nth dimension count _name = cell(1, 2) : row names, column names, ... _ridx = [] : a unique Id for each row _data = cell(0, 0) : a container for each column _type = cell(0, 0) : the type of each column The constructor can be used as - no argument: convert the whole workspace to a dataframe (TBD) - one null argument: return an empty dataframe - one numeric or cell argument: transform it to a dataframe; tries to infer column names from the name of the input argument. - one char array with more than one line: uses it as rownames - one single line char array: take it as the name of a file to read data from. Expected format is csv, try to be carefull with quoted/unquoted strings, also tries to remove trailing and leading spaces from string entries. Do not try to cope with things such as separator INSIDE quoted strings. -supplemental arguments may occur either as pairs (string, value), either as vectors. In the first case, the string contains an optional parameter whose value is contained in the next argument. In the second case, the argument is right-appended to the dataframe. Valid optional parameters are - rownames: a character array with the row names - unquot: a logical to indicate if strings must be unquoted, default=true - seeked: a string which must occur in the first row to start considering values. Previous lines are skipped. 3) Access (reading) - like a single matrix: df(:, 3); df(3, :). If all the results are of the same type, returns a matrix, otherwise a dataframe. This behavior can be inhibited by having the last argument set to 'dataframe': df(3, 3, 'dataframe') will return a one-by-one dataframe - by columnames: df(:, ["Fr*"; "VB*"; "C";]) will try to match a columname beginning by "F" followed by an optional 'r', thus 'F', 'Fréquence' and 'Freqs'; then a columname starting by "V" with an optional "B", like f.i. "VBias", then a columname with is the exact string 'C'. - by rownames: same principle - either member selector may also be logical: df(df.OK=='A', ['C';'G']) - as a struct: either use one of the column name (df.C), either use one of the allowed accessor for internal fields: "rownames", "colnames", "rowcnt", "colcnt", "rowidx", "types". Direct access to the members like y._type is allowed, but should be restricted to class members and friends. "types" accept both numeric and strings arguments, the latter being converter to column order based upon columns name. - as a cell: TODO: define how to fill the cell array with all the fields. 4) Modifying - as a matrix, using '()': use the same syntax as reading: df(3, 'Fr*') = 200 df(df.OK=='?', ['C'; 'G']) = NaN; Note that removing elements may only occur on a full row of colum basis. Removing a single element is not allowed. - as a struct: either access a columname, as df.C = []; either accessing the internal fields through entry points 'rownames' and 'colnames', where care is taken to adapt the strings width in order to make them compatibles. The entry point "types", with arguments numeric or strings, has the effect to cast whole column(s) to a new type: df.types{[3 5]} = 'uint16' df.type{"Freq"} = "uint32" - as a cell: TBD 5) other overloaded functions: display, size, numel, cat. The latter has to be thoroughfully tested. In particular, I've put the restriction that horizontal cat requires that the row indexes are the same for both elems. For vertical cat, how should we proceed ? Require uniqueness of row indexes, and sorting ? Other ? 6) to be done: - the 'load' function is in fact contained inside the constructor; maybe we should have a specific load function ? - be able to load a dataframe from a URI specification - write a simple 'save' function - adding data to a dataframe: R doesn't seems to allow adding rows to a data.frame, should we follow it ? - add test cases - implement a 'factor' class for categorised data - make all functions below statistics/ dataframe compatible Pascal Dupuis Louvain-la-Neuve, July First, 2010.