X-Git-Url: https://git.creatis.insa-lyon.fr/pubgit/?a=blobdiff_plain;ds=sidebyside;f=octave_packages%2Fdataframe-0.9.1%2F%40dataframe%2Frationale.txt;fp=octave_packages%2Fdataframe-0.9.1%2F%40dataframe%2Frationale.txt;h=9274e6bf0e6eb32d47ff90d03a4f3f17884c0e60;hb=f5f7a74bd8a4900f0b797da6783be80e11a68d86;hp=0000000000000000000000000000000000000000;hpb=1705066eceaaea976f010f669ce8e972f3734b05;p=CreaPhase.git diff --git a/octave_packages/dataframe-0.9.1/@dataframe/rationale.txt b/octave_packages/dataframe-0.9.1/@dataframe/rationale.txt new file mode 100644 index 0000000..9274e6b --- /dev/null +++ b/octave_packages/dataframe-0.9.1/@dataframe/rationale.txt @@ -0,0 +1,113 @@ +1) Context + +I was recently performing I-V measurements of a MOS +(Metal-Oxide-Semiconductor) structure. A full set of measurements +contained a DC biaising voltage, a AC frequency, a small signal +capacitance and conductance. I had to change a few times the +measurement device configuration, so sometimes the sweeping occured +first on frequency, then on voltage, sometimes in the reverse +order. To make it short, I had to deal with many input files with +inconsistent columns order. The code to identify this order quickly +became clumsy. + +The idea of a dataframe is to implement a mix between matrix and +cells. Its' like a matrix, where each column contains elements of the +same type. Unlike a matrix, columns type may be dissimilar. Also, +each colum MUST have a name, and rows MAY have a name. Moreover, to +make it easy to interface with databases, each row must have an unique +identifier. The goal is to make possible to use constructs like +y(:, ["Fr*"; "VB*"; "C";"G"]) +where y is the dataframe, and column selection is based on +regexp. This way, the translation between names and indexes uses all +the power of regexpes. + +2) Implementation +a dataframe is a class containing the following members: +_cnt = [0 0] : row count, column count, ... nth dimension count +_name = cell(1, 2) : row names, column names, ... +_ridx = [] : a unique Id for each row +_data = cell(0, 0) : a container for each column +_type = cell(0, 0) : the type of each column + +The constructor can be used as +- no argument: convert the whole workspace to a dataframe (TBD) +- one null argument: return an empty dataframe +- one numeric or cell argument: transform it to a dataframe; tries to +infer column names from the name of the input argument. +- one char array with more than one line: uses it as rownames +- one single line char array: take it as the name of a file to read +data from. Expected format is csv, try to be carefull with +quoted/unquoted strings, also tries to remove trailing and leading +spaces from string entries. Do not try to cope with things such as +separator INSIDE quoted strings. + +-supplemental arguments may occur either as pairs (string, value), + either as vectors. In the first case, the string contains an optional + parameter whose value is contained in the next argument. In the + second case, the argument is right-appended to the dataframe. Valid + optional parameters are + - rownames: a character array with the row names + - unquot: a logical to indicate if strings must be unquoted, default=true + - seeked: a string which must occur in the first row to start + considering values. Previous lines are skipped. + +3) Access (reading) +- like a single matrix: df(:, 3); df(3, :). If all the results are of +the same type, returns a matrix, otherwise a dataframe. This behavior +can be inhibited by having the last argument set to 'dataframe': + df(3, 3, 'dataframe') will return a one-by-one dataframe +- by columnames: + df(:, ["Fr*"; "VB*"; "C";]) + will try to match a columname beginning by "F" followed by an + optional 'r', thus 'F', 'Fréquence' and 'Freqs'; then a columname + starting by "V" with an optional "B", like f.i. "VBias", then a + columname with is the exact string 'C'. +- by rownames: same principle +- either member selector may also be logical: + df(df.OK=='A', ['C';'G']) +- as a struct: either use one of the column name (df.C), either use + one of the allowed accessor for internal fields: "rownames", + "colnames", "rowcnt", "colcnt", "rowidx", "types". Direct access to + the members like y._type is allowed, but should be restricted to + class members and friends. "types" accept both numeric and strings + arguments, the latter being converter to column order based upon + columns name. +- as a cell: TODO: define how to fill the cell array with all the + fields. + +4) Modifying +- as a matrix, using '()': use the same syntax as reading: + df(3, 'Fr*') = 200 + df(df.OK=='?', ['C'; 'G']) = NaN; + Note that removing elements may only occur on a full row of colum + basis. Removing a single element is not allowed. +- as a struct: either access a columname, as + df.C = []; + either accessing the internal fields through entry points 'rownames' + and 'colnames', where care is taken to adapt the strings width in + order to make them compatibles. The entry point "types", with + arguments numeric or strings, has the effect to cast whole column(s) + to a new type: + df.types{[3 5]} = 'uint16' + df.type{"Freq"} = "uint32" +- as a cell: TBD + +5) other overloaded functions: display, size, numel, cat. The latter +has to be thoroughfully tested. In particular, I've put the +restriction that horizontal cat requires that the row indexes are the +same for both elems. For vertical cat, how should we proceed ? Require +uniqueness of row indexes, and sorting ? Other ? + +6) to be done: +- the 'load' function is in fact contained inside the constructor; +maybe we should have a specific load function ? +- be able to load a dataframe from a URI specification +- write a simple 'save' function +- adding data to a dataframe: R doesn't seems to allow adding rows +to a data.frame, should we follow it ? +- add test cases +- implement a 'factor' class for categorised data +- make all functions below statistics/ dataframe compatible + +Pascal Dupuis +Louvain-la-Neuve, July First, 2010.