octave_packages/dataframe-0.9.1/@dataframe/rationale.txt

   1 1) Context
   2
   3 I was recently performing I-V measurements of a MOS
   4 (Metal-Oxide-Semiconductor) structure. A full set of measurements
   5 contained a DC biaising voltage, a AC frequency, a small signal
   6 capacitance and conductance. I had to change a few times the
   7 measurement device configuration, so sometimes the sweeping occured
   8 first on frequency, then on voltage, sometimes in the reverse
   9 order. To make it short, I had to deal with many input files with
  10 inconsistent columns order. The code to identify this order quickly
  11 became clumsy.
  12
  13 The idea of a dataframe is to implement a mix between matrix and
  14 cells. Its' like a matrix, where each column contains elements of the
  15 same type. Unlike a matrix, columns type may be dissimilar. Also,
  16 each colum MUST have a name, and rows MAY have a name. Moreover, to
  17 make it easy to interface with databases, each row must have an unique
  18 identifier. The goal is to make possible to use constructs like
  19 y(:, ["Fr*"; "VB*"; "C";"G"])
  20 where y is the dataframe, and column selection is based on
  21 regexp. This way, the translation between names and indexes uses all
  22 the power of regexpes.
  23
  24 2) Implementation
  25 a dataframe is a class containing the following members:
  26 _cnt = [0 0] : row count, column count, ... nth dimension count
  27 _name = cell(1, 2) : row names, column names, ...
  28 _ridx = []  : a unique Id for each row
  29 _data = cell(0, 0) : a container for each column
  30 _type = cell(0, 0) : the type of each column
  31
  32 The constructor can be used as
  33 - no argument: convert the whole workspace to a dataframe (TBD)
  34 - one null argument: return an empty dataframe
  35 - one numeric or cell argument: transform it to a dataframe; tries to
  36 infer column names from the name of the input argument.
  37 - one char array with more than one line: uses it as rownames
  38 - one single line char array: take it as the name of a file to read
  39 data from. Expected format is csv, try to be carefull with
  40 quoted/unquoted strings, also tries to remove trailing and leading
  41 spaces from string entries. Do not try to cope with things such as
  42 separator INSIDE quoted strings.
  43
  44 -supplemental arguments may occur either as pairs (string, value),
  45  either as vectors. In the first case, the string contains an optional
  46  parameter whose value is contained in the next argument. In the
  47  second case,  the argument is right-appended to the dataframe. Valid
  48  optional parameters are
  49  - rownames: a character array with the row names
  50  - unquot: a logical to indicate if strings must be unquoted, default=true
  51  - seeked: a string which must occur in the first row to start
  52  considering values. Previous lines are skipped.
  53
  54 3) Access (reading)
  55 - like a single matrix: df(:, 3); df(3, :). If all the results are of
  56 the same type, returns a matrix, otherwise a dataframe. This behavior
  57 can be inhibited by having the last argument set to 'dataframe':
  58   df(3, 3, 'dataframe') will return a one-by-one dataframe
  59 - by columnames:
  60   df(:, ["Fr*"; "VB*"; "C";])
  61   will try to match a columname beginning by "F" followed by an
  62   optional 'r', thus 'F', 'Fréquence' and 'Freqs'; then a columname
  63   starting by "V" with an optional "B", like f.i. "VBias", then a
  64   columname with is the exact string 'C'.
  65 - by rownames: same principle
  66 - either member selector may also be logical:
  67     df(df.OK=='A', ['C';'G'])
  68 - as a struct: either use one of the column name (df.C), either use
  69   one of the allowed accessor for internal fields: "rownames",
  70   "colnames", "rowcnt", "colcnt", "rowidx", "types". Direct access to
  71   the members like y._type is allowed, but should be restricted to
  72   class members and friends. "types" accept both numeric and strings
  73   arguments, the latter being converter to column order based upon
  74   columns name.
  75 - as a cell: TODO: define how to fill the cell array with all the
  76   fields.
  77
  78 4) Modifying
  79 - as a matrix, using '()': use the same syntax as reading:
  80   df(3, 'Fr*') = 200
  81   df(df.OK=='?', ['C'; 'G']) = NaN;
  82   Note that removing elements may only occur on a full row of colum
  83   basis. Removing a single element is not allowed.
  84 - as a struct: either access a columname, as
  85   df.C = [];
  86   either accessing the internal fields through entry points 'rownames'
  87   and 'colnames', where care is taken to adapt the strings width in
  88   order to make them compatibles. The entry point "types", with
  89   arguments numeric or strings, has the effect to cast whole column(s)
  90   to a new type:
  91   df.types{[3 5]} = 'uint16'
  92   df.type{"Freq"} = "uint32"
  93 - as a cell: TBD
  94
  95 5) other overloaded functions: display, size, numel, cat. The latter
  96 has to be thoroughfully tested. In particular, I've put the
  97 restriction that horizontal cat requires that the row indexes are the
  98 same for both elems. For vertical cat, how should we proceed ? Require
  99 uniqueness of row indexes, and sorting ? Other ?
 100
 101 6) to be done:
 102 - the 'load' function is in fact contained inside the constructor;
 103 maybe we should have a specific load function ?
 104 - be able to load a dataframe from a URI specification
 105 - write a simple 'save' function
 106 - adding data to a dataframe: R doesn't seems to allow adding rows
 107 to a data.frame, should we follow it ?
 108 - add test cases
 109 - implement a 'factor' class for categorised data
 110 - make all functions below statistics/ dataframe compatible
 111
 112 Pascal Dupuis
 113 Louvain-la-Neuve, July First, 2010.