1) Context

I was recently performing I-V measurements of a MOS
(Metal-Oxide-Semiconductor) structure. A full set of measurements
contained a DC biaising voltage, a AC frequency, a small signal
capacitance and conductance. I had to change a few times the
measurement device configuration, so sometimes the sweeping occured
first on frequency, then on voltage, sometimes in the reverse
order. To make it short, I had to deal with many input files with
inconsistent columns order. The code to identify this order quickly
became clumsy.

The idea of a dataframe is to implement a mix between matrix and
cells. Its' like a matrix, where each column contains elements of the
same type. Unlike a matrix, columns type may be dissimilar. Also,
each colum MUST have a name, and rows MAY have a name. Moreover, to
make it easy to interface with databases, each row must have an unique
identifier. The goal is to make possible to use constructs like
y(:, ["Fr*"; "VB*"; "C";"G"])
where y is the dataframe, and column selection is based on
regexp. This way, the translation between names and indexes uses all
the power of regexpes.

2) Implementation
a dataframe is a class containing the following members:
_cnt = [0 0] : row count, column count, ... nth dimension count
_name = cell(1, 2) : row names, column names, ...
_ridx = []  : a unique Id for each row
_data = cell(0, 0) : a container for each column
_type = cell(0, 0) : the type of each column

The constructor can be used as
- no argument: convert the whole workspace to a dataframe (TBD)
- one null argument: return an empty dataframe
- one numeric or cell argument: transform it to a dataframe; tries to
infer column names from the name of the input argument.
- one char array with more than one line: uses it as rownames
- one single line char array: take it as the name of a file to read
data from. Expected format is csv, try to be carefull with
quoted/unquoted strings, also tries to remove trailing and leading
spaces from string entries. Do not try to cope with things such as
separator INSIDE quoted strings.

-supplemental arguments may occur either as pairs (string, value),
 either as vectors. In the first case, the string contains an optional
 parameter whose value is contained in the next argument. In the
 second case,  the argument is right-appended to the dataframe. Valid
 optional parameters are
 - rownames: a character array with the row names
 - unquot: a logical to indicate if strings must be unquoted, default=true
 - seeked: a string which must occur in the first row to start
 considering values. Previous lines are skipped.

3) Access (reading)
- like a single matrix: df(:, 3); df(3, :). If all the results are of
the same type, returns a matrix, otherwise a dataframe. This behavior
can be inhibited by having the last argument set to 'dataframe':
  df(3, 3, 'dataframe') will return a one-by-one dataframe
- by columnames:  
  df(:, ["Fr*"; "VB*"; "C";])
  will try to match a columname beginning by "F" followed by an
  optional 'r', thus 'F', 'Fréquence' and 'Freqs'; then a columname
  starting by "V" with an optional "B", like f.i. "VBias", then a
  columname with is the exact string 'C'.
- by rownames: same principle
- either member selector may also be logical: 
    df(df.OK=='A', ['C';'G']) 
- as a struct: either use one of the column name (df.C), either use
  one of the allowed accessor for internal fields: "rownames",
  "colnames", "rowcnt", "colcnt", "rowidx", "types". Direct access to
  the members like y._type is allowed, but should be restricted to
  class members and friends. "types" accept both numeric and strings
  arguments, the latter being converter to column order based upon
  columns name.
- as a cell: TODO: define how to fill the cell array with all the
  fields.

4) Modifying
- as a matrix, using '()': use the same syntax as reading:
  df(3, 'Fr*') = 200
  df(df.OK=='?', ['C'; 'G']) = NaN;
  Note that removing elements may only occur on a full row of colum
  basis. Removing a single element is not allowed.
- as a struct: either access a columname, as 
  df.C = [];   
  either accessing the internal fields through entry points 'rownames'
  and 'colnames', where care is taken to adapt the strings width in
  order to make them compatibles. The entry point "types", with
  arguments numeric or strings, has the effect to cast whole column(s)
  to a new type:
  df.types{[3 5]} = 'uint16'
  df.type{"Freq"} = "uint32"
- as a cell: TBD

5) other overloaded functions: display, size, numel, cat. The latter
has to be thoroughfully tested. In particular, I've put the
restriction that horizontal cat requires that the row indexes are the
same for both elems. For vertical cat, how should we proceed ? Require
uniqueness of row indexes, and sorting ? Other ?

6) to be done:
- the 'load' function is in fact contained inside the constructor;
maybe we should have a specific load function ?
- be able to load a dataframe from a URI specification
- write a simple 'save' function
- adding data to a dataframe: R doesn't seems to allow adding rows
to a data.frame, should we follow it ?
- add test cases
- implement a 'factor' class for categorised data
- make all functions below statistics/ dataframe compatible

Pascal Dupuis
Louvain-la-Neuve, July First, 2010.