<<

NAME

UCS::DS::Memory - In-memory representation of data sets

SYNOPSIS

  use UCS::DS::Memory;

  $ds = new UCS::DS::Memory;            # empty data set
  $ds = new UCS::DS::Memory $filename;  # read from file (using UCS::DS::Stream)

  # access & edit variables, comments, and globals with UCS::DS methods

  $pairs = $ds->size;                   # number of pair types
  $ds->set_size($pairs);                # truncate or extend data set

  $value = $ds->cell($var, $n);         # read entry from data set table
  $ds->set_cell($var, $n, $value);      # set entry in data set table

  $rowdata = $ds->row($n);              # returns hashref (varname => value)
  $ds->set_row($n, $rowdata);           # set row data (ignores missing vars)
  $ds->set_row($n, "f1"=>$f1, "f2"=>$f2, ...);
  $ds->append_row($rowdata);           # append row to data set
  $ds->delete_rows($from, $to);         # delete a range of rows from the data set

  $vector = $ds->column($var);          # reference to data vector of $var
  $vector->[$n] = $value;               # fast direct access to cells

  $ds->eval($var, $exp)                 # evaluate expression on data set & store in $var
    unless $ds->missing($exp);          # check first whether all reqd. variables are available
  $ds->add($var);                       # auto-compute variable (derived variable or registered AM)

  $stats = $ds->summary($var);          # statistical summary of numerical variable

  $ds->where($idx, $exp);               # define index: rows matching UCS expression
  $n = $ds->count($exp);                # number of rows matching expression
  $vector = $ds->index($idx);           # returns reference to array of row numbers
  $ds->make_index($idx, $row1, $row2, ...);  # define index: explicit list of row numbers
  $ds->make_index($idx, $vector);            #            or array reference (will be duplicated)
  $ds->activate_index($idx);            # activate index (will be used by most access methods)
  $ds->activate_index();                # de-activate index
  $ds->delete_index($idx);              # delete index

  $ds2 = $ds->copy;                     # make physical copy of data set (using index if activated)
  $ds2 = $ds->copy("*", "am.%");        # copy selected variables only (in specified order)

  $ds->renumber;                        # renumber/add ID values as increasing sequence 1 .. size

  $ds->sort($idx, $var1, $var2, ...);   # sort data set on $var1, breaking ties by $var2 etc.
  $ds->sort($idx, "-$var1", "+$var2");  # - = descending, + = ascending (default depends on variable type)
  $ds->rank($ranking, $key1, ...);      # compute ranking (with ties) and store in data set variable $ranking

  $ds->save($filename);                 # save data set to file (using index if activated)

  $dict = $ds->dict($var1, $var2, ...); # lookup hash for variable(s) (UCS::DS::Memory::Dict object)
  ($max, $average) = $dict->multiplicity; # maximum / average number of rows for each key
  if ($dict->unique) { ... }            # whether every key identifies a unique row
  @rows = $dict->lookup($x1, $x2, ...); # look up key in dictionary, returns all matching rows
  $row  = $dict->lookup($x1, $x2, ...); # in scalar context, returns first matching row
  @rows = $dict->lookup($other_ds, $n); # look up row $n from other data set
  $n_rows = $dict->multiplicity($x1, $x2, ...);  # takes same arguments as lookup()
  @keys = $dict->keys;                  # return unsorted list of keys entered in dictionary

DESCRIPTION

This module implements an in-memory representation of UCS data sets. When a data set file has been loaded into a UCS::DS::Memory object (or a new empty data set has been created), then variable names, comments, and globals can be accessed and modified with the respective UCS::DS methods (see the UCS::DS manpage).

Additional methods in the UCS::DS::Memory class allow the user to:

The individual methods are detailed in the following sections. In all methods, columns are identified by the respective variable names, whereas rows (corresponding to pair types) are identified by row numbers. NB: Row numbers start with 1 (like R vectors, but unlike Perl arrays)!

GENERAL METHODS

$ds = new UCS::DS::Memory;

Create empty data set. The new data set has zero rows and no variables. Returns object of class UCS::DS::Memory;

$ds = new UCS::DS::Memory $file [, '-na'];

Reads data set file into memory and returns UCS::DS::Memory object. The argument $file is either a string giving the name of the data set file or a UCS::DS::Stream::Read object (see the UCS::DS::Stream manpage), which has been opened but not read from. When the specified file does not exist and in the case of a read error, the constructor dies with an appropriate error message.

The option '-na' disables missing value support (which is enabled by default), so that NA values in the data set file will be replaced by 0 or the empty string, depending on the data type. Use '+na' to enable missing value support explicitly.

$V = $ds->size;

Returns the size of the data set, i.e. the number of rows (or pair types).

$ds->set_size($V);

Change the size of the data set to $V rows. This method can both truncate and extend a data set. NB: Unlike the size method, set_size always applies to the real size of the data set and ignores the active row index. However, all row indices are preserved and adjusted in case of a truncation. If there is an active row index, it remains active. (See the section "ROW INDEX METHODS" below for more information on row indices.)

$value = $ds->cell($var, $n);

Retrieve the value of variable $var for row $n (i.e. the $n-th pair type). This method is convenient and performs various error checks, but it involves a considerable amount of overhead. Consider the column method when performance is an issue.

$ds->set_cell($var, $n, $value);

Set the value of variable $var for row $n to $value. Like cell, this method is convenient, but comparatively slow. Consider the column method when is an issue.

$rowdata = $ds->row($n);

Returns hash reference containing the entire data from row $n indexed by variable names. This method is inefficient and mainly for convenience, e.g. when applying a UCS expression to individual rows (cf. the description of the eval method in the UCS::Expression manpage).

$ds->set_row($n, $rowdata);
$ds->set_row($n, $var1 => $val1, $var2 => $val2, ...);

Set the values of some or all variables for row $n. The values can either be passed in a single hash reference indexed by variable names, or as $var => $value pairs. Any variables that do not exist in the data set $ds are silently ignored. This method is faster than calling set_cell repeatedly, especially when a new row is added to the data set.

$ds->append_row($rowdata);
$ds->append_row($var1 => $val1, $var2 => $val2, ...);

Append new row to the data set and fill it with the specified values. This method is a combination of set_size and set_row. Variable values that are not specified in the argument list are set to undef. When there is an active row index, the new row is appended to this index, while all other indices remain unchanged (see the section "ROW INDEX METHODS" below for more information on row indices).

$ds->delete_rows($from, $to);

Delete rows $from through $to from the data set. NB: This method always applies to the real row numbers and ignores the active row index. All existing indices are adjusted (which is an expensive operation) and an active row index remains activated. (See the section "ROW INDEX METHODS" below for more information on row indices.)

$vector = $ds->column($var);

Returns an array reference to the data vector of variable $var. $vector can be used both for read and write access, so care has to be taken that the data set isn't accidentally modified (e.g. through side effects of a map or grep operation on @$vector). Of course, activating a row index has no effect, since the column method gives direct access to the internal data structures. (See the section "ROW INDEX METHODS" below for more information on row indices.)

@missing_vars = $ds->missing($exp);

Determines whether all variables required to evaluate the UCS expression $exp (an object of class UCS::Expression) are defined in the data set $ds. Returns an empty list if $exp can be evaluated, and the names of missing variables otherwise.

$ds->eval($var, $exp);

Evaluate the UCS expression $exp (an object of class UCS::Expression) on the data set $ds, and store its values in the variable $var. When $var is a new variable, it is automatically added to the data set; Otherwise, the previous values are overwritten. This operation is much faster than repeatedly evaluating $exp for each row. For convenience, $exp can also be specified as a source string, which will be compiled on the fly. NB: The eval method always operates on the entire data set, even when a row index is activated. (See the section "ROW INDEX METHODS" below for more information on row indices.)

$ds->add($var);

Add a new variable to the data set and auto-compute its values, or overwrite an existing variable. $var must be the name of a derived variable such as E11 or an association score such as am.t.score (see the ucsfile manpage for details).

$stats = $ds->summary($var);

Computes a statistical summary of the numerical variable $var (a numerical variable is a variable of data type INT or DOUBLE). $stats is a hash reference representing a data structure with the following fields:

  MIN     ...  minimum value
  MAX     ...  maximum value
  ABSMIN  ...  smallest non-zero absolute value
  ABSMAX  ...  largest absolute value
  SUM     ...  sum of all value
  MEAN    ...  mean (= average) 
  MEDIAN  ...  median (= 50% quantile)
  VAR     ...  empirical variance
  SD      ...  empirical standard deviation (sq. root of variance)
  STEP    ...  smallest non-zero difference between any two values
  NA      ...  number of missing values (undef's)

Note that some of these fields may be undef if they have no meaningful value for the given data set.

$ds2 = $ds->copy;
$ds2 = $ds->copy(@variables);

Duplicates a data set, so that $ds2 is completely independent from $ds (whereas $ds2 = $ds; would just give another handle on the same data set). Comments and globals are copied to $ds2 as well. Optionally, a list of variable names and/or wildcard patterns (see the ucsexp manpage) can be specified. In this case, only the selected columns will be copied. NB: If there is an active row index, the copy will only include the rows selected by the index, and they will be arranged in the corresponding order. However, no row indices are copied to $ds2. (See the section "ROW INDEX METHODS" below for more information on row indices.)

$ds->renumber;

When rows have been deleted from a data set, or a copy has been made with an active row index, the values of the id variable are preserved (and can be used to match rows against the correspond entries in the original data set). When an independent numbering is desired, the renumber method can be used to re-compute the id values so that they form an uninterrupted sequence starting from 1. NB: The renumbering ignores an activated row index.

$ds->save($filename);
$ds->save($filename, @variables);

This method saves the contents of $ds to a UCS data set file $filename. When an optional list of variable names and/or wildcard patterns (see the ucsexp manpage) is specified, only the selected columns will be saved. NB: If there is an active row index, only the rows selected by the index will be written to $filename, and they will be arranged in the corresponding order. The row indices themselves cannot be stored in a data set file. (See the section "ROW INDEX METHODS" below for more information on row indices.) Also note that temporary variables will not be saved (see the UCS::DS manpage).

ROW INDEX METHODS

A row index is an array reference containing a list of row numbers (starting from 1, unlike Perl arrays). Row indices are used to select rows from an in-memory data set, or to represent a re-ordering of the rows (or both). They are usually created by the where and sort methods, but can also be constructed explicitly. An arbitrary number of named row indices can be stored in a UCS::DS::Memory object.

A row index can be activated, creating a "virtual" data set containing only the rows selected by the index, arranged in the corresponding order. Most UCS::DS::Memory methods will then operate on this virtual data set. All exceptions are marked clearly in this manpage. In particular, the where method selects a subset of the activated index, and sort can be used to reorder it. There can only be one active row index at a time. There is no way of localising the activation (so that a previously active index is restored at the end of a block), so it is highly recommended to use active indices only locally and de-activate them afterwards.

Index names must be valid UCS identifiers, i.e. they may only contain alphanumeric characters (A-Z a-z 0-9) and periods (.) (cf. "VARIABLES" in ucsfile). Note that index names beginning with a period are reserved for internal use.

$ds->make_index($idx, $row1, $row2, ...);
$ds->make_index($idx, $vector);

Construct row index from a list of row numbers or an array reference $vector, and store it under the name $idx in the data set $ds. In the second form, the anonymous array is duplicated, so the contents of $vector can be modified or destroyed without affecting the stored row index.

$vector = $ds->index($idx);

Retrieve row index by name. Returns an array reference to the internal data, so be careful not to modify the contents of $vector accidentally. In most cases, it is easier to activate $idx and use the normal access methods.

$ds->delete_index($idx);

Delete the row index named $idx. If it happens to be activated, it will automatically de-activated.

$ds->activate_index($idx);

Activate row index $idx. This will clear any previous activations. Note that this operation may change the effective size of the data set as returned by the size method (unless $idx is just a sort index).

$ds->activate_index();

Deactivate the currently active index, re-enabling direct access to the full data set in its original order.

$ds->where($idx, $exp);

Construct $idx selecting all rows for which the UCS expression $exp (given as a UCS::Expression object) evaluates to true (see the ucsexp manpage for an introduction to UCS expression, and the UCS::Expression manpage for compilation instructions). It is often convenient to compile $exp on the fly, especially when it is a simple condition, e.g.

  $ds->where("high.freq", new UCS::Expression '%f% >= 10');

which can be shortened to

  $ds->where("high.freq", '%f% >= 10');

The where method will automatically compile the source string passed as $exp into a UCS::Expression object. On-the-fly compilation involves only moderate overhead. When there is an active row index, where will select a subset of this index, preserving its ordering.

$n = $ds->count($exp);

Similar to where, this method only counts the number of rows matching the UCS expression $exp, without creating a named index. The condition $exp may be given either as a UCS::Expression object or as a source string, which is compiled on the fly. (Internally, the rows are collected in a temporary index, which is automatically deleted when the method call returns.)

$ds->sort($idx, $key1, $key2, ...);

Sort data set $ds by the specified sort keys. The data set is first sorted, by $key1. Ties are then broken by $key2, any remaining ties by $key3, etc. If there are any ties left when all sort keys have been used, their ordering is undefined (and depends on the implementation of the sort function in Perl). The resulting ordering is stored in a row index with the name $idx. When there is an active row index, sort will re-order the rows selected by this index.

Each sort key consists of a variable name, optionally preceded or followed by a + or - character to select ascending or descending sort order, respectively. The default order is descending for Boolean variables and association scores, and ascending for all other variables. The sort keys 'l1' and 'l2' sort in alphabetical order, while 'f-' puts the most frequent pair types first.

In order to break remaining ties randomly, an appropriate additional sort key has to be specified. The usual choice are the association scores of the random measure (see the UCS::AM manpage). It may be necessary to compute this measure first, which can be conveniently done with the add method, as shown in the example below.

  # order pair types by frequency (descending), breaking ties randomly
  if (not $ds->var("am.random")) {
    $ds->add("am.random");
    $ds->temporary("am.random", 1);  # temporary, don't save to disk
  }
  $ds->sort("by.freq", "f-", "am.random");
$ds->rank($ranking, $key1, $key2, ...);

The rank method is similar to sort, but creates a ranking instead of a sort index. The ranking is stored in the integer variable $ranking. Note that tied rows are assigned the same rank, which is the lowest available rank (as in the Olympic Games) rather than the average of all ranks in the group (as is often done in statistics). All other remarks about the sort method apply equally well to the rank method, especially those concerning randomisation.

DICTIONARIES (LOOKUP HASHES)

A data set dictionary is a hash structure listing all the different values that a given variable assumes in the data set (or all the different value combinations of several variables). For each value (or value combination), which is called a key of the dictionary, the corresponding row numbers in the data set can be retrieved (called a lookup of the key). In the terminology of relational databases, such a dictionary is referred to as an index. Be careful not to confuse this notion with the row index described above, which is used for subsetting and/or reordering the rows of a data set.

A dictionary can be created for any variable (or combination of variables) with the dict method, and is returned in the form of a UCS::DS::Memory::Dict object. NB: This dictionary is only valid as long as the data set itself is not modified (which includes activation or deactivation of a row index). Unlike a database index, the dictionary is not updated automatically. It is therefore important to keep operations on the data set under strict control while a dictionary is in use. It is always possible to add, modify, and delete variables that are not included in the dictionary, though. For the same reason (as well as to save working memory), dictionaries should be deleted when they are no longer needed.

The main purpose of a dictionary is to look up keys and find the matching rows in the data set efficiently (the ucs-join program is an example of a typical application). It is often desirable to choose variables in such a way that every key identifies a unique row in the data set (for instance, the values of l1 and l2 identify a pair type, which should have only one entry in a data set). A dictionary with this property is called unique. Both unique and non-unique dictionaries are supported (unique dictionaries are represented in a memory-efficient fashion). Lookup and similar operations are implemented as methods of the UCS::DS::Memory::Dict object.

Although mainly intended for string values, dictionaries support all data types. Boolean variables will usually be of interest only in combination with other variables (possibly also Boolean ones), and dictionaries are rarely useful for floating-point values.

$dict = $ds->dict($var1, ..., $varN);

Create a dictionary for the variables $var1, ..., $varN in the data set $ds. Each key of this dictionary is a combination of N values, which must be specified in the same order as the variable names. When a row index is in effect, keys and row numbers in the dictionary are taken from the virtual data set defined by the activated index. The returned object of class UCS::DS::Memory::Dict is a read-only dictionary: in order to take changes in the data set $ds into account (including the activation or deactivation of a row index), a new object has to be created with the dict method.

if ($dict->unique) { ... }

This method returns a true value iff $dict is a unique dictionary.

($max, $avg) = $dict->multiplicity;
$max = $dict->multiplicity;

Returns the maximum ($max) and average ($avg) number of rows matching a key in $dict. The dictionary is unique iff $max equals 1.

@rows = $dict->lookup($x1, ..., $xN);
$row = $dict->lookup($x1, ..., $xN);

Look up a key, specified as an N-tuple of variable values ($x1, ..., $xN), in the dictionary $dict and return the matching row numbers. The values $x1, ..., $xN must be given in the same order as the variables $var1, ..., $varN in the dict method call when the dictionary was created. When the key is not found in $dict, an empty list is returned.

In scalar context, the (number of the) first matching row is returned, or undef if the key is not found in the dictionary.

@rows = $dict->lookup($ds2, $n);
$row = $dict->lookup($ds2, $n);

The lookup method can also be used to look up rows from a second data set $ds2, i.e. to find rows in the dictionary's data set $ds where the values of $var1, ..., $varN match the $n-th row of $ds2. For this form of invocation, the dictionary variables must be defined in $ds2 (otherwise, a fatal error is raised).

$n_rows = $dict->multiplicity($x1, ..., $xN);
$n_rows = $dict->multiplicity($ds2, $n);

When called with arguments, the multiplicity method returns the number of rows matching a specific key in $dict. The key can be given in the same two ways as for the lookup method. (Note that calling lookup in scalar context returns the first matching row, not the total number of rows.)

@keys = $dict->keys;
$n_keys = $dict->keys;

Returns an unsorted list of all dictionary keys in the internal representation (where each key is a single string value). Such internal representations can be passed to the lookup and multiplicity methods instead of an N-tuple ($x1, ..., $xN). In scalar context, the keys method efficiently computes the number of keys in $dict.

Examples

The keys method and the ability to use the returned internal representations in the lookup method provide an easy way to compute the (empirical) distribution of a data set variable, i.e. a list of different values and their multiplicities. (Note that calling lookup in scalar context cannot be used to determine the multiplicity of a key because it returns the first matching row in this case.)

  # frequency table for variable $v on data set $ds
  $dict = $ds->dict($v);
  @distribution = 
    # sort values by multiplicity
    sort { $b->[1] <=> $a->[1] or $a->[0] cmp $b->[0] }
    # compute multiplicity for each value
    map { [$_, $dict->multiplicity($_)] }
    # for a single variable $v, internal keys are simply the values
    $dict->keys;
  undef $dict;                  # always erase dictionary after use

The following example is a bare-bones version of the ucs-join command, annotating the pair types of a data set $ds1 with a variable $var from another data set $ds2 (matching rows according to the pair types they represent, i.e. using the variables l1 and l2). Typically, $ds2 will be an annotation database.

  $ds1->add_variables($var);    # assuming $var hasn't previously exist in $ds1
  $dict = $ds2->dict($var);
  $dict->unique 
    or die "Not unique -- can't look up pair types.";
  foreach $n (1 .. $ds1->size) {
    $row = $dict->lookup($ds1, $n);
    $ds1->set_cell($var, $n, $ds2->cell($var, $row))
      if defined $row;
  }
  undef $dict;

SEE ALSO

The ucsfile manpage for general information about UCS data sets and the data set file format, the ucsexp manpage for an introduction to UCS expressions (which are used extensively in the UCS::DS::Memory module) and wildcard patterns, the UCS::Expression manpage for information on how to compile UCS expressions, and the UCS::DS manpage for methods that manipulate the layout of a data set and its header information.

COPYRIGHT

Copyright 2004 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.

<<