<<

NAME

UCS::DS::Stream - I/O streams for data set files

SYNOPSIS

  use UCS::DS::Stream;

  $ds = new UCS::DS::Stream::Read $filename;
  die "format error" unless defined $ds;
  # access variables, comments, and globals with UCS::DS methods
  while ($ds->read) {
    die "read/format error"
      unless $ds->valid;                # valid row data available?
    $n = $ds->row;                      # row number
    $idx = $ds->var_index("am.logl");   # see 'ucsdoc UCS::DS'
    $logl = $ds->columns->[$idx];       # $ds->columns returns arrayref
    $logl = $ds->value("am.logl");      # short and safe, but slower
    $rowdata = $ds->data;               # returns hashref (varname => value)
    $logl = $rowdata->{"am.logl"};      # == $ds->value("am.logl") 
  }
  ds->close;

  $ds = new UCS::DS::Stream::Write $filename;
  # set up variables, comments, and globals with UCS::DS methods
  $ds->open;                            # write data set header
  foreach $i (1 .. $N) {
    $ds->data("id"=>$i, "l1"=>$l1, ...);# takes hashref or list of pairs
    $ds->data("am.logl"=>$logl, ...);   # may be called repeatedly to add data
    $ds->columns($i, $l1, $l2, ...);    # complete list of column data
    $ds->write;                         # write row and clear data cache
  }
  $ds->close;

DESCRIPTION

UCS data set streams are used to read and write data set files one row at a time. When an input stream is created, the corresponding data set file is opened immediately and its header is read in. The header information can then be accessed through UCS::DS methods. Each read method call loads a single row from the data set file into an internal representation, from which it is available to the main program.

An output stream creates / overwrites its associated data set file only when the open method is called. This allows the main program to set up variables and header data with UCS::DS method calls. After opening the file, the data for each row is first stored in an internal representation, and then written to disk with the write method.

Note that there are no objects of class UCS::DS::Stream. Both input and output streams inherit directly from the UCS::DS class.

INPUT STREAMS

Input streams are implemented as UCS::DS::Stream::Read objects. When an input stream is created, the header of the associated data set file is read in. Header data and information about the variables in the data set can then be accessed using UCS::DS methods.

The actual data set table is then loaded one row (= pair type) at a time by calling the read method. The row data are extracted into an internal representation where they can be accessed with various methods (some of them being safe, others more efficient).

The na method controls whether missing values (represented by the string NA in the data set file) are recognised and stored internally as undefs, or whether they are silently translated into 0 (BOOL, INT, and DOUBLE variables) and the empty string (STRING variables), respectively.

$ds = new UCS::DS::Stream::Read $filename;

Open data set file $filename and read header information. Header variables and comments, as well as information about the variables in the data set can then be accessed with UCS::DS methods. If $filename is a plain filename or a partial path (i.e., neither a full relative or absolute path starting with / or ./ nor a command pipe) and the file is not found in the current working directory, the standard UCS libary is automatically searched for a data set with this name.

If there is a syntax error in the data set header, undef is returned. Note that the object constructor will die if the file $filename does not exist or cannot be opened for reading.

$ds->na(1);

Enables recognition of missing values represented by the string NA (as used by R). When enabled, missing values are represented by undefs. Otherwise, they will be silently translated into 0 (BOOL, INT, and DOUBLE variables) and the empty string (STRING variables), respectively. Use $ds->na(0); to disable missing value support, which is by default activated.

$ok = $ds->read;

Read one line of data from the data set file and extract the field values into an internal representation. Returns false when the entire data set has already been processed. Typically used in a while loop similar to the diamond operator: while ($ds->read) {...}.

$at_end = $ds->eof;

Returns true when the entire data set has been read, i.e. the logical complement of the value returned by the last read call.

$ok = $ds->valid;

Returns true if the internal representation contains valid row data. Currently, this only compares the number of columns in the file against the number of variables in the data set. Later on, values may also be syntax-checked and coerced into the correct data type.

$n = $ds->row;

Returns the current row number (of the row read in by the last read call, which is now stored in the internal representation).

$value = $ds->value($name);

Get value by variables name. Returns the value of variable $name currently stored in the internal representation. This method is convenient and safe (because it checks that the variable $name exists in the given data set), but incurs considerable overhead.

$cols = $ds->columns;

Return entire row data as an array reference. Individual variables have to be identified by their index, which can be obtained with the var_index method ($cols->[$idx]. Since index lookup can be moved out of the row processing loop, this access method is much more efficient than its alternatives. NB: the array @$rowdata is not reused for the next line of input and can safely be integrated into user-defined data structures.

$rowdata = $ds->data;

Returns hash reference containing entire row data indexed by variable names. Thus, the values of individual variables can be accessed with the expression $rowdata->{$varname}, similar to using the value method. Access with the data method is convenient for copying row data to an output stream. It is relatively slow, though, and should not be used in tight loops.

$ds->close;

Close the data set file. This method is automatically invoked when the object $ds is destroyed.

OUTPUT STREAMS

Output streams are implemented as UCS::DS::Stream::Write objects. After creating an output stream object, variables and header data are set up with the UCS::DS methods. The data set header is written to disk when the open method is called.

After that, the actual data set table is generated one row at a time. Row data is first stored in the internal presentation (using the data or the columns method), and then written to disk when the write method is called.

$ds = new UCS::DS::Stream::Write $filename;

Create output stream for data set file $filename. Note that this file will only be created or overwritten when the open method is called (in contrast to input streams, which open the data set file immediately).

$ds->open;

After setting up variables and header data (comment lines and global variables) with the respective UCS::DS methods, the open method opens the data set file and writes the data set header. If the file cannot be opened for writing, the open method will die with an error message.

$ds->data($v1 => $val1, $v2 => $val2, ...);
$ds->data($hashref);

Store data for the next row to be written in an internal representation. When using the data method, variables are identified by name ($v1, $v2, ...) and can be specified in any order. The variable-value pairs can also be passed with a single hash reference. Variables that do not exist in the data set will be silently ignored. The data method can be called repeatedly for a single row.

$ds->columns($val1, $val2, ...);

The columns method provides a more efficient way to specify row data. Here, all column values are passed in a single method call, and care has to be taken to list them in the correct order (namely, the order in which the variables were set up with the add_vars method). NB: the data and columns methods cannot be mixed. It is also not possible to set up the row data incrementally with repeated columns calls.

$ds->write;

Writes the row data currently stored in the internal buffer to the data set file, and resets the buffer (to undef values). Any undef values in the buffer (including the case where some variables were not specified with the data method) are interpreted as missing values and substituted by the string NA.

$ds->close;

Completes and closes the data set file.

EXAMPLES

The recommended way of copying rows from one data set file to another is to use the data methods of both streams, so that variables are copied by name rather than column position. It would be more efficient to pass row data directly (using the columns methods), but this approach is prone to lead to errors when the order of the columns is different between the input and output data sets.

The following example makes a copy of a data set file, adding an (enumerative) id variable if it is not present in the source file.

  $in = new UCS::DS::Stream::Read $input_file;
  die "$input_file: format error"
    unless defined $in;
  @vars = $in->vars;
  $add_id = not $in->var("id");

  $out = new UCS::DS::Stream::Write $output_file;
  $out->copy_comments($in);             # copy comments and
  $out->copy_globals($in);              # global variables from input file
  $out->add_vars("id")                  # conventionally, the "id" variables
    if $add_id;                         # is in the first column
  $out->add_vars(@vars);
  $out->open;                           # writes header to $output_file

  while ($in->read) {
    die "read/format error"
      unless $in->valid;
    $out->data($in->data);              # copy row data by field name
    $out->data("id" => $in->row)        # use row number as ID value
      if $add_id;
    $out->write;
  }

  $in->close;
  $out->close;

COPYRIGHT

Copyright 2004 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.

<<