NAME

ucsintro - A first introduction to UCS/Perl

INTRODUCTION

UCS is a set of libraries and tools intended for the empirical study of cooccurrence statistics. Its major uses are to apply such statistics, called association measures, to cooccurrence data obtained from a corpus, and to evaluate the resulting association scores and rankings against (manually annotated) reference data.

The frequency data extracted from a given corpus for a given type of cooccurrences consists of a list of pair types with their frequency signatures (i.e. joint and marginal frequencies), and is referred to as a data set. See (Evert 2004) for a detailed explanation of these concepts, different types of cooccurrences, and correct methods for obtaining frequency data. Data sets, stored in a special .ds file format, are the fundamental objects of the UCS toolkit. Most UCS programs manipulate or display such data set files.

The UCS implementation relies heavily on the programming language Perl (http://www.perl.com/) and the free statistical environment R (http://www.r-project.org/) as a library of mathematical and statistical functions. The core of UCS is written in Perl (the UCS/Perl part), but there is also a small library of R functions for interactive work within R (the UCS/R part). UCS/Perl uses R as a back-end, making the most important statistical functions available through a Perl module.

UCS/Perl is mainly a collection of Perl modules that perform the following tasks:

read and write data set files (.ds, .ds.gz)
manage in-memory representations of data sets
compile UCS expressions for easy access to data set variables
filter, annotate, sort, and analyse data sets
provide a repository of built-in association measures
display data sets and evaluation graphs (Perl/Tk and R) [not implemented yet]

Most UCS programs will be custom-built scripts, using the library of support functions provided by the UCS/Perl modules. Loading a data set, annotating it with association scores from one or more measures, and sorting it in various ways can be done with a few lines of Perl code. There are also some ready-made programs in UCS/Perl that perform such standard tasks, operating on data set files. A substantial part of the UCS/Perl functionality is thus accessible from the command-line, at the cost of some additional overhead compared to a custom script (which operates on in-memory representations).

Below, you will find a list of the general documentation files, Perl modules, and programs that are included in the UCS/Perl distribution. Manpages for all modules and programs (as well as the general documentation) are easily accessible with the ucsdoc program, and can also be formatted for printing.

General Documents

  ucsdoc ucsintro             # this introduction
  ucsdoc ucsfile              # description of the UCS data set file format (.ds)
  ucsdoc ucsexp               # UCS expressions and wildcards
  ucsdoc ucsam                # overview of built-in association measures

UCS/Perl MODULES

  use UCS;                    # core library
  use UCS::File;              # file access utilities
  use UCS::R;                 # interface to UCS/R
  use UCS::SFunc;             # special functions and statistical distributions

  use UCS::Expression;        # Perl code interspersed with UCS variables
  use UCS::Expression::Func;  # utility functions available in UCS expressions

  use UCS::AM;                # implementations of various association measures
  use UCS::AM::HTest;         # add-on package: variants of hypothesis tests
  use UCS::AM::Parametric;    # add-on package: parametric association measures

  use UCS::DS;                # data sets ...
  use UCS::DS::Stream;        #   i/o streams for data set files
  use UCS::DS::Memory;        #   in-memory representation of data sets
  use UCS::DS::Format;        #   ASCII formatter (+ other formats)

See the respective manpages (ucsdoc ModuleName) for more information.

UCS/Perl PROGRAMS

  ucsdoc          # front-end to perldoc
  ucs-config      # automatic configuration of UCS/Perl scripts
  ucs-tool        # find and run user-contributed UCS/Perl scripts
  ucs-list-am     # list built-in association measures & add-on packages
                  
  ucs-make-tables # compute frequency signatures from list of pair tokens
  ucs-merge       # merge parts of very large data set
  ucs-summarize   # print (statistical) summaries for selected variables
                  
  ucs-select      # select rows and/or columns from a data set file
  ucs-add         # add variables to a data set file
  ucs-join        # combine rows and/or columns from two data sets
  ucs-sort        # sort data set file by specified attribute(s)
                  
  ucs-info        # display information from header of data set file
  ucs-print       # format data set as ASCII table (for viewing and printing)

See the respective manpages (ucsdoc ProgramName) for more information.

TRIVIA

UCS stands for Utilities for Cooccurrence Statistics.

REFERENCES

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, University of Stuttgart, Germany.

On-line repository of association measures: http://www.collocations.de/

(http://www.collocations.de)

COPYRIGHT

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.