www.collocations.de: Software

The UCS toolkit.

The UCS toolkit is a collection of libraries and scripts for the statistical analysis of cooccurrence data. Data sets – each one containing a list of word pairs together with their joint and marginal frequencies – are stored in a tabular format in plain (compressed) text files. They can be viewed, printed, manipulated in various ways, annotated with association scores from a wide range of built-in measures, ranked, and sorted with the UCS/Perl subsystem. Additional functionality for the graphical evaluation of association measures in a collocation extraction task (cf. Evert & Krenn, 2001) is provided by the UCS/R subsystem.

Download UCS version 0.6 (UCS-0.6.tar.gz, 2.2 MB) – What's new?
Or check out the cutting-edge code from its new SourceForge home:
svn co svn://svn.code.sf.net/p/multiword/code/software/UCS/trunk UCS

The full release of sample code and data sets to accompany my PhD thesis (Evert 2004) – which are announced in the text as UCS version 0.5 – has been delayed and will probably never happen. Please use the latest version of the UCS toolkit (0.6 or newer), preferably installed directly from the SVN repository.

If you would like to replicate a particular analysis, please contact me by e-mail in order to obtain scripts and data in their current state.

On-line documentation: UCS/Perl documentation - UCS/R documentation
On-line tutorials: UCS/Perl tutorial - UCS/R tutorial - Viktor Trón's UCS quickstart (a one-minute guide for programmers)

Requirements

Perl version 5.6.1 or newer (5.8.1+ recommended) [http://www.perl.com/]
Additional Perl modules [http://www.cpan.org/]
- Expect
- Pod::Perldoc (Perl versions prior to 5.8.1)
- Term::ReadKey (recommended)
- Perl/Tk and Tk::Pod (optional, only for documentation viewer)
The R statistical environment version 2.0 or newer (version 2.8 or newer highly recommended) [http://www.r-project.org/]
- RSPerl interface (optional) [http://www.omegahat.org/RSPerl/] for faster communication between Perl and R
  - RSPerl may be difficult to install, and is not required for running UCS
  - On Mac OS X, you can try my hacked version RSPerl_0.92-2_fixed.tar.gz. See UCS installation notes for more information.
a2ps (recommended)

NB: Future releases of the UCS toolkit are expected to require Perl version 5.8.1 or newer (for Unicode support) and R version 2.10.1 or newer.

Supported and tested platforms

Linux 2.4 / i386 (SuSE 9.0 & RedHat 9)
Linux 2.6 / x86_64 (Debian)
SUN Solaris 2.8 / SPARC
Mac OS X 10.4.8–10.6.4 / PowerPC, i386, x86_64
other Unix-like platforms should work as well [not tested]
Win32 / i386 (Cygwin emulation) [experimental]

Footnote: The UCS toolkit has been designed for scientific research on the properties of statistical association measures and the relation between cooccurrences and collocations. In my terminology, this involves a close look at the data and a thorough understanding of the theoretical and methodological background. Flexibility is more important than either frills or speed. Therefore, the UCS system is not intended as a number cruncher that extracts and processes cooccurrences from several hundred million words of text in a few minutes. Nor is it a black box that accepts text files from a word processor and produces a list of collocation candidates at the push of a button.

Archive: UCS-0.6.tar.gz (2.2M) - UCS-0.5-prerelease.tar.gz (1.9M) - UCS-0.4.tar.gz (1.7M) - UCS-0.3.2.tar.gz (1.6M) - UCS-0.3.1.tar.gz (465k) - UCS-0.3.tar.gz (463k) - UCS-0.2.tar.gz (440k)