The UCS toolkit.
The UCS toolkit is a collection of libraries and scripts for the statistical analysis of cooccurrence data. Data sets – each one containing a list of word pairs together with their joint and marginal frequencies – are stored in a tabular format in plain (compressed) text files. They can be viewed, printed, manipulated in various ways, annotated with association scores from a wide range of built-in measures, ranked, and sorted with the UCS/Perl subsystem. Additional functionality for the graphical evaluation of association measures in a collocation extraction task (cf. Evert & Krenn, 2001) is provided by the UCS/R subsystem.
svn co svn://svn.code.sf.net/p/multiword/code/software/UCS/trunk UCS
The full release of sample code and data sets to accompany my PhD thesis (Evert 2004) – which are announced in the text as UCS version 0.5 – has been delayed. Since various extensions, bug fixes and compatibility updates have accumulated in the meantime, I have decided to go ahead with new releases beyond 0.5 despite these omissions. Sample code will be published as part of UCS version 1.0, or in different form together with a reimplementation of the UCS software.
UCS/Perl documentation -
On-line tutorials: UCS/Perl tutorial - UCS/R tutorial - Viktor Trón's UCS quickstart (a one-minute guide for programmers)
Pod::Perldoc(Perl versions prior to 5.8.1)
Tk::Pod(optional, only for documentation viewer)
NB: Future releases of the UCS toolkit are expected to require Perl version 5.8.1 or newer (for Unicode support) and R version 2.10.1 or newer.
Supported and tested platforms
Copyright © 2004–2010 by Stefan Evert.
Footnote: The UCS toolkit has been designed for scientific research on the properties of statistical association measures and the relation between cooccurrences and collocations. In my terminology, this involves a close look at the data and a thorough understanding of the theoretical and methodological background. Flexibility is more important than either frills or speed. Therefore, the UCS system is not intended as a number cruncher that extracts and processes cooccurrences from several hundred million words of text in a few minutes. Nor is it a black box that accepts text files from a word processor and produces a list of collocation candidates at the push of a button.
Archive: UCS-0.6.tar.gz (2.2M) - UCS-0.5-prerelease.tar.gz (1.9M) - UCS-0.4.tar.gz (1.7M) - UCS-0.3.2.tar.gz (1.6M) - UCS-0.3.1.tar.gz (465k) - UCS-0.3.tar.gz (463k) - UCS-0.2.tar.gz (440k)
© 2004-2010 by Stefan Evert, Last modified: Sun Sep 12 12:44:59 2010 (severt)