UCS Quickstart: A Very Quick Guide
by Viktor Trón
Download the UCS toolkit by Stefan Evert from the UCS download page.
UCS crucially relies on Perl and
the R (a language for statistical computing).
UCS/Perl uses R as a backend: important statistical functions provided by R are available
through a Perl module.
UCS will carp about any further missing dependencies.
Install UCS with
tar xzvf UCS-0.3.2.tar.gz
answer the questions and rejoice.
Configure UCS (assuming bash and you are still in the UCS toplevel directory) with
export UCS=`System/bin/ucs-config --base-dir`
Documentation in the modules and programs in Perl POD format.
Display them with
For GUI viewing (if you had Perl module
Tk::Pod at installation) use:
ucsdoc -tk ProgramName|ModuleName
Tutorials on line:
Publications related to UCS by Stefan Evert et al. are here.
Data set format (with extension
Fundamental objects of the UCS toolkit are frequency data extracted
from a given corpus for a given type of cooccurrences.
A data set file consists of a list of pair types (as opposed to tokens in a text)
with their frequency signatures (i.e. joint and marginal frequencies), see Evert 2004.
For more on UCS data set file format (
- words or other morpho-syntactic units words occurring in the same sentences or within a certain distance from each other
- adjectives modifying nouns (as in the Dickens and GLAW data sets)
- PPs that are P-objects or adjuncts of a verb (as in the FR-PNV data
.ds), see [ucsfile].
Data sets files are processed in gzipped form (
Examples are in
Get info about the data file with
ucs-info -v glaw.ds.gz
View the data file through a pager with
or much more conveniently (with persitent column headers) with
ucs-print -i dickens.ds.gz
Format the data file as an ASCII table with
Select parts of the data (and display/save them) with
ucs-select f FROM glaw.ds.gz TO ranks.ds.gz
'f' selects the variable named f; FROM and TO are keywords (not case-sensitive)
Create your own data set from a set of pairs of tokens standing in any structural relation (examples above).
Assuming that you have an extraction tool (
printing the instances (in the format
ITEM1 TAB ITEM2 NEWLINE representing a pair token) to
standard out, you can construct your data set with
YourExtractionTool | ucs-make-tables -v
Example script extracting A+N cooccurrences from
IMS Corpus Workbench (CWB).
With the CWB/Perl modules and the demo corpus installed, one can re-create the Dickens data set with
$UCS/Perl/tools/ucs-adj-n-from-cwb.perl penn DICKENS
| ucs-make-tables -v -f 3 my-dickens.ds.gz
Import data sets e.g., from the Ngram Statistics Package (NSP). Assumings
bigrams.cnt was created with NSP's
count.pl tool, create the UCS data set from it with
$UCS/Perl/tools/nsp2ucs.perl -v bigrams.cnt bigrams.ds.gz
Get a statistical summary (min, max, mean, var, sd of vars) with
Sort according to any var along the lines of:
ucs-sort -v dickens.ds.gz BY f+ -r INTO sorted.ds.gz
This sorts a gzipped ds file on var named 'f' in ascending order (+, descending is default) and break ties randomly (-r), the output is also file in gzipped ds. See [usc-sort].
Add association scores, i.e., annotate a data set with your favourite association measure with
ucs-add -v am.t.score am.log.likelihood TO dickens.ds.gz INTO scores.ds.gz
Add ranks (based on association measures) to the dataset with
ucs-add -v 'r.%' TO scores.ds.gz INTO ranks.ds.gz
r.% is a wildcard, see [ucsexp]
Count the number of pair types with cooccurrence frequency >= X with:
ucs-select -v --count FROM ranks.ds.gz WHERE '%f% >= 10'
%f% is a UCS expression, see [ucsexp]
UCS expressions are snippets
of Perl code with special syntax to access data set variables. They
have the full power of Perl. E.g., to retrieve all collocates of nouns
ending in -ness.
ucs-select -v '*' 'r.%' FROM ranks.ds.gz WHERE '%l2% =~ /ness$/' | ucs-sort by l2 l1 | ucs-print -i
Check overlap of two ds or adb (annotated database) files with
ucs-join -v fr-pnv.ds.gz pnv.adb.gz
Transfer annotation attributes accross files with
ucs-join -v fr-pnv.ds.gz WITH b.figur b.fvg FROM pnv.adb.gz INTO fr-annotated.ds.gz
Create new variables (and add them) with
ucs-add -v -m 'b.TP := %b.figur% or %b.fvg%'
TO fr-annotated.ds.gz INTO fr-annotated.ds.gz
Evaluate recall of an association measure for example by counting true positives with loglikelyhood measure <= 500
ucs-select -v --count FROM fr-annotated.ds.gz
WHERE '%b.TP% and %r.log.likelihood% <= 500'