<<

NAME

ucs-make-tables - Compute contingency tables from a sequence of pair tokens

SYNOPSIS

  ... | ucs-make-tables [-v] [--sort | -s] [--sample-size=<n> | -N <n>] 
                        [--threshold=<n> | -f <n>] [--types | -T]  data.ds.gz

  ... | ucs-make-tables [-v] [-s] [-N <n>] [-f <n>] 
                        [--dispersion [--chunk-size=<n>] ]  data.ds.gz

  ... | ucs-make-tables [-v] [-s] [-N <n>] [-f <n>] --segments data.ds.gz

DESCRIPTION

This utility computes frequency signatures and constructs a UCS data set for a stream of pair tokens (or segment-based cooccurrence data) read from STDIN. It is usually applied to the output of a cooccurrence extraction tool in a command-line pipe. The input can also be read from a file (with a < redirection), or decompressed on the fly (with gzip -cd or bzip2 -cd). The resulting data set is written to the file specified as the single mandatory argument on the command-line.

ucs-make-tables operates in two different modes for relational and positional (segment-based) cooccurrences. These two modes are described separately in the following subsections. They take the same command-line options and arguments, as described in the section COMMAND LINE below. Distance-based positional cooccurrences are not supported, as they usually require direct access to the source corpus in order to determine the precise window size.

Relational Cooccurrences

By default, ucs-make-tables operates in a mode for relational cooccurrences. In this mode, the input line format is

  <l1> TAB <l2>

Each such line represents a pair token with labels <l1> and <l2> (i.e. a pair token that belongs to the pair type (l1,l2)). Optionally, pre-compiled frequency counts for pair types can be used if the --types (or -T) option is specified. In this case, the input line format is

  <f> TAB <l1> TAB <l2>

It is still possible to have multiple entries for the same pair type, whose frequency counts will automatically be added up.

For dispersion counts (see below), the input lines should preserve the order in which the corresponding pair tokens appear in the corpus. When dispersion is measured with respect to pre-annotated parts (e.g. paragraphs or documents) rather than equally-sized parts, the input must contain an extra column with unique part identifiers:

  <l1> TAB <l2> TAB <part_id>

or in combination with --types:

  <f> TAB <l1> TAB <l2> TAB <part_id>

(where <f> is the frequency count of this pair type in the current part of the corpus). Note that all pair tokens from a given part must form an uninterrupted sequence in the input, otherwise the dispersion counts will not be correct.

Segment-based Cooccurrences

The mode for segment-based cooccurrences is activated with the --segments (or -S) option. In this mode, each segment is represented by a sequence of four lines in the input stream, called a record:

  1. <segment_id> [ TAB <part_id> ]
  2. The labels of all tokens in the segment that can become first components of pairs, separated by TABs.
  3. The labels of all tokens in the segment that can become second components of pairs, separated by TABs.
  4. A blank separator line.

Duplicate strings on the second or third line will automatically be ignored. The <segement_id> on the first line is currently ignored. The optional <part_id> can be used to compute dispersion counts for pre-annotated parts. All segments that belong to a given part must appear in consecutive records, otherwise the dispersion counts will not be correct.

A prototypical example of the segment-based approach are lemmatised noun-verb cooccurrences within sentences. In this case, each record in the input stream corresponds to a sentence. The first line contains an unimportant sentence identifier. The second line contains the lemma forms of all nouns in the sentence (note that duplicates are automatically removed), and the third line contains the lemma forms of all verbs in the sentence. In order to compute the dispersion of cooccurrences across documents (i.e. document frequencies in the terminology of information retrieval), unique document identifiers have to be added to the first line.

COMMAND LINE

The general form of the ucs-make-tables command is

  ... | ucs-make-tables [--verbose | -v] [--sort | -s]
                        [--threshold=<t> | -f <t>] 
                        [--sample-size=<n> | -N <n>] 
                        [--dispersion [--chunk-size=<s>]]
                        [--types | --segments]
                        data.ds.gz

With the --verbose (or -v) option, some progress information (including the number of pair tokens or segments, as well as the number of pair types encountered so far) is displayed while the program is running. When --sort (or -s) is specified, the resulting data set is sorted in ascending alphabetical order (on l1 first, then l2). Of course, the data set file can always be re-sorted with the ucs-sort utility. When a frequency threshold <t> is specified with the --threshold (or -f) option, only pair types with cooccurrence frequency f >= <t> will be saved to the data set file (but they are still included in the marginal frequency counts of relational cooccurrences, of course). This option helps keep the size of data sets extracted from large corpora manageable.

When --sample-size (or -N) is specified, only the first <n> pair tokens (or segment records) read from STDIN will be used, so that the sample size N of the resulting data set is equal to <n>. This option is mainly useful when computing dispersion counts on equally-sized parts (see below), but it has some other applications as well.

With the --dispersion (or -d) option, dispersion counts are added to the data set and can then be used to test the random sample assumption with a dispersion test (see Baayen 2001, Sec. 5.1.1). In order to do so, the token stream is divided into equally-sized parts, each one containing the number <s> of pair tokens specified with the --chunk-size (or -c) option. For segment-based cooccurrences, each part will contain cooccurrences from <s> segments. When the total number of pair tokens (or segments) is not an integer multiple of <s>, a warning message will be issued. In this case, it is recommended to adjust the number of tokens with the --sample-size option described above.

The dispersion count for each pair type, i.e. the number of parts in which it occurs, is stored in a variable named n.disp in the resulting data set file. In addition, the number of parts and the part size are recorded in the global variables chunks and chunk.size. When the part size is not specified, dispersion counts can be computed for pre-annotated parts, which must be identified in the input stream (see above). In this case, chunk.size is not defined as the individual parts may have different sizes. NB: The use of pre-annotated parts is discouraged, since the mathematics of the dispersion test assume equally-sized parts.

EXAMPLES

If you have installed the IMS Open Corpus Workbench (CWB) as well as the CWB/Perl interface, you can easily extract relational adjective+noun cooccurrences from part-of-speech tagged CWB corpora. The adj-n-from-cwb utility supplied with the UCS system supports several tagsets for German and English corpora. It can easily be extended to other tagsets, languages, and types of cooccurrences (as long as they can be identified with the help of part-of-speech patterns).

The following example extracts adjective+noun pairs with cooccurrence frequency f >= 3 from the CWB demonstration corpus DICKENS (ca. 3.4 million words), and saves them into the data set file dickens.adj-n.ds.gz. The shell variable $UCS refers to the System/ directory of the UCS installation (as in the UCS/Perl tutorial).

  ucs-tool adj-n-from-cwb  penn  DICKENS
       |  ucs-make-tables  --verbose --sort --threshold=3  dickens.adj-n.ds.gz

(Note that the command must be entered as a single line in the shell.)

Extraction from the DICKENS corpus produces approximately 122990 pair tokens. In order to apply a dispersion test with a chunk size of 1000 tokens each, the sample size has to be limited to an integer multiple of 1000:

  ucs-tool adj-n-from-cwb  penn  DICKENS
       |  ucs-make-tables  --verbose --sort --threshold=3 --sample-size=122000
                           --dispersion --chunk-size=1000  dickens.disp.ds.gz

A dispersion test for pair types with f <= 5 can then be performed with the following command, showing a significant amount of underdispersion at all levels.

  ucs-tool dispersion-test -v -m 5 dickens.disp.ds.gz

Segment-based data can be obtained from a CWB corpus with the segment-from-cwb utility. The following example extracts nouns and verbs cooccurring within sentences. A frequency threshold of 5 is applied in order to keep the amount of data (and hence the memory consumption of the ucs-make-tables program) manageable.

  ucs-tool segment-from-cwb -f 5 -t1 "VB.*" -t2 "NN.*" DICKENS s
       | ucs-make-tables --verbose --segments --threshold=5 dickens.n-v.ds.gz

Adjacent bigrams and other simple fixed-distance cooccurrences can efficiently be computed with the cwb-scan-corpus tool, which produces output suitable for use with the --types option. For instance, to determine bigram cooccurrences in the DICKENS corpus with information about their dispersion across different novels, you might execute the following commands

  cwb-scan-corpus DICKENS lemma+0 lemma+1 novel_title+0
      | LC_ALL=C sort -k 4
      | ucs-make-tables --verbose --dispersion --types bigrams.disp.ds.gz

Note the sort command in the pipeline, which orders pair type counts by novel as expected by the --dispersion option. LC_ALL=C ensures that sort will not stumble over character strings in an encoding that does not match the current locale.

REFERENCES

Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.

IMS Open Corpus Workbench (CWB): http://cwb.sourceforge.net/

COPYRIGHT

Copyright 2004-2010 Stefan Evert.

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.

<<