This tutorial gives an introduction to the core features of the UCS/R toolbox (loading UCS data sets and evaluation graphs for association measures) from an end-user perspective. Several functions that are used internally and that are mainly of interest to the programmer are not covered.

To follow the tutorial, first start an R process in the UCS/R base directory. Read the comments and execute the command lines following each comment block in your R process (either using cut & paste, or by typing C-c C-n in Emacs/ESS mode). Each command stands on a single, separate text line. The command lines must be executed in the order in which they are listed, and none of them should be left out.

Bibliographical details for all citations can be found at the URL

` http://www.collocations.de/REF/`

Have fun!

` Stefan Evert, June 2004.`

Load UCS/R configuration data, help pages, and module loading mechanism.

source("lib/ucs.R")

The relative path above only works when you are running R from the UCS/R
base directory. You can use the ucs-config tool from UCS/Perl to insert
the correct full path automatically:

` ucs-config script/tutorial.R`

The following command prints a list of the available UCS/R modules:

ucs.library()

For evaluation graphs, the "base" and "plots" modules are needed (it would be sufficient to specify the latter, which will automatically load its prerequisites).

ucs.library("base") ucs.library("plots")

Note that the two modules cannot be loaded with a single function call.
It is possible to load all UCS/R modules at once, though:

` ucs.library(all=TRUE)`

Once the UCS/R configuration file has been loaded, the full on-line documentation is available within the R help system. The "UCS" manpage gives an overview of the UCS/R modules with links to the most important manual pages.

?UCS

You can print a list of all UCS/R manpages with either of the following:

library(help="UCS") help(package="UCS")

See the "ucs.library" manpage for details on the module loading mechanism.

?ucs.library

The annotated data set required for the evaluation tutorial is not shipped
with the UCS system because of its size. The association score annotations
can easily be re-created with the UCS/Perl tools. If you have configured
UCS correctly and installed the UCS/Perl programs in your search path,
simply type

` ucs-tool prepare-tutorial`

on the command line. If the shell cannot find the "ucs-tool" program,
try writing "../bin/ucs-tool" instead. Note that this script may take some
time to compute all association measures. When the task has been successfully
completed, it will print "Done." on the screen.

You can also run the annotation script from R with the commands below. This operation may not be reliable on some systems (for instance, a special option --no-pv is necessary under Cygwin). If you experience any problems, fall back on the command line above.

ucs.tool <- paste(ucs.basedir, "perl", "bin", "ucs-tool", sep="/") command <- paste(ucs.tool, "prepare-tutorial") if (ucs.cygwin) command <- paste(command, "--no-pv") ucs.system(command)

Please wait until command execution has complete and the R prompt appears before proceeding. Note that you will not be able to see the screen output of the command in the Windows version of R.

This solution makes use of undocumented internal functions and configuration variables of UCS/R. In a similar way, experienced R programmers will be able to employ the UCS/Perl tools for data manipulation tasks in a UCS/R script.

Read a UCS data set from an uncompressed (.ds) or compressed (.ds.gz) file. Here we read in the German adjective+noun data from Evert, Heid, and Lezius (2000). The data set has been annotated with association scores for all standard measures supported by the UCS/Perl system (see "ucsdoc UCS::AM").

AN <- read.ds.gz("glaw.scores.ds.gz")

When a plain filename (rather than a full path) is passed to the
read.ds.gz() function and the corresponding data set file does not exist in
the current directory (as is the case here), UCS/R will automatically
search the global data set directory tree for a matching file. This
automatic search may not be available on some systems. If the command
above fails to load the data set, you need to specify the full absolute or
relative path to the data set:

` AN <- read.ds.gz("../../DataSet/glaw.scores.ds.gz")`

Note that this command will only work when the R interpreter has been run
from the UCS/R base directory.

This data set contains 4652 rows (= pair types) and more than 30 columns (= variables).

dim(AN)

The 30+ variables comprise the core variables id (a numerical unique ID); l1 and l2 (the pair type); f, f1, f2, and N (the frequency signature); the user-defined variable n.accept (manually annotated true positives); as well as more than 20 associations scores.

colnames(AN)

The ds.find.am() function returns the names of all association measures whose scores are annotated in the data set. These names will be needed later for the evaluation graphs.

ds.find.am(AN)

For the built-in association measures of the UCS toolkit, short descriptions are available and can be accessed with the am.key2desc() function:

am.key2desc(ds.find.am(AN))

A UCS data set file will usually contain only AM scores but not the corresponding rankings. The following command returns an empty list:

ds.find.am(AN, rank=TRUE)

Rankings can easily be added with the add.ranks() function.

AN <- add.ranks(AN)

By default, rankings are computed for all annotated AMs, but it is also
possible to specify an explicit list with the optional keys parameter, e.g.

` AN <- add.ranks(AN, keys=c("MI", "MI2"))`

Ties in the rankings are broken randomly (using the "random" AM, which
must be annotated). With the optional "randomise=FALSE" parameter,
each group of tied candidates is assigned the lowest free rank (see
"?add.ranks" for details).

It is also possible to combine loading and annotation of the ranks into a
single command:

` AN <- add.ranks( read.ds.gz("Distrib/glaw.scores.ds.gz") )`

Rankings for all association measures are now available in the data set:

ds.find.am(AN, rank=TRUE)

See the following manpages for information about functions that list and
manipulate variable names relating to association measures.

` ?ds.find.am`

` ?ds.key2var`

` ?builtin.ams`

The order.by.am() function is used internally to compute rankings. It can
also be accessed directly in order to sort a data set.

` ?order.by.am`

The following example sorts the data set by log-likelihood scores and
displays some relevant columns for the 30 highest-ranking candidates:

index <- order.by.am(AN, "log.likelihood") AN[index[1:30], c("l1", "l2", "f", "f1", "f2", "am.log.likelihood")]

In order to create evaluation graphs, it is necessary to mark true positives among the candidates. The GLAW data set contains manual annotations by two coders (see Evert, Heid, and Lezius 2000 for details): the variable "n.accept" specifies the number of coders who accepted a given candidate as a collocation.

However, the annotations have to be made explicit in the form of a Boolean variable (a "logical" in the R terminology) named "b.TP", which assumes the value TRUE for true positives. The following command defines true positives as candidates that were accepted at least by one of the annotators. In a similar way

AN$b.TP <- (AN$n.accept > 0)

Now it is easy to compute the number of true positives in the data set and the baseline precision (number of TPs / size of the data set).

TPs <- sum(AN$b.TP) baseline <- round(100 * TPs / nrow(AN), digits=2) cat(TPs, "true positives, baseline precision =", baseline, "%\n")

The standard evaluation procedure for association measures (used in a collocation extraction task) computes precision and recall for n-best lists of different sizes (Evert & Krenn 2001; Evert & Krenn 2003; Evert 2004).

The evaluation results for all n-best lists can be combined into precision and recall graphs. Both graph types (as well as other variants that will be introduced later) are generated with the evaluation.plot() function. One advantage of this bundling of functionality is that all graph types offer the same range of customisation through the same interface.

Let us begin with a precision graph for the log-likelihood measure. The first argument of the evaluation.plot() function is the data set to be evaluated, which must be annotated with association score rankings. Recall that the variable "b.TP" must be defined to identify true positives (evaluation.plot() will complain if the variable is undefined). The second argument is the name of the AM whose precision values are to be displayed.

evaluation.plot(AN, "log.likelihood")

The x-axis of this graph shows the number of candidates in the n-best list (i.e. n), and the y-axis shows the corresponding precision value as a percentage. Note that the y-axis is automatically scaled to an appropriate range. We can illustrate the interpretation of the precision graph by marking 1000-best and 2000-best lists with the "show.nbest" option. The intersection of each vertical line with the log-likelihood precision curve gives the respective n-best precision value.

evaluation.plot(AN, "log.likelihood", show.nbest=c(1000,2000))

Precision graphs for different association measures can be combined into a single plot, which allows for a direct visual comparison of their performance. The evaluation.plot() function accepts a vector of character strings as its second argument, giving the names of up to 10 measures. Plots combining more than 5 precision graphs tend to become quite confusing, though. The following command compares evaluation results for the log-likelihood, t-score, chi-squared, and mutual information (MI) measures. Following Krenn (2000) and Evert & Krenn (2001), ranking by cooccurrence frequency is used as a non-statistical baseline.

evaluation.plot(AN, c("log.likelihood", "frequency", "t.score", "chi.squared", "MI"))

The dotted horizontal line shows the baseline precision for the full data set. This is the expected performance for random selection of candidates, which can be simulated with the "random" association measure. Obviously, an AM is only useful for an application when it achieves substantially better precision than the baseline.

evaluation.plot(AN, c("log.likelihood", "random"))

When producing repeated plots for the same combination of AMs, it is convenient to assign the vector of names to a variable. The following graph returns to the five measures from above and adds vertical lines for 500-best, 1000-best and 2000-best lists as a visual aid for the comparison.

measures <- c("log.likelihood", "frequency", "t.score", "chi.squared", "MI") evaluation.plot(AN, measures, show.nbest=c(500,1000,2000-10,2000+10))

Note that there substantial differences between the measures are only found for n <= 2000 (a little trick has allowed us to show n=2000 as a double line). When a large proportion (say, 80%) of the candidates is considered, the precision graphs become very similar and converge towards the baseline. The reason is, of course, that at this point (almost) all the true positives have been identified by the measures. Increasing the value of n will only add false positives to the n-best lists, gradually lowering precision.

For a meaningful interpretation of the precision graphs, we should not only have to compare them to the baseline, but also to the upper bound, i.e. to an "ideal" association measure whose scores allow a perfect distinction between true and false positives. We can easily simulate such a perfect AM, which we will call the "optimal" measure. This measure assigns a score of 1 to every true positive, and 0 to every false positive. Note that we have to compute a ranking for this measure with the add.ranks() function, adding the option "overwrite=FALSE" to avoid unnecessary re-computing of existing rankings. (Don't worry about the assignment: R has a lazy evaluation strategy and will not create a complete physical copy of the AN data set in the add.ranks() function.)

AN$am.optimal <- as.integer(AN$b.TP) AN <- add.ranks(AN, overwrite=FALSE) evaluation.plot(AN, c(measures, "optimal"), show.nbest=c(500,1000,2000-10,2000+10))

The comparison shows that there is considerable room for improvement. However, when you plot precision graphs for the other AMs supported by the UCS toolkit you will find that the log-likelihood curve constitutes a "practical" upper bound. Perhaps the most interesting question in current research on association measures is whether it is possible to narrow the gap between the best-performing "real" AM and the "ideal" measure.

For the comparison of the five association measures we have chosen, only the range up to n=2000 is of interest. We can "zoom in" to this part of the plot with the "x.max" option. The additional options "x.min", "y.max" and "y.min" allow us to display an arbitrary rectangular region, but they will less often be needed.

evaluation.plot(AN, measures, show.nbest=c(500,1000), x.max=2000)

While collocation extraction with association measures often focuses on high precision (even if recall is low), it can be important for some applications that a substantial proportion of the true positives among the candidates is identified. The evaluation.plot() function can also draw recall graphs, which are activated with the option ``y.axis="recall"''. As usual with R, the keyword "recall" can be abbreviated to an unambiguous prefix, e.g. "rec". In the same way, the names of association measures can be abbreviated as long as the specified prefix is unique within the data set.

evaluation.plot(AN, measures, y.axis="recall")

You can change the appearance of the evaluation graphs in many ways with optional parameters to the evaluation.plot() function. The following example adds a heading to the plot and suppresses the baseline precision display.

evaluation.plot(AN, measures, x.max=2000, title="Evaluation on German adjective-noun pairs", show.baseline=FALSE)

The labels shown in the legend box, which default to the UCS names of the evaluated association measures, can be overridden with the "legend" parameter. This allows us to display shorter, nicely formatted or more meaningful names, as well as mathematical expressions.

evaluation.plot(AN, measures, x.max=2000, legend=expression(G^2, f, t, chi^2, MI))

The default fonts used by R are fairly small and can be difficult to read when the graphics are presented to a larger audience. The "cex" parameter specifies an expansion factor for labels and headings in the plot. The line width of precision graphs and most other lines can be increased with the "lex" parameter. Note that the latter is not an expansion factor but adds a fixed amount to the default line widths.

evaluation.plot(AN, measures, x.max=2000, cex=1.5, lex=1)

The defaults for most style-related settings can be controlled with the ucs.par() function, which is similar to R's built-in par() function for graphics parameters. The defaults can be queried by passing their names to ucs.par() as character strings.

ucs.par("cex", "lex")

They can be modified with the usual "name=value" syntax. We will now set the character expansion factor to 1.3 for the remainder of this tutorial.

ucs.par(cex=1.3)

The ucs.par() function also gives control over the colours and line styles of the precision graphs (which cannot be overridden directly). The style parameters "col", "lty" and "lwd" are vectors specifying the colours, line types and line widths for up to ten association measures, respectively. Shorter vectors will be recycled as necessary. The following example changes the colours to shades of blue and makes the lines "crumble" into dots. When used to set new defaults, ucs.par() returns a list containing the old values. This list can later be passed to the function in order to return to the previous settings.

old.pars <- ucs.par(col=c("#0000FF", "#0000CC", "#000099", "#000055", "#000000"), lty=c("solid", "21", "11", "12", "14")) print(old.pars) evaluation.plot(AN, measures, x.max=2000) ucs.par(old.pars)

You may have noticed that the initial parts of the precision (and recall) graphs are missing. By default, precision values are not shown for n-best lists with n < 100 because the graphs become very unstable. As an extreme example, precision is either 0% or 100% for n=1. The cutoff point is controlled with the "n.first" parameter. Setting "n.step" to some value k will show precision values for every k-th value of n only, leading to smoother graphs (but keep in mind that this may mask random variation and lead to spurious effects if "n.step" is set too high). We set "n.first" to 50 for the rest of this tutorial.

evaluation.plot(AN, measures, x.max=2000, y.max=70, n.first=1) evaluation.plot(AN, measures, x.max=2000, n.step=10) ucs.par(n.first=50)

When a "file" argument is given, the evaluation plot will be saved to an Encapsulated PostScript file with the specified name rather than displayed on screen.

evaluation.plot(AN, measures, file="temp/tut1.eps")

Try viewing the file "temp/tut1.eps" using "gv" or "ghostview" (or GSView on Windows). You will notice that it takes an extremely long time to open and display the EPS file. This is due to the fact that each precision curve consists of more than 4,500 vertices for each of the possible n-best lists. In order to keep file size reasonable and speed up display, we will compute only every 10-th n-best list from now on, which is done by setting n.step=10. Note that the size of <tut1.eps> is reduced from some 300 kBytes to less than 63 kBytes.

ucs.par(n.step=5) evaluation.plot(AN, measures, file="temp/tut1.eps")

By default, the total size of the graphic will be 6 by 6 inches, which can be changed with the "plot.width" and "plot.height" parameters. In most cases, there will be no need to change the defaults. However, it is often desirable to ensure that the plot region is exactly square (by setting "aspect=1") or has an oblong shape (when multiple plots will be combined side by side or on top of each other later on). Note that aspect settings apply only to EPS files and will not affect the on-screen display.

evaluation.plot(AN, measures, file="temp/tut2.eps", aspect=2) evaluation.plot(AN, measures, file="temp/tut2.eps", aspect=1)

When evaluation graphs are to be included in printed articles, they will usually have to be in black and white, which can be achieved by setting "bw=TRUE". It is often desirable to choose a large font ("cex") and draw heavier lines ("lex") so that the labels and precision graphs are easily discernible when the plot is scaled down for a two-column layout. Note that the line style for black and white graphs can be controlled by setting the "bw.col", "bw.lty" and "bw.lwd" defaults with the ucs.par() function.

evaluation.plot(AN, measures, file="temp/tut3.eps", aspect=1, bw=TRUE, cex=1.6, lex=1)

Especially for small data sets and n-best lists, the observed evaluation results may at least in part be due to random variation. It is therefore necessary to perform significance tests in order to make sure that there is sufficient evidence for a true effect. Evert (2004, Sec. 5.3) makes a detailed argument for the necessity of significance tests and describes appropriate methods for two tasks: (i) the evaluation of a single measure; (ii) the comparison of two different measures.

Task (i) is addressed by confidence intervals for the precision values of a single association measure, which are computed when the option "conf=TRUE" is passed to the evaluation.plot() function. In addition, the name of the measure has to be specified with the "conf.am" parameter. As usual, the name can be abbreviated to a prefix that is unique among the evaluated measures. At the default 95% confidence level, we can be 95% sure that the "true precision", averaged over many evaluation experiments under similar conditions, is contained in the confidence interval. The following example combines confidence intervals for log-likelihood with the precision graphs of the four other measures.

evaluation.plot(AN, measures, x.max=2000, conf=TRUE, conf.am="log.l")

A different confidence level can be selected by setting "conf" to the corresponding significance level rather than TRUE, e.g. 0.01 for 99% confidence or 0.001 for 99.9% confidence. Confidence intervals for up to two association measures can be combined in a single plot and are drawn as shaded region in the black and white version.

m2 <- c("log.likelihood", "frequency") evaluation.plot(AN, m2, x.max=2000, conf=TRUE, conf.am="log.l", conf.am2="f") evaluation.plot(AN, m2, x.max=2000, conf=TRUE, conf.am="log.l", conf.am2="f", file="temp/tut4.eps", aspect=1, bw=TRUE, lex=1)

Task (ii): At a first glance it would seem that there is no significant difference between the log-likelihood measure and a ranking by cooccurrence frequency. Their confidence intervals overlap, so that the true precision could be the same for both measures. The fallacy of this argument lies in the fact that the evaluation results of the two measures are not independent. Since the rankings are based on the same set of candidates, the precision values achieved by the two measures are highly correlated, i.e. they tend to deviate from the true precision in the same direction. Therefore, a much more sensitive test can be used to detect significant differences. Such a test is activated with the option "test=TRUE", specifying the two measures to be tested as "test.am1" and "test.am2". In the evaluation plot, yellow triangles mark significant differences.

evaluation.plot(AN, m2, x.max=2000, test=TRUE, test.am1="log.l", test.am2="f")

The default confidence level is 95% and can be changed in the same way as for the confidence intervals. The spacing of the n-best lists for which such significance tests are carried out is controlled by the "test.step" parameter, which is given in multiples of n.step and defaults to 10 (this relatively large value is chosen both for presentation and performance reasons).

evaluation.plot(AN, m2, x.max=2000, test=TRUE, test.am1="log.l", test.am2="f", test.step=5)

For large n-best lists, even minuscule differences between precision values can become significant (in statistics, this is a well known problem of the balance between effect size and amount of evidence). An experimental options tests whether the observed results provide significant evidence for a "substantial" difference between the true precision values of the two measures. Setting "test.relevant=1" defines "substantial" as a difference of at least one percent point and marks the corresponding n-best lists with red triangles. Note that the algorithm involves a number of guesses and approximations, and will print some debugging information for every relevant difference that is detected.

evaluation.plot(AN, m2, x.max=2000, test=TRUE, test.am1="log.l", test.am2="f", test.step=5, test.relevant=1)

In black and white mode (especially when saving to an EPS file), the triangles are filled in light grey (indicating significance) or dark grey (indicating relevance).

evaluation.plot(AN, m2, x.max=2000, test=TRUE, test.am1="log.l", test.am2="f", test.step=5, test.relevant=1, file="temp/tut5.eps", aspect=1, bw=TRUE, lex=1)

The most intuitive way to assess the practical relevance of differences between association measures is a plot of precision against recall, with recall shown on the x-axis and precision on the y-axis. This plot type is activated by passing the option ``x.axis="recall"'' to the evaluation.plot() function. Precision-by-recall graphs can be computed as transformations of the precision graphs, with n-best lists corresponding to diagonal lines. One advantage of plotting precision against recall is that differences between the measures often become much more conspicuous and it is usually not necessary to zoom in to a part of the graph.

evaluation.plot(AN, measures, x.axis="recall", show.nbest=c(500,1000,2000-10,2000+10))

All the features described above also work for precision-by-recall graphs, including confidence intervals and significance tests. Significant differences between the results of two association measures are indicated by coloured arrows rather than triangles, though.

evaluation.plot(AN, m2, x.axis="recall", test=TRUE, test.am1="log.l", test.am2="f", test.relevant=1, lex=1)

The precision graphs considered so far all measure the "cumulative precision" for n-best lists. This evaluation procedure is based on the assumption that will be able to rank most of the true positives at the top, and that the density of TPs, the "local precision", will continue to fall as one moves down the list. Plotting local precision directly can help to throw light on the properties of association measures, and is achieved by setting ``y.axis="local"''. The evaluation.plot() function uses a Gaussian kernel to estimate local precision. The approximate number of candidates taken into account is controlled by the "window" parameter and defaults to 400. The example below shows that most association measures behave in the expected way. For MI, however, the density of true positives remains almost constant across the first half of the ranking.

evaluation.plot(AN, measures, y.axis="local", window=500)

When saving plots to EPS files, one usually wants to display the graphs
also on the screen, requiring two invocations of the evaluation.plot()
function with nearly identical parameter lists. The evaluation.file()
function automates this procedure, producing both a screen and a file
version of the evaluation plot. The "file" parameter is ignored for the
screen version, and "bw=TRUE" is automatically added for the file version.
While testing and fine-tuning the plots, saving to an EPS file can
conveniently be de-activated by setting

` par(do.file=FALSE)`

evaluation.file(AN, m2, x.axis="recall", test=TRUE, test.am1="log.l", test.am2="f", test.relevant=1, lex=1, file="temp/tut6.eps", aspect=1.5)