NAME

ucsexp - Introduction to UCS expressions and wildcard patterns

INTRODUCTION

UCS expressions and wildcard patterns are two central features of the UCS/Perl system, which are to a large part responsible for its convenience and flexibility.

UCS wildcard patterns are used by most command-line tools to select data set variables with the help of shell-like wildcard characters (?, *, and %). A programmer interface is provided by the UCS::Match function from the UCS module (see the UCS manpage).

UCS expressions give easy access to data set variables from Perl code. With only a basic knowledge of Perl syntax, users can compute association scores and select rows from a data set (using the ucs-add and ucs-select utilities). The programmer interface is provided by the UCS::Expression module (see the UCS::Expression manpage for details). Before reading "UCS EXPRESSIONS", you should become familiar with the UCS data set format and variable naming conventions as described in the ucsfile manpage.

When used on the command line, wildcard patterns usually have to be quoted to keep the shell from expanding wildcards (the GNU Bash shell knows better, though, unless there happen to be matching files in the current directory). Note that when a list of variable names and patterns is passed to one of the UCS/Perl utilities, each name or wildcard pattern has to be quoted individually. UCS expressions (almost) always have to be quoted on the command-line. Single quotes ('...') are highly recommended to avoid interpolation of variables and other meta-characters. The UCS/Perl utilities expect a UCS expression to be passed as a single argument, so the expression must be written as one string. In particular, any expression containing whitespace must be quoted.

UCS WILDCARD PATTERNS

As described in the ucsfile manpage, UCS variable names may only contain the alphanumeric characters (A-Z a-z 0-9) and the period (.), which serves as a general-purpose word delimiter. There is a fixed set of core variables, whose names do not contain a period. All other variable names must begin with a prefix (one of am. r. b. n. x. f.) that determines the data type of the variable. The three wildcard characters take the special role of the period into account. Their meanings are

  ? ... a single character, except "."
  * ... a string that does NOT contain a "."
  % ... an arbitrary string of characters

The % wildcard is typically used to select variable names with a specific prefix or suffix, while * matches the individual words (or parts of words) in a complex variable name.

Examples

a pattern without wildcard characters corresponds to a literal variable name : id, O11, am.log.likelihood
the pattern * matches all core variables (and nothing else); % matches all variable names
O* matches the derived variables O11, O12, O21, and O22; *11 matches O11 and E11, but no complex variable names
prefix patterns allow us to select variables by their type, e.g. am.% for all association scores, or f.% for all user-defined string variables (factors); the * wildcard is inappropriate here because the variable names may contain additional period after the prefix
when variable names are chosen systematically, prefix patterns can also be used to select meaningful groups of variables: am.chi.squared% matches all association scores that are derived from a chi-squared test, and am.%.pv matches all association scores that can be interpreted as probability values (see the UCS::AM and UCS::AM::HTest manpages for more information)

UCS EXPRESSIONS

An UCS expression consists of ordinary Perl code extended with a special syntax to access data set variables. This code is compiled on the fly and applied to the rows of a data set one at a time. The return value of a UCS expression is the value of the last statement executed, unless there is an explicit return statement. When the expression is used as a condition to select rows from a data set, it evaluates to true or false according to the usual Perl rules (the empty string '' and the number 0 are false, everything else is true).

Data set variables are accessed by their variable name enclosed in % characters. They evaluate to the respective value for the current row in the data set and can be used like ordinary scalar variables in Perl. Thus, %f% corresponds to the cooccurrence frequency f of a pair type, %l1% and %l2% to its component lexemes, and %am.log.likelihood% to an association score from the log-likelihood measure. Derived variables (see the ucsfile manpage) do not have to be annotated explicitly in a data set. When necessary, they are computed on the fly from a pair type's frequency signature. Variable references should be treated as read-only (they are automatically localised so that assigning a new value to a UCS variable reference does not modify the original data set).

Any temporary variables needed by the Perl code should be made lexical by declaring them with the my keyword. Variable names beginning with an underscore (such as $_f or $_n_total) are reserved for internal use. Please don't use global variables, which pollute the namespaces and might interfere with other parts of the program. If you feel that you absolutely need a variable to carry information from one row to the next, use a fully qualified variable name in your own namespace.

Since a UCS expression is compiled by the Perl interpreter, it offers the full power and flexibility of Perl, but it also shares its idiosyncrasies and traps for the unwary. You should have a good working knowledge of Perl in order to write UCS expressions. If you don't know the difference between == and eq, now is the time to type perldoc perl and start reading the Perl documentation.

Just as in Perl, data types are automatically converted as necessary. Missing values (which appear as NA in data set files) are represented by undef in Perl. When there may be missing values in a data set, test for definedness (e.g. with defined(%b.colloc%)) to avoid warning messages. UCS expression can use all standard Perl functions (described on the perlfunc manpage). In addition, the utility functions from UCS::Expression::Func (see the UCS::Expression::Func manpage for a detailed description) and a range of special mathematical and statistical functions defined in the UCS::SFunc module (see the UCS::SFunc manpage for a complete listing and details) are imported automatically and can be used without qualification.

UCS Expressions for Programmers

The programmer interface to UCS expressions is provided by the UCS::Expression module (see the UCS::Expression manpage), with functions for compiling and evaluating UCS expressions. The UCS::DS::Memory module includes several methods that apply a UCS expression to the in-memory representation of a UCS data set. Note that all built-in association measures are implemented as UCS expressions (see the UCS and UCS::AM manpages for more information, or have a look at the source files).

When you want to use external functions (either defined by your own module or imported from a separate module), they must be fully qualified. For instance, you must write Math::Trig::atan(1) instead of just atan(1). Make sure that the module is loaded (with use Math::Trig;) before the expression is evaluated for the first time. You can just put the use statement in the Perl script or module where the UCS expression is defined, and it is probably also safe to include it in the expression itself (which allows you to use external libraries even in UCS expression typed on the command line).

An advanced feature of UCS expressions that is only available through the programmer interface are parameters. Parameters play the role of constants in UCS expressions: they can be accessed like data set variables, but their values are fixed and stored within the UCS::Expression object. Parameter names must be valid UCS identifiers and should be all uppercase in order to avoid conflicts with variable names. Parameters must be declared and intialised when the UCS expression is compiled. Their values can be changed with the set_param method. See the UCS::Expression manpage for more information.

Examples

The simplest UCS expressions compare the values of a data set variable to a constant. Recall that == is used for numerical comparison and eq for string comparison in Perl. Both operands will automatically be converted into an appropriate data type.
```
  %f% == 1             # hapax legomena (single occurrences)

  %f% >= 5             # pair types with cooccurrence freq. >= 5

  %l1% eq "black"      # first component type is "black"
```
Since UCS expressions are essentially short Perl scripts, the # character can be used to introduce line comments. String variables can also be matched against Perl regular expressions:
```
  %l2% =~ /ness$/      # second component ends in ...ness
```
Such simple comparisons can be combined into complex Boolean expressions. Use of the lexical operators and, or, and not is recommended for readability (and to avoid confusion with bit operators). Parentheses can also improve readability and help to avoid ambiguities.
```
  %f% >= 5 and %f% < 10        # pair types in frequency range 5 .. 9
  
  # pair types that are ranked high by t-score, but not by log-likelihood
  (%r.t.score% <= 100) and not (%r.log.likelihood% <= 100)
```
Missing values (NA) in a data set can be detected with Perl's defined operator. It may be useful to test data set variables before using them in order to avoid warning messages. The following examples assume a user-defined integer variable n.accept, which lists the number of annotators who have accepted a particular pair type as a collocation.
```
  not defined(%n.accept%)     # selects rows where n.accept has the value NA
  
  %n.accept% >= 1             # will print warnings for all NA values

  defined(%n.accept%) and (%n.accept% >= 1)  # this is safe
```
UCS expressions may contain multiple Perl statements, which must be separated by semicolon (;) characters. In this way, a complex formula can be broken down into smaller parts. The value of the expression is determined by the last statement (or by an explicit return command). Temporary variables that hold intermediate values should always be declared with lexical scope (using my). The first example computes the minimum of two frequency ratios, using the pre-declared min() function from UCS::Expression::Func.
```
  # UCS expression may also extend over multiple lines
  my $ratio1 = %f% / %f1%;
  my $ratio2 = %f% / %f2%;
  min($ratio1, $ratio2);      # min() is pre-declared
```
The second example shows how temporary variables can be used to replace missing values with defaults. Here the integer variable n.accept (for the number of annotators that accepted the given pair type as a collocation) defaults to 0.
```
  my $n = (defined %n.accept%) ? %n.accept% : 0;
  $n >= 1;
```
The third example identifies prime numbers used as ID values.
```
  foreach my $x (2 .. int(sqrt(%id%))) {
    return 0 if (%id% % $x) == 0;
  } 
  return 1;
```

Dirty Tricks

Things not to do ...

Global variables can be used to carry information from one row to the next (while lexicals will be re-instantiated and possibly initialised for each row they are applied to). In order to avoid namespace pollution, put the global variable in a namespace of your own. The example below uses a global variable in a made-up namespace (scrap) to compute partial sums for the numerical variable x.weight.
```
  $scrap::partial_sum += %x.weight%;
```
Of course, this expression will only work once. After that, the variable $scrap::partial_sum must be reset to zero. As long as the first row in the data set has an id value of 1, we can use the following trick (be careful when using the UCS::DS::Memory module, where index activation might change the order of the rows).
```
  $scrap::partial_sum = 0 if %id% == 1;
  $scrap::partial_sum += %x.weight%;
```

COPYRIGHT

This software is provided AS IS and the author makes no warranty as to its use and performance. You may use the software, redistribute and modify it under the same terms as Perl itself.