
  	PP+Verb Pairs from the FR Corpus (PNV-FR)


DESCRIPTION

This work aims to create a standard frequency database of German PP+Verb combinations from the Frankfurter Rundschau (FR) corpus [1].  Since the PP in these combinations is represented as Prep:Noun, we use the acronym PNV-FR for the resulting data set.

The corpus has been part-of-speech tagged with TreeTagger [2], lemmatised and annotated with morphological information from the IMSLex morphology [3] and chunk-parsed with the shallow parser YAC [4].

PP-verb combinations were extracted as syntactic cooccurrences in the terminology of [5] (called relational cooccurrences in [6]), i.e. instances of a syntactic PP-verb relation (including both PP-complements and PP-adjuncts). 

Eligible PPs are maximal, non-embedded PPs (which are not attached to a NP or adjective) identified by the YAC chunker. They are represented by a combination of preposition and lemmatised nominal head. Fused preposition-article combinations are normalised to the preposition and a "+" indicator, e.g. "zum", "zur" => "zu+". Eligible verbs are all content verbs heading a "verbal complex" annotated by YAC, and are represented by the verb lemma. Cooccurrence tokens are formed by eligible verbs and PPs in the same sentence or subclause, making use of the experimental and incomplete subclause annotation of YAC.

The nearest-neighbour extraction strategy (N) used here balances recall against precision, pairing each verb with the nearest eligible PP within the same sentence or subclause. This results in a considerably smaller data set with fewer false positives than the recall-oriented extraction strategy (R) used in previous work [6,7], which considers all pairings of verbs and eligible PPs as instances of a syntactic PP-verb relation. However, neither recall nor precision will be very high. One reason for choosing strategy (N) was that it does not inflate the marginal frequencies of PPs and verbs as much as strategy (R).

By way of illustration, consider the following German sentence, where content verbs and eligible PPs have been marked with brackets:

	Er [V: stellte] [PP: in Frage],  ob diese Zeilen [PP: aus  ihrer eigenen Feder] [V: stammen].
	He     called        in question if these lines       from her   own     pen        come.
	"He called in question whether she had written these lines herself."

Strategy (R) would pair every verb with every PP in the sentence, yielding four cooccurrence tokens:

	(stellen, in:Frage)
	(stellen, aus:Feder)
	(stammen, in:Frage)
	(stammen, aus:Feder)
	
It thus inflates the marginal frequencies of the verbs and PPs by a factor of two. Strategy (N), on the other hand, pairs each verb with the nearest PP only, yielding two cooccurrence tokens:

	(stellen, in:Frage)
	(stammen, aus:Feder)


FILE FORMAT

The frequency database is provided in two different forms: (1) a list of cooccurrence tokens with additional information from the annotated corpus (such as surface forms and morphosyntactic features), and (2) a list of pair types with basic frequency information ("frequency signatures" [6]) in UCS data set format [8].

The token list (1) is a gzip-compressed text file named

	de_pnv_fr_N.tokens.gz

where each cooccurrence token is given on a separate line with the following TAB-delimited fields:

    p:n     PP (normalised preposition : lemma of nominal head)
    v       content verb (lemmatised)
    dist    distance between PP and verbal complex of the content verb
            (number of intervening tokens + 1, negative if PP to the left of verb)
    case    partially disambiguated case of PP (Nom,Gen,Dat,Akk separated by blanks)
    num     partially disambiguated number of PP (Sg,Pl separated by blanks)
    det     determination of PP (Def = definite article, Ind = indefinite, Nil = no article)
    pp_form surface form of complete PP
    v_form  surface form of complete verbal complex around content verb

The UCS data set of pair types (2) in file

	de_pnv_fr_N.ds.gz

is a gzip-compressed, TAB-delimited table with a single header column preceded by optional comment lines. It contains the following variables (columns) for each pair type (row):

    id      row number (used as unique ID for pair type in file)
    l1      PP (normalised preposition : lemma of nominal head)
    l2      content verb (lemmatised)
    f       cooccurrence frequency of pair type (l1, l2)
    f1      marginal frequency of PP (l1)
    f2      marginal frequency of verb (l2)
    N       total number of cooccurrence tokens extracted from the corpus

See [8] for further details and a suite of processing tools for the data set format.


IMPORTANT NOTE

If you use this frequency database for the MWE 2008 Shared Task, you will notice that some MWE candidates (even those marked as frequent in the PNV gold standard) are not found in the corpus frequency data (derived from the same corpus as the gold standard), or occur with very low frequency only.

This is in part due to the different extraction strategy (the original data sets for the gold standard were extracted with strategy (R)), and in part to improvements in YAC and other annotation tools (the gold standard is based on an earlier annotated version of the FR corpus).


REFERENCES

[1] The FR corpus is part of the ECI Multilingual Corpus I distributed by ELSNET.  See http://www.elsnet.org/eci.html for more information and licensing conditions.

[2] Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.

[3] Lezius, Wolfgang; Dipper, Stefanie; Fitschen, Arne (2000). IMSLex - representing morphological and syntactical information in a relational database. In U. Heid, S. Evert, E. Lehmann, and C. Rohrer (eds.), Proceedings of the 9th EURALEX International Congress, pages 133-139, Stuttgart, Germany.

[4] Kermes, Hannah (2003). Off-line (and On-line) Text Analysis for Computational Lexicography. Ph.D. thesis, IMS, University of Stuttgart. Arbeitspapiere des Instituts fuer Maschinelle Sprachverarbeitung (AIMS), volume 9, number 3.

[5] Evert, Stefan (to appear 2008). Corpora and collocations. In A. Luedeling and M. Kytoe (eds.), Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter, Berlin.

[6] Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation, Institut fuer maschinelle Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Available from http://www.collocations.de/phd.html.

[7] Krenn, Brigitte (2000). The Usual Suspects: Data-Oriented Models for the Identification and Representation of Lexical Collocations, volume 7 of Saarbruecken Dissertations in Computational Linguistics and Language Technology. DFKI & Universitaet des Saarlandes, Saarbruecken, Germany.

[8] See http://www.collocations.de/software.html for more information and downloads.
