
  	Adj+Noun Pairs from the FR Corpus (AN-FR)


DESCRIPTION

This work aims to create a standard frequency database of German Adj+Noun combinations from the Frankfurter Rundschau (FR) corpus [1].  Only prenominal adjectives modifying the head of a non-embedded NP are considered.

The corpus has been part-of-speech tagged with TreeTagger [2], lemmatised and annotated with morphological information from the IMSLex morphology [3] and chunk-parsed with the shallow parser YAC [4].

Adjective-noun combinations are natural examples of syntactic cooccurrences in the terminology of [5] (called relational cooccurrences in [6]), i.e. instances of a direct syntactic relation (adjectival modification within NPs). They can easily be identified in POS-tagged corpora with high accuracy [7]. For the present database, the <ap> and <np> annotations made by YAC were used, and all APs directly embedded in a maximal NP were extracted. Both the adjective (head of the AP) and the noun (head of the NP) were lemmatised. 

Various filters were applied to improve the quality of the data, including the following:
    - invariant adjectives (mostly place names such as "Berliner") were excluded (based on YAC annotation)
    - some adverbially used adjectives have (incorrectly) been annotated as <ap> and had to be weeded out
    - only common nouns were accepted as head of the NP, excluding proper nouns
    - adjectives and nouns containing special characters were filtered out
    - short strings (up to 2 characters) are usually acronyms or annotation errors and were filtered out

FILE FORMAT

The frequency database is provided in two different forms: (1) a list of cooccurrence tokens with additional information from the annotated corpus (such as surface form and morphosyntactic features), and (2) a list of pair types with basic frequency information (frequency signatures [6]) in UCS data set format [8].

The token list (1) is a text file where each cooccurrence token is given on a separate line with the following TAB-delimited fields:

    adj     adjective (lemmatised)
    n       noun (lemmatised)
    length  length of AP in words (may include adverbs, PP-complements, etc.)
    dist    distance between AP and head of the NP (assumed to be rightmost noun)
            (defined as the number of tokens between AP and head noun + 1)
    case    partially disambiguated case of NP (Nom,Gen,Dat,Akk separated by blanks)
    num     partially disambiguated number of NP (Sg,Pl separated by blanks)
    det     determination of NP (Def = definite article, Ind = indefinite, Nil = no article)
    form    surface form of the relevant part of the NP (from AP to head noun)

The UCS data set of pair types (2) is a TAB-delimited table with a single header column preceded by optional comment lines. It contains the following variables (columns) for each pair type (row):

    id      row number (used as unique ID for pair type in file)
    l1      adjective (lemmatised)
    l2      noun (lemmatised)
    f       cooccurrence frequency of pair type (l1, l2)
    f1      marginal frequency of adjective (l1)
    f2      marginal frequency of noun (l2)
    N       total number of cooccurrence tokens extracted from the corpus

See [8] for further details and a suite of processing tools for the data set format.


IMPORTANT NOTE

If you use this frequency database for the MWE 2008 Shared Task, you will notice that some MWE candidates are not found in the corpus frequency data (derived from the same corpus as the gold standard), or occur with very low frequency only.

This is mostly due to improvements in YAC and other annotation tools, since the gold standard is based on an earlier annotated version of the FR corpus.



REFERENCES

[1] The FR corpus is part of the ECI Multilingual Corpus I distributed by ELSNET.  See http://www.elsnet.org/eci.html for more information and licensing conditions.

[2] Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.

[3] Lezius, Wolfgang; Dipper, Stefanie; Fitschen, Arne (2000). IMSLex - representing morphological and syntactical information in a relational database. In U. Heid, S. Evert, E. Lehmann, and C. Rohrer (eds.), Proceedings of the 9th EURALEX International Congress, pages 133-139, Stuttgart, Germany.

[4] Kermes, Hannah (2003). Off-line (and On-line) Text Analysis for Computational Lexicography. Ph.D. thesis, IMS, University of Stuttgart. Arbeitspapiere des Instituts fuer Maschinelle Sprachverarbeitung (AIMS), volume 9, number 3.

[5] Evert, Stefan (to appear 2008). Corpora and collocations. In A. Luedeling and M. Kytoe (eds.), Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter, Berlin.

[6] Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation, Institut fuer maschinelle Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Available from http://www.collocations.de/phd.html.

[7] Evert, Stefan and Kermes, Hannah (2003). Experiments on candidate data for collocation extraction. In Companion Volume to the Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pages 83-86.


[8] See http://www.collocations.de/software.html for more information and downloads.
