
	Czech Dependency Bigrams from the Prague Dependency Treebank (PDT-MWE)

Pavel Pecina <pecina@ufal.mff.cuni.cz>, Mon Feb 18, 2007

1. Preamble

    1.1 Source

        The PDT MWE candidate data was extracted from the Prague Dependency
        Treebank 2.0, see http://ufal.mff.cuni.cz/pdt2.0/

        The PDT 2.0 has been developed by the Institute of Formal and Applied 
        Linguistics and the Center for Computational Linguistics, Charles 
        University, Prague (see http://ufal.mff.cuni.cz/).

        The full PDT 2.0 data is available from LDC, catalog number LDC2006T01

    1.2 License

        The PDT MWE candidate data is made available under the terms of the
        Creative Commons Attribution-Noncommercial (CC-BY-NC) license, version
        3.0 unported. You may use them for academic research and all
        non-commercial purposes as long as the author (Pavel Pecina) is
        properly credited. See http://creativecommons.org/licenses/by-nc/3.0/
        for a full description and explanation of the licensing terms.


2. MWE Gold Standard Data [available as separate download]

3. MWE Frequency Data

    3.1 Description

        A list of dependency bigrams occurring in the PDT more than five times
        and having part-of-speech patterns that can possibly form a collocation
        (the same as in Section 3 of this document). Each bigram is provided with
        information about its occurence frequency in a form of contingency table.
        See details eg. in [3].

    3.2 Data format

        Data format follows these rules:

        * Each data file contains a list of bigrams each one starting on a new line.

        * A bigram consists of ten fields described in the list below.
          Fields are separated by one tab.

        * Data files are available in ISO-8859-2 (Latin 2) and UTF-8 encoding.

        Field 1: LEMMA1
          Lemma of the first word. This is the "lemma proper" (without technical
          suffixes) of PDT 2.0, see section 2.1. "Lemma structure" of the
          "Manual" or
          http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s01.html

        Field 2: POSTAG1
          Reduced part-of-speech tag of the first word. This is concatenation of
          1st, 3rd, 10th, and 11th character of the PDT 2.0 morphological tag
          (positional tag), see section 2.2.1.1. "Part of speech" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html#POS

        Field 3: DEPREL1
         Value "Head" if the first word is a head of the bigram, simplified
         dependency type otherwise.


        Field 4: LEMMA2
         Lemma of the second word.

        Field 5: POSTAG2
         Reduced Part-Of-Speech tag of the second word.

        Field 6: DEPREL2
         Value "Head" if the second word is a head of the bigram, simplified
         dependency type otherwise.

        Field 7: AVALUE
         Frequency of the dependency bigram.

        Field 8: BVALUE
         Frequency of the second word not beiing in a dependency relation with
         the first word. 

        Field 9: CVALUE
         Frequency of the first word not beiing in a dependency relation with
         the second word. 

        Field 10: DVALUE
         Frequency of all other dependency bigrams in the PDT.

    3.4 Files

        pdt20-mwe-frequency-all.latin2.dat
        pdt20-mwe-frequency-all.utf8.dat
         - All MWE condidates in one file.

        folds/pdt20-mwe-frequency-fold[1-7].latin2.dat
        folds/pdt20-mwe-frequency-fold[1-7].utf8.dat
         - The list of MWE condidates split into seven stratified folds
           See [3] for details.


4. Tools

    4.1 Ranking script: rank.pl
        A "QuickStart" script for ranking MWE candidates extracted from the
        Prague Dependency Treebank 2.0 provided with the frequency information
        in a form of contingency tables. It produces a ranked list of MWE
        candidates based on decreasing values of Pointwise Mutual Information.

        Example usage: ./rank.pl pdt-mwe-frequency-all.latin2.dat > rank.list

    4.2 Evaluation script: eval.pl

	A "QuickStart" script for evaluating a ranked list of MWE candidates
        extracted from the Prague Dependency Treebank 2.0. It computes Average
        Precision for the full n-best list based on the Ground Truth data.

        Examples usage: ./eval.pl rank.list pdt-mwe-gold-standard.latin2.dat

5. Baseline

        As a baseline we consider the Pointwise MutualInformation with AP=64.87%
        measured on all PDT MWE candidate data (implemented in rank.pl)


6. Acknowledgements

        Jan Hajic for granting the license.

        Three anonymous linguists for annotating the data.


7. References

        [1] Jiri Hana, Daniel Zeman, Manual for Morphological Annotation,
        Revision for the Prague Dependency Treebank 2.0 UFAL Technical
        Report No. 2005-27, Charles University, Czech Republic, 2002.

        [2] J. Hajic et al. A Manual for Analytic Layer Tagging of the
        Prague Dependency Treebank (in Czech). UFAL Technical Report
        TR-1997-03, Charles University, Czech Republic, 1997.

        [3] Pavel Pecina and Pavel Schlesinger:  Combining Association
        Measures for Collocation Extraction. Proceedings of the 21th
        International Conference on Computational Linguistics and 44th
        Annual Meeting of the Association for Computational Linguistics 
        (COLING/ACL 2006), Sydney, Australia, July 2006.

        [4] Pavel Pecina: An Extensive Empirical Study of Collocation
        Extraction Methods. Proceedings of the 43th Annual Meeting of the
        Association for Computational Linguistics (ACL 2005), Student
        Research Workshop, Ann Arbor, Michigan, June 2005.