Distributional semantics

Tutorial: Distributional Semantics

Tutorial & ESSLLI 2016 Course on Distributional Semantics

Slides & handouts

Code examples

R packages

Source packages can be installed on all platforms, but require a full development environment. For Mac OS X and Windows, binary packages are available, but can only be installed in the current R version 3.3.x. Right click & select download / save as so that packages aren't unpacked automatically.

Distributional semantic models (DSMs)

Pre-compiled DSMs for use with the wordspace package for R. Each model is contained in an .rda file, and can be loaded into R with the command load("model.rda").

DSMs based on the English Wikipedia

These models were compiled from WP500, a 200-million word subset of the Wackypedia corpus, comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions.

Neural word embeddings

Some publicly available pre-trained neural embeddings, converted into .rda format for use with the wordspace package.

CogALex-V shared task: Mach5

Mach5 is a DSM-based system that participated in the CogALex-V Shared Task on the corpus-based identification of semantic relations. In order to ensure reproducibility and help researchers carry out follow-up experiments, the full implementation of Mach5 including the underlying co-occurrence data can be downloaded here. For a minimal reproduction, download all three .rda data sets and put them in your R working directory together with the main script mach5.R. You will also need to install packages e1071 and wordspace from CRAN.

Word spaces (old data sets for DSMs)

Verb + object noun co-occurrences (pair tokens) from British National Corpus:

A 5-million word corpus of English Harry Potter fan-fiction:

A distributional semantic model (DSM) for 34,150 English nouns, derived from the 2-billion-word ukWaC Web corpus:

Co-occurrence data & multiword expressions

Corpus Frequency Data for MWE Candidates

German PP-verb combinations from Frankfurter Rundschau corpus: German adjective-noun combinations from Frankfurter Rundschau corpus: Czech dependency bigrams from Prague Dependency Treebank (PDT), version 2.0:

Evaluation Package for MWE 2008 Shared Task

Precision-recall evaluation script for R statistical environment: