Tutorial & ESSLLI 2016 Course on Distributional Semantics
Source packages can be installed on all platforms, but require a full development environment. For Mac OS X and Windows, binary packages are available, but can only be installed in the current R version 3.3.x. Right click & select download / save as so that packages aren't unpacked automatically.
wordspacefrom CRAN (including its dependencies)
wordspaceEvalv0.1: source/Linux – Mac OS X – Windows (login required)
Pre-compiled DSMs for use with the wordspace package for R. Each model is contained in an
.rda file, and can be loaded into R with the command
These models were compiled from WP500, a 200-million word subset of the Wackypedia corpus, comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions.
Some publicly available pre-trained neural embeddings, converted into
.rda format for use with the
Mach5 is a DSM-based system that participated in the CogALex-V Shared Task on the corpus-based identification of semantic relations. In order to ensure reproducibility and help researchers carry out follow-up experiments, the full implementation of Mach5 including the underlying co-occurrence data can be downloaded here. For a minimal reproduction, download all three
.rda data sets and put them in your R working directory together with the main script
mach5.R. You will also need to install packages
wordspace from CRAN.
Verb + object noun co-occurrences (pair tokens) from British National Corpus:
A 5-million word corpus of English Harry Potter fan-fiction:
_pos, pre-cleaned with stopword filter
A distributional semantic model (DSM) for 34,150 English nouns, derived from the 2-billion-word ukWaC Web corpus: