Distributional semantics

Tutorial: Distributional Semantics

ESSLLI 2021 Course on Hands-on Distributional Semantics

Slides & handouts

Part 1: Introduction (slides, handout)
Part 2: The parameters of a DSM (slides, handout)
Parts 3 + 4: Evaluation + DS beyond NLP: Linguistic issues (slides, handout)
Part 5: DS beyond NLP: Cognitive modelling (slides, handout)

Code examples

Part 1: hands_on_day1.R
Part 2: hands_on_day2.R – hands_on_day2_input_formats.R – hands_on_day2_matrix_factorization.R
data files: verb_dep.txt.gz (21.6 MB), adj_noun_tokens.txt.gz (8.3 MB), delta_de_termdoc.txt.gz (18.4 MB), potter_l2r2.txt.gz (51.3 MB), potter_lemmas.txt.gz (1.1 MB), VSS.txt (37 kB)
Part 3: hands_on_day3.R – hands_on_day3_exercise_1.R – hands_on_day3_exercise_2.R
Part 4: hands_on_day4.R – bonus material: schuetze1998.R
Part 5: hands_on_day5.R

R packages

Source packages can be installed on all platforms, but require a full development environment. For MacOS and Windows. Binary packages are available for these platforms, but can only be installed in compatible R versions. Right click & select download / save as so that packages aren't unpacked automatically.

wordspaceEval v0.2 for R 4.x: source/Linux – Mac OS X – Windows (login required)
wordspaceEval v0.1 for R 3.x: source/Linux – Mac OS X – Windows (login required)

Tutorial & ESSLLI 2016 Course on Distributional Semantics

Slides & handouts

Part 1: Introduction (slides, handout)
Part 2: DSM Parameters (slides, handout)
Part 3: Evaluation (slides, handout)
Part 4: Matrix algebra & SVD (slides, handout)

Code examples

Part 2: part2_examples.R, part2_input_formats.R
data files: verb_dep.txt.gz (21.6 MB), adj_noun_tokens.txt.gz (8.3 MB), delta_de_termdoc.txt.gz (18.4 MB), potter_l2r2.txt.gz (51.3 MB), potter_lemmas.txt.gz (1.1 MB)
Part 3: part3_examples.R, part3_exercise.R
Part 4: part4_exercise.R
Demo session from Evert (2014): example_session.R

R packages

Source packages can be installed on all platforms, but require a full development environment. For Mac OS X and Windows, binary packages are available, but can only be installed in the current R version 3.3.x. Right click & select download / save as so that packages aren't unpacked automatically.

please install package wordspace from CRAN (including its dependencies)
optionally, install packages tm and quanteda from CRAN
wordspaceEval v0.1: source/Linux – Mac OS X – Windows (login required)

Distributional semantic models (DSMs)

Pre-compiled DSMs for use with the wordspace package for R. Each model is contained in an .rda file, and can be loaded into R with the command load("model.rda").

DSMs based on the English Wikipedia

These models were compiled from WP500, a 200-million word subset of the Wackypedia corpus, comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions.

dependency-filtered: WP500_DepFilter_Lemma.rda (30.4 MB) – 500 latent SVD dimensions: WP500_DepFilter_Lemma_svd500.rda (175.9 MB)
dependency-structured: WP500_DepStruct_Lemma.rda (30.9 MB) – 500 latent SVD dimensions: WP500_DepStruct_Lemma_svd500.rda (176.8 MB)
L2/R2 surface span: WP500_Win2_Lemma.rda (50.1 MB) – 500 latent SVD dimensions: WP500_Win2_Lemma_svd500.rda (173.7 MB)
L5/R5 surface span: WP500_Win5_Lemma.rda (99.3 MB) – 500 latent SVD dimensions: WP500_Win5_Lemma_svd500.rda (176.5 MB)
L30/R30 surface span: WP500_Win30_Lemma.rda (295.8 MB) – 500 latent SVD dimensions: WP500_Win30_Lemma_svd500.rda (179.5 MB)
term-document model: WP500_TermDoc_Lemma.rda (101.3 MB) – 500 latent SVD dimensions: WP500_TermDoc_Lemma_svd500.rda (158.7 MB)
type contexts (L1+R1): WP500_Ctype_L1R1_Lemma.rda (55.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L1R1_Lemma_svd500.rda (153.9 MB)
type contexts (L2+R2 POS tags): WP500_Ctype_L2R2pos_Lemma.rda (55.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L2R2pos_Lemma_svd500.rda (172.2 MB)
word forms L2/R2: WP500_Win2_Word.rda (61.6 MB) – 500 latent SVD dimensions: WP500_Win2_Word_svd500.rda (182.0 MB)
word forms L2/R2 with non-lemmatized features: WP500_Win2_Word_WF.rda (65.9 MB) – 500 latent SVD dimensions: WP500_Win2_Word_WF_svd500.rda (182.5 MB)

Neural word embeddings

Some publicly available pre-trained neural embeddings, converted into .rda format for use with the wordspace package.

word2vec: GoogleNews300_wf200k.rda (129.2 MiB)

CogALex-V shared task: Mach5

Mach5 is a DSM-based system that participated in the CogALex-V Shared Task on the corpus-based identification of semantic relations. In order to ensure reproducibility and help researchers carry out follow-up experiments, the full implementation of Mach5 including the underlying co-occurrence data can be downloaded here. For a minimal reproduction, download all three .rda data sets and put them in your R working directory together with the main script mach5.R. You will also need to install packages e1071 and wordspace from CRAN.

a minimal implementation of Mach5: mach5.R
complete experiments and plots: mach5_experiments.R
CogALex-V gold standard (train and test): cogalex_gold_standard.rda
dependency-filtered co-occurrence data from ENCOW14: encow_depfilt.rda (430 MiB)
dependency-structured co-occurrence data from ENCOW14: encow_depstruct.rda (603 MiB)