hands_on_day1.R
hands_on_day2.R
– hands_on_day2_input_formats.R
– hands_on_day2_matrix_factorization.R
verb_dep.txt.gz
(21.6 MB),
adj_noun_tokens.txt.gz
(8.3 MB),
delta_de_termdoc.txt.gz
(18.4 MB),
potter_l2r2.txt.gz
(51.3 MB),
potter_lemmas.txt.gz
(1.1 MB),
VSS.txt
(37 kB)
hands_on_day3.R
– hands_on_day3_exercise_1.R
– hands_on_day3_exercise_2.R
hands_on_day4.R
– bonus material: schuetze1998.R
hands_on_day5.R
Source packages can be installed on all platforms, but require a full development environment. For MacOS and Windows. Binary packages are available for these platforms, but can only be installed in compatible R versions. Right click & select download / save as so that packages aren't unpacked automatically.
wordspaceEval
v0.2 for R 4.x: source/Linux – Mac OS X – Windows (login required)wordspaceEval
v0.1 for R 3.x: source/Linux – Mac OS X – Windows (login required)part2_examples.R
,
part2_input_formats.R
verb_dep.txt.gz
(21.6 MB),
adj_noun_tokens.txt.gz
(8.3 MB),
delta_de_termdoc.txt.gz
(18.4 MB),
potter_l2r2.txt.gz
(51.3 MB),
potter_lemmas.txt.gz
(1.1 MB)
part3_examples.R
,
part3_exercise.R
part4_exercise.R
example_session.R
Source packages can be installed on all platforms, but require a full development environment. For Mac OS X and Windows, binary packages are available, but can only be installed in the current R version 3.3.x. Right click & select download / save as so that packages aren't unpacked automatically.
wordspace
from CRAN (including its dependencies)tm
and quanteda
from CRANwordspaceEval
v0.1: source/Linux – Mac OS X – Windows (login required)Pre-compiled DSMs for use with the wordspace package for R. Each model is contained in an .rda
file, and can be loaded into R with the command load("model.rda")
.
These models were compiled from WP500, a 200-million word subset of the Wackypedia corpus, comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions.
Some publicly available pre-trained neural embeddings, converted into .rda
format for use with the wordspace
package.
Mach5 is a DSM-based system that participated in the CogALex-V Shared Task on the corpus-based identification of semantic relations. In order to ensure reproducibility and help researchers carry out follow-up experiments, the full implementation of Mach5 including the underlying co-occurrence data can be downloaded here. For a minimal reproduction, download all three .rda
data sets and put them in your R working directory together with the main script mach5.R
. You will also need to install packages e1071
and wordspace
from CRAN.
Verb + object noun co-occurrences (pair tokens) from British National Corpus:
A 5-million word corpus of English Harry Potter fan-fiction:
_
pos, pre-cleaned with stopword filterA distributional semantic model (DSM) for 34,150 English nouns, derived from the 2-billion-word ukWaC Web corpus: