## Extended exercise for part 3:
## Explore and evaluate DSM parameters
##
library(wordspace)
## evaluation tasks:
## ESSLLI08_Nouns ... clustering
## RG65 ... similarity ratings
## WordSim353 ... similarity ratings
## SemCorWSD ... word sense disambiguation (Schuetze-style)
library(wordspaceEval)
## additional non-public evaluation tasks:
## TOEFL80 ... multiple choice (synonyms)
## SPP_Items ... multiple choice (various relations)
## GEK_Items ... multiple choice (various relations)
## AP402 ... clustering
## Battig82 ... clustering
## Co-occurrence data for several DSMs based on the English Wikipedia (WP500 corpus)
## using different co-occurrence contexts are available for download from
##
## http://wordspace.collocations.de/doku.php/course:material#pre-compiled_dsms
##
## The following contexts are available:
## TermDoc ... term-document matrix
## Win30 ... 30-word span (L30/R30)
## Win5 ... 5-word span (L5/R5)
## Win2 ... 2-word span (L2/R2)
## DepFilter ... dependency-filtered
## DepStruct ... dependency-structured
## Ctype_L1R1 ... type context: left + right word (lemma)
## Ctype_L2R2pos ... type context: 2 left + 2 right POS pattern around target
##
## You can also try two non-lemmatized models, but you will have to specify the
## option format="HWLC" for all evaluation functions (see ?convert.lemma).
## Download one of the raw co-occurrence data sets (not a pre-compiled DSM) and
## read it into R. We will use the Win30 context below, but you may want to choose
## a smaller model depending on how powerful your computer is.
load("models/WP500_Win30_Lemma.rda", verbose=TRUE)
## Most models have a co-occurrence matrix of approx. 50,000 x 50,000 rows and columns
WP500_Win30_Lemma
## If you have sufficient amounts of RAM (at least 8 GB) and patience (you're willing to wait
## 30 minutes or more for SVD dimensionality reduction), you can work on the full matrix.
## Otherwise it's probably a good idea to reduce the number of rows and columns with the
## subset() function, which has a special method for DSM objects. You can specify two conditions
## for the rows and columns, respectively (see ?subset.dsm for more options).
## Use grepl() to filter terms with regular expressions, e.g. for adjectives as targets and
## nouns as features:
DSM <- subset(WP500_Win30_Lemma, grepl("_J$", term), grepl("_N$", term))
DSM # now ca. 6k x 38k
## Or you can filter out low-frequency target and feature terms. Let us look at the distribution
## of marginal frequencies first.
hist(log10(WP500_Win30_Lemma$rows$f)) # row marginals are scaled by span size!
hist(log10(WP500_Win30_Lemma$cols$f)) # 2 = 100, 3 = 1000, 4 = 10,000, 5 = 100,000, 6 = 1,000,000
## Let us keep targets with R >= 10,000 (choose a different suitable number for other DSMs!)
## and features in a mid-frequency range 1000 <= f <= 20,000
DSM <- subset(WP500_Win30_Lemma, f >= 10000, f >= 1000 & f <= 20000)
DSM # approx. 30k x 10k now
## If you're short on RAM, delete the original model now and clean up
rm(WP500_Win30_Lemma)
gc() # run garbage collector to free up RAM
## Experiment with an unreduced model first, which is less time-consuming than SVD
DSM <- dsm.score(DSM, score="Dice") # Dice scores without normalization
## Look at some nearest neighbours and evaluate the model in various task. Try different
## distance metrics. Go back and change parameters, then re-run all evaluation step (this
## is particularly convenient with an R script in RStudio).
nearest.neighbours(DSM, "mouse_N")
nearest.neighbours(DSM, "mouse_N", method="euclidean")
eval.multiple.choice(TOEFL80, DSM) # note that the missing items count as errors!
## Try many other tasks here!
## You can also check directly whether normalization seems to be necessary by plotting
## the distribution of row vector norms (need to specify the matrix rather than DSM object):
hist(rowNorms(DSM$S, method="euclidean"))
hist(rowNorms(DSM$S, method="manhattan")) # now huge differences here
## Once you're satisfied with your DSM, try to see if dimensionality reduction improves
## the representation. The more latent dimensions you ask for, the longer this will take,
## but you can then select the first r dimensions from the reduced matrix or skip a few.
DSM300 <- dsm.projection(DSM, n=300, method="svd")
nearest.neighbours(DSM300, "mouse_N") # most people use cosine similarity with SVD-reduced models
eval.multiple.choice(TOEFL80, DSM300)
## Use matrix subsetting to pick fewer dimensions or skip the first dimensions
M <- DSM300[, 1:100] # first 100 dim's only
M <- DSM300[, 51:150] # skip 50, then take next 100 dim's
nearest.neighbours(M, "mouse_N")
eval.multiple.choice(TOEFL80, M)
## Your task, as in the first exercise is to:
## - explore distances, neighbours and semantic maps for different DSMs and distance measures
## - remember that you can apply post-hoc power scaling and skip dimensions to the SVD-reduced DSM
## - evaluate each model in the three standard tasks listed above
## - additionally, here you can observe how SVD is affected by manipulation of the input matrix
## Summarize your findings:
## - How different are the various co-occurrence contexts?
## - Can you put a finger on the kind of semantic relations in each model?
## - Are some parts of speech, semantic classes, etc. represented better than others?
## - How much influence does the distance measure have? Is one measure better than all others?