^{1}
Formally, the variables X_{ij} can be defined as sums over indicator variables:

^{2}
Intuitively, the particular arrangement of the pair tokens in the sample cannot provide any
meaningful information, since it is presupposed to be random. In particular, all reorderings
of the sample must be equally likely. It is therefore sufficient to consider the total
(co)occurrence frequencies. In fact, a sufficient statistic is already given by three of the
variables in a contingency table (e.g. the joint frequency X_{11} and the first
row and column totals X_{R1} and X_{C1})
because the four cells must add up to the sample size:
X_{11} + X_{12} + X_{21} + X_{22} = N.

^{3}
Formally, the sampling distribution is the joint probability distribution of the random variables
X_{11}, ..., X_{22}.

^{4}
The name **maximum-likelihood estimates** derives from the fact that the
estimated values maximise the probability (or *likelihood*) of the
observed contingency table among the set of all possible parameter values.

^{5}
Precisely speaking, the point null hypothesis consists of three conditions:
π_{1} = p_{1}, π_{2} = p_{2},
and π = p_{1} p_{2}. Although p_{1}
and p_{2} are random variables, they are treated as constants
in the statistical model, which are set to the values computed from the observed data.
Hypothesis tests based on the point null hypothesis thus effectively ignore
the sampling error of p_{1} and p_{2}.

^{1}
This likelihood is the probability of an outcome where X_{11}
equals the observed value O_{11}, while the values of
X_{12}, X_{21}, and X_{22}
are unspecified. Note that the observed marginal frequencies still have
some effect through their influence on the point null hypothesis.

^{1}
Try the command `phyper(99, 1000, 999000, 1000, lower=F)`

, which
computes the Fisher score for a contingency table with
O_{11} = 100, R_{1} = C_{1} = 1,000,
and N = 1,000,000. At least on versions up to R-1.9.0 running under Linux/i386,
the result is a *negative* p-value (P < 0)!

^{1}
In earlier days, this task involved enormous tomes of statistical tables
where p-values for many known distributions were tabulated. Back then,
without the help of desktop computers, it was impossible to carry out exact
hypothesis tests except for the case of very small samples. Such practical
considerations were an important reason for the concentration on asymptotic (rather
than exact) hypothesis tests during the first half of the 19^{th} century.

^{2}
For instance, common sense dictates that in those cases where a contingency table A
is clearly less consistent with the null hypothesis than a table B, the test
statistic should assume a greater value for A than for B. In many
other cases, where the desired result of the comparison is not obvious, the definition of
the test statistic is essentially an intuitive choice.

^{3}
An equivalence proof for the three different versions of the chi-squared measure
can be based on the fact that the identity
(O_{11} - E_{11})^{2} = (O_{12} - E_{12})^{2}
= (O_{21} - E_{21})^{2} = (O_{22} - E_{22})^{2}
holds for any contingency table.

^{4}
The number of degrees of freedom is given by the dimension of the parameter space minus
the dimension of the null hypothesis (which is formally a subset of the parameter space).
In the case of coocurrence data, the former has dimension 3 (with free parameters π,
π_{1}, π_{2}), while the latter has dimension 2
(with π_{1}, π_{2} as free parameters, and
π determined by H_{0}). Therefore, the limiting
χ^{2} distribution of the likelihood ratio statistic has one degree of freedom.

^{1}
In particular, odds-ratio does not make any distinction between contingency tables
where either O_{12} = 0 or O_{21} = 0 (because they are
assigned the same infinite score).
After discounting, odds-ratio_{disc} assigns higher scores to tables where
both non-diagonal cells are empty (O_{12} = O_{21} = 0) rather than
just one, and it takes the cooccurrence frequency O_{11} into account.

^{2}
In the same paper, the authors also argue in favour of point estimates, which they
interpret as *descriptive* rather than *inferential* measures. They state
that descriptive statistics are more appropriate when it is feasible to analyse a
population exhaustively (which they imply to be the case for the very large corpora that
are available today). Interestingly, this argument is followed by an empirical evaluation
of the MS measure (as an example of a descriptive measure) on a *subset*
of the *Wall Street Journal*.

^{3}
Let g be the Dice score for a given contingency table. Then the
Jaccard score h for the same table is given by the equation
h = g &frasl (2 - g).

^{1}
Let g be the local-MI score for a given pair type (u,v),
and let h be the score of the Poisson-Stirling measure.
Then the following equality holds: h = g - O_{11}.

^{1}
Let g be the gmean association score for a pair type (u,v).
Then the score h of the MI^{2} measure is given by
h = log(g^{2}) + log N, which is a monotonic transformation
(for a fixed sample of size N).