precision.recall {UCS}R Documentation

Compute Precision and Recall for N-Best Lists (base)

Description

Computes precision and recall of n-best lists for a UCS data set annotated with true positives and rankings (based on association scores). This function forms the basis for the evaluation graphs in the plots packages.

Usage

precision.recall(ds, am, tp=ds$b.TP, step=1, first=1, cut=0, window=0)

Arguments

ds a UCS data set object
am a character string giving the name of an association measure. The corresponding ranking must be annotated in the data set (usually with the add.ranks function).
tp a logical vector, which must be parallel to the rows of the data set. TRUE values indicate true positives (see details below for the use of missing values). If tp is omitted, the data set must contain a Boolean variable b.TP which is used instead.
step step width for n-best lists considered, i.e. precision and recall are computed for every step-th value of n only (default: 1)
first smallest n-best list for which precision and recall are computed (default: 1)
cut pretend that data set consists only of the first cut rows in the ranking, i.e. treat cut-best list as full data set (for percentage and recall).
window if specified, local precision is estimated, considering a window of approximately the given size around each value of n (uses the density function for smoothing). Useful window sizes range from 400 to 1000.

Details

The precision.recall function supports evaluation based on random samples (cf. Evert, 2004, Sec. 5.4). Any NA values in the tp parameter (or the b.TP variable) are interpreted as unannotated candidates. Precision and recall values are computed from the annotated candidates only (as are the tp, fp, and lp variables in the returned data frame). For a random sample evaluation, confidence intervals should always be supplied with the raw precision values, and result differences should be tested for significance. Such tests are implemented by the evaluation.plot function, for instance.

Value

An invisible data frame with rows corresponding to n-best lists and the following variables:
n the number of candidates in the n-best list
perc the same as a percentage of the full data set (or the cut highest-ranking candidates if specified)
tp the number of true positives in the n-best list
fp the number of false positives in the n-best list
precision the precision of the n-best list, i.e. the number of TPs divided by n
recall the recall of the n-best list, i.e. the number of TPs divided by the total number of TPs in the data set
lp if window is specified, an estimate for the local precision, i.e. the density of TPs in the vicinity of the n-th rank. Averages over a symmetric window of approximately the specified total size by convolution with a Gaussian kernel (using the density function).

References

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart.

See Also

add.ranks, read.ds.gz, evaluation.plot


[Package UCS version 0.5 Index]