precision.recall {UCS} | R Documentation |

Computes precision and recall of n-best lists for a UCS data set
annotated with true positives and rankings (based on association
scores). This function forms the basis for the evaluation graphs in
the `plots`

packages.

precision.recall(ds, am, tp=ds$b.TP, step=1, first=1, cut=0, window=0)

`ds` |
a UCS data set object |

`am` |
a character string giving the name of an association measure.
The corresponding ranking must be annotated in the data set (usually
with the `add.ranks` function). |

`tp` |
a logical vector, which must be parallel to the rows of the
data set. `TRUE` values indicate true positives (see details
below for the use of missing values). If `tp` is omitted, the
data set must contain a Boolean variable `b.TP` which is used
instead. |

`step` |
step width for n-best lists considered, i.e. precision and
recall are computed for every `step` -th value of n only
(default: 1) |

`first` |
smallest n-best list for which precision and recall are computed (default: 1) |

`cut` |
pretend that data set consists only of the first `cut`
rows in the ranking, i.e. treat `cut` -best list as full data
set (for percentage and recall). |

`window` |
if specified, local precision is estimated, considering
a window of approximately the given size around each value of n
(uses the `density` function for smoothing). Useful window
sizes range from 400 to 1000. |

The `precision.recall`

function supports evaluation based on
random samples (cf. Evert, 2004, Sec. 5.4). Any `NA`

values in
the `tp`

parameter (or the `b.TP`

variable) are interpreted
as unannotated candidates. Precision and recall values are computed
from the annotated candidates only (as are the `tp`

, `fp`

,
and `lp`

variables in the returned data frame). For a random
sample evaluation, confidence intervals should always be supplied with
the raw precision values, and result differences should be tested for
significance. Such tests are implemented by the
`evaluation.plot`

function, for instance.

An invisible data frame with rows corresponding to n-best lists and the following variables:

`n` |
the number of candidates in the n-best list |

`perc` |
the same as a percentage of the full data set (or the
`cut` highest-ranking candidates if specified) |

`tp` |
the number of true positives in the n-best list |

`fp` |
the number of false positives in the n-best list |

`precision` |
the precision of the n-best list, i.e. the number of TPs divided by n |

`recall` |
the recall of the n-best list, i.e. the number of TPs divided by the total number of TPs in the data set |

`lp` |
if `window` is specified, an estimate for the local
precision, i.e. the density of TPs in the vicinity of the n-th
rank. Averages over a symmetric window of approximately the
specified total size by convolution with a Gaussian kernel (using
the `density` function). |

Evert, Stefan (2004). *The Statistics of Word Cooccurrences: Word
Pairs and Collocations.* PhD Thesis, IMS, University of Stuttgart.

`add.ranks`

, `read.ds.gz`

,
`evaluation.plot`

[Package *UCS* version 0.5 Index]