fzm {UCS}R Documentation

The Finite Zipf-Mandelbrot LNRE Model (fzm)

Description

Object constructor for a finite Zipf-Mandelbrot (fZM) LNRE model with parameters α, A and B (Evert, 2004a). Either the parameters are specified explicitly, or one or more of them can be estimated from an observed frequency spectrum.

Usage

fzm(alpha=NULL, A=NULL, B=NULL, N=NULL, V=NULL, spc=NULL, m.max=15, stepmax=10, debug=FALSE)

Usage

fzm(alpha, A, B)

fzm(alpha, A, N, V)

fzm(alpha, N, V, spc, m.max=15, stepmax=10, debug=FALSE)

fzm(N, V, spc, m.max=15, stepmax=10, debug=FALSE)

Arguments

alpha a number in the range (0,1), the shape parameter α of the fZM model. alpha can automatically be estimated from N, V, and spc.
A a small positive number A << 1, the parameter A of the fZM model. A can automatically be estimated from N, V, and spc.
B a large positive number B >> 1, the parameter B of the fZM model. B can automatically be estimated from N and V.
N the sample size, i.e. number of observed tokens
V the vocabulary size, i.e. the number of observed types
spc a vector of non-negative integers representing the class sizes V_m of the observed frequency spectrum. The vector is usually read from a file in lexstats format with the read.spectrum function.
m.max the number of ranks from spc that will be used to estimate the α parameter
stepmax maximal step size of the nlm function used for parameter estimation. It should not be necessary to change the default value.
debug if TRUE, print debugging information during the parameter estimation process. This feature can be useful to find out why parameter estimation fails.

Details

The fZM model with parameters α \in (0,1) and C > 0 is defined by the type density function

g(p) := C * p^(-alpha - 1)

for A <= p <= B. The normalisation constant C is determined from the other parameters by the condition

integral_A^B p * g(p) dp = 1

The parameters α and A are estimated simultaneously by nonlinear minimisation (nlm) of a multinomial chi-squared statistic for the observed against the expected frequency spectrum. Note that this is different from the multivariate chi-squared test used to measure the goodness-of-fit of the final model (Baayen, 2001, Sec. 3.3).

See Evert (2004, Ch. 4) for further mathematical details, especially concerning the expected vocabulary size, frequency spectrum and conditional parameter distribution, as well as their variances.

Value

An object of class "fzm" with the following components:
alpha value of the α parameter
A value of the A parameter
B value of the B parameter
C value of the normalisation constant C
C population size S predicted by the model
N number of observed tokens (if specified)
V number of observed types (if specified)
spc observed frequency spectrum (if specified)
This object prints a short summary, including the population size S and a comparison of the first ranks of the observed and expected frequency spectrum (if available).

References

Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart.

Evert, Stefan (2004a). A simple LNRE model for random character sequences. In Proceedings of JADT 2004, Louvain-la-Neuve, Belgium, pages 411–422.

See Also

zm, EV, EVm, VV, VVm, write.lexstats, lnre.goodness.of.fit, read.spectrum, and spectrum.plot


[Package UCS version 0.5 Index]