Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/788
Title: The Distribution of N-Grams
Authors: EGGHE, Leo 
Issue Date: 2000
Publisher: KLUWER ACADEMIC PUBL
Source: Scientometrics, 47(2). p. 237-252
Abstract: N-grams are generalized words consisting of N consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant N-grams. For entire texts this is known to be Zipf's law (i.e., an inverse power law). For N-grams, however, we show that the rank (r)-frequency distribution is P-N(r)=C/(psi(N)(r))(beta), where psi(N) is the inverse function of f(N)(x)=x ln(N-1)x. Here we assume that the rank-frequency distribution of the symbols follows Zipf's law with exponent beta.
Keywords: N-gram; law of Zipf; rank-frequency distribution;CENTRAL-LIMIT-THEOREM; INFORMATION-RETRIEVAL; ZIPFS LAW; SIMILARITY
Document URI: http://hdl.handle.net/1942/788
ISSN: 0138-9130
e-ISSN: 1588-2861
DOI: 10.1023/A:1005634925734
ISI #: 000089449100005
Category: A1
Type: Journal Contribution
Validations: ecoom 2001
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
distribution.pdfPeer-reviewed author version279.57 kBAdobe PDFView/Open
distribution 1.pdf
  Restricted Access
Published version393.51 kBAdobe PDFView/Open    Request a copy
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.