Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/788
Title: The Distribution of N-Grams
Authors: EGGHE, Leo 
Issue Date: 2000
Publisher: KLUWER ACADEMIC PUBL
Source: Scientometrics, 47(2). p. 237-252
Abstract: N-grams are generalized words consisting of N consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant N-grams. For entire texts this is known to be Zipf's law (i.e., an inverse power law). For N-grams, however, we show that the rank (r)-frequency distribution is P-N(r)=C/(psi(N)(r))(beta), where psi(N) is the inverse function of f(N)(x)=x ln(N-1)x. Here we assume that the rank-frequency distribution of the symbols follows Zipf's law with exponent beta.
Keywords: N-gram; law of Zipf; rank-frequency distribution;CENTRAL-LIMIT-THEOREM; INFORMATION-RETRIEVAL; ZIPFS LAW; SIMILARITY
Document URI: http://hdl.handle.net/1942/788
ISSN: 0138-9130
e-ISSN: 1588-2861
DOI: 10.1023/A:1005634925734
ISI #: 000089449100005
Category: A1
Type: Journal Contribution
Validations: ecoom 2001
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
distribution.pdfPeer-reviewed author version279.57 kBAdobe PDFView/Open
distribution 1.pdf
  Restricted Access
Published version393.51 kBAdobe PDFView/Open    Request a copy
Show full item record

WEB OF SCIENCETM
Citations

22
checked on May 7, 2024

Page view(s)

12
checked on Sep 5, 2022

Download(s)

14
checked on Sep 5, 2022

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.