Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/788
Title: | The Distribution of N-Grams | Authors: | EGGHE, Leo | Issue Date: | 2000 | Publisher: | KLUWER ACADEMIC PUBL | Source: | Scientometrics, 47(2). p. 237-252 | Abstract: | N-grams are generalized words consisting of N consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant N-grams. For entire texts this is known to be Zipf's law (i.e., an inverse power law). For N-grams, however, we show that the rank (r)-frequency distribution is P-N(r)=C/(psi(N)(r))(beta), where psi(N) is the inverse function of f(N)(x)=x ln(N-1)x. Here we assume that the rank-frequency distribution of the symbols follows Zipf's law with exponent beta. | Keywords: | N-gram; law of Zipf; rank-frequency distribution;CENTRAL-LIMIT-THEOREM; INFORMATION-RETRIEVAL; ZIPFS LAW; SIMILARITY | Document URI: | http://hdl.handle.net/1942/788 | ISSN: | 0138-9130 | e-ISSN: | 1588-2861 | DOI: | 10.1023/A:1005634925734 | ISI #: | 000089449100005 | Category: | A1 | Type: | Journal Contribution | Validations: | ecoom 2001 |
Appears in Collections: | Research publications |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
distribution.pdf | Peer-reviewed author version | 279.57 kB | Adobe PDF | View/Open |
distribution 1.pdf Restricted Access | Published version | 393.51 kB | Adobe PDF | View/Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.