Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/788
Title: | The Distribution of N-Grams | Authors: | EGGHE, Leo | Issue Date: | 2000 | Publisher: | KLUWER ACADEMIC PUBL | Source: | Scientometrics, 47(2). p. 237-252 | Abstract: | N-grams are generalized words consisting of N consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant N-grams. For entire texts this is known to be Zipf's law (i.e., an inverse power law). For N-grams, however, we show that the rank (r)-frequency distribution is P-N(r)=C/(psi(N)(r))(beta), where psi(N) is the inverse function of f(N)(x)=x ln(N-1)x. Here we assume that the rank-frequency distribution of the symbols follows Zipf's law with exponent beta. | Keywords: | N-gram; law of Zipf; rank-frequency distribution;CENTRAL-LIMIT-THEOREM; INFORMATION-RETRIEVAL; ZIPFS LAW; SIMILARITY | Document URI: | http://hdl.handle.net/1942/788 | ISSN: | 0138-9130 | e-ISSN: | 1588-2861 | DOI: | 10.1023/A:1005634925734 | ISI #: | 000089449100005 | Category: | A1 | Type: | Journal Contribution | Validations: | ecoom 2001 |
Appears in Collections: | Research publications |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
distribution.pdf | Peer-reviewed author version | 279.57 kB | Adobe PDF | View/Open |
distribution 1.pdf Restricted Access | Published version | 393.51 kB | Adobe PDF | View/Open Request a copy |
WEB OF SCIENCETM
Citations
22
checked on May 7, 2024
Page view(s)
12
checked on Sep 5, 2022
Download(s)
14
checked on Sep 5, 2022
Google ScholarTM
Check
Altmetric
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.