Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/747
Title: Properties of the n-overlap vector and n-overlap similarity theory
Authors: EGGHE, Leo 
Issue Date: 2006
Publisher: Wiley
Source: JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 57(9). p. 1165-1177
Abstract: In the first part of this paper we define the n-overlap vector whose coordinates consist of the fraction of the objects (e.g. books, N-grams,…) that belong to 1, 2,…, n sets (more generally: families) (e.g. libraries, databases,…). With the aid of the Lorenz concentration theory we build a theory of n-overlap similarity and corresponding measures, such as the generalized Jaccard index (generalizing the well-known Jaccard index in case ). n=2 Next we determine the distributional form of the n-overlap vector assuming certain distributions of the object’s and of the set (family)-sizes. In this section the decreasing power law and decreasing exponential distribution is explained for the n-overlap vector. Both item (token) n-overlap and source (type) n-overlap are studied. The final section is devoted to the n-overlap properties of objects indexed by a hierarchical system (e.g. books indexed by numbers from a UDC or Dewey system or by N-grams). We show that the results of Section II can be applied here. We also show that the Lorenz-order of the n-overlap vector is respected by an increase or a decrease of the level of refinement in the hierarchical system (e.g. the value N in N-grams).
Keywords: n-overlap vector; Lorenz; Jaccard index; power law; N-gram
Document URI: http://hdl.handle.net/1942/747
ISSN: 1532-2882
DOI: 10.1002/asi.v57:9
ISI #: 000238519600003
Category: A1
Type: Journal Contribution
Validations: ecoom 2007
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
properties 1.pdf
  Restricted Access
Published version149.98 kBAdobe PDFView/Open    Request a copy
properties 2.pdfPeer-reviewed author version622.91 kBAdobe PDFView/Open
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.