Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/33431
Title: Document Spanners: A Formal Approach to Information Extraction
Authors: Fagin, Ron
Kimelfeld, Benny
Reiss, Frederick
VANSUMMEREN, Stijn 
Issue Date: 2015
Publisher: Association for Computing Machinery
Source: JOURNAL OF THE ACM, 62 (2) , p. 1 -51 (Art N° 12)
Abstract: An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short). A spanner maps an input string into a relation over the spans (intervals specified by bounding indices) of the string. The focus of this article is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables.We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners-the closure of the first kind under standard relational operators. The core spanners extend the regular ones by stringe-quality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.
Keywords: Theory;Information extraction;document spanners;regular expressions;finite-state automata
Document URI: http://hdl.handle.net/1942/33431
ISSN: 0004-5411
e-ISSN: 1557-735X
DOI: 10.1145/2699442
ISI #: WOS:000354799200003
Category: A1
Type: Journal Contribution
Appears in Collections:Research publications

Show full item record

WEB OF SCIENCETM
Citations

53
checked on Oct 19, 2024

Page view(s)

30
checked on Jun 28, 2023

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.