Efficient Enumeration Algorithms for Regular Document Spanners

Florenzano, Fernando; Riveros, Cristian; Ugarte, Martín; VANSUMMEREN, Stijn; Vrgoc, Domagoj

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/33439

Full metadata record

DC Field	Value	Language
dc.contributor.author	Florenzano, Fernando	-
dc.contributor.author	Riveros, Cristian	-
dc.contributor.author	Ugarte, Martín	-
dc.contributor.author	VANSUMMEREN, Stijn	-
dc.contributor.author	Vrgoc, Domagoj	-
dc.date.accessioned	2021-02-12T12:27:12Z	-
dc.date.available	2021-02-12T12:27:12Z	-
dc.date.issued	2020	-
dc.date.submitted	2021-02-11T19:06:58Z	-
dc.identifier.citation	ACM TRANSACTIONS ON DATABASE SYSTEMS, 45 (1) , p. 1 -42 (Art N° 3)	-
dc.identifier.uri	http://hdl.handle.net/1942/33439	-
dc.description.abstract	Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages to locate the data that a user wants to extract from a text document and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have efficient evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Toward this goal, we present a practical evaluation algorithm that allows output-linear delay enumeration of a spanner's result after a precomputation phase that is linear in the document. Although the algorithm assumes that the spanner is specified in a syntactic variant of variable-set automata, we also study how it can be applied when the spanner is specified by general variable-set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner and provide a fine-grained analysis of the classes of document spanners that support efficient enumeration of their results.	-
dc.language.iso	en	-
dc.publisher	Association for Computing Machinery	-
dc.subject.other	Information extraction	-
dc.subject.other	spanners	-
dc.subject.other	enumeration delay	-
dc.subject.other	automata	-
dc.subject.other	capture variables	-
dc.title	Efficient Enumeration Algorithms for Regular Document Spanners	-
dc.type	Journal Contribution	-
dc.identifier.epage	42	-
dc.identifier.issue	1	-
dc.identifier.spage	1	-
dc.identifier.volume	45	-
local.bibliographicCitation.jcat	A1	-
local.publisher.place	2 PENN PLAZA, STE 701, NEW YORK, NY 10121-0701 USA	-
local.type.refereed	Refereed	-
local.type.specified	Article	-
local.bibliographicCitation.artnr	3	-
dc.identifier.doi	10.1145/3351451	-
dc.identifier.isi	WOS:000583687500004	-
local.provider.type	Web of Science	-
local.uhasselt.uhpub	no	-
local.uhasselt.international	yes	-
item.fulltext	No Fulltext	-
item.fullcitation	Florenzano, Fernando; Riveros, Cristian; Ugarte, Martín; VANSUMMEREN, Stijn & Vrgoc, Domagoj (2020) Efficient Enumeration Algorithms for Regular Document Spanners. In: ACM TRANSACTIONS ON DATABASE SYSTEMS, 45 (1) , p. 1 -42 (Art N° 3).	-
item.contributor	Florenzano, Fernando	-
item.contributor	Riveros, Cristian	-
item.contributor	Ugarte, Martín	-
item.contributor	VANSUMMEREN, Stijn	-
item.contributor	Vrgoc, Domagoj	-
item.accessRights	Closed Access	-
crisitem.journal.issn	0362-5915	-
crisitem.journal.eissn	1557-4644	-
Appears in Collections:	Research publications

Show simple item record

SCOPUS^TM
Citations

25

checked on Jun 3, 2026

WEB OF SCIENCE^TM
Citations

23

checked on Jun 14, 2026

Google Scholar^TM

Check

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM