Inference of Concise Regular Expressions and DTDs

BEX, Geert Jan; NEVEN, Frank; Schwentick, Thomas; VANSUMMEREN, Stijn

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/11342

Full metadata record

DC Field	Value	Language
dc.contributor.author	BEX, Geert Jan	-
dc.contributor.author	NEVEN, Frank	-
dc.contributor.author	Schwentick, Thomas	-
dc.contributor.author	VANSUMMEREN, Stijn	-
dc.date.accessioned	2010-12-12T17:14:28Z	-
dc.date.available	NO_RESTRICTION	-
dc.date.available	2010-12-12T17:14:28Z	-
dc.date.issued	2010	-
dc.identifier.citation	ACM TRANSACTIONS ON DATABASE SYSTEMS, 35 (2)	-
dc.identifier.issn	0362-5915	-
dc.identifier.uri	http://hdl.handle.net/1942/11342	-
dc.description.abstract	We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions-the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)-that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however ( for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs ( which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets.	-
dc.description.sponsorship	This research was done while S. Vansummeren was a Postdoctoral Fellow of the Research Foundation-Flanders (FWO) at Hasselt University. This work was funded by FWO-G.0821.09N and the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commision, under the FET-Open grant agreement FOX, number FP7-ICT-233599.	-
dc.language.iso	en	-
dc.publisher	ASSOC COMPUTING MACHINERY	-
dc.subject.other	Algorithms; Languages; Theory; Regular expressions; schema inference; XML	-
dc.title	Inference of Concise Regular Expressions and DTDs	-
dc.type	Journal Contribution	-
dc.identifier.issue	2	-
dc.identifier.volume	35	-
local.format.pages	47	-
local.bibliographicCitation.jcat	A1	-
dc.description.notes	[Bex, Geert Jan; Neven, Frank] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium. [Bex, Geert Jan; Neven, Frank] Transnatl Univ Limburg, B-3590 Diepenbeek, Belgium. [Schwentick, Thomas] TU Dortmund, Fak Informat, D-44227 Dortmund, Germany. [Vansummeren, Stijn] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium. geertjan.bex@uhasselt.be; frank.neven@uhasselt.be; thomas.schwentick@udo.edu; stijn.vansummeren@ulb.ac.be	-
local.type.refereed	Refereed	-
local.type.specified	Article	-
dc.bibliographicCitation.oldjcat	A1	-
dc.identifier.doi	10.1145/1735886.1735890	-
dc.identifier.isi	000277925600004	-
item.fulltext	No Fulltext	-
item.accessRights	Closed Access	-
item.validation	ecoom 2011	-
item.fullcitation	BEX, Geert Jan; NEVEN, Frank; Schwentick, Thomas & VANSUMMEREN, Stijn (2010) Inference of Concise Regular Expressions and DTDs. In: ACM TRANSACTIONS ON DATABASE SYSTEMS, 35 (2).	-
item.contributor	BEX, Geert Jan	-
item.contributor	NEVEN, Frank	-
item.contributor	Schwentick, Thomas	-
item.contributor	VANSUMMEREN, Stijn	-
crisitem.journal.issn	0362-5915	-
crisitem.journal.eissn	1557-4644	-
Appears in Collections:	Research publications

Show simple item record

SCOPUS^TM
Citations

88

checked on Jul 3, 2026

WEB OF SCIENCE^TM
Citations

60

checked on Jul 3, 2026

Google Scholar^TM

Check

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM