Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

BEX, Geert Jan; GELADE, Wouter; NEVEN, Frank; VANSUMMEREN, Stijn

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/11295

Full metadata record

DC Field	Value	Language
dc.contributor.author	BEX, Geert Jan	-
dc.contributor.author	GELADE, Wouter	-
dc.contributor.author	NEVEN, Frank	-
dc.contributor.author	VANSUMMEREN, Stijn	-
dc.date.accessioned	2010-11-10T07:54:49Z	-
dc.date.available	NO_RESTRICTION	-
dc.date.available	2010-11-10T07:54:49Z	-
dc.date.issued	2010	-
dc.identifier.citation	ACM TRANSACTIONS ON THE WEB, 4(4)	-
dc.identifier.issn	1559-1131	-
dc.identifier.uri	http://hdl.handle.net/1942/11295	-
dc.description.abstract	Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.	-
dc.description.sponsorship	This work was funded by FWO-G.0821.09N and the Future and Emerging Technologies (FET) program within the Seventh Framework Program for Research of the European Commission, under the FET-Open grant agreement FOX, number FP7-ICT-233599.	-
dc.language.iso	en	-
dc.publisher	ASSOC COMPUTING MACHINERY	-
dc.subject.other	Algorithms; Languages; Theory; Regular expressions; schema inference; XML	-
dc.subject.other	Algorithms; Languages; Theory; Regular expressions; schema inference; XML	-
dc.title	Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data	-
dc.type	Journal Contribution	-
dc.identifier.issue	4	-
dc.identifier.volume	4	-
local.format.pages	32	-
local.bibliographicCitation.jcat	A1	-
dc.description.notes	[Bex, Geert Jan; Gelade, Wouter; Neven, Frank] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium. [Bex, Geert Jan; Gelade, Wouter; Neven, Frank] Transnat Univ Limburg, B-3590 Diepenbeek, Belgium. [Vansummeren, Stijn] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium. geertjan.bex@uhasselt.be; wouter.gelade@uhasselt.be; frank.neven@uhasselt.be; stijn.vansummeren@ulb.ac.be	-
local.type.refereed	Refereed	-
local.type.specified	Article	-
dc.bibliographicCitation.oldjcat	A1	-
dc.identifier.doi	10.1145/1841909.1841911	-
dc.identifier.isi	000282756100002	-
item.fulltext	No Fulltext	-
item.accessRights	Closed Access	-
item.validation	ecoom 2011	-
item.fullcitation	BEX, Geert Jan; GELADE, Wouter; NEVEN, Frank & VANSUMMEREN, Stijn (2010) Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data. In: ACM TRANSACTIONS ON THE WEB, 4(4).	-
item.contributor	BEX, Geert Jan	-
item.contributor	GELADE, Wouter	-
item.contributor	NEVEN, Frank	-
item.contributor	VANSUMMEREN, Stijn	-
crisitem.journal.issn	1559-1131	-
crisitem.journal.eissn	1559-114X	-
Appears in Collections:	Research publications

Show simple item record

SCOPUS^TM
Citations

76

checked on Jul 3, 2026

WEB OF SCIENCE^TM
Citations

51

checked on Jul 3, 2026

Google Scholar^TM

Check

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM