Learning deterministic regular expressions for the inference of schemas from XML data

BEX, Geert Jan; GELADE, Wouter; NEVEN, Frank; VANSUMMEREN, Stijn

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/8500

Full metadata record

DC Field	Value	Language
dc.contributor.author	BEX, Geert Jan	-
dc.contributor.author	GELADE, Wouter	-
dc.contributor.author	NEVEN, Frank	-
dc.contributor.author	VANSUMMEREN, Stijn	-
dc.date.accessioned	2008-09-29T07:25:39Z	-
dc.date.available	2008-09-29T07:25:39Z	-
dc.date.issued	2008	-
dc.identifier.citation	Huai, Jinpeng & Chen, Robin & Liu, Hsiao-Wuen & Ma, Wei-Ying & Tomkins, Andrew & Zhang, Xiadong (Ed.) Proceedings of the 17th International Conference on World Wide Web. p. 825-834.	-
dc.identifier.isbn	978-1-60558-085-2	-
dc.identifier.uri	http://hdl.handle.net/1942/8500	-
dc.description.abstract	Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capapble of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTD's and XSD's, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.	-
dc.language.iso	en	-
dc.publisher	ACM	-
dc.subject.other	XML, regular expressions, schema inference	-
dc.title	Learning deterministic regular expressions for the inference of schemas from XML data	-
dc.type	Proceedings Paper	-
local.bibliographicCitation.authors	Huai, Jinpeng	-
local.bibliographicCitation.authors	Chen, Robin	-
local.bibliographicCitation.authors	Liu, Hsiao-Wuen	-
local.bibliographicCitation.authors	Ma, Wei-Ying	-
local.bibliographicCitation.authors	Tomkins, Andrew	-
local.bibliographicCitation.authors	Zhang, Xiadong	-
local.bibliographicCitation.conferencedate	April 21-25	-
local.bibliographicCitation.conferencename	International Conference on World Wide Web	-
dc.bibliographicCitation.conferencenr	17	-
local.bibliographicCitation.conferenceplace	Beijing, China	-
dc.identifier.epage	834	-
dc.identifier.spage	825	-
local.bibliographicCitation.jcat	C1	-
local.type.refereed	Refereed	-
local.type.specified	Proceedings Paper	-
dc.bibliographicCitation.oldjcat	C2	-
dc.identifier.url	http://doi.acm.org/10.1145/1367497.1367609	-
local.bibliographicCitation.btitle	Proceedings of the 17th International Conference on World Wide Web	-
item.contributor	BEX, Geert Jan	-
item.contributor	GELADE, Wouter	-
item.contributor	NEVEN, Frank	-
item.contributor	VANSUMMEREN, Stijn	-
item.accessRights	Closed Access	-
item.fulltext	With Fulltext	-
item.fullcitation	BEX, Geert Jan; GELADE, Wouter; NEVEN, Frank & VANSUMMEREN, Stijn (2008) Learning deterministic regular expressions for the inference of schemas from XML data. In: Huai, Jinpeng & Chen, Robin & Liu, Hsiao-Wuen & Ma, Wei-Ying & Tomkins, Andrew & Zhang, Xiadong (Ed.) Proceedings of the 17th International Conference on World Wide Web. p. 825-834..	-
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
www08.pdf	Non Peer-reviewed author version	286.4 kB	Adobe PDF	View/Open

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM