Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/8500
Full metadata record
DC FieldValueLanguage
dc.contributor.authorBEX, Geert Jan-
dc.contributor.authorGELADE, Wouter-
dc.contributor.authorNEVEN, Frank-
dc.contributor.authorVANSUMMEREN, Stijn-
dc.date.accessioned2008-09-29T07:25:39Z-
dc.date.available2008-09-29T07:25:39Z-
dc.date.issued2008-
dc.identifier.citationHuai, Jinpeng & Chen, Robin & Liu, Hsiao-Wuen & Ma, Wei-Ying & Tomkins, Andrew & Zhang, Xiadong (Ed.) Proceedings of the 17th International Conference on World Wide Web. p. 825-834.-
dc.identifier.isbn978-1-60558-085-2-
dc.identifier.urihttp://hdl.handle.net/1942/8500-
dc.description.abstractInferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capapble of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTD's and XSD's, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.-
dc.language.isoen-
dc.publisherACM-
dc.subject.otherXML, regular expressions, schema inference-
dc.titleLearning deterministic regular expressions for the inference of schemas from XML data-
dc.typeProceedings Paper-
local.bibliographicCitation.authorsHuai, Jinpeng-
local.bibliographicCitation.authorsChen, Robin-
local.bibliographicCitation.authorsLiu, Hsiao-Wuen-
local.bibliographicCitation.authorsMa, Wei-Ying-
local.bibliographicCitation.authorsTomkins, Andrew-
local.bibliographicCitation.authorsZhang, Xiadong-
local.bibliographicCitation.conferencedateApril 21-25-
local.bibliographicCitation.conferencenameInternational Conference on World Wide Web-
dc.bibliographicCitation.conferencenr17-
local.bibliographicCitation.conferenceplaceBeijing, China-
dc.identifier.epage834-
dc.identifier.spage825-
local.bibliographicCitation.jcatC1-
local.type.refereedRefereed-
local.type.specifiedProceedings Paper-
dc.bibliographicCitation.oldjcatC2-
dc.identifier.urlhttp://doi.acm.org/10.1145/1367497.1367609-
local.bibliographicCitation.btitleProceedings of the 17th International Conference on World Wide Web-
item.fulltextWith Fulltext-
item.accessRightsOpen Access-
item.fullcitationBEX, Geert Jan; GELADE, Wouter; NEVEN, Frank & VANSUMMEREN, Stijn (2008) Learning deterministic regular expressions for the inference of schemas from XML data. In: Huai, Jinpeng & Chen, Robin & Liu, Hsiao-Wuen & Ma, Wei-Ying & Tomkins, Andrew & Zhang, Xiadong (Ed.) Proceedings of the 17th International Conference on World Wide Web. p. 825-834..-
item.contributorVANSUMMEREN, Stijn-
item.contributorBEX, Geert Jan-
item.contributorNEVEN, Frank-
item.contributorGELADE, Wouter-
Appears in Collections:Research publications
Files in This Item:
File Description SizeFormat 
www08.pdfPreprint286.4 kBAdobe PDFView/Open
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.