Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/11295
Title: Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
Authors: BEX, Geert Jan 
GELADE, Wouter 
NEVEN, Frank 
VANSUMMEREN, Stijn 
Issue Date: 2010
Publisher: ASSOC COMPUTING MACHINERY
Source: ACM TRANSACTIONS ON THE WEB, 4(4)
Abstract: Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
Notes: [Bex, Geert Jan; Gelade, Wouter; Neven, Frank] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium. [Bex, Geert Jan; Gelade, Wouter; Neven, Frank] Transnat Univ Limburg, B-3590 Diepenbeek, Belgium. [Vansummeren, Stijn] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium. geertjan.bex@uhasselt.be; wouter.gelade@uhasselt.be; frank.neven@uhasselt.be; stijn.vansummeren@ulb.ac.be
Keywords: Algorithms; Languages; Theory; Regular expressions; schema inference; XML;Algorithms; Languages; Theory; Regular expressions; schema inference; XML
Document URI: http://hdl.handle.net/1942/11295
ISSN: 1559-1131
e-ISSN: 1559-114X
DOI: 10.1145/1841909.1841911
ISI #: 000282756100002
Category: A1
Type: Journal Contribution
Validations: ecoom 2011
Appears in Collections:Research publications

Show full item record

SCOPUSTM   
Citations

57
checked on Sep 2, 2020

WEB OF SCIENCETM
Citations

37
checked on May 21, 2022

Page view(s)

54
checked on May 23, 2022

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.