Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/11342
Title: Inference of Concise Regular Expressions and DTDs
Authors: BEX, Geert Jan 
NEVEN, Frank 
Schwentick, Thomas
VANSUMMEREN, Stijn 
Issue Date: 2010
Publisher: ASSOC COMPUTING MACHINERY
Source: ACM TRANSACTIONS ON DATABASE SYSTEMS, 35 (2)
Abstract: We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions-the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)-that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however ( for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs ( which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets.
Notes: [Bex, Geert Jan; Neven, Frank] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium. [Bex, Geert Jan; Neven, Frank] Transnatl Univ Limburg, B-3590 Diepenbeek, Belgium. [Schwentick, Thomas] TU Dortmund, Fak Informat, D-44227 Dortmund, Germany. [Vansummeren, Stijn] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium. geertjan.bex@uhasselt.be; frank.neven@uhasselt.be; thomas.schwentick@udo.edu; stijn.vansummeren@ulb.ac.be
Keywords: Algorithms; Languages; Theory; Regular expressions; schema inference; XML
Document URI: http://hdl.handle.net/1942/11342
ISSN: 0362-5915
e-ISSN: 1557-4644
DOI: 10.1145/1735886.1735890
ISI #: 000277925600004
Category: A1
Type: Journal Contribution
Validations: ecoom 2011
Appears in Collections:Research publications

Show full item record

SCOPUSTM   
Citations

67
checked on Sep 5, 2020

WEB OF SCIENCETM
Citations

57
checked on Apr 22, 2024

Page view(s)

88
checked on Jun 16, 2023

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.