Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/11342
Title: | Inference of Concise Regular Expressions and DTDs | Authors: | BEX, Geert Jan NEVEN, Frank Schwentick, Thomas VANSUMMEREN, Stijn |
Issue Date: | 2010 | Publisher: | ASSOC COMPUTING MACHINERY | Source: | ACM TRANSACTIONS ON DATABASE SYSTEMS, 35 (2) | Abstract: | We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions-the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)-that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however ( for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs ( which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets. | Notes: | [Bex, Geert Jan; Neven, Frank] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium. [Bex, Geert Jan; Neven, Frank] Transnatl Univ Limburg, B-3590 Diepenbeek, Belgium. [Schwentick, Thomas] TU Dortmund, Fak Informat, D-44227 Dortmund, Germany. [Vansummeren, Stijn] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium. geertjan.bex@uhasselt.be; frank.neven@uhasselt.be; thomas.schwentick@udo.edu; stijn.vansummeren@ulb.ac.be | Keywords: | Algorithms; Languages; Theory; Regular expressions; schema inference; XML | Document URI: | http://hdl.handle.net/1942/11342 | ISSN: | 0362-5915 | e-ISSN: | 1557-4644 | DOI: | 10.1145/1735886.1735890 | ISI #: | 000277925600004 | Category: | A1 | Type: | Journal Contribution | Validations: | ecoom 2011 |
Appears in Collections: | Research publications |
Show full item record
SCOPUSTM
Citations
67
checked on Sep 5, 2020
WEB OF SCIENCETM
Citations
58
checked on Oct 14, 2024
Page view(s)
88
checked on Jun 16, 2023
Google ScholarTM
Check
Altmetric
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.