Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/11342
Full metadata record
DC FieldValueLanguage
dc.contributor.authorBEX, Geert Jan-
dc.contributor.authorNEVEN, Frank-
dc.contributor.authorSchwentick, Thomas-
dc.contributor.authorVANSUMMEREN, Stijn-
dc.date.accessioned2010-12-12T17:14:28Z-
dc.date.availableNO_RESTRICTION-
dc.date.available2010-12-12T17:14:28Z-
dc.date.issued2010-
dc.identifier.citationACM TRANSACTIONS ON DATABASE SYSTEMS, 35 (2)-
dc.identifier.issn0362-5915-
dc.identifier.urihttp://hdl.handle.net/1942/11342-
dc.description.abstractWe consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions-the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)-that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however ( for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs ( which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets.-
dc.description.sponsorshipThis research was done while S. Vansummeren was a Postdoctoral Fellow of the Research Foundation-Flanders (FWO) at Hasselt University. This work was funded by FWO-G.0821.09N and the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commision, under the FET-Open grant agreement FOX, number FP7-ICT-233599.-
dc.language.isoen-
dc.publisherASSOC COMPUTING MACHINERY-
dc.subject.otherAlgorithms; Languages; Theory; Regular expressions; schema inference; XML-
dc.titleInference of Concise Regular Expressions and DTDs-
dc.typeJournal Contribution-
dc.identifier.issue2-
dc.identifier.volume35-
local.format.pages47-
local.bibliographicCitation.jcatA1-
dc.description.notes[Bex, Geert Jan; Neven, Frank] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium. [Bex, Geert Jan; Neven, Frank] Transnatl Univ Limburg, B-3590 Diepenbeek, Belgium. [Schwentick, Thomas] TU Dortmund, Fak Informat, D-44227 Dortmund, Germany. [Vansummeren, Stijn] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium. geertjan.bex@uhasselt.be; frank.neven@uhasselt.be; thomas.schwentick@udo.edu; stijn.vansummeren@ulb.ac.be-
local.type.refereedRefereed-
local.type.specifiedArticle-
dc.bibliographicCitation.oldjcatA1-
dc.identifier.doi10.1145/1735886.1735890-
dc.identifier.isi000277925600004-
item.accessRightsClosed Access-
item.contributorBEX, Geert Jan-
item.contributorNEVEN, Frank-
item.contributorSchwentick, Thomas-
item.contributorVANSUMMEREN, Stijn-
item.fullcitationBEX, Geert Jan; NEVEN, Frank; Schwentick, Thomas & VANSUMMEREN, Stijn (2010) Inference of Concise Regular Expressions and DTDs. In: ACM TRANSACTIONS ON DATABASE SYSTEMS, 35 (2).-
item.validationecoom 2011-
item.fulltextNo Fulltext-
crisitem.journal.issn0362-5915-
crisitem.journal.eissn1557-4644-
Appears in Collections:Research publications
Show simple item record

SCOPUSTM   
Citations

67
checked on Sep 5, 2020

WEB OF SCIENCETM
Citations

57
checked on Apr 14, 2024

Page view(s)

88
checked on Jun 16, 2023

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.