Please use this identifier to cite or link to this item:
Title: Inference of Concise DTDs from XML Data
Authors: BEX, Geert Jan 
NEVEN, Frank 
Schwentick, T
TUYLS, Karl 
Issue Date: 2006
Publisher: ACM Press
Source: Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126.
Abstract: We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas.
Keywords: XML document, DTD
Document URI:
Category: C2
Type: Proceedings Paper
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
bexVLDB.pdfPostprint635.38 kBAdobe PDFView/Open
Show full item record

Page view(s)

checked on May 17, 2022


checked on May 17, 2022

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.