Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/1416
Title: | Inference of Concise DTDs from XML Data | Authors: | BEX, Geert Jan NEVEN, Frank Schwentick, T TUYLS, Karl |
Issue Date: | 2006 | Publisher: | ACM Press | Source: | Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126. | Abstract: | We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas. | Keywords: | XML document, DTD | Document URI: | http://hdl.handle.net/1942/1416 | Category: | C2 | Type: | Proceedings Paper |
Appears in Collections: | Research publications |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
bexVLDB.pdf | Peer-reviewed author version | 635.38 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.