Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/1416
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | BEX, Geert Jan | - |
dc.contributor.author | NEVEN, Frank | - |
dc.contributor.author | Schwentick, T | - |
dc.contributor.author | TUYLS, Karl | - |
dc.date.accessioned | 2007-05-03T09:33:16Z | - |
dc.date.available | 2007-05-03T09:33:16Z | - |
dc.date.issued | 2006 | - |
dc.identifier.citation | Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126. | - |
dc.identifier.uri | http://hdl.handle.net/1942/1416 | - |
dc.description.abstract | We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas. | - |
dc.format.extent | 650624 bytes | - |
dc.format.mimetype | application/pdf | - |
dc.language.iso | en | - |
dc.publisher | ACM Press | - |
dc.subject.other | XML document, DTD | - |
dc.title | Inference of Concise DTDs from XML Data | - |
dc.type | Proceedings Paper | - |
local.bibliographicCitation.authors | Dayal, Umeshwar | - |
local.bibliographicCitation.authors | Whang, Kyu-Young | - |
local.bibliographicCitation.authors | Lomet, David B. | - |
local.bibliographicCitation.conferencedate | SEP 12-15, 2006 | - |
local.bibliographicCitation.conferencename | Very Large Databases (VLDB' 06) | - |
dc.bibliographicCitation.conferencenr | 32 | - |
local.bibliographicCitation.conferenceplace | Seoul, Korea | - |
dc.identifier.epage | 126 | - |
dc.identifier.spage | 115 | - |
local.bibliographicCitation.jcat | C2 | - |
local.type.specified | Proceedings Paper | - |
dc.bibliographicCitation.oldjcat | C2 | - |
local.bibliographicCitation.btitle | Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06) | - |
item.fulltext | With Fulltext | - |
item.accessRights | Open Access | - |
item.fullcitation | BEX, Geert Jan; NEVEN, Frank; Schwentick, T & TUYLS, Karl (2006) Inference of Concise DTDs from XML Data. In: Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126.. | - |
item.contributor | BEX, Geert Jan | - |
item.contributor | NEVEN, Frank | - |
item.contributor | Schwentick, T | - |
item.contributor | TUYLS, Karl | - |
Appears in Collections: | Research publications |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
bexVLDB.pdf | Peer-reviewed author version | 635.38 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.