Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/1416
Full metadata record
DC FieldValueLanguage
dc.contributor.authorBEX, Geert Jan-
dc.contributor.authorNEVEN, Frank-
dc.contributor.authorSchwentick, T-
dc.contributor.authorTUYLS, Karl-
dc.date.accessioned2007-05-03T09:33:16Z-
dc.date.available2007-05-03T09:33:16Z-
dc.date.issued2006-
dc.identifier.citationDayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126.-
dc.identifier.urihttp://hdl.handle.net/1942/1416-
dc.description.abstractWe consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas.-
dc.format.extent650624 bytes-
dc.format.mimetypeapplication/pdf-
dc.language.isoen-
dc.publisherACM Press-
dc.subject.otherXML document, DTD-
dc.titleInference of Concise DTDs from XML Data-
dc.typeProceedings Paper-
local.bibliographicCitation.authorsDayal, Umeshwar-
local.bibliographicCitation.authorsWhang, Kyu-Young-
local.bibliographicCitation.authorsLomet, David B.-
local.bibliographicCitation.conferencedateSEP 12-15, 2006-
local.bibliographicCitation.conferencenameVery Large Databases (VLDB' 06)-
dc.bibliographicCitation.conferencenr32-
local.bibliographicCitation.conferenceplaceSeoul, Korea-
dc.identifier.epage126-
dc.identifier.spage115-
local.bibliographicCitation.jcatC2-
local.type.specifiedProceedings Paper-
dc.bibliographicCitation.oldjcatC2-
local.bibliographicCitation.btitleProceedings of the 32nd International Conference on Very Large Databases (VLDB' 06)-
item.accessRightsOpen Access-
item.fulltextWith Fulltext-
item.fullcitationBEX, Geert Jan; NEVEN, Frank; Schwentick, T & TUYLS, Karl (2006) Inference of Concise DTDs from XML Data. In: Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126..-
item.contributorBEX, Geert Jan-
item.contributorNEVEN, Frank-
item.contributorSchwentick, T-
item.contributorTUYLS, Karl-
Appears in Collections:Research publications
Files in This Item:
File Description SizeFormat 
bexVLDB.pdfPeer-reviewed author version635.38 kBAdobe PDFView/Open
Show simple item record

Page view(s)

28
checked on Sep 7, 2022

Download(s)

36
checked on Sep 7, 2022

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.