Inference of Concise DTDs from XML Data

BEX, Geert Jan; NEVEN, Frank; Schwentick, T; TUYLS, Karl

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/1416

Full metadata record

DC Field	Value	Language
dc.contributor.author	BEX, Geert Jan	-
dc.contributor.author	NEVEN, Frank	-
dc.contributor.author	Schwentick, T	-
dc.contributor.author	TUYLS, Karl	-
dc.date.accessioned	2007-05-03T09:33:16Z	-
dc.date.available	2007-05-03T09:33:16Z	-
dc.date.issued	2006	-
dc.identifier.citation	Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126.	-
dc.identifier.uri	http://hdl.handle.net/1942/1416	-
dc.description.abstract	We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas.	-
dc.format.extent	650624 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	en	-
dc.publisher	ACM Press	-
dc.subject.other	XML document, DTD	-
dc.title	Inference of Concise DTDs from XML Data	-
dc.type	Proceedings Paper	-
local.bibliographicCitation.authors	Dayal, Umeshwar	-
local.bibliographicCitation.authors	Whang, Kyu-Young	-
local.bibliographicCitation.authors	Lomet, David B.	-
local.bibliographicCitation.conferencedate	SEP 12-15, 2006	-
local.bibliographicCitation.conferencename	Very Large Databases (VLDB' 06)	-
dc.bibliographicCitation.conferencenr	32	-
local.bibliographicCitation.conferenceplace	Seoul, Korea	-
dc.identifier.epage	126	-
dc.identifier.spage	115	-
local.bibliographicCitation.jcat	C2	-
local.type.specified	Proceedings Paper	-
dc.bibliographicCitation.oldjcat	C2	-
local.bibliographicCitation.btitle	Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06)	-
item.accessRights	Open Access	-
item.contributor	BEX, Geert Jan	-
item.contributor	NEVEN, Frank	-
item.contributor	Schwentick, T	-
item.contributor	TUYLS, Karl	-
item.fulltext	With Fulltext	-
item.fullcitation	BEX, Geert Jan; NEVEN, Frank; Schwentick, T & TUYLS, Karl (2006) Inference of Concise DTDs from XML Data. In: Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126..	-
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
bexVLDB.pdf	Peer-reviewed author version	635.38 kB	Adobe PDF	View/Open

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM