Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/8912
Title: Discovering structure in semi-structured data
Authors: BEX, Geert Jan 
Advisors: NEVEN, Frank
Issue Date: 2008
Publisher: UHasselt Diepenbeek
Abstract: Excerpt of introduction: Unfortunately, in spite of the above mentioned advantages, the presence of a schema is not mandatory and many XML documents are not accompanied by one. For instance, in a recent study, Barbosa et al. have shown that approximately half of the XML documents available on the web do not refer to a schema. In another study, we have noted that about two-thirds of XSDs gathered from schema repositories and from the web are not valid with respect to the W3C XML Schema specification, rendering them essentially useless for immediate application (see Chapter 6). A similar observation was made by Sahuguet concerning DTDs. Based on the lack of schemas in practice, it is essential to devise algorithms that can infer a schema for a given collection of XML documents when none, or no syntactically correct one, is present. This is also acknowledged by Florescu who emphasizes that in the context of data integration: “We need to extract good-quality schemas automatically from existing data and perform incremental maintenance of the generated schemas.” It should be noted that even when a schema is already available, there are situations where inference can be useful. One such situation is schema cleaning: sometimes a schema is too general with respect to the XML data that it is supposed to describe. In that case, it can be advantageous to infer a new schema based solely on the data at hand.... In general, schema inference can be used to restrict schemas to a relevant subset of data needed by the application at hand, thereby facilitating difficult tasks like schema matching and data integration. Indeed, as argued by Hinkelman [Hin05], industry-level standards are too loosely defined in general, which can result in XML schemas where many business structures are formally specified as being optional.... Based on the above observations, it is hence essential to devise algorithms that can automatically infer a DTD or XSD from a given corpus of XML documents....
Notes: doctoraat wetenschappen informatica
Document URI: http://hdl.handle.net/1942/8912
Category: T1
Type: Theses and Dissertations
Appears in Collections:PhD theses
Research publications

Files in This Item:
File Description SizeFormat 
GeertJanBex.pdf1.43 MBAdobe PDFView/Open
Show full item record

Page view(s)

86
checked on Nov 7, 2023

Download(s)

28
checked on Nov 7, 2023

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.