Please use this identifier to cite or link to this item:
Title: Foundations of XML: Regular Expressions Revisited
Authors: GELADE, Wouter 
Advisors: NEVEN, Frank
Issue Date: 2009
Abstract: The regular languages constitute a highly robust class of languages. They can be represented by many different formalisms such as finite automata [92], regular expressions [67], finite semigroups [97], monadic second-order logic [18], right-linear grammars [21], quantifier-free first-order updates [39], ... Regular languages are closed under a wide range of operations and, even more importantly, almost any interesting problem concerning regular languages is decidable. All kinds of aspects of the regular languages have been studied over the past 50 years. From a practical perspective, the most widespread way to specify regular languages is by using regular expressions (REs). They are used in applications in many different areas of computer science, including bioinformatics [84], programming languages [108], model checking [105], and XML schema language [100]. XML is the lingua franca and the de facto standard for data exchange on the Web. When two parties exchange data in XML, the XML documents usually adhere to a certain format. Thereto, XML schema languages are used to describe the structure an XML document can have. The most common XML schema languages are DTD, XML Schema [100], both W3C standards, and Relax NG [22]. From a formal language theory point of view each of these is a grammar-based formalism with regular expressions at their right-hand sides. These expression, however, differ from the standard regular expressions in that they are extended by additional operators but are also restricted by the requirement to be deterministic. Although these requirements are recorded in W3C and ISO standards, it is not clear what their impact on the different schema languages is. The goal of this thesis therefore is a study of these consequences; in particular, we study the complexity of optimization in the presence of additional operators, illustrate the difficulties of migrating from one schema language to another, and study the implications of the determinism constraint (also in the presence of additional operators) and try to make it accessible in practice. Although the questions we ask are mainly inspired by questions about XML, we believe that the answers are also interesting for the general theoretical computer science community as they answer fundamental questions concerning regular expressions. Therefore, this thesis consists of two parts. In the first part, we study fundamental aspects of regular expression. In the second, we apply the former results to applications concerning XML schema languages. Although most of the work is of a foundational nature, we also developed software to bring these theoretical insights to practice.
Document URI:
Category: T1
Type: Theses and Dissertations
Appears in Collections:PhD theses
Research publications

Files in This Item:
File Description SizeFormat 
thesis.pdfOriginal version1.42 MBAdobe PDFView/Open
Show full item record

Page view(s)

checked on May 23, 2022


checked on May 23, 2022

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.