Foundations of XML: Regular Expressions Revisited

GELADE, Wouter

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/10859

Title:	Foundations of XML: Regular Expressions Revisited
Authors:	GELADE, Wouter
Advisors:	NEVEN, Frank
Issue Date:	2009
Abstract:	The regular languages constitute a highly robust class of languages. They can be represented by many different formalisms such as finite automata [92], regular expressions [67], finite semigroups [97], monadic second-order logic [18], right-linear grammars [21], quantifier-free first-order updates [39], ... Regular languages are closed under a wide range of operations and, even more importantly, almost any interesting problem concerning regular languages is decidable. All kinds of aspects of the regular languages have been studied over the past 50 years. From a practical perspective, the most widespread way to specify regular languages is by using regular expressions (REs). They are used in applications in many different areas of computer science, including bioinformatics [84], programming languages [108], model checking [105], and XML schema language [100]. XML is the lingua franca and the de facto standard for data exchange on the Web. When two parties exchange data in XML, the XML documents usually adhere to a certain format. Thereto, XML schema languages are used to describe the structure an XML document can have. The most common XML schema languages are DTD, XML Schema [100], both W3C standards, and Relax NG [22]. From a formal language theory point of view each of these is a grammar-based formalism with regular expressions at their right-hand sides. These expression, however, differ from the standard regular expressions in that they are extended by additional operators but are also restricted by the requirement to be deterministic. Although these requirements are recorded in W3C and ISO standards, it is not clear what their impact on the different schema languages is. The goal of this thesis therefore is a study of these consequences; in particular, we study the complexity of optimization in the presence of additional operators, illustrate the difficulties of migrating from one schema language to another, and study the implications of the determinism constraint (also in the presence of additional operators) and try to make it accessible in practice. Although the questions we ask are mainly inspired by questions about XML, we believe that the answers are also interesting for the general theoretical computer science community as they answer fundamental questions concerning regular expressions. Therefore, this thesis consists of two parts. In the first part, we study fundamental aspects of regular expression. In the second, we apply the former results to applications concerning XML schema languages. Although most of the work is of a foundational nature, we also developed software to bring these theoretical insights to practice.
Document URI:	http://hdl.handle.net/1942/10859
Category:	T1
Type:	Theses and Dissertations
Appears in Collections:	PhD theses Research publications

Files in This Item:

File	Description	Size	Format
thesis.pdf		1.42 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM