Efficient Processing of Very Large XML Documents in Small Space

Item

Title
Efficient Processing of Very Large XML Documents in Small Space
Identifier
d_2009_2013:97f256e77e72:11579
identifier
12092
Creator
Meyer, Matthew K.,
Contributor
Ira Rudowsky
Date
2012
Language
English
Publisher
City University of New York.
Subject
Computer science | DOM | XML
Abstract
The focus of this research was to develop a highly efficient serialized data structure for use in Document Object Model (DOM) query systems of Extensible Markup Language (XML) documents, which are increasingly the dominant model for data representation across the World Web. The space efficiency of the serialized data structure developed is shown to improve query response time and eliminates the need to rebuild the DOM representation each time it is desired to query an existing XML document. Moreover, this serialized data structure enables DOM modeling of extremely large XML documents at a low and fixed memory cost. New algorithms and software tools for working with these data structures have been developed and this allows for compatibility with standard DOM modeling. The structures, tool and techniques presented can be readily adapted for use on the Semantic Web as it becomes a more prominent feature of the internet.;The improved performance of the XML database storage and retrieval application developed in this thesis helps close the gap between relational database application performance and XML Document (NXD) database application performance. In addition, the novel DOM querying technique that we have developed in this thesis has potential application on hand held devices, autonomous robots, and other devices which have limited memory.;The research evolved from an examination of three principal limitations that restrict the widespread implementation of Semantic Web technologies. These limitations include: (1) Querying information stored in XML format. Representation in XML format is a hierarchical and queries are less efficient and more time consuming than comparable queries of data stored in relational database format [1] . (2) Complexity of tools and technologies. The suite of technologies, useful ontology's, and tools necessary for creating semantic web technologies are complex and development time is extensive [2]. (3) Insufficient user friendly interfaces. Users with no prior exposure to semantic technologies, such as "Description Logic (DL)," cannot easily utilize the power of semantic search capabilities. In this thesis, the first of the inhibiting factors restricting the growth of the semantic web will be addressed. It will be shown that serializing the data representation of an XML document representation enables significant improvement of XML-DOM query response times. This serialized representation also eliminates the restrictions on documents size imposed by most XML- DOM query systems.;Chapters 2 and 3 introduce and motivate the research described in this thesis. Chapters 4, 5 and 6 explain the basic costs inherent in DOM modeling, expand on the influence of DOM tree size and shape on the total cost of DOM modeling, and enumerate why DOM modeling of a XML documents is traditionally considered to take 2 to 5 times the size of the source XML document to store the DOM tree representing the document in memory. Chapters 7 and 8 introduce methods for reducing the size of DOM trees and establish upper and lower bounds for DOM tree size in relation to source XML documents; we establish a practical upper-bound of memory required for supporting XML DOM querying and show that it is smaller than the 2-5 size limitation described above. Chapter 9 introduces the Minimum DOM Node (MDN), a minimized data structure capable of storing DOM node information. It is also shown how an MDN Array (MDNA) can be used as an alternative to a traditional DOM tree. Chapter 10 examines the memory cost of using an MDNA for DOM modeling. It is also shown that using the MDNA model of an XML document allows DOM querying with a low and fixed allocation of memory. Chapter 11 details the process and costs involved in creating an MDNA from a given source XML document and is a major contribution of this work. Chapter 12 examines the W3C specification for the DOM interface and explains how an MDNA can be used to satisfy the requirements of core DOM node operations. In Chapter 13, a C/C++ implementation of a MDNA parser was developed to prove the concepts presented in this thesis. Chapter 14 examines the future research topics that follow logically from these developments.
Type
dissertation
Source
2009_2013.csv
degree
Ph.D.
Program
Computer Science