Compressed representation of XML documents with rapid navigation
thesisposted on 05.09.2014, 15:30 by Mohammad Kamel Ahmad Kharabsheh
XML(Extensible Markup Language) is a language used in data representation and storage, and transmission and manipulation of data. Excessive memory consumption is an important challenge when representing XML documents in main memory. Document Object Model (DOM) APIs are used in a processing level that provides access to all parts of XML documents through the navigation operations. Although DOM serves as a a general purpose tool that can be used in different applications, it has high memory cost particularly if using na¨ıve. The space usage of DOM has been reduced significantly while keeping fast processing speeds, by use of succinct data structures in SiXDOM . However, SiXDOM does not explore in depth XML data compression principles to improve in-memory space usage. Such XML data compression techniques have been proven to be very effective in on-disk compression of XML document. In this thesis we propose a new approach to represent XML documents in-memory using XML data compression ideas to further reduce space usage while rapidly supporting operations of the kind supported by DOM. Our approach is based upon a compression method  which represents an XML document as a directed acyclic graph (DAG) by sharing common subtrees. However, this approach does not permit the representation of attributes and textual data, and furthermore, a naive implementation of this idea gives very poor space usage relative to other space-efficient DOM implementations . In order to realise the potential of this compression method as an in-memory representation, a number of optimisations are made by application of succinct data structures and variablelength encoding. Furthermore, a framework for supporting attribute and textual data nodes is introduced. Finally, we propose a novel approach to representing the textual data using Minimal Perfect Hashing(MPH). We have implemented our ideas in a software library called DAGDOMand performed extensive experimental evaluation on a number of standard XML files. DAGDOM yields a good result and we are able to obtain significant space reductions over existing space-efficient DOM implementations (typically 2 to 5 times space reduction), with very modest degradations in CPU time for navigational operations.