11 XML DOM

Dr M. Vijayalakshmi

Parsers

The word “Parser”comes from Compilers. Parser is a program that analyses the grammatical structure of an input, with respect to a given formal grammar.The parser determines that how a sentence is constructed from the grammar of the language by describing the atomic elements of the input and the relationship among them.

XML –Parsing Standards

We will consider two parsing methods that implement W3C standards for accessing XML.

SAX (Simple API for XML) is an event-driven interface. The programmer specifies an event that may happen and, if it does, SAX gets control and handles the situation. SAX is a “serial access protocol i.e., it accesses the XML document sequentially. It is a read-only API that provides a mechanism for reading data from an XML document but it cannot write an XML document. SAX is an ad-hoc protocol but it is a very popular standard used for XML parsing.

DOM (Document Object Model) converts an XML document into a tree of objects. It is a random access protocol that accesses the XML document randomly. DOM can update the XML document i.e., it can write into the XML document or manipulate an XML document. DOM is a W3C standard for accessing the XML document.

Difference between SAX and DOM

DOM reads the entire XML document into the memory and stores it as a tree data structure whereas SAX reads the XML document and sends an event for each element that it encounters. DOM provides “random access” into the XML document whereas SAX provides only sequential access. DOM is slow and requires huge amounts of memory, so it cannot be used for large XML documents whereas SAX is fast and requires very little memory, so it can be used for huge documents (or large numbers of documents). This feature makes the SAX much more popular for websites. But some DOM implementations have methods for changing the XML document in memory whereas SAX implementations do not.

DOM

DOM is a language-neutral and platform-independent object model for XML and HTML documents. It allows programs and scripts to build documents and navigate their structure. It also allows programs and scripts to dynamically access and manipulate the content, structure, and style of a document. DOM provides a foundation for developing querying, filtering, transformation, rendering etc. applications on top of its implementations.

The DOM is a W3C standard for accessing the XML and HTML documents.

It is separated into 3 different parts/levels :

Core DOM

This is a standard model for any structured document.

XML DOM

This is a standard model for XML documents, is platform and language-independent programming interface for XML. It defines the objects and properties of all XML elements, and the methods (interface) to access and manipulate them.

HTML DOM

This is a standard model for HTML documents, defines the objects and properties of all HTML elements, and the methods (interface) to access them. The HTML DOM model is constructed as a tree of Objects.In HTML DOM,everything is considered as a node.The document itself is a document node, all HTML elements are element nodes, all HTML attributes are attribute nodes, text inside HTML elements are text nodes, Comments are comment nodes.

Document Object Model (DOM) tree

The DOM presents an XML document as a tree-structure. The tree structure consists of nodes such as Parent node, Child node, Sibling node etc.Every parent in the tree structure can have zero or more children. The children are represented by child node. The Parent of the Parent node is called ancestor node. The children of the same parent are called siblings. Every tree structure consists of single root node which contains all other nodes in the document.

DOM structure model

DOM is based on Object Oriented concepts. The DOM defines the objects and properties of all document elements, and the methods (interface) to access them. It provides methods to access or change object’s state, interfaces which contains declaration of a set of methods and objects which provides encapsulation of data and methods. DOM structure is a parse tree which is a tree- like structure implied by the abstract relationships defined by the programming interfaces.

XML DOM Nodes

In XML DOM, everything is a node. The entire document is represented as a document node. Every XML element is called an element node. The text in the XML elements are called text nodes. Every attribute is called as an attribute node. Comments are called comment nodes.

The XML DOM Node Tree

The XML DOM views an XML document as a tree structure or hierarchical -structure. This structure is called a node-tree. All nodes can be accessed through the tree and their contents can be modified or deleted, and new elements can be created. The node tree shows the set of nodes, and the connections between them. The tree starts at the root node and branches out to the text nodes at the lowest level of the tree. Hierarchical structure means that it is not a single tree, but a forest of many trees.

Node Parents, Children, and Siblings

The nodes in the node tree have a hierarchical relationship to each other. The terms parent, child, and sibling are used to describe the relationships. Parent nodes have children. Children on the same level are called siblings (brothers or sisters).

Given below are the properties of the DOM Tree Structure.

In a node tree, the top node is called the root
Every node, except the root, has exactly one parent node
A node can have any number of children
A leaf is a node with no children
Siblings are nodes with the same parent

XML Example

Consider the given sample XML document.

The root node in the document is ‘bookstore’. It has an element called ‘book’ which has an attribute ‘category’. The ‘title’ and ‘price’ are the sub elements of the element ‘book’ and the ‘author’ and the ‘year’ are the sub elements of ‘title’.

<title lang=”en”>Web Technology</title> <author>Dr.Muthuswamy Vijayalakshmi</author> <year>2015</year>

</book>

</bookstore>

XML DOM Node Tree

The sample node tree for the above given XML document is given below in Figure 11.1:

XML Parent/Child Relationship

The Figure 11.2 shows the relationships among the elements in the XML document. From the figure, we can understand the childnodes to <bookstore> element are the <book> elemnt and are siblings to each other.The <title> element is the first child of the <book> element, and the <price> element is the last child of the <book> element. The <book> element is the parent node of the <title>, <author>, <year>, and <price> elements. The elements <title>, <author>, <year> , and <price> are siblings of each other.

Core Interfaces: Node & its variants

The core interfaces defined in the XML DOM is given in Figure 11.3.

The fundamental Interfaces defined in DOM are Node, Document, DocumentFragment, Element, Attr, CharacterData, COmment and Text. The extended interfaces also form part of the core DOM. These interfaces include CDATASection, DocumentType, EntityReference, ProcessingInstruction and others.

DOM Nodes

Document represents the entire XML document. Element represents the element in the XML document. Attr represents an attribute of an element. Text represents the textual content of an Element node. Document Fragment is a light-weight or a minimal document object that represents a portion of the document. DocumentType gives the information about the document. ProcessingInstruction provides the specific information about the document to the processor. Comment represents comment in an XML document. CDATA section is a block of text in an escape format to allow for the inclusion of special characters. It is useful for arbitrary text containing special characters. An entity represents either parsed or unparsed entity of the XML document.

Interface Names and Constants

Below is a list of interfaces and constants provided by DOM.

Properties of Node Interface

Given below is a list of properties available with node interface of DOM.

nodeType indicates an unsigned short value representing the type of the context node. Possible return values include ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE or CDATA_SECTION_NODE.
NodeName indicates a string containing the name of the context node.
nodeValue indicates a string representing the value of a text node.
nodeList indicates the list of nodes available in the contextNode. The list can be accessed like an array.
propertyLength indicates the number of children in root element.
nextSibling indicates the next sibling of the given node.
parentNode indicates a node that is the parent of the context node. If no such node exists, this property returns null.
childNode indicates a list containing all child nodes of the context node. This property is valid for only element type node.
firstChild refers to the first child of the context node.
lastChild refers to the last child of the context node.
previousSibling refers to the node that is immediately preceding the context node.
Attributes refers to the Unordered collection containing all attributes specified for the node.

DOM interfaces: Node

Node

The Node interface defines functions and below is a list of functions or methods available with the Node interface.

Node.getNodeType returns a string indicating the type of the node.
Node.getNodeValue returns a string containing the value of the node.
Node.getOwnerDocument returns a Document object associated with the given node.
Node.getParentNode returns a Node object representing the parent of the context node.
Node.hasChildNodes returns a Boolean value indicating whether the given node has any children.
Node.getChildNodes returns a Node List object that contains the all the children of given context node.
Node.getFirstChild returns a Node object that contains the first child of the context node.
Node.getLastChild returns a Node object that contains the last child of the context node.
Node.getPreviousSibling returns a Node object that contains the node that immediately precedes the context node.
Node.getNextSibling returns a Node object that contains the node that immediately follows the context node.
Node.hasAttributes returns a Boolean value indicating whether the context node contains attributes. If the context node is not an element node, it returns null.
Node.getAttributes returns a list of attributes of the context node.
appendChild(newChild) adds the node newChild to the end of the list of children of the context node.
insertBefore(newChild,refChild) inserts the node newChild before the existing child node refChild.
ReplaceChild(newChild,oldChild) replaces the child node oldChild with newChild in the list of children, and returns the oldChild node.
removeChild(oldChild) removes the child node indicated by oldChild from the list of children, and returns it.

Table 11.1 Node Interfaces, Names and Values

Table 11.1 provides us the names and values that could be retrieved for the different interfaces defined in the DOM. The values of nodeName and nodeValue vary according to the node type as shown in the table.

DOM interfaces: Document

List of methods available with Document interface of DOM is given below:

getDocumentElement returns an object of Element type which represents the root element of the XML document.
createAttribute(name) creates an attribute of given name.
createElement(tagName) creates an element of given tagName.
createTextNode(data) creates a text node with the given data.
getDocType() returns the Document Type Declaration associated with this document.
getElementById(IdVal) returns the Element node having the specified id attribute
getElementByTagName (tagName) returns a list of Element nodes with the specified element name

DOM interfaces: Element

Here is a list of some methods available in Element object.

String getAttribute(String name) returns the value of the given attribute.
Attr getAttributeNode(String name) returns an attribute node as an Attribute object.
NodeList getElementsByTagName(String name) returns a NodeList of matching element nodes , and their children.
boolean hasAttribute(String name) returns whether an element has any attributes matching a specified name

DOM interfaces: Attribute

Here is a list of some methods available in Attribute object.

String getName() returns a string containing the name of the attribute.
String getValue() returns a string containing the value of the specified attribute.
Element getOwnerElement() returns the element under which the specified attribute is categorized.

DOM interfaces: CharacterData

Here is a list of some methods available in CharacterData Object.

This class is inherited by Text, Comment, and CDATASection.

int getLength() returns the length of the specified CharacterData object (text, comment etc.)
String getData() returns a string containing the text of the given CharacterData node.

DOM interfaces: ProcessingInstruction

Here is a list of some methods available in ProcessingInstruction Object.

String getTarget() returns a string that identifies the application to which the instruction or data is directed.
String getData() returns a string that describes the information for the application to process immediately preceding the ?>.

DOM interfaces: NodeList

Here is a list of some methods available in NodeList Object.

int getLength() returns the length of the specified node list.
Node item(int index) returns the node object whose index matches the specified index in the node list.

DOM interfaces: DocumentType

Here is a list of some methods available in DocumentType Object.

String getName() returns a string containing the name of the DTD which is written immediately next to the keyword !DOCTYPE.
NamedNodeMap getEntities() returns a NamedNodeMap object containing the general entities declared in the DTD.
NamedNodeMap getNotations() returns a NamedNodeMap object containing the notations declared in the DTD.
String getSystemId() returns a string containing the system identifier of the external subset.
String getPublicId() returns a string containing the public identifier of the external subset. This node has no children, but methods allow us to fetch entities and notations from the DTD it represents.

DOM interfaces: NamedNodeMap

Here is a list of some methods available in NamedNodeMap Object.

int getLength() gives the number of nodes in this map. The range of valid child node indices is 0 to length-1 inclusive.
Node item(int index) returns the node object whose index matches the specified index in the map. If index is greater than or equal to the number of nodes in this map, this returns null.
Node getNamedItem(String name) Retrieves the node specified by name.

JAVA and XML DOM

XML maps well to Java because of the following reasons.

late binding nature of Java.
Object Oriented data model of Java.
Unicode support in Java
XML Structures map well to Java Objects
Portability and Network friendliness of Java.

The following example explains how to parse an XML document using DOM parser.

Simple DOM program – I

import javax.xml.parsers.*;

import org.w3c.dom.*;

public class FirstDom {

public static void main(String args[]) {

try {

…Main part of program goes here…

} catch (Exception e) { e.printStackTrace(System.out);

}

The above program serves as a template to parse any XML document using DOM. More specific example is provided below.

Simple XML

Consider the following sample XML document.

<?xml version=”1.0″ ?>

<question-paper>

<question id=”q2“>What are Leaves?</question>

</question-paper>

Simple DOM program – II

The program for parsing the given sample XML document is given below.

First XML related packages has to be imported. A parser is created using DocumentBuilder. Document object is created for the given sample XML file and then the root element is extracted.Then the required operations can be performed in the given sample XML document by calling the corresponding functions.

The detail explanation of the program would be given in the next module.

import javax.xml.parsers.*;

import org.w3c.dom.*;

class GetQuestion

{

public static void main(String args[])

{

try

{

DocumentBuilderFactory

factory=DocumentBuilderFactory.newInstance();

DocumentBuilder parser=factory.newDocumentBuilder();

Document doc =parser.parse(“questions.xml”);

Element root=doc.getDocumentElement();

NodeList children=root.getChildNodes();

System.out.println(children.getLength());

for(int i=0;i<children.getLength();i++)

{

Node node=children.item(i);

if(node.getNodeType()==Node.ELEMENT_NODE)

System.out.println(node.getFirstChild().getNodeValue());

}

}catch(Exception e)

{

e.printStackTrace();

}

The Output of the parser program above for the given sample XML document would be,

What is DOM?

What are Leaves?

Summary

In this module the basics of DOM was explained and the details of DOM parser and SAX parser was discussed. The module explores about the DOM Node Interface, Node Types, its properties and methods. A simple example of an XML parser is also provided for which the detailed explanation could appear in the next module.

Web Links

www.w3schools.com/xml
https://www.cs.utexas.edu/~mitra/csFall2014/cs329/lectures/xml.ppt
tinman.cs.gsu.edu/~raj/bangalore/xml-parsing.ppt