XML Parsing

An XML file is just plain text. To actually use the data inside it — read values, search for elements, or modify the content — the XML must first be parsed. Parsing is the process of reading the XML text and converting it into a structured form that a program can work with.

There are two fundamental approaches to parsing XML: DOM (Document Object Model) and SAX (Simple API for XML). Each has its own strengths and is suited for different situations.

What is a Parser?

An XML parser is a software component that reads an XML document, checks that it is well-formed (and optionally validates it against a DTD or XSD), and then makes the content available to an application. Almost all programming languages — Java, Python, JavaScript, C#, PHP — provide built-in XML parsers or libraries.

DOM Parsing

DOM stands for Document Object Model. When a DOM parser processes an XML file, it reads the entire document at once and builds a complete tree structure in memory. This tree represents every element, attribute, text node, and comment in the XML file.

How DOM Works

The parser reads the entire XML file from start to finish.
It builds a tree (called the DOM tree) where every element is a node.
The application can then navigate, search, and modify any part of the tree freely.

Visual Representation of a DOM Tree

Given this XML:

<?xml version="1.0"?>
<library>
  <book id="B01">
    <title>XML Fundamentals</title>
    <author>Grace Lim</author>
  </book>
</library>

The DOM tree would look like this (conceptually):

Document
└── library
    └── book [id="B01"]
        ├── title
        │   └── "XML Fundamentals"
        └── author
            └── "Grace Lim"

Every item in this tree — including text nodes — is accessible through the DOM API.

DOM Parsing in JavaScript

In a web browser, JavaScript provides a built-in DOM parser for XML.

// XML string to parse
const xmlString = `
  <library>
    <book id="B01">
      <title>XML Fundamentals</title>
      <author>Grace Lim</author>
    </book>
  </library>
`;

// Parse the XML string into a DOM tree
const parser = new DOMParser();
const xmlDoc = parser.parseFromString(xmlString, "text/xml");

// Access the title element's text content
const title = xmlDoc.getElementsByTagName("title")[0].textContent;
console.log(title); // Output: XML Fundamentals

DOM Parsing in Python

import xml.dom.minidom

xml_data = """
<library>
  <book id="B01">
    <title>XML Fundamentals</title>
    <author>Grace Lim</author>
  </book>
</library>
"""

doc = xml.dom.minidom.parseString(xml_data)
titles = doc.getElementsByTagName("title")
for t in titles:
    print(t.firstChild.data)   # Output: XML Fundamentals

Advantages of DOM Parsing

Easy to navigate and manipulate — can go backward and forward through the tree.
Can modify the document: add, delete, or update elements and attributes.
Intuitive and widely supported in all major languages.

Disadvantages of DOM Parsing

Loads the entire XML into memory at once — not suitable for very large files.
Can be slow and memory-intensive for files with millions of elements.

SAX Parsing

SAX stands for Simple API for XML. Unlike DOM, a SAX parser does not build the entire document in memory. Instead, it reads the XML file line by line and fires events as it encounters different parts of the document. The application responds to these events as they happen.

How SAX Works

The parser reads the XML file sequentially from top to bottom.
When it finds a start tag, it fires a "start element" event.
When it finds text content, it fires a "characters" event.
When it finds an end tag, it fires an "end element" event.
The application handles each event with custom code (called a handler).

SAX Parsing in Python

import xml.sax

class BookHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        self.current = name
        if name == "book":
            print("Found a book with id:", attrs["id"])

    def characters(self, content):
        if self.current == "title":
            print("Title:", content.strip())

    def endElement(self, name):
        self.current = ""

xml_data = """
<library>
  <book id="B01">
    <title>XML Fundamentals</title>
    <author>Grace Lim</author>
  </book>
</library>
"""

import io
xml.sax.parseString(bytes(xml_data, "utf-8"), BookHandler())

Output:

Found a book with id: B01
Title: XML Fundamentals

Advantages of SAX Parsing

Extremely memory-efficient — reads the document one piece at a time.
Ideal for very large XML files (gigabytes of data).
Faster than DOM when only a portion of the document needs processing.

Disadvantages of SAX Parsing

Read-only — cannot modify the XML document.
Only moves forward — cannot go back to a previously read element.
More complex to write because the application must track its own state.

DOM vs SAX — Choosing the Right Approach

Consideration	Use DOM	Use SAX
File size	Small to medium files	Large or very large files
Need to modify XML?	Yes	No
Need random access?	Yes (navigate freely)	No (forward only)
Memory usage	Higher (full tree in memory)	Minimal
Complexity	Simpler to code	More complex to manage state

ElementTree: A Practical Middle Ground (Python)

Python's xml.etree.ElementTree is a popular, easy-to-use XML parsing library. It offers a simpler interface than raw DOM while being more convenient than SAX for most tasks.

import xml.etree.ElementTree as ET

xml_data = """
<library>
  <book id="B01">
    <title>XML Fundamentals</title>
    <author>Grace Lim</author>
  </book>
  <book id="B02">
    <title>Advanced XML</title>
    <author>Peter Chan</author>
  </book>
</library>
"""

root = ET.fromstring(xml_data)

for book in root.findall("book"):
    book_id = book.get("id")
    title = book.find("title").text
    author = book.find("author").text
    print(f"ID: {book_id}, Title: {title}, Author: {author}")