XML Parsing
An XML file is just plain text. To actually use the data inside it — read values, search for elements, or modify the content — the XML must first be parsed. Parsing is the process of reading the XML text and converting it into a structured form that a program can work with.
There are two fundamental approaches to parsing XML: DOM (Document Object Model) and SAX (Simple API for XML). Each has its own strengths and is suited for different situations.
What is a Parser?
An XML parser is a software component that reads an XML document, checks that it is well-formed (and optionally validates it against a DTD or XSD), and then makes the content available to an application. Almost all programming languages — Java, Python, JavaScript, C#, PHP — provide built-in XML parsers or libraries.
DOM Parsing
DOM stands for Document Object Model. When a DOM parser processes an XML file, it reads the entire document at once and builds a complete tree structure in memory. This tree represents every element, attribute, text node, and comment in the XML file.
How DOM Works
- The parser reads the entire XML file from start to finish.
- It builds a tree (called the DOM tree) where every element is a node.
- The application can then navigate, search, and modify any part of the tree freely.
Visual Representation of a DOM Tree
Given this XML:
<?xml version="1.0"?>
<library>
<book id="B01">
<title>XML Fundamentals</title>
<author>Grace Lim</author>
</book>
</library>
The DOM tree would look like this (conceptually):
Document
└── library
└── book [id="B01"]
├── title
│ └── "XML Fundamentals"
└── author
└── "Grace Lim"
Every item in this tree — including text nodes — is accessible through the DOM API.
DOM Parsing in JavaScript
In a web browser, JavaScript provides a built-in DOM parser for XML.
// XML string to parse
const xmlString = `
<library>
<book id="B01">
<title>XML Fundamentals</title>
<author>Grace Lim</author>
</book>
</library>
`;
// Parse the XML string into a DOM tree
const parser = new DOMParser();
const xmlDoc = parser.parseFromString(xmlString, "text/xml");
// Access the title element's text content
const title = xmlDoc.getElementsByTagName("title")[0].textContent;
console.log(title); // Output: XML Fundamentals
DOM Parsing in Python
import xml.dom.minidom
xml_data = """
<library>
<book id="B01">
<title>XML Fundamentals</title>
<author>Grace Lim</author>
</book>
</library>
"""
doc = xml.dom.minidom.parseString(xml_data)
titles = doc.getElementsByTagName("title")
for t in titles:
print(t.firstChild.data) # Output: XML Fundamentals
Advantages of DOM Parsing
- Easy to navigate and manipulate — can go backward and forward through the tree.
- Can modify the document: add, delete, or update elements and attributes.
- Intuitive and widely supported in all major languages.
Disadvantages of DOM Parsing
- Loads the entire XML into memory at once — not suitable for very large files.
- Can be slow and memory-intensive for files with millions of elements.
SAX Parsing
SAX stands for Simple API for XML. Unlike DOM, a SAX parser does not build the entire document in memory. Instead, it reads the XML file line by line and fires events as it encounters different parts of the document. The application responds to these events as they happen.
How SAX Works
- The parser reads the XML file sequentially from top to bottom.
- When it finds a start tag, it fires a "start element" event.
- When it finds text content, it fires a "characters" event.
- When it finds an end tag, it fires an "end element" event.
- The application handles each event with custom code (called a handler).
SAX Parsing in Python
import xml.sax
class BookHandler(xml.sax.ContentHandler):
def startElement(self, name, attrs):
self.current = name
if name == "book":
print("Found a book with id:", attrs["id"])
def characters(self, content):
if self.current == "title":
print("Title:", content.strip())
def endElement(self, name):
self.current = ""
xml_data = """
<library>
<book id="B01">
<title>XML Fundamentals</title>
<author>Grace Lim</author>
</book>
</library>
"""
import io
xml.sax.parseString(bytes(xml_data, "utf-8"), BookHandler())
Output:
Found a book with id: B01
Title: XML Fundamentals
Advantages of SAX Parsing
- Extremely memory-efficient — reads the document one piece at a time.
- Ideal for very large XML files (gigabytes of data).
- Faster than DOM when only a portion of the document needs processing.
Disadvantages of SAX Parsing
- Read-only — cannot modify the XML document.
- Only moves forward — cannot go back to a previously read element.
- More complex to write because the application must track its own state.
DOM vs SAX — Choosing the Right Approach
| Consideration | Use DOM | Use SAX |
|---|---|---|
| File size | Small to medium files | Large or very large files |
| Need to modify XML? | Yes | No |
| Need random access? | Yes (navigate freely) | No (forward only) |
| Memory usage | Higher (full tree in memory) | Minimal |
| Complexity | Simpler to code | More complex to manage state |
ElementTree: A Practical Middle Ground (Python)
Python's xml.etree.ElementTree is a popular, easy-to-use XML parsing library. It offers a simpler interface than raw DOM while being more convenient than SAX for most tasks.
import xml.etree.ElementTree as ET
xml_data = """
<library>
<book id="B01">
<title>XML Fundamentals</title>
<author>Grace Lim</author>
</book>
<book id="B02">
<title>Advanced XML</title>
<author>Peter Chan</author>
</book>
</library>
"""
root = ET.fromstring(xml_data)
for book in root.findall("book"):
book_id = book.get("id")
title = book.find("title").text
author = book.find("author").text
print(f"ID: {book_id}, Title: {title}, Author: {author}")
Output:
ID: B01, Title: XML Fundamentals, Author: Grace Lim
ID: B02, Title: Advanced XML, Author: Peter Chan
Key Points
- Parsing converts an XML file from text into a structured object that programs can use.
- DOM parsing loads the entire XML into memory as a navigable tree. Best for small-to-medium files that need random access or modification.
- SAX parsing reads the XML sequentially and fires events. Best for large files where memory efficiency is critical.
- DOM allows modifying the document; SAX is read-only and forward-only.
- Python's ElementTree is a practical and beginner-friendly middle option for most XML parsing tasks.
- All major programming languages provide XML parsing tools through standard libraries or packages.
