XML Text, CDATA, and Character Encoding
Text is the most common type of content inside XML elements. However, not all text can be placed directly into XML without some consideration. Certain characters have special meaning in XML and can cause parsing errors if used carelessly. This topic covers how XML handles text content, how to safely include special characters using CDATA sections, and how character encoding works.
Text Content in XML
Any text placed between the start and end tags of an XML element is called text content or character data. The XML parser reads this text and makes it available to the application.
<message>Welcome to the XML tutorial.</message>
The text Welcome to the XML tutorial. is the character data of the <message> element.
The Problem with Special Characters
XML uses certain characters as part of its syntax. If these characters appear in text content, the parser gets confused because it cannot tell whether the character is part of the XML structure or part of the data.
The five characters that cause problems are:
| Character | Why It's Problematic | Entity Reference |
|---|---|---|
< | Looks like the start of a tag | < |
> | Looks like the end of a tag | > |
& | Looks like the start of an entity reference | & |
' | Can conflict with single-quoted attribute values | ' |
" | Can conflict with double-quoted attribute values | " |
Example Using Entity References
<formula>
If x < 10 & y > 5, then the condition is true.
</formula>
When this XML is parsed, the entity references are automatically converted back to their actual characters: <, &, and >.
CDATA Sections
When text content contains many special characters — such as HTML code or programming code — using entity references for every character becomes tedious and hard to read. The solution is a CDATA section.
CDATA stands for Character Data. A CDATA section tells the XML parser: "treat everything inside this block as plain text, not as XML markup." The parser will not try to interpret anything within a CDATA section as tags or entity references.
CDATA Syntax
<![CDATA[ your content here ]]>
- Starts with
<![CDATA[ - Ends with
]]> - Everything in between is treated as literal text.
Example: Storing HTML Inside XML Without CDATA
This would cause a parsing error:
<webpage>
<h1>Welcome</h1>
<p>Hello & welcome to our <b>site</b>.</p>
</webpage>
The HTML tags inside would confuse the XML parser.
Example: Storing HTML Inside XML With CDATA
<webpage>
<![CDATA[
<h1>Welcome</h1>
<p>Hello & welcome to our <b>site</b>.</p>
]]>
</webpage>
Now the HTML is safely wrapped in CDATA. The parser sees the entire HTML block as plain text and does not try to parse it as XML.
Example: Storing a Code Snippet in XML
<code_example>
<![CDATA[
if (score > 90 && grade == "A") {
System.out.println("Excellent!");
}
]]>
</code_example>
Without CDATA, the >, &&, and " characters would all need escaping, making the code hard to read.
What CDATA Cannot Contain
The only string that a CDATA section cannot contain is the closing sequence ]]>, because the parser uses it to detect the end of the CDATA block. If the text itself happens to contain ]]>, it must be split into two CDATA sections.
<data>
<![CDATA[First part ]]]]><![CDATA[> second part]]>
</data>
Character Encoding
XML files are plain text files, but text is stored differently depending on the character encoding used. Character encoding is the system that maps letters, numbers, and symbols to binary values that computers can store.
The most important encodings to know for XML are:
- UTF-8: The most common encoding. Supports nearly all characters from all languages. Uses between 1 and 4 bytes per character. Recommended for all XML files.
- UTF-16: Uses at least 2 bytes per character. Supports the same wide range as UTF-8 but is less commonly used for XML.
- ISO-8859-1 (Latin-1): An older encoding that only supports Western European characters. Not recommended for new projects.
Declaring Encoding in the XML Declaration
<?xml version="1.0" encoding="UTF-8"?>
If no encoding is declared, the XML parser defaults to UTF-8 or UTF-16 depending on the file's byte order mark (BOM). Always declaring the encoding explicitly is best practice to avoid confusion.
Why Encoding Matters
<?xml version="1.0" encoding="UTF-8"?>
<greetings>
<greeting lang="Arabic">مرحبا</greeting>
<greeting lang="Japanese">こんにちは</greeting>
<greeting lang="English">Hello</greeting>
</greetings>
Because UTF-8 is declared, this XML file can safely store text from multiple languages in the same document. If a narrower encoding like ISO-8859-1 were used, the Arabic and Japanese characters would not be stored correctly.
Key Points
- Text content is the data placed between an element's opening and closing tags.
- Special characters like
<,>, and&must be replaced with entity references to avoid parsing errors. - CDATA sections (
<![CDATA[ ... ]]>) allow raw text — including special characters and markup — to be embedded without escaping. - CDATA is especially useful for embedding HTML, code snippets, or any content with many special characters.
- The closing sequence
]]>cannot appear inside a CDATA section. - UTF-8 is the recommended character encoding for XML documents and supports characters from virtually all languages.
- Always declare the encoding in the XML declaration to ensure consistent processing.
