CDATA Section Delimitosis

Delimitosis = disease pertaining to delimiter

I don't know if it is just because I am a parser-minded person, but the first time I learned about CDATA Sections a warning buzzer went off in my head and has been ringing ever since. It is saying: What if ]]> happens to be in the data you put into a CDATA Section?

Well obviously it is not allowed. Hmmm. But that is not very helpful is it? Does than mean I am supposed to check to see if my text contains ]]> every time I want to use a CDATA Section? And what should I do if it does?

I want to settle some of the unsettling issues about CDATA Sections here.

Why use a CDATA Section anyway?

If you have seen XML, you've seen that it uses less than and greater than signs to create "tags" which have a special meaning and are not part of the normal text content. If you know more about XML, you know that if you want a less than or greater than sign in your normal text content you "escape" that character as < or > so that it is not confused for indicating part of a tag.

XML is often appreciated for its human readability, but if you have played around with XML (especially with a text editor) you may have run into a situation where some markup like HTML or other XML is placed as text content into an XML document and when all of the special characters are escaped it becomes unreadable. That's probably when you learned about CDATA Sections.

A CDATA Section is a kind of text node in XML that does not escape the special characters. A CDATA Section starts with

<![CDATA[

and ends with

]]>

CDATA Sections are useful for all these types of situations:

    I want to store XML or HTML as data in my XML document
    I want to preserve all of the spaces, tabs, carriage returns and line feeds in my text value
    I want my source code snippets to be readable in a text editor without escaped less than signs and greater than comparison operators

A CDATA Section is never necessary programmatically because tools will escape and unescape special characters in the parsed data of the XML, and you can usually preserve whitespace using the xml:preserve="true" attribute in the element or the DTD. CDATA Sections can be desirable for the sake of usability in a text editor. The performance hit of escaping and unescaping large data values or the slight increase in space requirements of large escaped data values might also be a consideration in rare circumstances.

Researching the delimiter issue

The XML specification 2.7 CDATA Sections only makes it clear the ]]> delimiter marks the end of the CDATA Section. For the XML Spec it is good to be brief and to the point, however other information sources should delve into the obvious usability issues. Regular XML tutorials like ICommerce and other sources such as Wikipedia may go the extra logical step of saying that the CDATA Section cannot contain the delimiter and that CDATA Sections cannot be nested, but that is it.

In What is it about CDATA Sections? someone points out that some XML books miss the boat on CDATA Sections. One book supposedly said to use "<![cdata[" which is incorrect because CDATA must be uppercase, and another said CDATA Sections cannot contain "]]" (but there is no problem with ]] inside a CDATA Section). Where I disagree is when he blames a book for failing to mention that the string "]]>" (unescaped) can't ever occur in a document except when it ends a CDATA section. To me this is unnecessary trivia.

MSDN has no mention of this issue in any of the XML component documenation including the XmlCDataSection class and XmlWriter. MSDN does address it under the question "How do you encode ]]> inside of a CDATA section?" in the The XML Files but it only examines a limited example without reference to programming with Microsoft XML tools.

The split-cdata-sections parameter was added to the DOM Level 3 Core Working Draft in January 2002 almost 4 years ago. It says Split CDATA sections containing the CDATA section termination marker ']]>'. So, taking the lead from the XML Document Object Model, the correct way to deal with the delimiter in a CDATA Section is to split the CDATA Section at the delimiter and start a new CDATA Section.

The sun Java CDATASection Interface documentation addresses this more directly. No lexical check is done on the content of a CDATA section and it is therefore possible to have the character sequence "]]>" in the content, which is illegal in a CDATA section per section 2.7 of [XML 1.0]. The presence of this character sequence must generate a fatal error during serialization or the cdata section must be splitted before the serialization (see also the parameter "split-cdata-sections" in the DOMConfiguration interface)

The Apache Xerces documentation also says that "When a CDATA section is split a warning is issued."

So, can you nest CDATA Sections or not?

Well, no, a CDATA Section cannot contain another CDATA Section because of the ]]> delimiter. However, the CDATA section splitting technique supported in some tools works and is generalizable. Most XML tools provide the concatenation of adjacent text nodes inside of an element when you obtain the text value of the element. Therefore, if that delimiter was split between two adjacent CDATA Sections then the process of splitting and concatenating can be ignored by the programmer. Visually, with this splitting technique, the ]]> delimiter is replaced by ]]]]><![CDATA[> which may seem a bit messy, but just think of that long string as the delimiter for the nested CDATA Section.

Do not use homegrown solutions like replacing occurrences of ]]> with something else like ]_]_> or </xml-cdata> as offered in the responses to this JOS post: Nesting CDATA sections. First of all there is no standard way of communicating to consumers of your XML document that you have done this and it goes against the unparsed nature of CDATA Sections. Secondly, by doing this you just postpone the problem by creating a different string that cannot exist in the CDATA Section so it is not generalizable.

My C++ XML product CMarkup supports splitting and concatenating CDATA Sections seamlessly in element data.

Numeric Character References

I should also touch on numeric character references. These are often used when the XML file encoding will not support the character or just to avoid text encoding limitations across various tools and viewers. A reference such as 浬 to represent 浬 U+6D6C will be treated as text rather than a reference in a CDATA Section. So your XML tool will not translate it to the actual character for you when retrieving the text value from CDATA Sections in the document.

I've run out of time, but this should be a good starter and hopefully a cure for CDATA Section "delimitosis".