More Then Just Leftovers
In this session, we discuss some miscellaneous topics mainly concerned with what is and isn't legal content in XML documents, and also ways to circumvent these rules using special element constructs. The topics we cover are:
- Comments and how they behave
- White-space handing
- Special characters in content
- Creating parser-ignorable content
- Embedding version and character-set information
We will illustrate these concepts with an example, "The King's Speech", based on the multi-Oscar winning 2011 British Film of the year. This movie addressed King George VI's speech problem, and its correction by a speech therapist. Let us now see the King in modern times, with the BBC making a recording of his World War III speech to motivate the British Army troops stationed across the world. Here is how the BBC's state-of-the-art speech recording and filtering software would go about its task, while choosing XML to be the output format:
Stammering speech is stored as comments in the XML document. (This content is useful for creating statistics and studying them.) Long periods of silence (no noise) are treated as 1 white-space per second of silence. Special characters display mathematical statistics. Recording noise is also identified and isolated, but maintained in the document as special non-processed data, for the tool's own logging and improvement. Here is the document example:
<?xml version="1.0" encoding="UTF-8" ?>
<king-speech>
<!%u2014line> duh-duh-duh (This is Stammering) duh-duh </line -- >
<line>The time has come for the British to stand up for the rest of Humanity</line>
<line> </line>
<![CDATA[ALL RECORDING NOISE*****]]>
<!%u2014line>The rest of the speech</line -- >
<statistics>Stammer Percentage < 30 per cent</statistics>
<statistics>Noise Correction < 45 per cent</statistics>
</king-speech>
Here is how we map the syntax used in this document to the actual requirements, stated just preceding it:
- Comment Syntax : <!%u2014 is the start tag and -- > is the end-tag. Anything in between is ignored by the application, so this is ideal for storing the king's stammering to be analyzed later on.
- White-space: considered as valid content%u2026 here used to model silence in the speech.
- The recording noise is a binary bit-pattern which cannot be parsed by the XML parser as it can only understand character data. We could have ignored it, but decided to keep it for statistical purposes. The start tag is : <![CDATA[ and end-tag is ]]>
- Special mathematical symbols such as '<' can be confused for beginning of a tag, so special character sequences such as '<' (familiar from HTML) are used to preserve them.
- The first line of the XML document describes the type of information (XML,) the version of the XML standard (1.0) and the encoding information (UTF-8).
The main point is that XML understands and stores only character sequence data, and has a few restricted keywords. But it also provides constructs to work around these limitations and syntactic rules, as we saw in the document example above.