Steven Pemberton, CWI, Amsterdam
Invisible XML (ixml) is a method for treating non-XML documents as if they were XML.
After a number of design iterations, the language is ready for specification. This paper describes the production of the specification of ixml. An interesting aspect of this specification is that ixml is itself an application of ixml: the grammar describes itself, and therefore can be used to parse itself, and thus produce an XML representation of the grammar. We discuss the decisions taken to produce the most useful XML version possible.
Cite as: Steven Pemberton, On the Specification of Invisible XML, Proc. XML Prague 2019, Prague, Czech Republic, pp 413-430, ISBN 978-80-906259-6-9 (pdf), https://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf#page=425
Invisible XML (ixml) is a method for treating non-XML documents as if they were XML.
This gives a number of advantages:
IXML works by providing a description of the data or document of interest in the form of a (context-free) grammar. This grammar is then used to parse the document, and the resulting parse tree is then serialised as an XML document. The grammar includes extra information about how the tree should be serialised, allowing the eliding of unnecessary nodes, and the choice of serialising nodes as XML elements or attributes.
As a (necessarily small) example, take the definition of the email type from XForms 2.0 [xf], which is defined by a regular expression:
<xs:simpleType name="email"> <xs:restriction base="xs:string"> <xs:pattern value="([A-Za-z0-9!#-'\*\+\-/=\?\^_`\{-~]+) (\.[A-Za-z0-9!#-'\*\+\-/=\?\^_`\{-~]+)* @([A-Za-z0-9]([A-Za-z0-9-]*[A-Za-z0-9])?) (\.[A-Za-z0-9]([A-Za-z0-9-]*[A-Za-z0-9])?)+"/> </xs:restriction> </xs:simpleType>
If we turn this into ixml, we get the following:
email: user, "@", host. {An email address has two parts separated by an @ sign} user: atom+".". {The user part is one or more atoms, separated by dots} atom: char+. {An atom is a string of one or more 'char'} host: domain+".". {A host is a series of domains, separated by dots} domain: word+"-". {A domain may contain a hyphen, but not start or end with one} word: letgit+. {A domain otherwise consists of letters and digits} -letgit: ["A"-"Z"; "a"-"z"; "0"-"9"]. -char: letgit; ["!#$%&'*+-/=?^_`{|}~"]. {A char is a letter, digit, or punctuation.}
If we now feed into this the string
~my_mail+{nospam}$?@sub-domain.example.info
we will get the following XML serialisation:
<email> <user> <atom>~my_mail+{nospam}$?</atom> </user>@ <host> <domain> <word>sub</word>- <word>domain</word> </domain>. <domain> <word>example</word> </domain>. <domain> <word>info</word> </domain> </host> </email>
If the rule for letgit
hadn't had a dash before it, then, for
instance, the element <word>sub</word>
, would have
looked like this:
<word><letgit>s</letgit><letgit>u</letgit><letgit>b</letgit></word>
Since the word part of a domain has no semantic meaning, we can exclude it
from the serialisation by changing the rule for word
into:
-word: letgit+.
to give:
<email> <user> <atom>~my_mail+{nospam}$?</atom> </user>@ <host> <domain>sub-domain</domain>. <domain>example</domain>. <domain>info</domain> </host> </email>
If we change the rule for atom
and domain
into
-atom: char+. -domain: word+"-".
we get:
<email> <user>~my_mail+{nospam}$?</user>@ <host>sub-domain.example.info</host> </email>
Finally, changing the rules for user
and host
to:
@user: atom+".". @host: domain+".".
gives:
<email user='~my_mail+{nospam}$?' host='sub-domain.example.info'>@</email>
To get rid of the left-over "@", we can change the rule for
email
to:
email: atoms, -"@", host.
to give:
<email user='~my_mail+{nospam}$?' host='sub-domain.example.info'/>
What we can see here is that the definition is much more structured than a regular expression, and that we have a lot of control over what emerges from the XML serialisation.
After a number of design iterations [ixml1, ixml2, ixml3, ixml4], the language is now ready for specification. A draft specification [ixml] has been produced. This paper describes the specification, and some of the decisions made.
The grammar is based on a format known as VWG [vwg].
A VWG grammar consists of a series of rules, where a rule consists of a name, a colon, a one or more 'alternatives' separated by semicolons, all terminated by a dot. An alternative consists of zero or more 'terms', separated by commas.
Expressing this in ixml would look like this:
ixml: rule+. rule: name, ":", alternatives, ".". alternatives: alternative+";". alternative: term*",".
This introduces some of the extensions ixml has added to vwgs. Appending a star or plus to a term is used in many grammar systems and related notations such as regular expressions, to express repetition of a term, zero or more for a star and one or more for a plus. An ixml innovation is to make both postfix operators infix as well. The right hand operand then defines a separator that comes between the repeated terms.
Note that these additions do not add any power to the language (they can all be represented by other means). However they do add ease of authoring, since the alternative way of writing them is verbose, and obscures the purpose of the extra rules that have to be added.
A term
is a factor, an optional factor, or a repeated factor,
repeated zero or more, or one or more times:
term: factor; option; repeat0; repeat1. option: factor, "?". repeat0: factor, "*", separator?. repeat1: factor, "+", separator?. separator: factor.
These rules also demonstrate the use of a question mark to indicate an optional factor.
A factor is a terminal, a nonterminal, or a bracketed series of alternatives:
factor: terminal; nonterminal; "(", alternatives, ")".
A nonterminal is just a name, which refers to the rule defining that name:
nonterminal: name.
Terminals are the elements that actually match characters in the input; Unicode characters are used. There are two forms for terminals: literal strings, and character sets.
The simplest form is a literal string, as we have seen above, such as
":"
and ","
. Strings may be delimited by either
double quotes or single: ":"
and ':'
are equivalent.
If you want to include the delimiting quote in a string, it should be doubled:
"don't"
and 'don''t'
are equivalent.
terminal: literal; charset. literal: quoted; encoded. quoted: '"', dchar+, '"'; "'", schar+, "'". dchar: ~['"']; '""'. schar: ~["'"]; "''".
This introduces the exclusion: the construct ~['"']
matches any character except what is enclosed in the brackets, and is
defined below.
In order to express characters with no explicit visible representation or
with an ambiguous representation there is a representation for encoded
characters. For instance #a0
represents a non-breaking space.
encoded: "#", hex+. hex: ["0"-"9"; "a"-"f"; "A"-"F"].
Encoded characters do not appear within strings, but are free-standing; however, this doesn't limit expressiveness, since a rule like
end: "the", #a0, "end".
represents the seven characters with a non-breaking space in the middle.
Character sets, which we have also seen in the earlier simple example, allow
you to match a character from a set of characters, such as ["A"-"Z";
"a"-"z"; "0"-"9"]
. Elements of a character set can be a literal string,
representing all the characters in the string, a range like above, or a
character class. Unicode defines a number of classes, such as lower case
letter and upper case letter, encoded with two-letter
abbreviations, such as Ll and Lu; these abbreviations may be
used thus saving the work of having to define those classes yourself:
charset: inclusion; exclusion. inclusion: "[", element+, "]". exclusion: "~", inclusion. element: quoted; range; class. range: char, "-", char. class: letter, letter. char: '"', dchar, '"'; "'", schar, "'"; encoded. letter: ["a"-"z"; "A"-"Z"].
The question arises: what should be allowed as a name in ixml?
One of the more opaque sections of the XML specification [xml] is which characters are permitted in a name, since it is specified almost entirely in terms of ranges over hexadecimal numbers, without further explanation. In an informal test of a handful of XML users, we found they were unable to confidently determine if certain element names were valid XML or not. Just to give you a taste of the problem, this character may not appear in an XML identifier: µ, while this one may: μ. They are both versions of the Greek letter mu, but one is at #B5 and the other at #3BC.
So should ixml continue this opacity, or is it possible to create something more transparent? Should ixml exactly duplicate the XML rules? Not all ixml names are serialised, so there is no a priori need for them to adhere to XML rules; furthermore, there may in the future be other serialisations; it could be left to the author to make sure that names adhered to the rules needed for the chosen serialisation. Would there be holes, and would it matter?
So what does XML allow? This is the XML rule:
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
So an XML name can start with a letter, a colon, an underscore, and "other stuff"; it can continue with the same characters, a hyphen, a dot, the digits, and some more other stuff.
In brief, the "other stuff" for a start character is anything between #C0 (which is À) and #EFFFF (which is not assigned, so really shouldn't be allowed) -- mostly representing a huge collection of letters from many languages. What is excluded is:
A name continuation character consists of the same characters plus as already said a hyphen, a dot, and the digits, and then #B7 (the middot "·"), the combining characters left out of the start characters, and then the two characters #203F and #2040, the overtie and undertie characters ‿⁀.
On the other hand, Unicode [ucc] has 30 character classes:
Name | Description | Number | Examples |
---|---|---|---|
Cc | Control | 65 | Ack, Bell, Backspace, Tab, LF, CR, etc |
Cf | Format | 151 | Soft hyphen, Arabic Number Sign, Zero-width space, left-to-right mark, invisible times, etc. |
Co | Private Use | #E000-#F8FF | |
Cs | Surrogate | #D800-#DFFF | |
Ll | Lowercase Letter | 2,063 | a, µ, ß, à, æ, ð, ñ, π, Latin, Greek, Coptic, Cyrillic, Armenian, Georgian, Cherokee, Glagolitic, many more |
Lm | Modifier Letter | 250 | letter or symbol typically written next to another letter that it modifies in some way. |
Lo | Other Letter | 121,047 | ª, º, ƻ, dental click, glottal stop, etc., and letters from languages that don't have cased letters, such as Hebrew, Arabic, Syriac, ... |
Lt | Titlecase Letter | 31 | Mostly ligatures that have to be treated specially when starting a word. |
Lu | Uppercase Letter | 1,702 | A, Á, etc |
Mc | Spacing Mark | 401 | Spacing combining marks, Devengari, Bengali, etc. |
Me | Enclosing Mark | 13 | Combining enclosing characters such as "Enclosing circle" |
Mn | Nonspacing Mark | 1763 | Combining marks, such as combining grave accent. |
Nd | Decimal Number | 590 | 0-9, in many languages, mathematical variants, |
Nl | Letter Number | 236 | Ⅰ, Ⅱ, Ⅲ, Ⅳ,... |
No | Other Number | 676 | subscripts, superscripts, fractions, circled and bracketed numbers, many languages |
Pc | Connector Punctuation | 10 | |
Pd | Dash Punctuation | 24 | |
Pe | Close Punctuation | 73 | |
Pf | Final Punctuation | 10 | |
Pi | Initial Punctuation | 12 | |
Po | Other Punctuation | 566 | |
Ps | Open Punctuation | 75 | |
Sc | Currency Symbol | 54 | |
Sk | Modifier Symbol | 121 | |
Sm | Math Symbol | 948 | |
So | Other Symbol | 5855 | |
Zl | Line Separator | 1 | Not cr, lf. |
Zp | Paragraph Separator | 1 | |
Zs | Space Separator | 17 | space, nbsp, en quad, em quad, thin space, etc. Not tab, cr, lf etc. |
The final decision was made to define ixml names using Unicode character classes, while keeping as close as possible to the spirit of what is allowed in XML:
name: namestart, namefollower*. namestart: ["_"; Ll; Lu; Lm; Lt; Lo]. namefollower: namestart; ["-.·‿⁀"; Nd; Mn].
Consequently there are small differences in what a name is. For instance, Unicode classes the characters ª and º as letters, and does class both mu characters as letters, while XML doesn't; as was said above, not all ixml names are serialised, so it is the responsibility of the ixml author to ensure that those that do, adhere to the XML rules.
One thing that the above grammar hasn't defined is where spaces may go.
These are allowed after any token, and before the very first token, so we have to update all the rules to indicate this. For instance:
ixml: S, rule+. rule: name, S, ":", S, alternatives, ".", S. alternatives: alternative+(";", S).
where S
defines what a space is:
S: (whitespace; comment)*. whitespace: [Zs; #9 {tab}; #a {lf}; #d {cr}]. comment: "{", (cchar; comment)*, "}". cchar: ~["{}"].
Here we see the use of a character class, Zs
, which are all
characters classified in Unicode as a space character; to this we have added
tab, line feed, and carriage return (which are classified as control characters
in Unicode).
Actually, Unicode is somewhat equivocal about whitespace. Along with the character class Zs it also has the character property WS ('WhiteSpace').
The Zs class contains the following 17 characters:
space, no-break space, ogham space mark, en quad, em quad, en space, em space, three-per-em space, four-per-em space, six-per-em space, figure space, punctuation space, thin space, hair space, narrow no-break space, medium mathematical space, ideographic space.
These all have the WS property, with the exception of no-break space, and narrow no-break space, which have the property CS (common separator, along with commas, colons, slashes and the like). There are also two characters that have the WS property, but are not in the Zs class, form-feed, which has class Cc (control character), and line separator, which is the sole member of class Zl.
For the record, line feed and carriage return are both in character class Cc (control character), with property 'Paragraph Separator'; the other characters that share this property are: #1C, #1D, #1E ('information separator' control characters, all in class Cc), #85 (Next line, also in Cc), and #2029 (paragraph separator, the sole member of Zp, paragraph separator). The tab character, also in Cc, has property 'Segment Separator'; other characters that have this property are #B (line tabulation), and #1F ('Information separator), both in class Cc.
Wherever whitespace is permitted in ixml, so is a comment. A comment may itself contain a comment, which enables commenting out sections of a grammar.
An addition that has also been made is to allow =
and well as
:
to delimit a rule name, and |
as well as
;
to delimit alternatives:
rule: name, S, ["=:"], S, alternatives, ".", S. alternatives: alternative+([";|"], S).
Once a grammar has been defined, it is then used to parse input documents; a resulting parse is serialised as XML. It is not specified which parse algorithm should be used, as long as it accepts all context-free grammars, and produces at least one parse of any document that matches the grammar.
By default, a parse tree is serialised as XML elements: the parse tree is traversed in document order (depth first, left to right), and nonterminal nodes are output as XML elements, and terminals are just output.
For instance, for this small grammar for simple expressions:
expr: operand+operator. operand: id; number. id: letter+. number: digit+. letter: ["a"-"z"]. digit: ["0"-"9"]. operator: ["+-×÷"].
parsing the following string:
pi×10
would produce
<expr> <operand> <id> <letter>p</letter> <letter>i</letter> </id> </operand> <operator>×</operator> <operand> <number> <digit>1</digit> <digit>0</digit> </number> </operand> </expr>
To control serialisation, marks are added to grammars.
There are three options for serialising a nonterminal node, such as
expr
, and operand
above:
For serialising a terminal the only option is between serialising it and not.
There are two places where a nonterminal can be marked for serialising: at the definition of the rule for that nonterminal, which specifies the default way it is serialised, or at the use of the nonterminal, which overrides the marking used on the rule if any. A terminal can only be marked at the place of use.
There are three type of mark: "^" for full, "@" for attribute (which doesn't apply to terminals), and "-" for partial (which causes terminals not to be serialised).
To support this, we add to the grammar for rule
, and
nonterminal:
rule: (mark, S)?, name, S, ":", S, alternatives, ".", S. nonterminal: (mark, S)?, name, S. mark: ["@^-"].
and similar for terminals.
The only unusual case for serialisation is for attribute children of a partially serialised node. In that case there is no element for the attributes to be put on, and so they are lifted to a higher node in the parse tree. For instance, with:
expr: operand+operator. operand: id; number. id: @name. name: letter+. number: @value. value: digit+. letter: ["a"-"z"]. digit: ["0"-"9"]. operator: ["+-×÷"].
the default serialisation for pi×10
would look like this:
<expr> <operand> <id name='pi'/> </operand> <operator>×</operator> <operand> <number value='10'/> </operand> </expr>
However, if we changed the rules for id
and number
to partial serialisation:
-id: @name. -number: @value.
so that neither produces an element, then the attributes are moved up, giving:
<expr> <operand name='pi'/> <operator>×</operator> <operand value='10'/> </expr>
A grammar may be ambiguous, so that a given input may have more than one possible parse.
For instance, here is an ambiguous grammar for expressions:
expr: id; number; expr, operator, expr. id: ["a"-"z"]+. number: ["0"-"9"]+. operator: "+"; "-"; "×"; "÷".
Given the string a÷b÷c
, this could produce either of the
following serialisations:
<expr> <expr> <id>a</id> </expr> <operator>÷</operator> <expr> <expr> <id>b</id> </expr> <operator>÷</operator> <expr> <id>c</id> </expr> </expr> </expr>
or
<expr> <expr> <expr> <id>a</id> </expr> <operator>÷</operator> <expr> <id>b</id> </expr> <id>a</id> </expr> <operator>÷</operator> <expr> <id>c</id> </expr> </expr>
i.e. it could be interpreted as a÷(b÷c)
or as
(a÷b)÷c
.
In the case of ambiguous parses, one of the parse trees is serialised (it is not specified which), but the root element is marked to indicate that it is ambiguous.
Other examples of possible ambiguity to look out for are if we had defined a
rule
in the grammar as:
rule: name, ":", alternatives, ".". alternatives: alternative*";". alternative: term*",".
then an empty rule such as:
empty: .
could be interpreted equally well as a rule with zero alternatives, or with one alternative with zero terms.
Similarly, if a grammar says that spaces could appear before or after tokens
id: S, letter+, S. operator: S, ["+-×÷"], S.
then with an input such as a + b
the first space could be
interpreted as either following a
, or preceding +.
By the way, this is why commas are needed between terms in an ixml alternative. Otherwise you wouldn't be able to see the difference between:
a: b+c.
and
a: b+, c.
IXML is itself an application of IXML, since the grammar is defined in its own format. That means that the grammar can be parsed with itself, and then serialised to XML. This has consequences for the design of the grammar.
The main choice was whether to use attributes at all, and if so where. The decision taken was to put all semantic terminals (such as names) in attributes, and otherwise to use elements.
As pointed out above, spaces were carefully placed to prevent ambiguous parses, but also placed in the grammar so that they didn't occur in attribute values.
So an example serialisation for the rule for rule
, is:
<rule name='rule'>: <alt> <option> <nonterminal name='mark'/>?</option>, <nonterminal name='name'/>, <nonterminal name='S'/>, <inclusion>[ <literal dstring='=:'/>]</inclusion>, <nonterminal name='S'/>, <nonterminal mark='-' name='alts'/>, <literal dstring='.'/>, <nonterminal name='S'/> </alt>.</rule>
Although all terminal symbols are preserved in the serialisation, the only ones of import are in attribute values.
Since formally it is the XML serialisation that is used as input to the parser, and text nodes in the serialisation are ignored, it is only the serialisation that matters for the parser. This means that a different ixml grammar may be used, as long as the serialisation is the same. So if the grammar looks like this:
<expr> ::= <id> | <number> | <expr>, <operator>, <expr>.
that is fine, as long as it produces the same serialisation structure.
If you look at an ixml grammar in the right way, you can also see it as a
type of schema for an XML format. Future work will look at the possibilities of
using ixml to define XML formats. For instance, if we take the following XSL
definition of the bind
element in XForms 1.1:
<element name="bind"> <complexType> <sequence minOccurs="0" maxOccurs="unbounded"> <element ref="xforms:bind"/> </sequence> <attributeGroup ref="xforms:Common.Attributes"/> <attribute name="nodeset" type="xforms:XPathExpression" use="optional"/> <attribute name="calculate" type="xforms:XPathExpression" use="optional"/> <attribute name="type" type="QName" use="optional"/> <attribute name="required" type="xforms:XPathExpression" use="optional"/> <attribute name="constraint" type="xforms:XPathExpression" use="optional"/> <attribute name="relevant" type="xforms:XPathExpression" use="optional"/> <attribute name="readonly" type="xforms:XPathExpression" use="optional"/> <attribute name="p3ptype" type="xsd:string" use="optional"/> </complexType> </element>
you could express this in ixml as follows:
bind: -Common, @nodeset?, -MIP*, bind*. MIP: @calculate; @type; @required; @constraint; @relevant; @readonly; @p3ptype. nodeset: xpath. calculate: xpath. type: QName. constraint: xpath. relevant: xpath. readonly: xpath. p3ptype: string.
The main hurdle is that a rule name must be unique in a grammar, and in XML
attributes and elements with the same name may have different content models
For instance, there is also a bind
attribute on other elements in
XForms.
Another example is the input
element in XForms:
<element name="input"> <complexType> <sequence> <element ref="xforms:label"/> <group ref="xforms:UI.Common" minOccurs="0" maxOccurs="unbounded"/> </sequence> <attributeGroup ref="xforms:Common.Attributes"/> <attributeGroup ref="xforms:Single.Node.Binding.Attributes"/> <attribute name="inputmode" type="xsd:string" use="optional"/> <attributeGroup ref="xforms:UI.Common.Attrs"/> <attribute name="incremental" type="xsd:boolean" use="optional" default="false"/> </complexType> </element> <attributeGroup name="Single.Node.Binding.Attributes"> <attribute name="model" type="xsd:IDREF" use="optional"/> <attribute name="ref" type="xforms:XPathExpression" use="optional"/> <attribute name="bind" type="xsd:IDREF" use="optional"/> </attributeGroup>
which becomes:
input: -Common, -UICommonAtts, -Binding?, @inputmode?, @incremental?, label, UICommon*. Binding: (@model?, @ref; @ref, @model); @bind. model: IDREF. bind: IDREF. ref: xpath.
If you had such definitions, you could then even design 'compact' versions of XML formats, eg for XForms:
bind //@open type boolean input age "How old are you?"
by altering the rules above to
bind: -"bind" -Common, @ref?, -MIP*, bind*. MIP: @nodeset; @calculate; @type; @required; @constraint; @relevant; @readonly; @p3ptype. nodeset: xpath. calculate: -"calculate", xpath. type: -"type", QName.
etc., and
input: -"input", -Common, -UICommonAtts, -Binding?, @inputmode?, @incremental?, label, UICommon*.
etc.
IXML opens a host of new non-XML documents to the XML process pipeline. By defining ixml in ixml, it becomes the first large application of ixml.