On the Specification of Invisible XML

Steven Pemberton, CWI, Amsterdam

Abstract

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML.

After a number of design iterations, the language is ready for specification. This paper describes the production of the specification of ixml. An interesting aspect of this specification is that ixml is itself an application of ixml: the grammar describes itself, and therefore can be used to parse itself, and thus produce an XML representation of the grammar. We discuss the decisions taken to produce the most useful XML version possible.

Cite as: Steven Pemberton, On the Specification of Invisible XML, Proc. XML Prague 2019, Prague, Czech Republic, pp 413-430, ISBN 978-80-906259-6-9 (pdf), https://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf#page=425

Introduction
The Grammar
Serialisation and Marks
- Attribute lifting
Ambiguity
The IXML Serialisation
Future work
Conclusion
References

Introduction

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML.

This gives a number of advantages:

it enables authors to write documents and data in a format they prefer,
provides XML for processes that are more effective with XML content,
opens up documents and data that otherwise are hard to import into XML environments.

IXML works by providing a description of the data or document of interest in the form of a (context-free) grammar. This grammar is then used to parse the document, and the resulting parse tree is then serialised as an XML document. The grammar includes extra information about how the tree should be serialised, allowing the eliding of unnecessary nodes, and the choice of serialising nodes as XML elements or attributes.

As a (necessarily small) example, take the definition of the email type from XForms 2.0 [xf], which is defined by a regular expression:

<xs:simpleType name="email">
  <xs:restriction base="xs:string">
    <xs:pattern value="([A-Za-z0-9!#-'\*\+\-/=\?\^_`\{-~]+)
                     (\.[A-Za-z0-9!#-'\*\+\-/=\?\^_`\{-~]+)*
                      @([A-Za-z0-9]([A-Za-z0-9-]*[A-Za-z0-9])?)
                     (\.[A-Za-z0-9]([A-Za-z0-9-]*[A-Za-z0-9])?)+"/>
  </xs:restriction>
</xs:simpleType>

If we turn this into ixml, we get the following:

email:  user, "@", host. {An email address has two parts separated by an @ sign}
user:   atom+".".        {The user part is one or more atoms, separated by dots}
atom:   char+.           {An atom is a string of one or more 'char'}
host:   domain+".".      {A host is a series of domains, separated by dots}
domain: word+"-".        {A domain may contain a hyphen, but not start or end with one}
word:   letgit+.         {A domain otherwise consists of letters and digits}
-letgit: ["A"-"Z"; "a"-"z"; "0"-"9"].
-char:   letgit; ["!#$%&'*+-/=?^_`{|}~"]. {A char is a letter, digit, or punctuation.}

If we now feed into this the string

~my_mail+{nospam}$?@sub-domain.example.info

we will get the following XML serialisation:

<email>
   <user>
      <atom>~my_mail+{nospam}$?</atom>
   </user>@
   <host>
      <domain>
         <word>sub</word>-
         <word>domain</word>
      </domain>.
      <domain>
         <word>example</word>
      </domain>.
      <domain>
         <word>info</word>
      </domain>
   </host>
</email>

If the rule for letgit hadn't had a dash before it, then, for instance, the element <word>sub</word>, would have looked like this:

<word><letgit>s</letgit><letgit>u</letgit><letgit>b</letgit></word>

Since the word part of a domain has no semantic meaning, we can exclude it from the serialisation by changing the rule for word into:

-word: letgit+.

to give:

<email>
   <user>
      <atom>~my_mail+{nospam}$?</atom>
   </user>@
   <host>
      <domain>sub-domain</domain>.
      <domain>example</domain>.
      <domain>info</domain>
   </host>
</email>

If we change the rule for atom and domain into

-atom: char+.
-domain: word+"-".

we get:

<email>
   <user>~my_mail+{nospam}$?</user>@
   <host>sub-domain.example.info</host>
</email>

Finally, changing the rules for user and host to:

@user: atom+".".
@host: domain+".".

gives:

<email user='~my_mail+{nospam}$?' host='sub-domain.example.info'>@</email>

To get rid of the left-over "@", we can change the rule for email to:

email: atoms, -"@", host.

to give:

<email user='~my_mail+{nospam}$?' host='sub-domain.example.info'/>

What we can see here is that the definition is much more structured than a regular expression, and that we have a lot of control over what emerges from the XML serialisation.

After a number of design iterations [ixml1, ixml2, ixml3, ixml4], the language is now ready for specification. A draft specification [ixml] has been produced. This paper describes the specification, and some of the decisions made.

The Grammar

The grammar is based on a format known as VWG [vwg].

A VWG grammar consists of a series of rules, where a rule consists of a name, a colon, a one or more 'alternatives' separated by semicolons, all terminated by a dot. An alternative consists of zero or more 'terms', separated by commas.

Expressing this in ixml would look like this:

ixml: rule+.
rule: name, ":", alternatives, ".".
alternatives: alternative+";".
alternative: term*",".

This introduces some of the extensions ixml has added to vwgs. Appending a star or plus to a term is used in many grammar systems and related notations such as regular expressions, to express repetition of a term, zero or more for a star and one or more for a plus. An ixml innovation is to make both postfix operators infix as well. The right hand operand then defines a separator that comes between the repeated terms.

Note that these additions do not add any power to the language (they can all be represented by other means). However they do add ease of authoring, since the alternative way of writing them is verbose, and obscures the purpose of the extra rules that have to be added.

A term is a factor, an optional factor, or a repeated factor, repeated zero or more, or one or more times:

term: factor; option; repeat0; repeat1.
option: factor, "?".
repeat0: factor, "*", separator?.
repeat1: factor, "+", separator?.
separator: factor.

These rules also demonstrate the use of a question mark to indicate an optional factor.

A factor is a terminal, a nonterminal, or a bracketed series of alternatives:

factor: terminal; nonterminal; "(", alternatives, ")".

Nonterminals and Terminals

A nonterminal is just a name, which refers to the rule defining that name:

nonterminal: name.

Terminals are the elements that actually match characters in the input; Unicode characters are used. There are two forms for terminals: literal strings, and character sets.

The simplest form is a literal string, as we have seen above, such as ":" and ",". Strings may be delimited by either double quotes or single: ":" and ':' are equivalent. If you want to include the delimiting quote in a string, it should be doubled: "don't" and 'don''t' are equivalent.

terminal: literal; charset.
literal: quoted; encoded.
quoted: '"', dchar+, '"';
        "'", schar+, "'".
dchar: ~['"']; '""'.
schar: ~["'"]; "''".

This introduces the exclusion: the construct ~['"'] matches any character except what is enclosed in the brackets, and is defined below.

In order to express characters with no explicit visible representation or with an ambiguous representation there is a representation for encoded characters. For instance #a0 represents a non-breaking space.

encoded: "#", hex+.
hex: ["0"-"9"; "a"-"f"; "A"-"F"].

Encoded characters do not appear within strings, but are free-standing; however, this doesn't limit expressiveness, since a rule like

end: "the", #a0, "end".

represents the seven characters with a non-breaking space in the middle.

Character sets, which we have also seen in the earlier simple example, allow you to match a character from a set of characters, such as ["A"-"Z"; "a"-"z"; "0"-"9"]. Elements of a character set can be a literal string, representing all the characters in the string, a range like above, or a character class. Unicode defines a number of classes, such as lower case letter and upper case letter, encoded with two-letter abbreviations, such as Ll and Lu; these abbreviations may be used thus saving the work of having to define those classes yourself:

charset:   inclusion; exclusion.
inclusion: "[", element+, "]".
exclusion: "~", inclusion.
element:   quoted; range; class.

range:  char, "-", char.
class:  letter, letter.
char:   '"', dchar, '"';
        "'", schar, "'";
        encoded.
letter: ["a"-"z"; "A"-"Z"].

What's in a Name?

The question arises: what should be allowed as a name in ixml?

One of the more opaque sections of the XML specification [xml] is which characters are permitted in a name, since it is specified almost entirely in terms of ranges over hexadecimal numbers, without further explanation. In an informal test of a handful of XML users, we found they were unable to confidently determine if certain element names were valid XML or not. Just to give you a taste of the problem, this character may not appear in an XML identifier: µ, while this one may: μ. They are both versions of the Greek letter mu, but one is at #B5 and the other at #3BC.

So should ixml continue this opacity, or is it possible to create something more transparent? Should ixml exactly duplicate the XML rules? Not all ixml names are serialised, so there is no a priori need for them to adhere to XML rules; furthermore, there may in the future be other serialisations; it could be left to the author to make sure that names adhered to the rules needed for the chosen serialisation. Would there be holes, and would it matter?

So what does XML allow? This is the XML rule:

NameStartChar   ::=
                ":" |
                [A-Z] | 
                "_" | 
                [a-z] | 
                [#xC0-#xD6] | 
                [#xD8-#xF6] | 
                [#xF8-#x2FF] | 
                [#x370-#x37D] | 
                [#x37F-#x1FFF] | 
                [#x200C-#x200D] | 
                [#x2070-#x218F] | 
                [#x2C00-#x2FEF] | 
                [#x3001-#xD7FF] | 
                [#xF900-#xFDCF] | 
                [#xFDF0-#xFFFD] | 
                [#x10000-#xEFFFF]
NameChar   ::=   
                NameStartChar | 
                "-" | 
                "." | 
               [0-9] | 
               #xB7 | 
               [#x0300-#x036F] | 
               [#x203F-#x2040]

So an XML name can start with a letter, a colon, an underscore, and "other stuff"; it can continue with the same characters, a hyphen, a dot, the digits, and some more other stuff.

In brief, the "other stuff" for a start character is anything between #C0 (which is À) and #EFFFF (which is not assigned, so really shouldn't be allowed) -- mostly representing a huge collection of letters from many languages. What is excluded is:

#D7 (×, the multiplication sign)
#F7 (÷, the division sign)
#300-#36F (combining characters, such as the combining grave accent),
#37E (the Greek question mark ";")
#2000-#200B (various spaces, such as the en space)
#200E-#2006F, (various punctuation, including several hyphen characters)
#2190-#2BFF, (a large number of symbol-like characters, such as arrows)
#2FF0-#3000, (ideographic description characters)
#D800-#F8FF, (surrogates, and private use)
#FDD0-#FDEF, (unassigned)
#FFFE and #FFFF (unassigned).

A name continuation character consists of the same characters plus as already said a hyphen, a dot, and the digits, and then #B7 (the middot "·"), the combining characters left out of the start characters, and then the two characters #203F and #2040, the overtie and undertie characters ‿⁀.

On the other hand, Unicode [ucc] has 30 character classes:


Name	Description	Number	Examples
Cc	Control	65	Ack, Bell, Backspace, Tab, LF, CR, etc
Cf	Format	151	Soft hyphen, Arabic Number Sign, Zero-width space, left-to-right mark, invisible times, etc.
Co	Private Use		#E000-#F8FF
Cs	Surrogate		#D800-#DFFF
Ll	Lowercase Letter	2,063	a, µ, ß, à, æ, ð, ñ, π, Latin, Greek, Coptic, Cyrillic, Armenian, Georgian, Cherokee, Glagolitic, many more
Lm	Modifier Letter	250	letter or symbol typically written next to another letter that it modifies in some way.
Lo	Other Letter	121,047	ª, º, ƻ, dental click, glottal stop, etc., and letters from languages that don't have cased letters, such as Hebrew, Arabic, Syriac, ...
Lt	Titlecase Letter	31	Mostly ligatures that have to be treated specially when starting a word.
Lu	Uppercase Letter	1,702	A, Á, etc
Mc	Spacing Mark	401	Spacing combining marks, Devengari, Bengali, etc.
Me	Enclosing Mark	13	Combining enclosing characters such as "Enclosing circle"
Mn	Nonspacing Mark	1763	Combining marks, such as combining grave accent.
Nd	Decimal Number	590	0-9, in many languages, mathematical variants,
Nl	Letter Number	236	Ⅰ, Ⅱ, Ⅲ, Ⅳ,...
No	Other Number	676	subscripts, superscripts, fractions, circled and bracketed numbers, many languages
Pc	Connector Punctuation	10
Pd	Dash Punctuation	24
Pe	Close Punctuation	73
Pf	Final Punctuation	10
Pi	Initial Punctuation	12
Po	Other Punctuation	566
Ps	Open Punctuation	75
Sc	Currency Symbol	54
Sk	Modifier Symbol	121
Sm	Math Symbol	948
So	Other Symbol	5855
Zl	Line Separator	1	Not cr, lf.
Zp	Paragraph Separator	1
Zs	Space Separator	17	space, nbsp, en quad, em quad, thin space, etc. Not tab, cr, lf etc.

The final decision was made to define ixml names using Unicode character classes, while keeping as close as possible to the spirit of what is allowed in XML:

name: namestart, namefollower*.
namestart: ["_"; Ll; Lu; Lm; Lt; Lo].
namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

Consequently there are small differences in what a name is. For instance, Unicode classes the characters ª and º as letters, and does class both mu characters as letters, while XML doesn't; as was said above, not all ixml names are serialised, so it is the responsibility of the ixml author to ensure that those that do, adhere to the XML rules.

Spaces and comments

One thing that the above grammar hasn't defined is where spaces may go.

These are allowed after any token, and before the very first token, so we have to update all the rules to indicate this. For instance:

ixml: S, rule+.
rule: name, S, ":", S, alternatives, ".", S.
alternatives: alternative+(";", S).

where S defines what a space is:

S: (whitespace; comment)*.
whitespace: [Zs; #9 {tab}; #a {lf}; #d {cr}].
comment: "{", (cchar; comment)*, "}".
cchar: ~["{}"].

Here we see the use of a character class, Zs, which are all characters classified in Unicode as a space character; to this we have added tab, line feed, and carriage return (which are classified as control characters in Unicode).

Actually, Unicode is somewhat equivocal about whitespace. Along with the character class Zs it also has the character property WS ('WhiteSpace').

The Zs class contains the following 17 characters:

space, no-break space, ogham space mark, en quad, em quad, en space, em space, three-per-em space, four-per-em space, six-per-em space, figure space, punctuation space, thin space, hair space, narrow no-break space, medium mathematical space, ideographic space.

These all have the WS property, with the exception of no-break space, and narrow no-break space, which have the property CS (common separator, along with commas, colons, slashes and the like). There are also two characters that have the WS property, but are not in the Zs class, form-feed, which has class Cc (control character), and line separator, which is the sole member of class Zl.

For the record, line feed and carriage return are both in character class Cc (control character), with property 'Paragraph Separator'; the other characters that share this property are: #1C, #1D, #1E ('information separator' control characters, all in class Cc), #85 (Next line, also in Cc), and #2029 (paragraph separator, the sole member of Zp, paragraph separator). The tab character, also in Cc, has property 'Segment Separator'; other characters that have this property are #B (line tabulation), and #1F ('Information separator), both in class Cc.

Wherever whitespace is permitted in ixml, so is a comment. A comment may itself contain a comment, which enables commenting out sections of a grammar.

An addition that has also been made is to allow = and well as : to delimit a rule name, and | as well as ; to delimit alternatives:

rule: name, S, ["=:"], S, alternatives, ".", S.
alternatives: alternative+([";|"], S).

Serialisation and Marks

Once a grammar has been defined, it is then used to parse input documents; a resulting parse is serialised as XML. It is not specified which parse algorithm should be used, as long as it accepts all context-free grammars, and produces at least one parse of any document that matches the grammar.

By default, a parse tree is serialised as XML elements: the parse tree is traversed in document order (depth first, left to right), and nonterminal nodes are output as XML elements, and terminals are just output.

For instance, for this small grammar for simple expressions:

expr: operand+operator.
operand: id; number.
id: letter+.
number: digit+.
letter: ["a"-"z"].
digit: ["0"-"9"].
operator: ["+-×÷"].

parsing the following string:

pi×10

would produce

<expr>
   <operand>
      <id>
         <letter>p</letter>
         <letter>i</letter>
      </id>
   </operand>
   <operator>×</operator>
   <operand>
      <number>
         <digit>1</digit>
         <digit>0</digit>
      </number>
   </operand>
</expr>

To control serialisation, marks are added to grammars.

There are three options for serialising a nonterminal node, such as expr, and operand above:

Full serialisation as an element, as above, which is the default;
As an attribute: in which case all (serialised) terminal descendents of the node become the value of the attribute;
Partial serialisation, only serialising the children, essentially the same as option 1, but without the surrounding tag.

For serialising a terminal the only option is between serialising it and not.

There are two places where a nonterminal can be marked for serialising: at the definition of the rule for that nonterminal, which specifies the default way it is serialised, or at the use of the nonterminal, which overrides the marking used on the rule if any. A terminal can only be marked at the place of use.

There are three type of mark: "^" for full, "@" for attribute (which doesn't apply to terminals), and "-" for partial (which causes terminals not to be serialised).

To support this, we add to the grammar for rule, and nonterminal:

rule: (mark, S)?, name, S, ":", S, alternatives, ".", S.
nonterminal: (mark, S)?, name, S.
mark: ["@^-"].

and similar for terminals.

Attribute lifting

The only unusual case for serialisation is for attribute children of a partially serialised node. In that case there is no element for the attributes to be put on, and so they are lifted to a higher node in the parse tree. For instance, with:

expr: operand+operator.
operand: id; number.
id: @name.
name: letter+.
number: @value.
value: digit+.
letter: ["a"-"z"].
digit: ["0"-"9"].
operator: ["+-×÷"].

the default serialisation for pi×10 would look like this:

<expr>
   <operand>
      <id name='pi'/>
   </operand>
   <operator>×</operator>
   <operand>
      <number value='10'/>
   </operand>
</expr>

However, if we changed the rules for id and number to partial serialisation:

-id: @name.
-number: @value.

so that neither produces an element, then the attributes are moved up, giving:

<expr>
   <operand name='pi'/>
   <operator>×</operator>
   <operand value='10'/>
</expr>

Ambiguity

A grammar may be ambiguous, so that a given input may have more than one possible parse.

For instance, here is an ambiguous grammar for expressions:

expr: id; number; expr, operator, expr.
id: ["a"-"z"]+.
number: ["0"-"9"]+.
operator: "+"; "-"; "×"; "÷".

Given the string a÷b÷c, this could produce either of the following serialisations:

<expr>
   <expr>
      <id>a</id>
   </expr>
   <operator>÷</operator>
   <expr>
      <expr>
         <id>b</id>
      </expr>
      <operator>÷</operator>
      <expr>
         <id>c</id>
      </expr>
   </expr>
</expr>

<expr>
   <expr>
      <expr>
         <id>a</id>
      </expr>
      <operator>÷</operator>
      <expr>
         <id>b</id>
      </expr>
      <id>a</id>
   </expr>
   <operator>÷</operator>
   <expr>
      <id>c</id>
   </expr>
</expr>

i.e. it could be interpreted as a÷(b÷c) or as (a÷b)÷c.

In the case of ambiguous parses, one of the parse trees is serialised (it is not specified which), but the root element is marked to indicate that it is ambiguous.

Other examples of possible ambiguity to look out for are if we had defined a rule in the grammar as:

rule: name, ":", alternatives, ".".
alternatives: alternative*";".
alternative: term*",".

then an empty rule such as:

empty: .

could be interpreted equally well as a rule with zero alternatives, or with one alternative with zero terms.

Similarly, if a grammar says that spaces could appear before or after tokens

id: S, letter+, S.
operator: S, ["+-×÷"], S.

then with an input such as a + b the first space could be interpreted as either following a, or preceding +.

By the way, this is why commas are needed between terms in an ixml alternative. Otherwise you wouldn't be able to see the difference between:

a: b+c.

and

a: b+, c.

The IXML Serialisation

IXML is itself an application of IXML, since the grammar is defined in its own format. That means that the grammar can be parsed with itself, and then serialised to XML. This has consequences for the design of the grammar.

The main choice was whether to use attributes at all, and if so where. The decision taken was to put all semantic terminals (such as names) in attributes, and otherwise to use elements.

As pointed out above, spaces were carefully placed to prevent ambiguous parses, but also placed in the grammar so that they didn't occur in attribute values.

So an example serialisation for the rule for rule, is:

<rule name='rule'>:
   <alt>
      <option>
         <nonterminal name='mark'/>?</option>,
      <nonterminal name='name'/>,
      <nonterminal name='S'/>,
      <inclusion>[
         <literal dstring='=:'/>]</inclusion>,
      <nonterminal name='S'/>,
      <nonterminal mark='-' name='alts'/>,
      <literal dstring='.'/>,
      <nonterminal name='S'/>
   </alt>.</rule>

Although all terminal symbols are preserved in the serialisation, the only ones of import are in attribute values.

Since formally it is the XML serialisation that is used as input to the parser, and text nodes in the serialisation are ignored, it is only the serialisation that matters for the parser. This means that a different ixml grammar may be used, as long as the serialisation is the same. So if the grammar looks like this:

<expr> ::= <id> | <number> | <expr>, <operator>, <expr>.

that is fine, as long as it produces the same serialisation structure.

Future work

If you look at an ixml grammar in the right way, you can also see it as a type of schema for an XML format. Future work will look at the possibilities of using ixml to define XML formats. For instance, if we take the following XSL definition of the bind element in XForms 1.1:

<element name="bind">
   <complexType>
      <sequence minOccurs="0" maxOccurs="unbounded">
         <element ref="xforms:bind"/>
      </sequence>
      <attributeGroup ref="xforms:Common.Attributes"/>
      <attribute name="nodeset" type="xforms:XPathExpression" use="optional"/>
      <attribute name="calculate" type="xforms:XPathExpression" use="optional"/>
      <attribute name="type" type="QName" use="optional"/>
      <attribute name="required" type="xforms:XPathExpression" use="optional"/>
      <attribute name="constraint" type="xforms:XPathExpression" use="optional"/>
      <attribute name="relevant" type="xforms:XPathExpression" use="optional"/>
      <attribute name="readonly" type="xforms:XPathExpression" use="optional"/>
      <attribute name="p3ptype" type="xsd:string" use="optional"/>
   </complexType>
</element>

you could express this in ixml as follows:

bind: -Common, @nodeset?, -MIP*, bind*.
MIP: @calculate; @type; @required; @constraint; 
     @relevant; @readonly; @p3ptype.
nodeset: xpath.
calculate: xpath.
type: QName.
constraint: xpath.
relevant: xpath.
readonly: xpath.
p3ptype: string.

The main hurdle is that a rule name must be unique in a grammar, and in XML attributes and elements with the same name may have different content models For instance, there is also a bind attribute on other elements in XForms.

Another example is the input element in XForms:

<element name="input">
   <complexType>
      <sequence>
         <element ref="xforms:label"/>
         <group ref="xforms:UI.Common" minOccurs="0" maxOccurs="unbounded"/>
      </sequence>
      <attributeGroup ref="xforms:Common.Attributes"/>
      <attributeGroup ref="xforms:Single.Node.Binding.Attributes"/>
      <attribute name="inputmode" type="xsd:string" use="optional"/>
      <attributeGroup ref="xforms:UI.Common.Attrs"/>
      <attribute name="incremental" type="xsd:boolean" use="optional" default="false"/>
   </complexType>
</element>

<attributeGroup name="Single.Node.Binding.Attributes">
   <attribute name="model" type="xsd:IDREF" use="optional"/>
   <attribute name="ref" type="xforms:XPathExpression" use="optional"/>
   <attribute name="bind" type="xsd:IDREF" use="optional"/>
</attributeGroup>

which becomes:

input: -Common, -UICommonAtts, -Binding?, @inputmode?, @incremental?, label, UICommon*.
Binding: (@model?, @ref; @ref, @model); @bind.
model: IDREF.
bind: IDREF.
ref: xpath.

If you had such definitions, you could then even design 'compact' versions of XML formats, eg for XForms:

bind //@open type boolean
input age "How old are you?"

by altering the rules above to

bind: -"bind" -Common, @ref?, -MIP*, bind*.
MIP: @nodeset; @calculate; @type; @required; @constraint; @relevant; @readonly; @p3ptype.
nodeset: xpath.
calculate: -"calculate", xpath.
type: -"type", QName.

etc., and

input: -"input", -Common, -UICommonAtts, -Binding?, @inputmode?, @incremental?, label, UICommon*.

etc.

Conclusion

IXML opens a host of new non-XML documents to the XML process pipeline. By defining ixml in ixml, it becomes the first large application of ixml.

References

[ixml1] Steven Pemberton. Invisible XML, Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:10.4242/BalisageVol10.Pemberton01. http://www.cwi.nl/~steven/Talks/2013/08-07-invisible-xml/invisible-xml-3.html
[ixml2] Steven Pemberton, Data Just Wants to Be Format-Neutral, Proc. XML Prague, 2016, Prague, Czech Republic, pp109-120. http://archive.xmlprague.cz/2016/files/xmlprague-2016-proceedings.pdf
[ixml3] Steven Pemberton, Parse Earley, Parse Often: How to Parse Anything to XML, Proc. XML London 2016, London, UK, pp120-126. http://xmllondon.com/2016/xmllondon-2016-proceedings.pdf#page=120
[ixml4] Steven Pemberton, On the Descriptions of Data: The Usability of Notations, Proc. XML Prague, 2017, Prague, Czech Republic, pp143-159. https://archive.xmlprague.cz/2017/files/xmlprague-2017-proceedings.pdf#page=155
[ixml] Steven Pemberton, Invisible XML Specification (Draft), CWI, 2018, https://www.cwi.nl/~steven/ixml/ixml-specification.html
[xf] E. Bruchez et al. (eds.), XForms 2.0, W3C, https://www.w3.org/community/xformsusers/wiki/XForms_2.0.
[xml] Tim Bray, et al. (eds), XML 1.0 5th edition, W3C, https://www.w3.org/TR/2008/REC-xml-20081126/
[vwg] Steven Pemberton, Executable Semantic Definition of Programming Languages Using Two-level Grammars (Van Wijngaarden Grammars),
[ucc] Unicode Consortium, The Unicode Standard, Chapter 4: Character Properties". Unicode, Inc. June 2018, https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G2212.