Invisible XML Specification (Draft)

Steven Pemberton, CWI, Amsterdam

Version: 2021-07-06

Status

This is the current state of the ixml base grammar; it is close to final.

Introduction

Data is an abstraction: there is no essential difference between the JSON

{"temperature": {"scale": "C"; "value": 21}}

and an equivalent XML

<temperature scale="C" value="21"/>

or

<temperature>
   <scale>C</scale>
   <value>21</value>
</temperature>

since the underlying abstractions being represented are the same.

We choose which representations of our data to use, CSV, JSON, XML, or whatever, depending on habit, convenience, and the context we want to use that data in. On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value. How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content. For example, it can turn CSS code like

body {color: blue; font-weight: bold}

into XML like

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

or, if preferred, as:

<css>
   <rule>
      <simple-selector name="body"/>
      <property name="color" value="blue"/>
      <property name="font-weight" value="bold"/>
   </rule>
</css>

As another example, the expression

pi×(10+b)

can result in the XML

<prod>
   <id>pi</id>
   <sum>
      <number>10</number>
      <id>b</id>
   </sum>
</prod>

or

<prod>
   <id name='pi'/>
   <sum>
      <number value='10'/>
      <id name='b'/>
   </sum>
</prod>

and the URL

http://www.w3.org/TR/1999/xhtml.html

can give

<url>
   <scheme name='http'/>
   <authority>
      <host>
         <sub name='www'/>
         <sub name='w3'/>
         <sub name='org'/>
      </host>
   </authority>
   <path>
      <seg sname='TR'/>
      <seg sname='1999'/>
      <seg sname='xhtml.html'/>
   </path>
</url>

or

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

The JSON value:

{"name": "pi", "value": 3.145926}

can give

<json>
   <object>
      <pair string='name'>
         <string>pi</string>
      </pair>
      <pair string='value'>
         <number>3.145926</number>
      </pair>
   </object>
</json>

How it works

A grammar is used to describe the input format. An input is parsed using this grammar, and the resulting parse tree is serialised as XML. Special marks in the grammar affect details of this serialisation, excluding parts of the tree, or serialising parts as attributes instead of elements.

As an example, consider this simplified grammar for URLs:

url: scheme, ":", authority, path.

scheme: letter+.

authority: "//", host.
host: sub+".".
sub: letter+.

path: ("/", seg)+.
seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

This means that a URL consists of a scheme (whatever that is), followed by a colon, followed by an authority, and then a path. A scheme, is one or more letters (whatever a letter is). An authority starts with two slashes, followed by a host. A host is one or more subs, separated by points. A sub is one or more letters. A path is a slash followed by a seg, repeated one or more times. A seg is zero or more fletters. A letter is a lowercase letter, an uppercase letter, or a digit. A fletter is a letter or a point.

So, given the input string http://www.w3.org/TR/1999/xhtml.html, this would produce the serialisation

<url>
   <scheme>http</scheme>:
   <authority>//
      <host>
         <sub>www</sub>.
         <sub>w3</sub>.
         <sub>org</sub>
      </host>
   </authority>
   <path>
      /<seg>TR</seg>
      /<seg>1999</seg>
      /<seg>xhtml.html</seg>
   </path>
</url>

If the rule for letter had not had a "-" before it, the serialisation for scheme, for instance, would have been:

<scheme><letter>h</letter><letter>t</letter><letter>t</letter><letter>p</letter></scheme>

Changing the rule for scheme to

scheme: name.
@name: letter+.

would change the serialisation for scheme to:

<scheme name="http"/>:

Changing the rule for scheme instead to:

@scheme: letter+.

would change the serialisation for url to:

<url scheme="http">

Changing the definitions of sub and seg from

sub: letter+.
seg: fletter*.

to

-sub: letter+.
-seg: fletter*.

would prevent the sub and seg elements appearing in the serialised result, giving:

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

The Grammar

Here we describe the format of the grammar used to describe documents. Note that it is in its own format, and therefore describes itself.

A grammar is a sequence of one or more rules, surrounded and separated by spacing and comments:

ixml: S, rule+S, S.

S stands for an optional sequence of spacing and comments. A comment is enclosed in braces, and may included nested comments, to enable commenting out parts of a grammar:

          -S: (whitespace; comment)*.
  -whitespace: -[Zs]; tab; lf; cr.
         -tab: -#9.
          -lf: -#a.
          -cr: -#d.
      comment: -"{", (cchar; comment)*, -"}".
       -cchar: ~["{}"].

Rules

A rule consists of an optional mark, a name, and one or more alternatives. The grammar here uses colons to define rules; an equals sign is also allowed.

rule: (mark, S)?, name, S, ["=:"], S, -alts, ".".

A mark is one of @, ^ or -, and indicates whether the item so marked will be serialised as an attribute (@), an element with its children (^), which is the default, or only its children (-).

@mark: ["@^-"].

A name starts with a letter or underscore, and continues with a letter, digit, underscore, a small number of punctuation characters, and the Unicode combiner characters; Unicode classes are used to define the sets of characters used, for instance, for letters and digits. This is close to, but not identical with the XML definition of a name; it is the grammar author's responsibility to ensure that all serialised names match the requirements for an XML name [XML].

        @name: namestart, namefollower*.
   -namestart: ["_"; Ll; Lu; Lm; Lt; Lo].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

Alternatives are separated by a semicolon or a vertical bar. The grammar here uses semicolons.

alts: alt+([";|"], S).

An alternative is zero or more terms, separated by commas:

alt: term*(",", S).

A term is a singleton factor, an optional factor, or a repeated factor, repeated zero or more times, or one or more times.

-term: factor;
       option;
       repeat0;
       repeat1.

A factor is a terminal, a nonterminal, or a bracketed series of alternatives:

-factor: terminal;
         nonterminal;
         "(", S, alts, ")", S.

A factor repeated zero or more times is followed by an asterisk, optionally followed by a separator, e.g. abc* and abc*",". For instance "a"*"#" would match the empty string, a a#a a#a#a etc.

repeat0: factor, "*", S, sep?.

Similarly, a factor repeated one or more times is followed by a plus, optionally followed by a separator, e.g. abc+ and abc+",". For instance "a"+"#" would match a a#a a#a#a etc., but not the empty string.

repeat1: factor, "+", S, sep?.

An optional factor is followed by a question mark, e.g. abc?. For instance "a"? would match a or the empty string.

option: factor, "?", S.

A separator may be any factor. E.g. abc*def or abc*(","; "."). For instance "a"+("#"; "!") would match a#a a!a a#a!a a!a#a a#a#a etc.

sep: factor.

Nonterminals

A nonterminal is an optionally marked name:

nonterminal: (mark, S)?, name, S.

This name refers to the rule that defines this name, which must exist, and there must only be one such rule.

Terminals

A terminal is a literal or a set of characters. It matches one or more characters in the input. A terminal must not be marked as an attribute. Since a terminal has no children, if it is marked with "-", it will serialise to the empty string.

-terminal: literal; 
           charset.

A literal is either a quoted string, or a hexadecimally encoded character:

  literal: quoted;
           encoded.

A quoted string is an optionally marked string of one or more characters, enclosed with single or double quotes. The enclosing quote is represented in a string by doubling it. A quoted string matches only the exact same string in the input. Examples: "yes" 'yes'.

These two strings are identical: 'Isn''t it?' "Isn't it?"

 -quoted: (tmark, S)?, -string.
  @tmark: ["^-"].
  string: -'"', dstring, -'"', S;
          -"'", sstring, -"'", S.
@dstring: dchar+.
@sstring: schar+.
   dchar: ~['"'];
          '"', -'"'. {all characters, quotes are doubled}
   schar: ~["'"];
          "'", -"'". {all characters, quotes are doubled}

An encoded character is an optionally marked hexadecimal number. It represents a single character and matches that character in the input. It starts with a hash symbol, followed by any number of hexadecimal digits, for example #a0. The digits are interpreted as a number in hexadecimal, and the character at that Unicode code-point is used. The number must be within the Unicode code-point range.

-encoded: (tmark, S)?, -"#", @hex, S.
     hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.

Character sets

A character set is an inclusion or an exclusion: an inclusion matches one character in the input that is in the set, an exclusion matches one character not in the set.

An inclusion is enclosed in square brackets, and represents the set of characters defined by any combination of literal characters, a range of characters, hex encoded characters, or Unicode classes. Examples ["a"-"z"] ["xyz"] [Lc] ["0"-"9"; "!@#"; Lc]. Note that ["abc"] ["a"; "b"; "c"] ["a"-"c"] [#61-#63] all represent the same set of characters.

An exclusion is an inclusion preceded by a tilde ~. For example, ~["{}"] matches any character that is not an opening or closing brace.

Note that the empty inclusion [] would fail to match any character in the input; on the other hand ~[] would match any one character, whatever it is.

 -charset: inclusion; 
           exclusion.
inclusion: (tmark, S)?,         set.
exclusion: (tmark, S)?, "~", S, set.
     -set: "[", S,  member*([";|"], S), "]", S.
  -member: literal;
           range;
           class.

A range matches any character in the range from the start character to the end, inclusive, using the Unicode ordering:

range: from, "-", S, to.
@from: character.
  @to: character.

A character is a string of length one, or a hex encoded character:

-character: -'"', dchar, -'"', S;
            -"'", schar, -"'", S;
            "#", hex, S.

A class is two letters, representing any character from the Unicode character category of that name. E.g. [Ll] matches any lower-case letter, [Ll; Lu] matches any upper- or lower-case character; it is an error if there is no such class.

  class: @code, S.
   code: letter, letter.
-letter: ["a"-"z"; "A"-"Z"].

Complete

         ixml: S, rule+S, S.

           -S: (whitespace; comment)*.
  -whitespace: -[Zs]; tab; lf; cr.
         -tab: -#9.
          -lf: -#a.
          -cr: -#d.
      comment: -"{", (cchar; comment)*, -"}".
       -cchar: ~["{}"].

         rule: (mark, S)?, name, S, ["=:"], S, -alts, ".".
        @mark: ["@^-"].
         alts: alt+([";|"], S).
          alt: term*(",", S).
        -term: factor;
               option;
               repeat0;
               repeat1.
      -factor: terminal;
               nonterminal;
               "(", S, alts, ")", S.
      repeat0: factor, "*", S, sep?.
      repeat1: factor, "+", S, sep?.
       option: factor, "?", S.
          sep: factor.
  nonterminal: (mark, S)?, name, S.

    -terminal: literal; 
               charset.
      literal: quoted;
               encoded.
      -quoted: (tmark, S)?, -string.

        @name: namestart, namefollower*.
   -namestart: ["_"; Ll; Lu; Lm; Lt; Lo].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

       @tmark: ["^-"].
       string: -'"', dstring, -'"', S;
               -"'", sstring, -"'", S.
     @dstring: dchar+.
     @sstring: schar+.
        dchar: ~['"'];
               '"', -'"'. {all characters, quotes must be doubled}
        schar: ~["'"];
               "'", -"'". {all characters, quotes must be doubled}
     -encoded: (tmark, S)?, -"#", @hex, S.
          hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.

     -charset: inclusion; 
               exclusion.
    inclusion: (tmark, S)?,         set.
    exclusion: (tmark, S)?, "~", S, set.
         -set: "[", S,  member*([";|"], S), "]", S.
      -member: literal;
               range;
               class.
        range: from, "-", S, to.
        @from: character.
          @to: character.
   -character: -'"', dchar, -'"', S;
               -"'", schar, -"'", S;
               "#", hex, S.
        class: @code, S.
         code: letter, letter.
      -letter: ["a"-"z"; "A"-"Z"].

Parsing

The root symbol of the grammar is the name of the first rule in the grammar. If it is marked as hidden, all of its productions must produce exactly one non-hidden nonterminal and no non-hidden terminals before or after that nonterminal (in order to match the XML requirement of a single-rooted document).

Processors must accept and parse any conforming grammar, and produce at least one parse of any input that conforms to the grammar starting at the root symbol. If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation. The ixml namespace URI is "http://invisiblexml.org/NS".

Serialisation

If the parse fails, some XML document must be produced with ixml:state="failed" on the document element. The document should provide helpful information about where and why it failed; it may be a partial parse tree that includes parts of the parse that succeeded.

If the parse succeeds, the resulting parse-tree is serialised as XML by serialising the root node. If the root node is marked as an attribute, that marking is ignored.

A parse node is either a nonterminal or a terminal.

Any parse node may be marked as an element, as an attribute, or as hidden (children only). The mark comes from the use of the node in a rule if present, otherwise, for a nonterminal, from the definition of the rule for that nonterminal. If a node is not marked, it is treated as if marked as an element. It is an error if the name of a node to be output does not match the requirements of an XML name.

An example grammar that illustrates these rules is

expr: open, -arith, @close.
@open: "(".
close: ")".
arith: left, op, right.
left: @name.
right: -name.
@name: "a"; "b".
-op: sign.
@sign: "+".

Applied to the string (a+b), it yields the serialisation

<expr open="(" sign="+" close=")">
   <left name="a"/>
   <right>b</right>
</expr>

Hints for Implementors

Many parsing algorithms only mention terminals, and nonterminals, and don't explain how to deal with the repetition constructs used in ixml. However, these can be handled simply by converting them to equivalent simple constructs. In the examples below, f and sep are factors from the grammar above. The other nonterminals are generated nonterminals.

Optional factor:

f? ⇒ f-option
-f-option: f; .

Zero or more repetitions:

f* ⇒ f-star
-f-star: f, f-star; .

One or more repetitions:

f+ ⇒ f-plus
-f-plus: f, f-star.
-f-star: f, f-star; .

One or more repetitions with separator:

f+sep ⇒ f-plus-sep
-f-plus-sep: f, sep-part-option. 
-sep-part-option: sep, f-plus-sep; .

Zero or more repetitions with separator:

f*sep ⇒ f-star-sep
-f-star-sep: f-plus-sep; .
-f-plus-sep: f, sep-part-option.
-sep-part-option: sep, f-plus-sep; .

IXML in IXML

Since the ixml grammar is expressed in its own notation, the above grammar can be processed into an XML document by parsing it using itself, and then serialising. Note that all semantically significant terminals are recorded in attributes. The serialisation begins as below, but the entire serialisation is available:

<ixml>
   <rule name='ixml'>:
      <alt>
         <nonterminal name='S'/>,
         <repeat1>
            <nonterminal name='rule'/>+</repeat1>
      </alt>.</rule>
   <rule mark='-' name='S'>:
      <alt>
         <repeat0>(
            <alts>
               <alt>
                  <nonterminal name='whitespace'/>
               </alt>;
               <alt>
                  <nonterminal name='comment'/>
               </alt>
            </alts>)*</repeat0>
      </alt>.</rule>
   <rule mark='-' name='whitespace'>:
      <alt>
         <inclusion tmark='-'>[
            <class>Zs</class>;
            <literal hex='9'>
               <comment>tab</comment>
            </literal>;
            <literal hex='a'>
               <comment>cr</comment>
            </literal>;
            <literal hex='d'>
               <comment>lf</comment>
            </literal>]</inclusion>
      </alt>.</rule>
   <rule name='comment'>:
      <alt>
         <literal tmark='-' dstring='{'/>,
         <repeat0>(
            <alts>
               <alt>
                  <nonterminal name='cchar'/>
               </alt>;
               <alt>
                  <nonterminal name='comment'/>
               </alt>
            </alts>)*</repeat0>,
         <literal tmark='-' dstring='}'/>
      </alt>.</rule>
   <rule mark='-' name='cchar'>:
      <alt>
         <exclusion>~[
            <literal dstring='{}'/>]</exclusion>
      </alt>.</rule>
   <rule name='rule'>:
      <alt>
         <option>
            <nonterminal name='mark'/>?</option>,
         <nonterminal name='name'/>,
         <nonterminal name='S'/>,
         <inclusion>[
            <literal dstring='=:'/>]</inclusion>,
         <nonterminal name='S'/>,
         <nonterminal mark='-' name='alts'/>,
         <literal dstring='.'/>,
         <nonterminal name='S'/>
      </alt>.</rule>

Conformance

In this specification, the verb "must" expresses unconditional requirements for conformance to the specification; the verb "should" expresses requirements that are encouraged but which are not conditions of conformance; the verb "may" expresses optional features which are neither required not prohibited.

Conformance to this specification can meaningfully be claimed for grammars and for processors.

Note: although input described by a grammar is sometimes described as "obeying" or "conforming to" the grammar, conformance to this specification cannot be claimed of input streams or of input + grammar pairs.

Conformance of grammars

An ixml grammar in ixml form conforms to this specification if

An ixml grammar in XML form conforms to this specification if

Note: The normative formulations of conformance requirements are those given elsewhere in this specification. But for convenience the requirements that go beyond what is expressed in the grammar itself can be summarized as follows. Reasonable effort has been used to make this list complete, but omission of any conformance requirement from this list does not affect its status as a conformance requirement.

Conformance of processors

A processor conforms to this specification if it accepts grammars in ixml {or XML?} form and uses those grammars to parse input and produce XML documents representing parse trees as specified elsewhere in this specification. A conforming processor must not accept non-conforming grammars.

In addition to requirements mentioned elsewhere in this specification, the following also apply to conforming processors:

References

The Unicode Consortium (ed.), The Unicode Standard — Version 13.0. Unicode Consortium, 2020, ISBN 978-1-936213-26-9, http://www.unicode.org/versions/Unicode13.0.0/

ibid. Chapter 4, Unicode Character Properties https://www.unicode.org/versions/Unicode13.0.0/ch04.pdf

General Category Values https://unicode.org/reports/tr44/#General_Category_Values (See also http://www.fileformat.info/info/unicode/category/index.htm)

Tim Bray et al. (eds.), Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C, 2008, https://www.w3.org/TR/xml/

Acknowledgments

Thanks are due to Michael Sperberg-McQueen and Hans-Dieter Hiep for their close reading of the specification, and consequent many helpful comments.