Invisible XML Specification (Draft)

Steven Pemberton, CWI, Amsterdam

Version: 2018-10-11

Status

This is the current state of the ixml base grammar; it is not final. There are comments about potential changes.

Introduction

Data is an abstraction: there is no essential difference between the JSON

{"temperature": {"scale": "C"; "value": 21}}

and an equivalent XML

<temperature scale="C" value="21"/>

or

<temperature>
   <scale>C</scale>
   <value>21</value>
</temperature>

since the underlying abstractions being represented are the same.

We choose which representations of our data to use, JSON, CSV, XML, or whatever, depending on habit, convenience, and the context we want to use that data in. On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value. How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content. For example, it can turn CSS code like

body {color: blue; font-weight: bold}

into XML like

<css>
   <rule>
      <simple-selector name="body"/>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

or, if preferred, as:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color" value="blue"/>
         <property name="font-weight" value="bold"/>
      </block>
   </rule>
</css>

As another example, the expression

pi×(10+b)

can result in the XML

<prod>
   <id>pi</id>
   <sum>
      <number>10</number>
      <id>b</id>
   </sum>
</prod>

or

<prod>
   <id name='pi'/>
   <sum>
      <number value='10'/>
      <id name='b'/>
   </sum>
</prod>

and the URL

http://www.w3.org/TR/1999/xhtml.html

can give

<url>
   <scheme name='http'/>
   <authority>
      <host>
         <sub name='www'/>
         <sub name='w3'/>
         <sub name='org'/>
      </host>
   </authority>
   <path>
      <seg sname='TR'/>
      <seg sname='1999'/>
      <seg sname='xhtml.html'/>
   </path>
</url>

or

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

The JSON value:

{"name": "pi", "value": 3.145926}

can give

<json>
   <object>
      <pair string='name'>
         <string>pi</string>
      </pair>
      <pair string='value'>
         <number>3.145926</number>
      </pair>
   </object>
</json>

How it works

A grammar is used to describe the input format. An input is parsed with this grammar, and the resulting parse tree is serialised as XML. Special marks in the grammar affect details of this serialisation, excluding parts of the tree, or serialising parts as attributes instead of elements.

As an example, consider this simplified grammar for URLs:

url: scheme, ":", authority, path.

scheme: letter+.

authority: "//", host.
host: sub+".".
sub: letter+.

path: ("/", seg)+.
seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

This means that a URL consists of a scheme (whatever that is), followed by a colon, followed by an authority, and then a path. A scheme, is one or more letters (whatever a letter is). An authority starts with two slashes, followed by a host. A host is one or more subs, separated by points. A path is a slash followed by a seg, repeated one or more times. A seg is zero or more fletters. A letter is a lowercase letter, an uppercase letter, or a digit. An fletter is a letter or a point.

So, given the input string http://www.w3.org/TR/1999/xhtml.html, this would produce the serialisation

<url>
   <scheme>http</scheme>:
   <authority>//
      <host>
         <sub>www</sub>.
         <sub>w3</sub>.
         <sub>org</sub>
      </host>
   </authority>
   <path>
      /<seg>TR</seg>
      /<seg>1999</seg>
      /<seg>xhtml.html</seg>
   </path>
</url>

If the rule for letter had not had a "-" before it, the serialisation for scheme, for instance, would have been:

<scheme><letter>h</letter><letter>t</letter><letter>t</letter><letter>p</letter></scheme>

Changing the rule for scheme to

scheme: name.
@name: letter+.

would change the serialisation for scheme to:

<scheme name="http"/>:

Changing the rule for scheme instead to:

@scheme: letter+.

would change the serialisation for url to:

<url scheme="http">

Changing the definitions of sub and seg from

sub: letter+.
seg: fletter*.

to

-sub: letter+.
-seg: fletter*.

would prevent the sub and seg elements appearing in the serialised result, giving:

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

The Grammar

Here we describe the format of the grammar used to describe documents. Note that it is in its own format, and therefore describes itself.

A grammar is a sequence of one or more rules.

ixml: S, rule+.

S stands for an optional sequence of spacing and comments. A comment is enclosed in braces, and may included nested comments, to enable commenting out parts of a grammar:

         -S: (whitespace; comment)*.
-whitespace: -[Zs]; tab; lf; cr.
       -tab: -#9.
        -lf: -#a.
        -cr: -#d.
    comment: -"{", (cchar; comment)*, -"}".
     -cchar: ~["{}"].

Rules

A rule consists of an optional mark, a name, and a number of alternatives. The grammar here uses colons to define rules, an equals sign is also allowed.

rule: mark?, name, ["=:"], S, -alts, ".", S.

A mark is one of @, ^ or -, and indicates whether the item so marked will be serialised as an attribute (@), an element with its children (^), which is the default, or only its children (-).

@mark: ["@^-"], S.

Alternatives are separated by a semicolon or a vertical bar. Semicolons have been used everywhere here:

alts: alt+([";|"], S).

An alternative is zero or more terms, separated by commas:

alt: term*(",", S).

A term is a singleton factor, an optional factor, or a repeated factor, repeated zero or more times, or one or more times.

-term: factor;
       option;
       repeat0;
       repeat1.

A factor is a terminal, a nonterminal, or a bracketed series of alternatives:

-factor: terminal;
         nonterminal;
         "(", S, alts, ")", S.

A factor repeated zero or more times is followed by an asterisk, optionally followed by a separator, e.g. abc* and abc*",". For instance "a"*"#" would match the empty string, a a#a a#a#a etc.

repeat0: factor, "*", S, sep?.

Similarly, a factor repeated one or more times is followed by a plus, optionally followed by a separator, e.g. abc+ and abc+",". For instance "a"+"#" would match a a#a a#a#a etc., but not the empty string.

repeat1: factor, "+", S, sep?.

An optional factor is followed by a question mark, e.g. abc?. For instance "a"? would match a or the empty string.

option: factor, "?", S.

A separator may be any factor. E.g. abc*def or abc*(","; "."). For instance "a"+("#"; "!") would match a#a a!a a#a!a a!a#a a#a#a etc.

sep: factor.

Nonterminals

A nonterminal is an optionally marked name:

nonterminal: mark?, name.

A name starts with a letter or digit, and continues with a letter, digit or hyphen (this will be extended in the near future, for instance by using Unicode character classes Ll, Lc, and Nd). It is the grammar author's responsibility to ensure that all serialised names match the requirements for an XML name.

   @name: letgit, xletter*, S.
 -letgit: letter; digit.
 -letter: ["a"-"z"; "A"-"Z"].
  -digit: ["0"-"9"].
-xletter: letgit; "-".

Terminals

A terminal is a literal or a set of characters:

-terminal: literal; 
           charset.

A literal is either a quoted string, or a hexadecimally encoded character:

-literal: quoted;
          encoded.

A quoted string is an optionally marked string of one or more characters, enclosed with single or double quotes. The enclosing quote is represented in a string by doubling it. A quoted string must be exactly matched in the input. Examples: "yes" 'yes'.

These two strings are identical: 'Isn''t it?' "Isn't it?"

  quoted: tmark?, -string.

A terminal may not be marked as an attribute. Since a terminal has no children, if it is marked with "-", it will serialise to the empty string.

@tmark: ["^-"], S.
  string: -'"', dstring, -'"', S;
          -"'", sstring, -"'", S.
@dstring: dchar+.
@sstring: schar+.
   dchar: ~['"'];
          '""'. {all characters, quotes must be doubled}
   schar: ~["'"];
          "''". {all characters, quotes must be doubled}

An encoded character is an optionally marked hexadecimal number. It represents a single character and must be matched exactly in the input. It starts with a hash symbol, followed by any number of hexadecimal digits, for example #a0. The digits are interpreted as a number in hexadecimal, and the character at that Unicode code point is used.

encoded: tmark?, "#", @hex.
    hex: ["0"-"9"; "a"-"f"; "A"-"F"]+, S.

Character sets

A character set is an inclusion or an exclusion.

An inclusion is enclosed in square brackets, and also matches a single character in the input. It represents a set of characters, defined by any combination of literal characters, a range of characters, hex encoded characters, or Unicode classes. Examples ["a"-"z"] ["xyz"] [Lc] ["0"-"9"; "!@#"; Lc]. Note that ["abc"] ["a"; "b"; "c"] ["a"-"c"] [#61-#63] all represent the same set of characters.

An exclusion matches one character not in the set. E.g. ~["{}"] matches any character that is not an opening or closing brace.

 -charset: inclusion; 
           exclusion.
inclusion: tmark?, "[", S,  element+([";|"], S), "]", S.
exclusion: tmark?, "~", S, -inclusion.
 -element: literal;
           range;
           class.

A range matches any character in the range from the start character to the end, inclusive, using the Unicode ordering:

range: from, "-", S, to.
@from: character.
  @to: character.

A character is a string of length one, or a hex encoded character:

-character: -'"', dchar, -'"', S;
            -"'", schar, -"'", S;
            hex.

A class is two letters, representing any character from the Unicode character category of that name. E.g. [Ll] matches any lower-case letter, [Ll; Lu] matches any upper- or lower-case character; it is an error if there is no such class.

class: letter, letter, S.

Complete

ixml: S, rule+.
         -S: (whitespace; comment)*.
-whitespace: -[Zs]; tab; lf; cr.
       -tab: -#9.
        -lf: -#a.
        -cr: -#d.
    comment: -"{", (cchar; comment)*, -"}".
     -cchar: ~["{}"].
rule: mark?, name, ["=:"], S, -alts, ".", S.
@mark: ["@^-"], S.
alts: alt+([";|"], S).
alt: term*(",", S).
-term: factor;
       option;
       repeat0;
       repeat1.
-factor: terminal;
         nonterminal;
         "(", S, alts, ")", S.
repeat0: factor, "*", S, sep?.
repeat1: factor, "+", S, sep?.
option: factor, "?", S.
sep: factor.
nonterminal: mark?, name.
   @name: letgit, xletter*, S.
 -letgit: letter; digit.
 -letter: ["a"-"z"; "A"-"Z"].
  -digit: ["0"-"9"].
-xletter: letgit; "-".
-terminal: literal; 
           charset.
-literal: quoted;
          encoded.
  quoted: tmark?, -string.
  @tmark: ["^-"], S.
  string: -'"', dstring, -'"', S;
          -"'", sstring, -"'", S.
@dstring: dchar+.
@sstring: schar+.
   dchar: ~['"'];
          '""'. {all characters, quotes must be doubled}
   schar: ~["'"];
          "''". {all characters, quotes must be doubled}
encoded: tmark?, "#", @hex.
    hex: ["0"-"9"; "a"-"f"; "A"-"F"]+, S.
 -charset: inclusion; 
           exclusion.
inclusion: tmark?, "[", S,  element+([";|"], S), "]", S.
exclusion: tmark?, "~", S, -inclusion.
 -element: literal;
           range;
           class.
range: from, "-", S, to.
@from: character.
  @to: character.
-character: -'"', dchar, -'"', S;
            -"'", schar, -"'", S;
            hex.
class: letter, letter, S.

Parsing

The root symbol of the grammar is the name of the first rule in the grammar. If it is marked as hidden, its production must produce exactly one non-hidden nonterminal and no non-hidden terminals before or after that nonterminal (in order to match the XML requirement of a single-rooted document).

The input must be parsed by an algorithm that accepts and parses any context-free grammar, and produces at least one parse of any input that conforms to the grammar starting at the root symbol. If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse must be marked as ambiguous by including the attribute ixml:state="ambiguous" on the document element of the serialisation. The ixml namespace URI is "{tbd}".

Serialisation

If the parse succeeds, the resulting parse-tree is serialised as XML by serialising the root node. If the root node is marked as an attribute, that marking is ignored.

A parse node is either a nonterminal or a terminal.

Any parse node may be marked as an element, as an attribute, or as hidden (children only). The mark comes from the use of the node in a rule if present, otherwise, for a nonterminal, from the definition of the rule for that nonterminal. If a node is not marked, it is treated as if marked as an element.

An example grammar that illustrates these rules is

expr: open, -arith, @close.
@open: "(".
close: ")".
arith: left, op, right.
left: @name.
right: -name.
@name: "a"; "b".
-op: sign.
@sign: "+"

Applied to the string (a+b), it yields the serialisation

<expr open="(" sign="+" close=")">
   <left name="a"/>
   <right>b</right>
</expr>

If the parse fails, some XML document is produced with ixml:state="failed" on the document element. The document should provide helpful information about where and why it failed; it may be a partial parse tree that includes parts of the parse that succeeded.

Hints for Implementors

Many parsing algorithms only mention terminals, and non-terminals, and don't explain how to deal with the repetition constructs used in ixml. However, these can be handled simply by converting them to equivalent simple constructs. In the examples below, f and sep are factors from the grammar above. The other nonterminals are generated nonterminals.

Optional factor:

f? ⇒ f-option
-f-option: f; .

Zero or more repetitions:

f* ⇒ f-star
-f-star: f, f-star; .

One or more repetitions:

f+ ⇒ f-plus
-f-plus: f, f-plus-option.
-f-plus-option: f-plus; .

One or more repetitions with separator:

f+sep ⇒ f-plus-sep
-f-plus-sep: f, sep-part-option. 
-sep-part-option: sep, f-plus-sep; .

Zero or more repetitions with separator:

f*sep ⇒ f-star-sep
-f-star-sep: f-plus-sep; .
-f-plus-sep: f, sep-part-option.
-sep-part-option: sep, f-plus-sep; .

IXML in IXML

Since the ixml grammar is expressed in its own notation, the above grammar can be processed into an XML document by parsing it using itself, and then serialising. Note that all semantically significant terminals are recorded in attributes. The serialisation begins like this:

<ixml>
   <rule name='ixml'>: 
      <alt>
         <nonterminal name='S'/>, 
         <repeat1>
            <nonterminal name='rule'/>+</repeat1>
      </alt>.</rule>
   <rule mark='-' name='S'>: 
      <alt>
         <repeat0>(
            <alts>
               <alt>
                  <nonterminal name='whitespace'/>
               </alt>; 
               <alt>
                  <nonterminal name='comment'/>
               </alt>
            </alts>)*</repeat0>
      </alt>.</rule>
   <rule mark='-' name='whitespace'>: 
      <alt>
         <inclusion tmark='-'>[
            <class>Zs</class>]</inclusion>
      </alt>; 
      <alt>
         <nonterminal name='tab'/>
      </alt>; 
      <alt>
         <nonterminal name='lf'/>
      </alt>; 
      <alt>
         <nonterminal name='cr'/>
      </alt>.</rule>
   <rule mark='-' name='tab'>: 
      <alt>
         <encoded tmark='-' hex='#9'/>
      </alt>.</rule>
   <rule mark='-' name='lf'>: 
      <alt>
         <encoded tmark='-' hex='#a'/>
      </alt>.</rule>
   <rule mark='-' name='cr'>: 
      <alt>
         <encoded tmark='-' hex='#d'/>
      </alt>.</rule>
   <rule name='comment'>: 
      <alt>
         <quoted tmark='-' dstring='{'/>, 
         <repeat0>(
            <alts>
               <alt>
                  <nonterminal name='cchar'/>
               </alt>; 
               <alt>
                  <nonterminal name='comment'/>
               </alt>
            </alts>)*</repeat0>, 
         <quoted tmark='-' dstring='}'/>
      </alt>.     </rule>
   <rule mark='-' name='cchar'>: 
      <alt>
         <exclusion>~[
            <quoted dstring='{}'/>]</exclusion>
      </alt>.</rule>
   <rule name='rule'>: 
      <alt>
         <option>
            <nonterminal name='mark'/>?</option>, 
         <nonterminal name='name'/>, 
         <inclusion>[
            <quoted dstring='=:'/>]</inclusion>, 
         <nonterminal name='S'/>, 
         <nonterminal mark='-' name='alts'/>, 
         <quoted dstring='.'/>, 
         <nonterminal name='S'/>
      </alt>.</rule>

Conformance

{TBD}

References

{TBD}

Acknowledgments

Thanks are due to Michael Sperberg-McQueen for his many helpful comments.