Invisible Markup

Data is an abstraction

There is no essential difference between the JSON

{"temperature": {"scale": "C"; "value": 21}}

and an equivalent XML

<temperature scale="C" value="21"/>

<temperature>
   <scale>C</scale>
   <value>21</value>
</temperature>

since the underlying abstractions being represented are the same.

Representations

We choose which representations of our data to use, JSON, CSV, XML, or whatever, depending on habit, convenience, or the context we want to use that data in.

On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value.

How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?

Invisible XML

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content.

For example, depending on choice, it can turn CSS code like:

body {color: blue; font-weight: bold}

into XML like

<css>
   <rule>
      <simple-selector name="body"/>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color" value="blue"/>
         <property name="font-weight" value="bold"/>
      </block>
   </rule>
</css>

Example: Expression

The input

pi×(10+b)

can result in the XML

<prod>
   <id>pi</id>
   <sum>
      <number>10</number>
      <id>b</id>
   </sum>
</prod>

<prod>
   <id name='pi'/>
   <sum>
      <number value='10'/>
      <id name='b'/>
   </sum>
</prod>

Example: URL

The input

http://www.w3.org/TR/1999/xhtml.html

can give

<url>
   <scheme name='http'/>
   <authority>
      <host>
         <sub name='www'/>
         <sub name='w3'/>
         <sub name='org'/>
      </host>
   </authority>
   <path>
      <seg sname='TR'/>
      <seg sname='1999'/>
      <seg sname='xhtml.html'/>
   </path>
</url>

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

Example: JSON

{"name": "pi", "value": 3.145926}

can give

<json>
   <object>
      <pair string='name'>
         <string>pi</string>
      </pair>
      <pair string='value'>
         <number>3.145926</number>
      </pair>
   </object>
</json>

Grammars

ixml works by describing the document to be treated in a grammar:

expr: term; sum.

sum: expr, "+", term.
term: factor; prod.
prod: term, "×", factor.
factor: id; number; "(", expr, ")".

id: letter+.
number: digit+.

letter: ["a"-"z"].
digit: ["0"-"9"]..

'Parsing' recognises the structure of an input according to such a grammar.

Example: pi×(10+b)

We parse the input with the grammar, and serialise the parse-tree as XML.

This can yield a huge XML document (for such a short string):

<expr>
   <term>
      <prod>
         <term>
            <factor>
               <id>
                  <letter>p</letter>
                  <letter>i</letter>
               </id>
            </factor>
         </term>×
         <factor>(
            <expr>
               <sum>
                  <expr>
                     <term>
                        <factor>
                           <number>
                              <digit>1</digit>
                              <digit>0</digit>
                           </number>
                        </factor>
                     </term>
                  </expr>+
                  <term>
                     <factor>
                        <id>
                           <letter>b</letter>
                        </id>
                     </factor>
                  </term>
               </sum>
            </expr>)
         </factor>
      </prod>
   </term>
</expr>
<expr>
   <term>
      <prod>
         <term>
            <factor>
               <id>
                  <letter>p</letter>
                  <letter>i</letter>
               </id>
            </factor>
         </term>×
         <factor>(
            <expr>
               <sum>
                  <expr>
                     <term>
                        <factor>
                           <number>
                              <digit>1</digit>
                              <digit>0</digit>
                           </number>
                        </factor>
                     </term>
                  </expr>+
                  <term>
                     <factor>
                        <id>
                           <letter>b</letter>
                        </id>
                     </factor>
                  </term>
               </sum>
            </expr>)
         </factor>
      </prod>
   </term>
</expr>

Remove unnecessary elements

However, many of the elements in the parse are there for structural reasons, and can be removed in the serialisation.

Here we remove term and factor. This removes the element from the serialisation, but not its children.

expr: term; sum.

sum: expr, "+", term.
-term: factor; prod.
prod: term, "×", factor.
-factor: id; number; "(", expr, ")".

id: letter+.
number: digit+.

letter: ["a"-"z"].
digit: ["0"-"9"].

Result

<prod>
   <id>pi</id>×(
   <sum>
      <number>10</number>+
      <id>b</id>
   </sum>
</prod>

You can delete the extraneous characters if you wish:

sum: expr, -"+", term.
-factor: id; number; -"(", expr, -")".

but leaving them in means that just outputting the terminal characters gives you the original input again.

Adding attributes

Changing

id: letter+.
number: digit+.

id: @name.
name: letter+.
number: @value.
value: digit+.

id: name.
@name: letter+.
number: value.
@value: digit+.

gives

<prod>
   <id name='pi'/>
   <sum>
       <number value='10'/>
       <id name='b'/>
    </sum>
</prod>

Usage

The usage

id: @name.

means that that one usage of name will appear as an attribute on id.

The usage

@name: letter+.

means that all usages of name will appear as attributes.

Similarly

prod: -term, "×", factor.

would mean that that one usage of term would not be serialised, while

-term: factor; prod.

means that no usage of term will be serialised.

Adding Nodes

If we change the input parse string to "pi+(10×b)" and process it, we get:

<sum>
   <id name='pi'/>
   <prod>
      <number value='10'/>
      <id name='b'/>
   </prod>
</sum>

A possible problem here, is that this is the identical parse to what you would get for the string "pi+10×b":

The brackets in the input do not affect the parse tree.

This is understandable, since the brackets add no extra information: the two strings are semantically identical.

Fix

We can fix this, if required, by adding back the node for expr in the case that it is a bracketed expressions:

-factor: id; number; -"(", ^expr, -")".

giving for the bracketed case:

<sum>
   <id name='pi'/>
   <expr>
      <prod>
         <number value='10'/>
         <id name='b'/>
      </prod>
   </expr>
</sum>

Alternative fix

Another solution would be to add a rule:

-factor: id; number; bracketed.
bracketed: -"(", expr, -")".

to give:

<sum>
   <id name='pi'/>
   <bracketed>
      <prod>
         <number value='10'/>
         <id name='b'/>
      </prod>
   </bracketed>
</sum>

URLs

As another small example, consider this restricted grammar for URLs:

url: scheme, ":", authority, path.

scheme: name.
@name: letter+.

authority: "//", host.
host: sub+".".
sub: name.

path: ("/", seg)+.
seg: sname.
@sname: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

Here you can see the use of repetitions with separators (sub+"." means one or more subs separated by points), as well as a grouped repetition: ("/", seg)+

URL example

Given the input string "http://www.w3.org/TR/1999/xhtml.html" you get the parse:

<url>
   <scheme name='http'/>:
   <authority>//
      <host>
         <sub name='www'/>.
         <sub name='w3'/>.
         <sub name='org'/>
      </host>
   </authority>
   <path>/
      <seg sname='TR'/>/
      <seg sname='1999'/>/
      <seg sname='xhtml.html'/>
   </path>
</url>

This illustrates a point about attributes and elements: if they have a different syntax in the input, they have to have a different name in the output.

Here sub has an attribute called name and seg has an attribute called sname. The attribute sname cannot be called name because it has a different syntax to a name (an sname can contain points ".", whereas a name may not).

Refactored grammar

Same grammar, only a change to the marks:

url: scheme, ":", authority, path.
@scheme: name.
-name: letter+.
-authority: "//", host.
host: sub+".".
-sub: name.
path: ("/", seg)+.
-seg: sname.
-sname: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

which gives for "http://www.w3.org/TR/1999/xhtml.html":

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

Conclusion

This is an ongoing project to provide software that lets you treat any parsable format as if it were XML, without the need for markup.

There are currently four papers:

Invisible XML
Introduces to the concepts, and develops a notation to support them.
Data just wants to be (format) neutral
Discusses issues with automatic serialisation, and the relationship between Invisible XML grammars and data schemas.
Parse Earley, Parse Often: How to parse anything to XML
Discusses issues around grammar design, and in particular parsing algorithms used to recognise any document, and converting the resultant parse-tree into XML, and gives a new perspective on a classic algorithm.
On the Descriptions of Data: The Usability of Notations
Discusses changes to the design following experience with using it, giving examples of its use to develop data descriptions, and in passing, suggests other output formats.

Software to support ixml will be made available at a later date.

Invisible Markup

Contents

Data is an abstraction

Representations

Invisible XML

Example: Expression

Example: URL

Example: JSON

Grammars

Example: pi×(10+b)

Remove unnecessary elements

Result is already much smaller

Remove letter and digit

Result

Remove expr

Result

Another way to look at it

Adding attributes

Usage

Resulting grammar

Adding Nodes

Fix

Alternative fix

URLs

URL example

Refactored grammar

Conclusion