Invisible Markup

The author Steven Pemberton, CWI Amsterdam

Contents

  1. Contents
  2. Data is an abstraction
  3. Representations
  4. Invisible XML
  5. Example: Expression
  6. Example: URL
  7. Example: JSON
  8. Grammars
  9. Example: pi×(10+b)
  10. Remove unnecessary elements
  11. Result is already much smaller
  12. Remove letter and digit
  13. Result
  14. Remove expr
  15. Result
  16. Adding attributes
  17. Usage
  18. Resulting grammar
  19. Serialising to JSON
  20. Adding Nodes
  21. Fix
  22. Alternative fix
  23. URLs
  24. URL example
  25. Alternative grammar
  26. Conclusion

Data is an abstraction

There is no essential difference between the JSON

{"temperature": {"scale": "C"; "value": 21}}

and an equivalent XML

<temperature scale="C" value="21"/>

or

<temperature>
   <scale>C</scale>
   <value>21</value>
</temperature>

since the underlying abstractions being represented are the same.

Representations

We choose which representations of our data to use, JSON, CSV, XML, or whatever, depending on habit, convenience, or the context we want to use that data in.

On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value.

How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?

Invisible XML

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content.

For example, depending on choice, it can turn CSS code like:

body {color: blue; font-weight: bold}

into XML like

<css>
   <rule>
      <simple-selector name="body"/>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

or

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color" value="blue"/>
         <property name="font-weight" value="bold"/>
      </block>
   </rule>
</css>

Example: Expression

The input

pi×(10+b)

can result in the XML

<prod>
   <id>pi</id>
   <sum>
      <number>10</number>
      <id>b</id>
   </sum>
</prod>

or

<prod>
   <id name='pi'/>
   <sum>
      <number value='10'/>
      <id name='b'/>
   </sum>
</prod>

Example: URL

The input

http://www.w3.org/TR/1999/xhtml.html

can give

<url>
   <scheme name='http'/>
   <authority>
      <host>
         <sub name='www'/>
         <sub name='w3'/>
         <sub name='org'/>
      </host>
   </authority>
   <path>
      <seg sname='TR'/>
      <seg sname='1999'/>
      <seg sname='xhtml.html'/>
   </path>
</url>

or

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

Example: JSON

{"name": "pi", "value": 3.145926}

can give

<json>
   <object>
      <pair string='name'>
         <string>pi</string>
      </pair>
      <pair string='value'>
         <number>3.145926</number>
      </pair>
   </object>
</json>

Grammars

ixml works by describing the document to be treated in a grammar:

expr: term; sum.

sum: expr, "+", term.
term: factor; prod.
prod: term, "×", factor.
factor: id; number; "(", expr, ")".

id: letter+.
number: digit+.

letter: ["a"-"z"].
digit: ["0"-"9"]..

'Parsing' recognises the structure of an input according to such a grammar.

Example: pi×(10+b)

We parse the input with the grammar, and serialise the parse-tree as XML.

This can yield a huge XML document (for such a short string):

<expr>
   <term>
      <prod>
         <term>
            <factor>
               <id>
                  <letter>p</letter>
                  <letter>i</letter>
               </id>
            </factor>
         </term>×
         <factor>(
            <expr>
               <sum>
                  <expr>
                     <term>
                        <factor>
                           <number>
                              <digit>1</digit>
                              <digit>0</digit>
                           </number>
                        </factor>
                     </term>
                  </expr>+
                  <term>
                     <factor>
                        <id>
                           <letter>b</letter>
                        </id>
                     </factor>
                  </term>
               </sum>
            </expr>)
         </factor>
      </prod>
   </term>
</expr>
<expr>
   <term>
      <prod>
         <term>
            <factor>
               <id>
                  <letter>p</letter>
                  <letter>i</letter>
               </id>
            </factor>
         </term>×
         <factor>(
            <expr>
               <sum>
                  <expr>
                     <term>
                        <factor>
                           <number>
                              <digit>1</digit>
                              <digit>0</digit>
                           </number>
                        </factor>
                     </term>
                  </expr>+
                  <term>
                     <factor>
                        <id>
                           <letter>b</letter>
                        </id>
                     </factor>
                  </term>
               </sum>
            </expr>)
         </factor>
      </prod>
   </term>
</expr>

Remove unnecessary elements

However, many of the elements in the parse are there for structural reasons, and can be removed in the serialisation.

Here we remove term and factor. This removes the element from the serialisation, but not its children.

expr: term; sum.

sum: expr, "+", term.
-term: factor; prod.
prod: term, "×", factor.
-factor: id; number; "(", expr, ")".

id: letter+.
number: digit+.

letter: ["a"-"z"].
digit: ["0"-"9"].

Result is already much smaller

<expr>
   <prod>
      <id>
         <letter>p</letter>
         <letter>i</letter>
      </id>×(
      <expr>
         <sum>
            <expr>
               <number>
                  <digit>1</digit>
                  <digit>0</digit>
               </number>
            </expr>+
            <id>
               <letter>b</letter>
            </id>
         </sum>
      </expr>)
   </prod>
</expr>

Remove letter and digit

expr: term; sum.

sum: expr, "+", term.
-term: factor; prod.
prod: term, "×", factor.
-factor: id; number; "(", expr, ")".

id: letter+.
number: digit+.

-letter: ["a"-"z"].
-digit: ["0"-"9"].

Result

<expr>
   <prod>
      <id>pi</id>×(
      <expr>
         <sum>
            <expr>
               <number>10</number>
            </expr>+
            <id>b</id>
         </sum>
      </expr>)
   </prod>
</expr>

Remove expr

-expr: term; sum.

sum: expr, "+", term.
-term: factor; prod.
prod: term, "×", factor.
-factor: id; number; "(", expr, ")".

id: letter+.
number: digit+.

-letter: ["a"-"z"].
-digit: ["0"-"9"].

Result

<prod>
   <id>pi</id>×(
   <sum>
      <number>10</number>+
      <id>b</id>
   </sum>
</prod>

You can delete the extraneous characters if you wish:

sum: expr, -"+", term.
-factor: id; number; -"(", expr, -")".

but leaving them in means that just outputting the terminal characters gives you the original input again.

Another way to look at it

This process just marks up a string by inserting elements identifying the different sub-parts:

<prod><id>pi</id>×(<sum><number>10</number>+<id>b</id>
  </sum>)</prod>

Adding attributes

Changing

id: letter+.
number: digit+.

to

id: @name.
name: letter+.
number: @value.
value: digit+.

or

id: name.
@name: letter+.
number: value.
@value: digit+.

gives

<prod>
   <id name='pi'/>
   <sum>
       <number value='10'/>
       <id name='b'/>
    </sum>
</prod>

Usage

The usage

id: @name.

means that that one usage of name will appear as an attribute on id.

The usage

@name: letter+.

means that all usages of name will appear as attributes.

Similarly

prod: -term, "×", factor.

would mean that that one usage of term would not be serialised, while

-term: factor; prod.

means that no usage of term will be serialised.

Resulting grammar

-expr: term; sum.

sum: expr, -"+", term.
-term: factor; prod.
prod: term, -"×", factor.
-factor: id; number; -"(", expr, -")".

id: @name.
name: ["a"-"z"]+.

number: @value.
value: ["0"-"9"]+.

Adding Nodes

If we change the input parse string to "pi+(10×b)" and process it, we get:

<sum>
   <id name='pi'/>
   <prod>
      <number value='10'/>
      <id name='b'/>
   </prod>
</sum>

A possible problem here, is that this is the identical parse to what you would get for the string "pi+10×b":

The brackets in the input do not affect the parse tree.

This is understandable, since the brackets add no extra information: the two strings are semantically identical.

Fix

We can fix this, if required, by adding back the node for expr in the case that it is a bracketed expressions:

-factor: id; number; -"(", ^expr, -")".

giving for the bracketed case:

<sum>
   <id name='pi'/>
   <expr>
      <prod>
         <number value='10'/>
         <id name='b'/>
      </prod>
   </expr>
</sum>

Alternative fix

Another solution would be to add a rule:

-factor: id; number; bracketed.
bracketed: -"(", expr, -")".

to give:

<sum>
   <id name='pi'/>
   <bracketed>
      <prod>
         <number value='10'/>
         <id name='b'/>
      </prod>
   </bracketed>
</sum>

URLs

As another small example, consider this restricted grammar for URLs:

url: scheme, ":", authority, path.

scheme: name.
@name: letter+.

authority: "//", host.
host: sub+".".
sub: name.

path: ("/", seg)+.
seg: sname.
@sname: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

Here you can see the use of repetitions with separators (sub+"." means one or more subs separated by points), as well as a grouped repetition: ("/", seg)+

URL example

Given the input string "http://www.w3.org/TR/1999/xhtml.html" you get the parse:

<url>
   <scheme name='http'/>:
   <authority>//
      <host>
         <sub name='www'/>.
         <sub name='w3'/>.
         <sub name='org'/>
      </host>
   </authority>
   <path>/
      <seg sname='TR'/>/
      <seg sname='1999'/>/
      <seg sname='xhtml.html'/>
   </path>
</url>

This illustrates a point about attributes and elements: if they have a different syntax in the input, they have to have a different name in the output.

Here sub has an attribute called name and seg has an attribute called sname. The attribute sname cannot be called name because it has a different syntax to a name (an sname can contain points ".", whereas a name may not).

Refactored grammar

Same grammar, only a change to the marks:

url: scheme, ":", authority, path.
@scheme: name.
-name: letter+.
-authority: "//", host.
host: sub+".".
-sub: name.
path: ("/", seg)+.
-seg: sname.
-sname: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

which gives for "http://www.w3.org/TR/1999/xhtml.html":

<url scheme='http'>://
   <host>www.w3.org</host>
   <path>/TR/1999/xhtml.html</path>
</url>

Conclusion

This is an ongoing project to provide software that lets you treat any parsable format as if it were XML, without the need for markup.

There are currently four papers:

Software to support ixml will be made available at a later date.