There is no essential difference between the JSON
{"temperature": {"scale": "C"; "value": 21}}
and an equivalent XML
<temperature scale="C" value="21"/>
or
<temperature> <scale>C</scale> <value>21</value> </temperature>
since the underlying abstractions being represented are the same.
We choose which representations of our data to use, JSON, CSV, XML, or whatever, depending on habit, convenience, or the context we want to use that data in.
On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value.
How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?
Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content.
For example, depending on choice, it can turn CSS code like:
body {color: blue; font-weight: bold}
into XML like
<css> <rule> <simple-selector name="body"/> <block> <property> <name>color</name> <value>blue</value> </property> <property> <name>font-weight</name> <value>bold</value> </property> </block> </rule> </css>
or
<css> <rule> <selector>body</selector> <block> <property name="color" value="blue"/> <property name="font-weight" value="bold"/> </block> </rule> </css>
The input
pi×(10+b)
can result in the XML
<prod> <id>pi</id> <sum> <number>10</number> <id>b</id> </sum> </prod>
or
<prod> <id name='pi'/> <sum> <number value='10'/> <id name='b'/> </sum> </prod>
The input
http://www.w3.org/TR/1999/xhtml.html
can give
<url> <scheme name='http'/> <authority> <host> <sub name='www'/> <sub name='w3'/> <sub name='org'/> </host> </authority> <path> <seg sname='TR'/> <seg sname='1999'/> <seg sname='xhtml.html'/> </path> </url>
or
<url scheme='http'>:// <host>www.w3.org</host> <path>/TR/1999/xhtml.html</path> </url>
{"name": "pi", "value": 3.145926}
can give
<json> <object> <pair string='name'> <string>pi</string> </pair> <pair string='value'> <number>3.145926</number> </pair> </object> </json>
ixml works by describing the document to be treated in a grammar:
expr: term; sum. sum: expr, "+", term. term: factor; prod. prod: term, "×", factor. factor: id; number; "(", expr, ")". id: letter+. number: digit+. letter: ["a"-"z"]. digit: ["0"-"9"]..
'Parsing' recognises the structure of an input according to such a grammar.
We parse the input with the grammar, and serialise the parse-tree as XML.
This can yield a huge XML document (for such a short string):
<expr> <term> <prod> <term> <factor> <id> <letter>p</letter> <letter>i</letter> </id> </factor> </term>× <factor>( <expr> <sum> <expr> <term> <factor> <number> <digit>1</digit> <digit>0</digit> </number> </factor> </term> </expr>+ <term> <factor> <id> <letter>b</letter> </id> </factor> </term> </sum> </expr>) </factor> </prod> </term> </expr> <expr> <term> <prod> <term> <factor> <id> <letter>p</letter> <letter>i</letter> </id> </factor> </term>× <factor>( <expr> <sum> <expr> <term> <factor> <number> <digit>1</digit> <digit>0</digit> </number> </factor> </term> </expr>+ <term> <factor> <id> <letter>b</letter> </id> </factor> </term> </sum> </expr>) </factor> </prod> </term> </expr>
However, many of the elements in the parse are there for structural reasons, and can be removed in the serialisation.
Here we remove term
and factor.
This removes the
element from the serialisation, but not its children.
expr: term; sum. sum: expr, "+", term. -term: factor; prod. prod: term, "×", factor. -factor: id; number; "(", expr, ")". id: letter+. number: digit+. letter: ["a"-"z"]. digit: ["0"-"9"].
<expr> <prod> <id> <letter>p</letter> <letter>i</letter> </id>×( <expr> <sum> <expr> <number> <digit>1</digit> <digit>0</digit> </number> </expr>+ <id> <letter>b</letter> </id> </sum> </expr>) </prod> </expr>
expr: term; sum. sum: expr, "+", term. -term: factor; prod. prod: term, "×", factor. -factor: id; number; "(", expr, ")". id: letter+. number: digit+. -letter: ["a"-"z"]. -digit: ["0"-"9"].
<expr> <prod> <id>pi</id>×( <expr> <sum> <expr> <number>10</number> </expr>+ <id>b</id> </sum> </expr>) </prod> </expr>
-expr: term; sum. sum: expr, "+", term. -term: factor; prod. prod: term, "×", factor. -factor: id; number; "(", expr, ")". id: letter+. number: digit+. -letter: ["a"-"z"]. -digit: ["0"-"9"].
<prod> <id>pi</id>×( <sum> <number>10</number>+ <id>b</id> </sum> </prod>
You can delete the extraneous characters if you wish:
sum: expr, -"+", term. -factor: id; number; -"(", expr, -")".
but leaving them in means that just outputting the terminal characters gives you the original input again.
This process just marks up a string by inserting elements identifying the different sub-parts:
<prod><id>pi</id>×(<sum><number>10</number>+<id>b</id> </sum>)</prod>
Changing
id: letter+. number: digit+.
to
id: @name. name: letter+. number: @value. value: digit+.
or
id: name. @name: letter+. number: value. @value: digit+.
gives
<prod> <id name='pi'/> <sum> <number value='10'/> <id name='b'/> </sum> </prod>
The usage
id: @name.
means that that one usage of name
will appear as an attribute
on id.
The usage
@name: letter+.
means that all usages of name
will appear as
attributes.
Similarly
prod: -term, "×", factor.
would mean that that one usage of term
would not be serialised,
while
-term: factor; prod.
means that no usage of term will be serialised.
-expr: term; sum. sum: expr, -"+", term. -term: factor; prod. prod: term, -"×", factor. -factor: id; number; -"(", expr, -")". id: @name. name: ["a"-"z"]+. number: @value. value: ["0"-"9"]+.
If we change the input parse string to "pi+(10×b)
" and process
it, we get:
<sum> <id name='pi'/> <prod> <number value='10'/> <id name='b'/> </prod> </sum>
A possible problem here, is that this is the identical parse to what you
would get for the string "pi+10×b
":
The brackets in the input do not affect the parse tree.
This is understandable, since the brackets add no extra information: the two strings are semantically identical.
We can fix this, if required, by adding back the node for expr
in the case that it is a bracketed expressions:
-factor: id; number; -"(", ^expr, -")".
giving for the bracketed case:
<sum> <id name='pi'/> <expr> <prod> <number value='10'/> <id name='b'/> </prod> </expr> </sum>
Another solution would be to add a rule:
-factor: id; number; bracketed. bracketed: -"(", expr, -")".
to give:
<sum> <id name='pi'/> <bracketed> <prod> <number value='10'/> <id name='b'/> </prod> </bracketed> </sum>
As another small example, consider this restricted grammar for URLs:
url: scheme, ":", authority, path. scheme: name. @name: letter+. authority: "//", host. host: sub+".". sub: name. path: ("/", seg)+. seg: sname. @sname: fletter*. -letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"]. -fletter: letter; ".".
Here you can see the use of repetitions with separators
(sub+"."
means one or more sub
s separated by points),
as well as a grouped repetition: ("/", seg)+
Given the input string "http://www.w3.org/TR/1999/xhtml.html
"
you get the parse:
<url> <scheme name='http'/>: <authority>// <host> <sub name='www'/>. <sub name='w3'/>. <sub name='org'/> </host> </authority> <path>/ <seg sname='TR'/>/ <seg sname='1999'/>/ <seg sname='xhtml.html'/> </path> </url>
This illustrates a point about attributes and elements: if they have a different syntax in the input, they have to have a different name in the output.
Here sub
has an attribute called name
and
seg
has an attribute called sname
. The attribute
sname
cannot be called name
because it has a
different syntax to a name
(an sname
can contain
points ".", whereas a name
may not).
Same grammar, only a change to the marks:
url: scheme, ":", authority, path. @scheme: name. -name: letter+. -authority: "//", host. host: sub+".". -sub: name. path: ("/", seg)+. -seg: sname. -sname: fletter*. -letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"]. -fletter: letter; ".".
which gives for "http://www.w3.org/TR/1999/xhtml.html"
:
<url scheme='http'>:// <host>www.w3.org</host> <path>/TR/1999/xhtml.html</path> </url>
This is an ongoing project to provide software that lets you treat any parsable format as if it were XML, without the need for markup.
There are currently four papers:
Introduces to the concepts, and develops a notation to support them.
Discusses issues with automatic serialisation, and the relationship between Invisible XML grammars and data schemas.
Discusses issues around grammar design, and in particular parsing algorithms used to recognise any document, and converting the resultant parse-tree into XML, and gives a new perspective on a classic algorithm.
Discusses changes to the design following experience with using it, giving examples of its use to develop data descriptions, and in passing, suggests other output formats.
Software to support ixml will be made available at a later date.