ixml: declarative abstract data representation

Steven Pemberton, CWI, Amsterdam

The Author

Contents

ixml

Invisible XML (ixml) is a new notation for expressing data abstractions

Version 1 was officially released in June on invisiblexml.org

Currently 5 implementations known

History

Long incubation period. Bus trip

First vocalised by me on this trip in 2002.

At my first ever keynote, at XML Europe 2004:

"Parsing is quite easy

It would be fairly easy to add a generalised part to the XML pipeline that parsed unmarked-up text, and produced XML as a parse tree: it's just a different sort of transform."

First paper 2013:

R1: "This is clearly a submission that needs to be shredded, burned, and the ashes buried in multiple locations"

R2: "I think the audience will eat him alive. But I want to be there to hear it."

Went through several iterations.

Working group formed last year.

Abstractions

Three examplesNumbers are abstractions: you can't point to the number three, just three bicycles, or three sheep, or three self-referential examples.

Three is what those bicycles and sheep and examples have in common.

Representations

You can represent a number in different ways:

3, III, 0011, ㆔, ३, ፫, ૩, ੩, 〣, ೩, ៣, ໓, Ⅲ, ൩, ၃, ႓, trois, drie.

You can concretise numbers as a length, a weight, a speed, a temperature.

But in the end, they all represent the same three.

Data is an abstraction too!

We are often obliged for different reasons to represent data in some way or another.

But in the end those representations are all of the same abstraction; there is no essential difference between the JSON

{"temperature": {"scale": "C"; "value": 21}}

and an equivalent XML

<temperature scale="C" value="21"/>

or

<temperature>
   <scale>C</scale>
   <value>21</value>
</temperature>

or indeed

temperature: 21°C

since the underlying abstractions being represented are the same.

What ixml does

Takes a representation of data (typically with implicit structure).

Uses a description of the format of that data to recognise the data's structure.

Creates an internal representation of the data, now with the structure made explicit.

Which can be used for multiple purposes, including creating an external representation with explicit structure.

Representations

Some representations are weaker than others: they may not be able to faithfully represent all of the abstraction, and are therefore not reversible.

XML is probably the best available general notation for generating the representation of any abstraction.

The intention behind ixml is to allow extracting abstractions from representations; of converting weaker representations of abstractions into stronger ones, with XML therefore an excellent target for that.

(Simple) Example: Dates

19 October 2022

Describe the format:

date: day, " ", month, " ", year.
day: digit, digit?.
month: "January"; "February"; ...; "December".
year: digit, digit, digit, digit.
digit: ["0"-"9"].

Process the input with this description, and get:

<date>
   <day>
      <digit>1</digit>
      <digit>9</digit>
   </day> 
   <month>October</month> 
   <year>
      <digit>2</digit>
      <digit>0</digit>
      <digit>2</digit>
      <digit>2</digit>
   </year>
</date>

(Simple) Example: Dates

19 October 2022

Describe the format:

date: day, " ", month, " ", year.
day: digit, digit.
month: "January"; "February"; ...; "December".
year: digit, digit, digit, digit.
-digit: ["0"-"9"].

Process the input with this description, and get:

<date>
   <day>19</day> 
   <month>October</month> 
   <year>2022</year>
</date>

(Simple) Example: Dates

Add another format option:

19/10/2022

Add to the description:

date: day, " ", month, " ", year;
      day, "/", nmonth, "/", year.
day: digit, digit?.
month: "January"; "February"; ...; "December".
nmonth: digit, digit?.
year: digit, digit, digit, digit.
-digit: ["0"-"9"].

Process the input with this description, and get:

<date>
   <day>19</day>/
   <nmonth>10</nmonth>/
   <year>2022</year>
</date>

Several dates

dates: date+.

Better:

dates: (date, " "*)+.

or:

dates: date++", ".

for

19/10/2022, 31 December 2022, 1/1/2023

Attributes

date: day, " ", month, " ", year;
      day, "/", nmonth, "/", year.
@day: digit, digit?.
@month: "January"; "February"; ...; "December".
@nmonth: digit, digit?.
@year: digit, digit, digit, digit.
-digit: ["0"-"9"].

with input

19/10/2022

gives

<date day="19" nmonth="10" year="2022">//</date>

Deleting terminals

date: day, -" ", month, -" ", year;
      day, -"/", nmonth, -"/", year.
@day: digit, digit?.
@month: "January"; "February"; ...; "December".
@nmonth: digit, digit?.
@year: digit, digit, digit, digit.
-digit: ["0"-"9"].

with input

19/10/2022

gives

<date day="19" nmonth="10" year="2022"/>

Ambiguity

ixml reports ambiguous input.

A grammar accepting both World and USA style dates, with month only 1-12, and day 1-31:

date: us; world.
us: month, -"/", day, -"/", year.
world: day, -"/", month, -"/", year.
month: "0"?, ["1"-"9"]; 
       "10"; "11"; "12".
etc

the input 04/10/2021 would produce:

<!-- AMBIGUOUS
     The input from line.pos 1.1 to 1.11 can be interpreted as 'date' in 2 different ways:
     1: us[:1.11] 
     2: world[:1.11] 
-->
<date ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS">
   <us>
      <month>04</month>
      <day>10</day>
      <year>2021</year>
   </us>
</date>

Real world example

The hardest part of getting an article into Docbook format (the format used by several conferences I go to) is getting the bibliography right.

The bibliography for a recent paper was produced with the help of ixml. For instance, the text

[spec] Steven Pemberton (ed.), Invisible XML Specification, invisiblexml.org,
2022, https://invisiblexml.org/ixml-specification.html

was processed by an ixml grammar whose top-level rules were

bibliography: biblioentry+.
biblioentry: 
    abbrev, (author; editor), -", ",
    title, -", ",
    publisher, -", ", pubdate, -",",
    (artpagenums, -", ")?, 
    (bibliomisc; biblioid)**-", ", -#a.

Yielding

<biblioentry>
   <abbrev>spec</abbrev>
   <editor>
      <personname>
         <firstname>Steven</firstname>
         <surname>Pemberton</surname>
      </personname>
   </editor>
   <title>Invisible XML Specification</title>
   <publisher>invisiblexml.org</publisher>
   <pubdate>2022</pubdate>
   <bibliomisc>
      <link xl-href='https://invisiblexml.org/ixml-specification.html'/>
   </bibliomisc>
</biblioentry>

Processing step

ixml processing step An ixml processor takes a document in a particular (textual) format, along with a description of that format, in the form of a grammar, and uses it to parse the document.

This produces a structured parse tree, which can then be processed in a number of ways, such as serialization as XML.

Structured ixml

ixml processing step The format description is drawn as a structured document.

However, it is normally supplied in textual form, and is processed in exactly the same way, by the ixml processor, but using a description of the ixml format.

This results in the structured version of the description.

Processing

Processing ixml

ixml in ixml

ixml is of course expressed in ixml:

rule: (mark, s)?, name, s, -["=:"], s, -alts, -".".

which comes out as XML

<rule name='rule'>
   <alt>
      <option>
         <alts>
            <alt>
               <nonterminal name='mark'/>
               <nonterminal name='s'/>
            </alt>
         </alts>
      </option>
      <nonterminal name='name'/>
      <nonterminal name='s'/>
      <inclusion tmark='-'>
         <member string='=:'/>
      </inclusion>
      <nonterminal name='s'/>
      <nonterminal mark='-' name='alts'/>
      <literal tmark='-' string='.'/>
  </alt>
</rule>

Full Processing Cycle

ixml Processing chain

The Abstract Document

What many miss on first introduction to ixml is the essential role of the abstract document: everything else is just representation.

The ixml description language plays three roles:

What this means, for instance, that even the ixml description language is mutable: as long as the structure of the abstract description stays the same, you can use any syntax you like.

Using BNF

Instead of

rule: (mark, s)?, name, s, -["=:"], s, -alts, -".".
name: namestart, namefollower*.

you could define an equivalent language using:

rule: (mark, s)?, name, s, -"::=", s, -alts.
name: -"<", namestart, namefollower*, -">".

and then define languages using BNF instead:

<rule> ::= (<mark>, <s>)? <name> <s> -"::=" <s> -<alts>
<name> ::= -"<" namestart namefollower* -">"

Fuller processing diagram

Fuller processing, using BNF

There is more

You have now seen the essence.

For full details read the specification, or see the tutorial.