Are you ready for Invisible XML?

Abstractions

Three examples Numbers are abstractions: you can't point to the number three, just three bicycles, or three sheep, or three self-referential examples.

Three is what those bicycles and sheep and examples have in common.

Representations

You can represent a number in different ways:

3, III, 0011, ㆔, ३, ፫, ૩, ੩, 〣, ೩, ៣, ໓, Ⅲ, ൩, ၃, ႓, trois, drie.

You can concretise numbers as a length, a weight, a speed, a temperature.

But in the end, they all represent the same three.

Data is an abstraction too!

We are often obliged for different reasons to represent data in some way or another, but in the end those representations are all of the same abstraction; there is no essential difference between the JSON

{"temperature": {"scale": "C"; "value": 21}}

and an equivalent XML

<temperature scale="C" value="21"/>

<temperature>
   <scale>C</scale>
   <value>21</value>
</temperature>

or indeed

temperature: 21°C

since the underlying abstractions being represented are the same.

What Invisible Markup Does

Takes a representation of data (typically with implicit structure).

Uses a description of the format of that data to recognise the data's structure.

Creates an internal representation of the data, now with the structure made explicit.

Which can be used for multiple purposes, including creating an external representation with explicit structure.

Representations

Some representations are weaker than others: they may not be able to faithfully represent all of the abstraction, and are therefore not reversible.

XML is probably the best available general notation for generating the representation of any abstraction.

The intention behind ixml is to allow extracting abstractions from representations; of converting weaker representations of abstractions into stronger ones, with XML therefore an excellent target for that.

(Simple) Example: Dates

Input: 5 December 2023

Describe the format:

date: day, " ", month, " ", year.
day: digit, digit?.
month: "January"; "February"; ...; "December".
year: digit, digit, digit, digit.
digit: ["0"-"9"].

(Simple) Example: Dates

Input: 5 December 2023

Describe the format:

date: day, " ", month, " ", year.
day: digit, digit?.
month: "January"; "February"; ...; "December".
year: digit, digit, digit, digit.
digit: ["0"-"9"].

Process the input with this description, and get:

<date>
   <day>
      <digit>5</digit>
   </day> 
   <month>December</month> 
   <year>
      <digit>2</digit>
      <digit>0</digit>
      <digit>2</digit>
      <digit>3</digit>
   </year>
</date>

(Simple) Example: Dates

Input: 5 December 2023

Describe the format:

date: day, " ", month, " ", year.
day: digit, digit?.
month: "January"; "February"; ...; "December".
year: digit, digit, digit, digit.
-digit: ["0"-"9"].

Process the input with this description, and get:

<date>
   <day>5</day> 
   <month>December</month> 
   <year>2023</year>
</date>

(Simple) Example: Dates

Add another format option: 5/12/2023

Add to the description:

date: day, " ", month, " ", year;
         day, "/", nmonth, "/", year.
day: digit, digit?.
month: "January"; "February"; ...; "December".
nmonth: digit, digit?.
year: digit, digit, digit, digit.
-digit: ["0"-"9"].

Process the input with this description, and get:

<date>
   <day>5</day>/
   <nmonth>12</nmonth>/
   <year>2023</year>
</date>

Several dates

dates: date+.

Better:

dates: (date, " "*)+.

Several dates

dates: date+.

Better:

dates: (date, " "*)+.

or:

dates: date++", ".

for input like:

19/10/2022, 31 December 2023, 1/1/2024

Attributes

Input: 5/12/2023

date: day, " ", month, " ", year;
         day, "/", nmonth, "/", year.
day: digit, digit?.
month: "January"; "February"; ...; "December".
nmonth: digit, digit?.
year: digit, digit, digit, digit.
-digit: ["0"-"9"].

gives:

<date>
   <day>5</day>/
   <nmonth>12</nmonth>/
   <year>2023</year>
</date>

Attributes

Input: 5/12/2023

date: day, " ", month, " ", year;
         day, "/", nmonth, "/", year.
@day: digit, digit?.
@month: "January"; "February"; ...; "December".
@nmonth: digit, digit?.
@year: digit, digit, digit, digit.
-digit: ["0"-"9"].

gives

<date day="5" nmonth="12" year="2023">//</date>

Deleting terminals

Input: 5/12/2023

date: day, -" ", month, -" ", year;
         day, -"/", nmonth, -"/", year.
@day: digit, digit?.
@month: "January"; "February"; ...; "December".
@nmonth: digit, digit?.
@year: digit, digit, digit, digit.
-digit: ["0"-"9"].

gives

<date day="5" nmonth="12" year="2023"/>

ixml reports ambiguity

A grammar accepting both World and USA style dates, with month only 1-12, and day 1-31:

date: us; world.
us: month, -"/", day, -"/", year.
world: day, -"/", month, -"/", year.
month: "0"?, ["1"-"9"]; 
             "10"; "11"; "12".
etc

the input 31/12/2023 would produce:

<date xmlns:ixml="http://invisiblexml.org/NS">
   <world>
      <day>31</day>
      <month>12</month>
      <year>2023</year>
   </world>
</date>

ixml reports ambiguity

A grammar accepting both World and USA style dates, with month only 1-12, and day 1-31:

date: us; world.
us: month, -"/", day, -"/", year.
world: day, -"/", month, -"/", year.
month: "0"?, ["1"-"9"]; 
             "10"; "11"; "12".
etc

the input 12/31/2023 would produce:

<date xmlns:ixml="http://invisiblexml.org/NS">
   <us>
      <month>12</month>
      <day>31</day>
      <year>2023</year>
   </us>
</date>

ixml reports ambiguity

A grammar accepting both World and USA style dates, with month only 1-12, and day 1-31:

date: us; world.
us: month, -"/", day, -"/", year.
world: day, -"/", month, -"/", year.
month: "0"?, ["1"-"9"]; 
             "10"; "11"; "12".
etc

the input 5/12/2023 would produce:

<!-- AMBIGUOUS
     The input from line.pos 1.1 to 1.11 can be interpreted as 'date' in 2 different ways:
     1: us[:1.11] 
     2: world[:1.11] 
-->
<date ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS">
   <us>
      <month>5</month>
      <day>12</day>
      <year>2023</year>
   </us>
</date>

Another small example

url: scheme, ":", authority, path.
scheme: letter+.
authority: "//", host.
host: sub++".".
sub: letter+.
path: ("/", seg)+.
seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

Result for http://w3.org/1999/xhtml.html:

<url>
   <scheme>http</scheme>:
   <authority>//
      <host>
         <sub>w3</sub>.
         <sub>org</sub>
      </host>
   </authority>
   <path>
      /<seg>1999</seg>
      /<seg>xhtml.html</seg>
   </path>
</url>

Another small example

url: scheme, ":", authority, path.
@scheme: letter+.
authority: "//", host.
host: sub++".".
sub: letter+.
path: ("/", seg)+.
seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

Result for http://w3.org/1999/xhtml.html:

<url scheme="http">:
   <authority>//
      <host>
         <sub>w3</sub>.
         <sub>org</sub>
      </host>
   </authority>
   <path>
      /<seg>1999</seg>
      /<seg>xhtml.html</seg>
   </path>
</url>

Another small example

url: scheme, ":", authority, path.
@scheme: letter+.
authority: "//", host.
host: sub++".".
-sub: letter+.
path: ("/", seg)+.
-seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

Result for http://w3.org/1999/xhtml.html:

<url scheme="http">:
   <authority>//
      <host>w3.org</host>
   </authority>
   <path>1999/xhtml.html</path>
</url>

Another small example

url: scheme, -":", authority, path.
@scheme: letter+.
authority: -"//", host.
host: sub++".".
-sub: letter+.
path: ("/", seg)+.
-seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

Result for http://w3.org/1999/xhtml.html:

<url scheme="http">
   <authority>
      <host>w3.org</host>
   </authority>
   <path>1999/xhtml.html</path>
</url>

Some Use Cases

The hardest part of getting an article into Docbook format (the format used by several conferences I go to) is getting the bibliography right.

The bibliography for a recent paper was produced with the help of ixml. For instance, the text

[spec] Steven Pemberton (ed.), Invisible XML Specification, invisiblexml.org,
 2022, https://invisiblexml.org/ixml-specification.html

was processed by an ixml grammar whose top-level rules were

bibliography: biblioentry+.
biblioentry: 
    abbrev, (author; editor), -", ",
    title, -", ",
    publisher, -", ", pubdate, -",",
    (artpagenums, -", ")?, 
    (bibliomisc; biblioid)**-", ", -#a.

Yielding

<biblioentry>
   <abbrev>spec</abbrev>
   <editor>
      <personname>
         <firstname>Steven</firstname>
         <surname>Pemberton</surname>
      </personname>
   </editor>
   <title>Invisible XML Specification</title>
   <publisher>invisiblexml.org</publisher>
   <pubdate>2022</pubdate>
   <bibliomisc>
      <link xl-href='https://invisiblexml.org/ixml-specification.html'/>
   </bibliomisc>
</biblioentry>

Extracting Information

Invisible XML has greater expressive power than regular expressions.

This makes it great for extracting information from existing textual documents.

The input data doesn't have to be line-based to extract information.

There is a project in the Dutch Government using ixml for just this, to locate particular information within large document collections.

ixml Libraries

The ixml community is producing a library of grammars for a number of standard formats.

For instance, there is a grammar for (a subset of) Markdown: I recently used it to produce a chapter of a book.

Processing step

ixml processing step An ixml processor takes a document in a particular (textual) format, along with a description of that format, in the form of a grammar, and uses it to parse the document.

This produces a structured parse tree, which can then be processed in a number of ways, such as serialization as XML.

Structured ixml

ixml processing step The format description is drawn as a structured document.

However, it is normally supplied in textual form, and is processed in exactly the same way, by the ixml processor, but using a description of the ixml format.

This results in the structured version of the description.

ixml in ixml

ixml is of course expressed in ixml:

rule: (mark, s)?, name, s, -["=:"], s, -alts, -".".

which comes out as XML

<rule name='rule'>
   <alt>
      <option>
         <alts>
            <alt>
               <nonterminal name='mark'/>
               <nonterminal name='s'/>
            </alt>
         </alts>
      </option>
      <nonterminal name='name'/>
      <nonterminal name='s'/>
      <inclusion tmark='-'>
         <member string='=:'/>
      </inclusion>
      <nonterminal name='s'/>
      <nonterminal mark='-' name='alts'/>
      <literal tmark='-' string='.'/>
  </alt>
</rule>

The Abstract Document

What many miss on first introduction to ixml is the essential role of the abstract document in the middle: everything else is just representation.

The ixml description language plays three roles:

a description of the syntax of the input
a specification of the representation of the output
a schema of the abstract document.

Learning materials

There are two online interactive hands-on tutorials available, with exercises:

Introduction: takes you through the language, and describes all features
Advanced: Shows techniques, gives tips, helps how to design input and output formats, and covers a half-dosen case studies.

The specification: as specifications go, rather readable, and should answer any questions you have about how it works.

ixml

Version 1 was officially released in 2022 on invisiblexml.org

Currently 4 implementations running, 4 more in preparation.

For full details pop along to the ixml website, and join the working group if you want.

Are you ready for Invisible XML?

Contents

Abstractions

Representations

Data is an abstraction too!

What Invisible Markup Does

Representations

(Simple) Example: Dates

(Simple) Example: Dates

(Simple) Example: Dates

(Simple) Example: Dates

Several dates

Several dates

Several dates

Attributes

Attributes

Deleting terminals

ixml reports ambiguity

ixml reports ambiguity

ixml reports ambiguity

Another small example

Another small example

Another small example

Another small example

Some Use Cases

Yielding

Bibliographies

Extracting Information

ixml Libraries

Processing step

Structured ixml

Processing

ixml in ixml

Full Processing Cycle

The Abstract Document

Learning materials

ixml

Thank you!