Generalised Invisible Markup

Steven Pemberton, CWI, Amsterdam

The author

Abstract

Invisible XML makes the implicit structure of textual documents explicit by parsing, and then transforming the resultant parsetree into an abstract document. This abstract document is the essence of Invisible XML: it can be processed in many ways, though principally by serialising to XML.

However, as shown in an earlier paper on roundtripping ixml, it is possible to simultaneously simplify and generalise the ixml serialisation process, thereby opening it to other serialisations that are not hard-wired in the processor. By the same token, this further simplifies ixml proper, by reducing it to a simple transformation of grammars from ixml to an equivalent Invisible Markup grammar.

This talks discusses the changes needed to create a generalised Invisible Markup Language, explores the alternatives, and proposes future steps.

Contents

ixml

Textual documents have an implicit structure that is mostly recognisable for human readers, but for computers the structure has to be made explicit to enable processing of documents.

Thus SGML and XML (etc).

Invisible XML (ixml) works in a different way:

Extra information in the grammar controls the serialisation.

Abstract document

The abstract document is the central element of the Invisible XML process, since it can be used in several ways, though the principle one is serialising to XML.

However, there are other possible uses and serialisations.

Roundtripping

A 2024 paper on roundtripping, pointed out that you can reverse the serialisation process by using ixml to recognise the XML serialisation produced by ixml.

This is done by transforming grammars so that they include the necessary extra characters such as "<" and ">" added by serialisation, thus recreating an abstract document that would create the same XML serialisation.

Extra facilities are needed to do this in a general way, since some items (namely attributes) get implicitly moved to earlier in the serialisation than their position in the abstract document.

Example

 date: day, -"/", month, -"/", year.
  day: d, d.
month: d, d.
 year: +"20", d, d.
   -d: ["0"-"9"].

with input

07/11/25

would generate

<date><day>07</day><month>11</month><year>2025</year></date>

Example

 date: day, -"/", month, -"/", year.
  day: d, d.
month: d, d.
 year: +"20", d, d.
   -d: ["0"-"9"].

transforming the grammar to recognise the serialisation gives:

  date: -"<date>", day, +"/", month, +"/", year, -"</date>".
  -day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
 -year: -"<year>", -"20", d, d, -"</year>".
    -d: ["0"-"9"].

Example

  date: -"<date>", day, +"/", month, +"/", year, -"</date>".
  -day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
 -year: -"<year>", -"20", d, d, -"</year>".
    -d: ["0"-"9"].

Using this grammar and the regular ixml parser to parse

<date><day>07</day><month>11</month><year>2025</year></date>

gives

<date>30/06/2024</date>

Surprise

If you transform a grammar twice in this way, the resulting grammar both recognises the original input format, and includes the necessary characters to do the serialisation, without needing to call on the XML serialisation process of ixml.

This would give:

 -date: +"<date>", day, -"/", month, -"/", year, +"</date>".
  -day: +"<day>", d, d, +"</day>".
-month: +"<month>", d, d, +"</month>".
 -year: +"<year>", +"20", d, d, +"</year>".
    -d: ["0"-"9"].

which for the same original input, produces the same original serialisation (modulo details explained in the original paper).

Motivation

As the roundtripping paper remarks:

the ixml processor is now unbound from XML, and could be used to produce other serialisations in a fairly straightforward way.

This paper takes that observation, and explores the design of a generalised version of ixml that allows serialisation not only to XML, but also to other structured data formats.

Design

The purpose of the generalised invisible markup language is thus to

Example

The ixml for a simplified URL, which we will treat in different ways.

      url: scheme, -":", authority, path.
   scheme: letter+.
authority: -"//", host.
     host: sub++-".".
      sub: letter+.
     path: (-"/", seg)+.
      seg: (letter; ".")*.
  -letter: ["a"-"z"; "A"-"Z"; "0"-"9"].

With input https://invisiblexml.org/1.0/, we get

<url>
   <scheme>https</scheme>
   <authority>
      <host>
         <sub>invisiblexml</sub>
         <sub>org</sub>
      </host>
   </authority>
   <path>
      <seg>1.0</seg>
      <seg/>
   </path>
</url>

iml

Assume a version where all serialisation characters are in the grammar.

To duplicate the above just requires adding start and end tags:

   url: +"<url>", scheme, -":", authority, path, +"</url>".
scheme: +"<scheme>", letter+, +"</scheme>".

etc. There is one difference, namely this would generate

<seg></seg>

for the final empty segment. If the short version were required, then you could write:

-seg: +"<seg>", (letter; ".")+, +"</seg>";
      +"<seg/>".

Attributes

If we wanted to make scheme an attribute, we could write:

   -url: +"<url", scheme, +">", -":", authority, path, +"</url>".
-scheme: +" scheme='",  letter+, +"'".

This would then generate

<url scheme='https'>

as the first line of the serialisation.

New freedom

But now we have some more freedom than we did in ixml. For instance we could use the scheme as the name of the element:

   -url: +"<", scheme, +">", -":", authority, path, 
         +"</", +scheme, +">".
-scheme: letter+.

This uses + with a nonterminal, meaning "parse nothing, but serialise the nonterminal of this name at this position", and gives:

<https>
  <authority>
     <host>
        <sub>invisiblexml</sub>
        <sub>org</sub>
      </host>
   </authority>
   <path>
      <seg>1.0</seg>
      <seg/>
   </path>
</https>

Readability

In ixml, rules combine both the form of the input and of the serialisation.

Since the two are very closely related, and the reordering of attributes only happens implicitly, it is easy to see what is input, and what is output.

However, since more is required to be included in rules in the new version, it quickly becomes hard to read, and distinguish the input from the serialisation.

So a different form for rules is proposed here, separating input and output.

Variant

   url: scheme, ":", authority, path
        => "<", scheme, ">", authority, path, "</", scheme, ">".
scheme: letter+.

All serialisation 'marks' are gone.

Input and output formats are more obvious.

If input and output are identical, such as in scheme above, there is no need to specify a separate output part of the rule.

The Rest

authority: "//", host     => "<authority>", host, "</authority>".
     host: sub++"."       => "<host>", sub+, "</host>".
      sub: letter+        => "<sub>", letter+, "</sub>".
     path: ("/", seg)+    => "<path>", seg+, "</path>".
      seg: (letter; ".")* => "<seg>", (letter; ".")*, "</seg>".
   letter: ["a"-"z"; "A"-"Z"; "0"-"9"].

If a rule should produce no output then an empty output part is used:

 input: spaces, url, spaces.
spaces: " "* => .

Alternatives

Note that each alternative needs an output part, not just the whole rule:

 input: entry++" " => "<entries>", entry+, "</entries>".
 entry: number     => "<number>", number, "</number>";
        word       => "<word>", word, "</word>".
number: ["0"-"9"]+.
  word: [L]+.

In some cases, they can be combined

date: day, "/", month, "/", year => day, month, year;
      year, "-", month, "-", day => day, month, year.

can be combined to

date: (day, "/", month, "/", year;
       year, "-", month, "-", day) => day, month, year.

Multiple use of a nonterminal

Some times, the same nonterminal occurs more than once, such as s here:

mapping: name, ":", s, values, ".", s
      => "<", name, ">", s, values, s, "</", name, ">".

Each is taken in turn in the output; if there are more in the output than the input (such as name above), then it restarts from the first.

If it is necessary to use a specific nonterminal, extra rules can be added:

person: name1, " ", name2 => name2, ", ", name1.
 name1: name.
 name2: name.
  name: [L]+.

This would turn Steven Pemberton into Pemberton, Steven

Roundtripping

Roundtripping is now trivial.

Since input and output are separately described, you just swap the input and output sides: parse with the output part, and serialise with the input part.

E.g.:

 date: day, "/", month, "/", year
   => "<date>", year, month, day, "</date>".
  day: d, d => "<day>", d, d, "</day>".
month: d, d => "<month>", d, d, "</month>".
 year: d, d => "<year>", "20", d, d "</year>".
    d: ["0"-"9"].

Ambiguity

As with ixml, a parse may be ambiguous.

While ixml does not define which parse to use, it may be useful to specify that the textually earliest serialisation be used.

For instance in the case of

date: (day, "/", month, "/", year;
       year, "-", month, "-", day) => day, month, year.

where round tripping has a choice of two serialisations, it might useful to be able to specify that it is the first of these that is used.

Further implementation work needs to be done to investigate this.

Ambiguity

It is also worth noting that there is now no longer a place to specify that a parse was ambiguous. Implementations may have to resort to using a Unix-style error output channel to report errors and warnings of this sort.

Other Uses

Although the inspiration for iml is invisible markup, there are other uses.

For instance, general editing, such as replacing all separating semicolons with commas, but not those in strings:

input: item++";" => item++",".
item: string; word; number.
string: '"', ~['"']*, '"'.
word: [L]+.
number: [Nd]+

Other Uses

Reversing a list of words:

words: word, s, words => words, " ", word.
    s: " "+ => " ".
 word: [L]+.

In fact iml could be used for many similar cases to how sed and grep are used, but with an additional advantage of being able to select on structure as well as text.

Structural Opportunities

In the original 2013 ixml paper, it was noted that

{"a": 1, "b": 2}

cannot be transformed [with ixml] into

<j><a>1</a><b>2</b></j>.

However, iml enables it to be done:

j: "{", member**", ", "}"  => "<j>", member*, "</j>".
member: name, ": ", number => "<", name, ">", number, "</", name, ">".
name: '"', letters, '"'    => letters.
letters: ["a"-"z"]+.
number: ["0"-"9"]+.

ixml as an application of iml

iml makes the implementation of ixml much easier: all this is needed is grammar transformation.

An ixml grammar is transformed into an iml grammar, in the same way that ixml round-tripping was done earlier.

The only special case is the need to deal with XML special characters, such as "<", in text. For instance, the ixml

line: c*, #a.
c: ~[#a].

needs to be transformed to

line: c*, #a => "<line>", c*, </line>
c: ~[#a; "<"];
   "<" => "&lt;".

Future work

This proposal is not yet complete.

Some details still to be worked out, such as moving nodes from deep in the parse.

Future work

For instance:

entry: numeric => "<entry n='", numeric/n, "/>";
       text    => "<entry t='", text/t, "/>";
numeric: "(", n, ")".
text: "[", t, "]".
n: ["0"-9"]+.
t: ~["[]"]+.

where

(123)

produces

<entry n="123"/>

and

[123]

produces

<entry t="123"/>

This has complications for roundtripping.

Conclusion