Invisible XML makes the implicit structure of textual documents explicit by parsing, and then transforming the resultant parsetree into an abstract document. This abstract document is the essence of Invisible XML: it can be processed in many ways, though principally by serialising to XML.
However, as shown in an earlier paper on roundtripping ixml, it is possible to simultaneously simplify and generalise the ixml serialisation process, thereby opening it to other serialisations that are not hard-wired in the processor. By the same token, this further simplifies ixml proper, by reducing it to a simple transformation of grammars from ixml to an equivalent Invisible Markup grammar.
This talks discusses the changes needed to create a generalised Invisible Markup Language, explores the alternatives, and proposes future steps.
Textual documents have an implicit structure that is mostly recognisable for human readers, but for computers the structure has to be made explicit to enable processing of documents.
Thus SGML and XML (etc).
Invisible XML (ixml) works in a different way:
Extra information in the grammar controls the serialisation.
The abstract document is the central element of the Invisible XML process, since it can be used in several ways, though the principle one is serialising to XML.
However, there are other possible uses and serialisations.
A 2024 paper on roundtripping, pointed out that you can reverse the serialisation process by using ixml to recognise the XML serialisation produced by ixml.
This is done by transforming grammars so that they include the necessary extra characters such as "<" and ">" added by serialisation, thus recreating an abstract document that would create the same XML serialisation.
Extra facilities are needed to do this in a general way, since some items (namely attributes) get implicitly moved to earlier in the serialisation than their position in the abstract document.
date: day, -"/", month, -"/", year. day: d, d. month: d, d. year: +"20", d, d. -d: ["0"-"9"].
with input
07/11/25
would generate
<date><day>07</day><month>11</month><year>2025</year></date>
date: day, -"/", month, -"/", year. day: d, d. month: d, d. year: +"20", d, d. -d: ["0"-"9"].
transforming the grammar to recognise the serialisation gives:
date: -"<date>", day, +"/", month, +"/", year, -"</date>".
-day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
-year: -"<year>", -"20", d, d, -"</year>".
-d: ["0"-"9"].
date: -"<date>", day, +"/", month, +"/", year, -"</date>".
-day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
-year: -"<year>", -"20", d, d, -"</year>".
-d: ["0"-"9"].
Using this grammar and the regular ixml parser to parse
<date><day>07</day><month>11</month><year>2025</year></date>
gives
<date>30/06/2024</date>
If you transform a grammar twice in this way, the resulting grammar both recognises the original input format, and includes the necessary characters to do the serialisation, without needing to call on the XML serialisation process of ixml.
This would give:
-date: +"<date>", day, -"/", month, -"/", year, +"</date>".
-day: +"<day>", d, d, +"</day>".
-month: +"<month>", d, d, +"</month>".
-year: +"<year>", +"20", d, d, +"</year>".
-d: ["0"-"9"].
which for the same original input, produces the same original serialisation (modulo details explained in the original paper).
As the roundtripping paper remarks:
the ixml processor is now unbound from XML, and could be used to produce other serialisations in a fairly straightforward way.
This paper takes that observation, and explores the design of a generalised version of ixml that allows serialisation not only to XML, but also to other structured data formats.
The purpose of the generalised invisible markup language is thus to
The ixml for a simplified URL, which we will treat in different ways.
url: scheme, -":", authority, path.
scheme: letter+.
authority: -"//", host.
host: sub++-".".
sub: letter+.
path: (-"/", seg)+.
seg: (letter; ".")*.
-letter: ["a"-"z"; "A"-"Z"; "0"-"9"].
With input https://invisiblexml.org/1.0/, we get
<url>
<scheme>https</scheme>
<authority>
<host>
<sub>invisiblexml</sub>
<sub>org</sub>
</host>
</authority>
<path>
<seg>1.0</seg>
<seg/>
</path>
</url>
Assume a version where all serialisation characters are in the grammar.
To duplicate the above just requires adding start and end tags:
url: +"<url>", scheme, -":", authority, path, +"</url>". scheme: +"<scheme>", letter+, +"</scheme>".
etc. There is one difference, namely this would generate
<seg></seg>
for the final empty segment. If the short version were required, then you could write:
-seg: +"<seg>", (letter; ".")+, +"</seg>";
+"<seg/>".
If we wanted to make scheme an attribute, we could write:
-url: +"<url", scheme, +">", -":", authority, path, +"</url>". -scheme: +" scheme='", letter+, +"'".
This would then generate
<url scheme='https'>
as the first line of the serialisation.
But now we have some more freedom than we did in ixml. For instance we could use the scheme as the name of the element:
-url: +"<", scheme, +">", -":", authority, path,
+"</", +scheme, +">".
-scheme: letter+.
This uses + with a nonterminal, meaning "parse nothing, but
serialise the nonterminal of this name at this position", and gives:
<https>
<authority>
<host>
<sub>invisiblexml</sub>
<sub>org</sub>
</host>
</authority>
<path>
<seg>1.0</seg>
<seg/>
</path>
</https>
In ixml, rules combine both the form of the input and of the serialisation.
Since the two are very closely related, and the reordering of attributes only happens implicitly, it is easy to see what is input, and what is output.
However, since more is required to be included in rules in the new version, it quickly becomes hard to read, and distinguish the input from the serialisation.
So a different form for rules is proposed here, separating input and output.
url: scheme, ":", authority, path
=> "<", scheme, ">", authority, path, "</", scheme, ">".
scheme: letter+.
All serialisation 'marks' are gone.
Input and output formats are more obvious.
If input and output are identical, such as in scheme above,
there is no need to specify a separate output part of the rule.
authority: "//", host => "<authority>", host, "</authority>".
host: sub++"." => "<host>", sub+, "</host>".
sub: letter+ => "<sub>", letter+, "</sub>".
path: ("/", seg)+ => "<path>", seg+, "</path>".
seg: (letter; ".")* => "<seg>", (letter; ".")*, "</seg>".
letter: ["a"-"z"; "A"-"Z"; "0"-"9"].
If a rule should produce no output then an empty output part is used:
input: spaces, url, spaces. spaces: " "* => .
Note that each alternative needs an output part, not just the whole rule:
input: entry++" " => "<entries>", entry+, "</entries>".
entry: number => "<number>", number, "</number>";
word => "<word>", word, "</word>".
number: ["0"-"9"]+.
word: [L]+.
In some cases, they can be combined
date: day, "/", month, "/", year => day, month, year;
year, "-", month, "-", day => day, month, year.
can be combined to
date: (day, "/", month, "/", year;
year, "-", month, "-", day) => day, month, year.
Some times, the same nonterminal occurs more than once, such as
s here:
mapping: name, ":", s, values, ".", s
=> "<", name, ">", s, values, s, "</", name, ">".
Each is taken in turn in the output; if there are more in the output than
the input (such as name above), then it restarts from the
first.
If it is necessary to use a specific nonterminal, extra rules can be added:
person: name1, " ", name2 => name2, ", ", name1. name1: name. name2: name. name: [L]+.
This would turn Steven Pemberton into Pemberton,
Steven
Roundtripping is now trivial.
Since input and output are separately described, you just swap the input and output sides: parse with the output part, and serialise with the input part.
E.g.:
date: day, "/", month, "/", year
=> "<date>", year, month, day, "</date>".
day: d, d => "<day>", d, d, "</day>".
month: d, d => "<month>", d, d, "</month>".
year: d, d => "<year>", "20", d, d "</year>".
d: ["0"-"9"].
As with ixml, a parse may be ambiguous.
While ixml does not define which parse to use, it may be useful to specify that the textually earliest serialisation be used.
For instance in the case of
date: (day, "/", month, "/", year;
year, "-", month, "-", day) => day, month, year.
where round tripping has a choice of two serialisations, it might useful to be able to specify that it is the first of these that is used.
Further implementation work needs to be done to investigate this.
It is also worth noting that there is now no longer a place to specify that a parse was ambiguous. Implementations may have to resort to using a Unix-style error output channel to report errors and warnings of this sort.
Although the inspiration for iml is invisible markup, there are other uses.
For instance, general editing, such as replacing all separating semicolons with commas, but not those in strings:
input: item++";" => item++",". item: string; word; number. string: '"', ~['"']*, '"'. word: [L]+. number: [Nd]+
Reversing a list of words:
words: word, s, words => words, " ", word.
s: " "+ => " ".
word: [L]+.
In fact iml could be used for many similar cases to how sed and
grep are used, but with an additional advantage of being able to
select on structure as well as text.
In the original 2013 ixml paper, it was noted that
{"a": 1, "b": 2}
cannot be transformed [with ixml] into
<j><a>1</a><b>2</b></j>.
However, iml enables it to be done:
j: "{", member**", ", "}" => "<j>", member*, "</j>".
member: name, ": ", number => "<", name, ">", number, "</", name, ">".
name: '"', letters, '"' => letters.
letters: ["a"-"z"]+.
number: ["0"-"9"]+.
iml makes the implementation of ixml much easier: all this is needed is grammar transformation.
An ixml grammar is transformed into an iml grammar, in the same way that ixml round-tripping was done earlier.
The only special case is the need to deal with XML special characters, such as "<", in text. For instance, the ixml
line: c*, #a. c: ~[#a].
needs to be transformed to
line: c*, #a => "<line>", c*, </line> c: ~[#a; "<"]; "<" => "<".
This proposal is not yet complete.
Some details still to be worked out, such as moving nodes from deep in the parse.
For instance:
entry: numeric => "<entry n='", numeric/n, "/>";
text => "<entry t='", text/t, "/>";
numeric: "(", n, ")".
text: "[", t, "]".
n: ["0"-9"]+.
t: ~["[]"]+.
where
(123)
produces
<entry n="123"/>
and
[123]
produces
<entry t="123"/>
This has complications for roundtripping.