My tutor at University was Dick Grimsdale, who built the world's first transistorised computer.
His tutor was Turing.
I later, and coincidentally, worked in Turing's old department on the 5th computer in the series that Turing worked on.
ixml is a language for representation transformations.
It takes a (textual) document with implicit structure
And produces an abstract document with explicit structure
That can then be used in different ways, typically serialised to XML
Text:
30/06/2024
Description:
date: day, "/", month, "/", year. day: d, d. month: d, d. year: d, d, d, d. -d: ["0"-"9"].
Output:
<date> <day>30</day>/ <month>06</month>/ <year>2024</year> </date>
The question is: in the general case, can we get back to the text version from the structured version?
As you will shortly see, there are obstacles:
So it is impossible in general to create a character-perfect version of the input. For this reason, we define round-tripping as
Creating an input that would produce the identical output.
Example: programming language where the ixml deletes comments and spurious whitespace. We get back the same program, but without the comments and extra spaces, but otherwise the same program.
Simple case, we have already seen.
30/06/2024
Output:
<date> <day>30</day>/ <month>06</month>/ <year>2024</year> </date>
If nothing is inserted or deleted, and only elements are used, round-tripping is easy: concatenate the text nodes. Done.
In these cases it is possible to perfectly round-trip documents.
That is a choice that can be made when designing a grammar.
ixml has facilities to control the serialisation format.
For instance, the slash characters "/" in the input are there only as punctuation to separate the different parts, but play no role in the output, and so can be deleted from the serialisation:
date: day, -"/", month, -"/", year.
giving
<date> <day>30</day> <month>06</month> <year>2024</year> </date>
Now we can no longer just concatentate the text nodes.
Similarly, characters can be inserted into the serialisation. Suppose the input date format only had two digits for the year, but the output serialisation required four. You can write this:
date: day, -"/", month, -"/", year. day: d, d. month: d, d. year: +"20", d, d. -d: ["0"-"9"].
which for the input 30/06/24
would give:
<date> <day>30</day> <month>06</month> <year>2024</year> </date>
Finally, elements can be serialised as attributes:
date: day, -"/", month, -"/", year. day: d, d. month: d, d. @year: +"20", d, d. -d: ["0"-"9"].
which gives
<date year='2024'> <day>30</day> <month>06</month> </date>
which reorders the input.
The approach: transform the input grammar to a different ixml grammar that recognises all possible serialisations of the input.
Use the new grammar to parse the serialisation, using the standard ixml parser, to recreate the parse tree that produced that serialisation.
Serialise the resulting parse tree (as text).
As an example, for the first date grammar above
date: day, "/", month, "/", year. day: d, d. month: d, d. year: d, d, d, d. -d: ["0"-"9"].
a grammar can be generated that recognises the serialisation:
date: -"<date>", day, "/", month, "/", year, -"</date>". -day: -"<day>", d, d, -"</day>". -month: -"<month>", d, d, -"</month>". -year: -"<year>", d, d, d, d, -"</year>". -d: ["0"-"9"].
Use this and the regular ixml parser to parse the output XML:
<date>30/06/2024</date>
We'll talk about those date
tags shortly.
In the second example everything is the same except the rule:
date: day, -"/", month, -"/", year.
with a transformed equivalent rule:
date: -"<date>", day, +"/", month, +"/", year, -"</date>".
And finally (for now), the version with added output characters
year: +"20", d, d.
has transformation:
-year: -"<year>", -"20", d, d, -"</year>".
giving
<date>30/06/24</date>
So from these three examples we can see some patterns for the transformations emerging:
The main problem left is with attributes, which turn up at a different position in the serialisation. For instance the grammar
date: day, -"/", month, -"/", year. day: d, d. month: d, d. @year: +"20", d, d. -d: ["0"-"9"].
generates the output
<date year='2024'> <day>30</day> <month>06</month> </date>
<date year='2024'> <day>30</day> <month>06</month> </date>
which can be recognised with
date: -"<date", year, -">", day, +"/", month, +"/", -"</date>". -day: -"<day>", d, d, -"</day>". -month: -"<month>", d, d, -"</month>". -year: -" year='", -"20", d, d, -"'". -d: ["0"-"9"].
but would produce on output
<date>2430/06/</date>
In other words it turns up in the wrong place in the round-tripping. A mechanism is needed for putting it back in the right place.
In normal ixml processing the grammar
This is done in two-passes: when serialising an element, children of the element, and of hidden children elements are traversed first to find rules marked as attributes, which are then serialised before the element children.
To reverse this process, we must recognise the attribute at its implicit position in the input, and explicitly identify in the grammar the place where it needs to be serialised; this will always be later in the output than where the attribute was found.
We can do this by defining two extra marks: one to indicate that the attribute as parsed should not be serialised at that position, and another to indicate the position where it should be serialised.
This is comparable to how terminals are inserted and deleted on serialisation:
-"abc"
means "parse the string but serialise
nothing"+"abc"
means "parse nothing, but serialise the
string" .Unfortunately, "-
" has already been assigned a different role
for nonterminals on serialisation, but "+
" is free to use with the
same meaning.
So using the mark "*
" to mean "parse the input, but
serialise nothing", and "+
" to mean "parse nothing but
serialise the node of this name from earlier in the tree marked with a
"*
"", we can specify:
date: -"<date", *year, -">", day, +"/", month, +"/", +year, -"</date>". -day: -"<day>", d, d, -"</day>". -month: -"<month>", d, d, -"</month>". -year: -" year='", -"20", d, d -"'". -d: ["0"-"9"].
This now correctly produces
<date>30/06/24</date>
Note that these new marks are not changes to the ixml language itself, but only internal additions: the transformations are done on internal representations of the grammar, and not on external representations.
If the input being parsed by ixml might not be correct, then the grammar needs to be strict. With dates for example, for months you must then only accept single digits in the range 1 to 9, and double digits up to 12.
month: "0"?, ["1"-"9"]; "10"; "11"; "12".
Another example is the grammar for ixml itself, which clearly has to be strict.
On the other hand, if it is certain that the input being processed with ixml is correct, the grammar can be laxer in what it accepts.
If dates are only being recognised, and not checked for correctness, then
"d, d
" is a perfectly good pattern for recognising the month
number.
Even though this would also recognise 13
or 99
,
since such dates never occur, you don't have to worry about them.
Such grammars are referred to as permissive.
Since we are sure that the XML we are parsing is correct, this makes transformations much easier to write, and we can create permissive grammars.
For instance, we don't have to worry about whether an element can be empty
or not. We can supply an alternative to recognise "<a/>
", if
it occurs it matches, and if it doesn't, it doesn't.
ixml includes extensions such as ?
for options, *
,
**
, +
, and ++
for repetition, and
(
and )
for grouping.
These extensions add expressiveness but don't add recognition power: they can all be straightforwardly transformed into grammars without the extensions, while recognising the same language.
As a consequence, the round-tripping process doesn't have to take the extensions into account: either the initial transformations have already been done, or it can do them itself.
Either way only simple grammar rules need to be transformed.
In regular ixml processing it can occur that an input document matches the grammar in more than one way.
The ixml processor is required to only serialise one of the possible matches, but must also report that the result is ambiguous: that the serialisation is only one of the possible interpretations of the input.
In a similar way, round-tripping can also be ambiguous, in this case meaning "this is only one of the possible strings that can produce the same serialisation".
In fact, many inputs that are not ambiguous can produce ambiguous round-trips.
As a simple example, if symbols on the input may be separated by one or more spaces, that are then deleted on serialisation:
input: sym**spaces. spaces: -" "+.
then on round-tripping any number of spaces would be acceptable.
This can best be solved by producing the minimum acceptable on
round-tripping: one for "+
", and zero for "*
".
In a similar way:
-optional-a: -"a"?.
is ambiguous, because nothing is produced in the output, so we don't know if
an "a
" was present or not in the input. In this case, either
option is acceptable, though the shorter case is probably preferable.
Ambiguity is marked in ixml by adding an attribute to the root element.
This is why it is necessary to retain the root element on output of the round-trip: the output is (flat) XML anyway, and we need somewhere to report ambiguity.
Some implementations may produce a "pretty printed" serialisation of the XML, with added newlines and indentation.
While recognising this could be added to the transformed grammar, it adds a danger of extra ambiguity, since elements may already have whitespace in them as part of the serialisation.
For that reason it is better to round-trip XML without added whitespace.
This is all work in progress
The code to produce the transformed grammars and the code used for serialisation are very similar.
This is not surprising, since both walk a tree, and one outputs the serialisation, while the other outputs a grammar that recognises that serialisation.
Consequently the two processes are almost isomorphic.
Which raises a question...
What happens if you take the transformed grammar, and transform it again? What does it produce?
What happens if you take the transformed grammar, and transform it again? What does it produce?
Well, it round-trips the round-trip: it produces a serialisation that would have produced the round-tripped text. In other words, it serialises the input as XML!
This already works in basic cases, for instance, round-tripping the original example twice, you get
<date><day>30</day><month>06</month><year>2024</year></date>
(The <
s are because it is outputting text that
represents XML, but not XML itself. For readability, here is that output with
the <
s expanded:)
<date><day>30</day><month>06</month><year>2024</year></date>
Serialisation of ixml could be greatly simplified, only having to output characters, and not having to worry about elements and attributes.
Serialisation could be then done by twice transforming the input grammar, and using that instead.
Transformations are currently only partially isomorphic: for instance there
is no reverse operation for the *
mark (which would say
"recognise nothing, but serialise the rule with this name from later in the
parse").
A second corollary however is that the ixml processor is now unbound from XML, and could be used to produce other serialisations in a fairly straightforward way.
Although the additions to ixml to enable round-tripping were added to the internal form of ixml, and not the external form, if these facilities were to be made available to users as well, some marks would be needed to be added to the language.
These are:
Contrary to expectations, it is possible to round-trip ixml, for a suitable definition of round-tripping, and with a slightly adapted ixml serialisation.
This technique is very simple, and not only that, actually allows the ixml processor to be simplified, and to some extent generalised.
An added advantage is that, with some work that still has to be done, the process is entirely reversible.
Declarative Amsterdam, 7-8 November
Call for presentations now open