Round-tripping ixml

Steven Pemberton, CWI, Amsterdam

The Author

Contents

About me

My tutor at University was Dick Grimsdale, who built the world's first transistorised computer.

His tutor was Turing.

I later, and coincidentally, worked in Turing's old department on the 5th computer in the series that Turing worked on.

MU5

70 years

Alan Turing

ixml: a reminder

ixml is a language for representation transformations.

It takes a (textual) document with implicit structure

And produces an abstract document with explicit structure

That can then be used in different ways, typically serialised to XML

The process

Very simple example: Dates

Text:

30/06/2024

Description:

 date: day, "/", month, "/", year.
  day: d, d.
month: d, d.
 year: d, d, d, d.
   -d: ["0"-"9"].

Output:

<date>
   <day>30</day>/
   <month>06</month>/
   <year>2024</year>
</date>

Round-tripping

The question is: in the general case, can we get back to the text version from the structured version?

As you will shortly see, there are obstacles:

Definition

So it is impossible in general to create a character-perfect version of the input. For this reason, we define round-tripping as

Creating an input that would produce the identical output.

Example: programming language where the ixml deletes comments and spurious whitespace. We get back the same program, but without the comments and extra spaces, but otherwise the same program.

Roundtripping

The problem space

Simple case, we have already seen.

30/06/2024

Output:

<date>
   <day>30</day>/
   <month>06</month>/
   <year>2024</year>
</date>

If nothing is inserted or deleted, and only elements are used, round-tripping is easy: concatenate the text nodes. Done.

In these cases it is possible to perfectly round-trip documents.

That is a choice that can be made when designing a grammar.

Problem: Deletions

ixml has facilities to control the serialisation format.

For instance, the slash characters "/" in the input are there only as punctuation to separate the different parts, but play no role in the output, and so can be deleted from the serialisation:

date: day, -"/", month, -"/", year.

giving

<date>
   <day>30</day>
   <month>06</month>
   <year>2024</year>
</date>

Now we can no longer just concatentate the text nodes.

Problem: Insertions

Similarly, characters can be inserted into the serialisation. Suppose the input date format only had two digits for the year, but the output serialisation required four. You can write this:

 date: day, -"/", month, -"/", year.
  day: d, d.
month: d, d.
 year: +"20", d, d.
   -d: ["0"-"9"].

which for the input 30/06/24 would give:

<date>
   <day>30</day>
   <month>06</month>
   <year>2024</year>
</date>

Problem: Attributes

Finally, elements can be serialised as attributes:

 date: day, -"/", month, -"/", year.
  day: d, d.
month: d, d.
@year: +"20", d, d.
   -d: ["0"-"9"].

which gives

<date year='2024'>
   <day>30</day>
   <month>06</month>
</date>

which reorders the input.

Round-tripping through grammar transformation

The approach: transform the input grammar to a different ixml grammar that recognises all possible serialisations of the input.

Use the new grammar to parse the serialisation, using the standard ixml parser, to recreate the parse tree that produced that serialisation.

Serialise the resulting parse tree (as text).

Simple case

As an example, for the first date grammar above

  date: day, "/", month, "/", year.
   day: d, d.
 month: d, d.
  year: d, d, d, d.
    -d: ["0"-"9"].

a grammar can be generated that recognises the serialisation:

  date: -"<date>", day, "/", month, "/", year, -"</date>".
  -day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
 -year: -"<year>", d, d, d, d, -"</year>".
    -d: ["0"-"9"].

Use this and the regular ixml parser to parse the output XML:

<date>30/06/2024</date>

We'll talk about those date tags shortly.

Deletions

In the second example everything is the same except the rule:

date: day, -"/", month, -"/", year.

with a transformed equivalent rule:

date: -"<date>", day, +"/", month, +"/", year, -"</date>".

Insertions

And finally (for now), the version with added output characters

year: +"20", d, d.

has transformation:

-year: -"<year>", -"20", d, d, -"</year>".

giving

<date>30/06/24</date>

Patterns

So from these three examples we can see some patterns for the transformations emerging:

Attributes

The main problem left is with attributes, which turn up at a different position in the serialisation. For instance the grammar

 date: day, -"/", month, -"/", year.
  day: d, d.
month: d, d.
@year: +"20", d, d.
   -d: ["0"-"9"].

generates the output

<date year='2024'>
   <day>30</day>
   <month>06</month>
</date>

Attributes

<date year='2024'>
   <day>30</day>
   <month>06</month>
</date>

which can be recognised with

  date: -"<date", year, -">", day, +"/", month, +"/", -"</date>".
  -day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
 -year: -" year='", -"20", d, d, -"'".
    -d: ["0"-"9"].

but would produce on output

<date>2430/06/</date>

In other words it turns up in the wrong place in the round-tripping. A mechanism is needed for putting it back in the right place.

How ixml serialises attributes

In normal ixml processing the grammar

This is done in two-passes: when serialising an element, children of the element, and of hidden children elements are traversed first to find rules marked as attributes, which are then serialised before the element children.

Reverse attributes

To reverse this process, we must recognise the attribute at its implicit position in the input, and explicitly identify in the grammar the place where it needs to be serialised; this will always be later in the output than where the attribute was found.

We can do this by defining two extra marks: one to indicate that the attribute as parsed should not be serialised at that position, and another to indicate the position where it should be serialised.

Comparison

This is comparable to how terminals are inserted and deleted on serialisation:

Unfortunately, "-" has already been assigned a different role for nonterminals on serialisation, but "+" is free to use with the same meaning.

Marks

So using the mark "*" to mean "parse the input, but serialise nothing", and "+" to mean "parse nothing but serialise the node of this name from earlier in the tree marked with a "*"", we can specify:

  date: -"<date", *year, -">", 
             day, +"/", month, +"/", +year, -"</date>".
  -day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
 -year: -" year='", -"20", d, d -"'".
    -d: ["0"-"9"].

This now correctly produces

<date>30/06/24</date>

Additions

Note that these new marks are not changes to the ixml language itself, but only internal additions: the transformations are done on internal representations of the grammar, and not on external representations.

Strict and Permissive Grammars

If the input being parsed by ixml might not be correct, then the grammar needs to be strict. With dates for example, for months you must then only accept single digits in the range 1 to 9, and double digits up to 12.

month: "0"?, ["1"-"9"]; "10"; "11"; "12".

Another example is the grammar for ixml itself, which clearly has to be strict.

Permissive grammars

On the other hand, if it is certain that the input being processed with ixml is correct, the grammar can be laxer in what it accepts.

If dates are only being recognised, and not checked for correctness, then "d, d" is a perfectly good pattern for recognising the month number.

Even though this would also recognise 13 or 99, since such dates never occur, you don't have to worry about them.

Such grammars are referred to as permissive.

Since we are sure that the XML we are parsing is correct, this makes transformations much easier to write, and we can create permissive grammars.

For instance, we don't have to worry about whether an element can be empty or not. We can supply an alternative to recognise "<a/>", if it occurs it matches, and if it doesn't, it doesn't.

Syntactic Equivalences

ixml includes extensions such as ? for options, *, **, +, and ++ for repetition, and ( and ) for grouping.

These extensions add expressiveness but don't add recognition power: they can all be straightforwardly transformed into grammars without the extensions, while recognising the same language.

As a consequence, the round-tripping process doesn't have to take the extensions into account: either the initial transformations have already been done, or it can do them itself.

Either way only simple grammar rules need to be transformed.

Ambiguity

In regular ixml processing it can occur that an input document matches the grammar in more than one way.

The ixml processor is required to only serialise one of the possible matches, but must also report that the result is ambiguous: that the serialisation is only one of the possible interpretations of the input.

Ambiguity on round-tripping

In a similar way, round-tripping can also be ambiguous, in this case meaning "this is only one of the possible strings that can produce the same serialisation".

In fact, many inputs that are not ambiguous can produce ambiguous round-trips.

Example

As a simple example, if symbols on the input may be separated by one or more spaces, that are then deleted on serialisation:

input: sym**spaces.
spaces: -" "+.

then on round-tripping any number of spaces would be acceptable.

This can best be solved by producing the minimum acceptable on round-tripping: one for "+", and zero for "*".

In a similar way:

-optional-a: -"a"?.

is ambiguous, because nothing is produced in the output, so we don't know if an "a" was present or not in the input. In this case, either option is acceptable, though the shorter case is probably preferable.

Reporting ambiguity

Ambiguity is marked in ixml by adding an attribute to the root element.

This is why it is necessary to retain the root element on output of the round-trip: the output is (flat) XML anyway, and we need somewhere to report ambiguity.

Inserted Layout

Some implementations may produce a "pretty printed" serialisation of the XML, with added newlines and indentation.

While recognising this could be added to the transformed grammar, it adds a danger of extra ambiguity, since elements may already have whitespace in them as part of the serialisation.

For that reason it is better to round-trip XML without added whitespace.

Future Work

This is all work in progress

Similarities between Serialisation and Transformation

The code to produce the transformed grammars and the code used for serialisation are very similar.

This is not surprising, since both walk a tree, and one outputs the serialisation, while the other outputs a grammar that recognises that serialisation.

Consequently the two processes are almost isomorphic.

Which raises a question...

Double transformation

What happens if you take the transformed grammar, and transform it again? What does it produce?

Double transformation

What happens if you take the transformed grammar, and transform it again? What does it produce?

Well, it round-trips the round-trip: it produces a serialisation that would have produced the round-tripped text. In other words, it serialises the input as XML!

This already works in basic cases, for instance, round-tripping the original example twice, you get

&lt;date>&lt;day>30&lt;/day>&lt;month>06&lt;/month>&lt;year>2024&lt;/year>&lt;/date>

(The &lt;s are because it is outputting text that represents XML, but not XML itself. For readability, here is that output with the &lt;s expanded:)

<date><day>30</day><month>06</month><year>2024</year></date>

Process

The full round-tripping cycle

Corollary

Serialisation of ixml could be greatly simplified, only having to output characters, and not having to worry about elements and attributes.

Serialisation could be then done by twice transforming the input grammar, and using that instead.

Transformations are currently only partially isomorphic: for instance there is no reverse operation for the * mark (which would say "recognise nothing, but serialise the rule with this name from later in the parse").

A second corollary however is that the ixml processor is now unbound from XML, and could be used to produce other serialisations in a fairly straightforward way.

Possible additions to ixml

Although the additions to ixml to enable round-tripping were added to the internal form of ixml, and not the external form, if these facilities were to be made available to users as well, some marks would be needed to be added to the language.

These are:

Conclusions

Contrary to expectations, it is possible to round-trip ixml, for a suitable definition of round-tripping, and with a slightly adapted ixml serialisation.

This technique is very simple, and not only that, actually allows the ixml processor to be simplified, and to some extent generalised.

An added advantage is that, with some work that still has to be done, the process is entirely reversible.

Reminder

Declarative Amsterdam

Declarative Amsterdam, 7-8 November

Call for presentations now open