Invisible XML: State of Play and Future Directions

Steven Pemberton, CWI, Amsterdam

The author

Abstract

Invisible XML [ixml] has had a stable specification since 2022, there are currently a half dozen implementations, and typically a dozen presentations per year have recently been given at conferences. At the beginning of 2026 the first International Symposium on Invisible XML was held [sym], with 14 presentations and 40 or so attendees. Meanwhile there is a working group [wg] developing the language further.

This paper gives an overview and discussion of the topics and issues currently being considered within the community on the route to the next version.

Keywords: markup, invisible xml, ixml, notation design, parsing, standards

Contents

Introduction

Usage of ixml has been increasing over a surprisingly broad range of applications

With this experience emerges requests for features or support for use cases.

These requests get noticed by, or passed on to, the working group for consideration, and often arrive as requests for particular features, for instance similar features available in other languages.

Notation Design

There are two observable streams in notation design.

Einstein's maxim "Everything should be kept as simple as possible, but no simpler" is a good aim when designing: it implies not adding multiple mechanisms to achieve the same result, but designing from use cases, not features.

Now on to individual issues under consideration within the community.

Renaming

Rules in ixml define two things:

  1. the input syntax
  2. how the parsetree should be serialised. In ixml 1.0 the serialisation of an element or attribute always has the same name as the rule it comes from.

However, two different input syntaxes might represent the same output serialisation (for instance two different date formats). This originally meant that one abstraction, with two different input formats, had to have two separate output formats as well.

Renaming

Renaming allows a different name from the rule name on serialisation. Without renaming, you get

date: y, m, d.

<date><y>2025</y><m>08</m><d>06</d></date>

While using renaming gives a different serialisation:

date > d: y, m, d.

<d><y>2025</y><m>08</m><d>06</d></d>

This is completed work; not yet officially published but already widely implemented.

See the modularisation paper [mod] for examples of its use, and the work-in-progress ixml draft [ixml2] for its definition.

Modularisation

Two ixml developments have been observed:

  1. Some very large grammars are emerging, such as the one for XPath 4 [xp4] at 350 lines, and 200 rules;
  2. Several grammars being written with common subparts, for instance for URIs, and indeed XPath expressions.

Modularisation allows you to split grammars into smaller more manageable parts:

Example

Here is the proposed syntax of a module, using its own definition:

+uses ixml, name, s, RS from ixml.ixml; 
      iri from iri.ixml
+shares module

   module: s, (multiuse; shares)*, ixml.
-multiuse: -"+uses", RS, uses++(-";", s).
   shares: -"+shares", RS, entries.
     uses: entries, RS, -"from", RS, from.
 -entries: share++(-",", s).
    share: @name, s.
    @from: iri, s.

Can be done by preprocessor: a processed modularised grammar can still be used by implementations that don't support modularisation.

See the paper from MarkupUK 2025 [mod] and a slightly different proposal from Norm Tovey-Walsh [mod2].

Round tripping

Round tripping is about recreating an equivalent input document from the output of ixml.

You would think that because of deletions and insertions, round tripping is not in general possible.

But by defining round tripping as "producing a document that would produce the same output", it is possible in the general case.

How it works

Transform the grammar that produced the output into one that recognises that output, and then serialising only the terminal characters of a resultant parse.

This produced a textual document that when processed with the original grammar produces the same output.

One implementation to date [rti] calls this "creating the canonical value": although the new document is not identical with the original, running it through ixml, and back again will continually produce the same canonical document.

An interesting corollary is that transforming the transformed grammar in the same way a second time produces the same output as the original grammar, but requiring less work from the ixml serialiser, pointing to a possible way of simplifying the ixml serialiser.

Corollary

Round tripping

Original input t is processed using the original grammar g to produce output t.

Transforming grammar g to produce grammar g' allows the output t to be transformed back to input document t', which though not identical to t, will produce the same output as t when processed using grammar g.

Transforming g' using the same transformation process to create grammar g'' will also produce the identical output, but with far simpler use of the ixml serialiser.

Ambiguity

Ambiguity in ixml is a property of the input; it is deliberately accepted in ixml:

Since ixml does not dictate which parser should be used, it is unspecified which parse is serialised; this means that different implementations may produce a different result.

There are disadvantages to ambiguous grammars:

Types of ambiguity: input

Ambiguities can be divided into several classes.

Some ambiguities are only on the input, such as deleted spaces in this example:

  input: number*.
 number: spaces, digit, spaces.
  digit: ["0"-"9"].
-spaces: -" "+.

Even though this is ambiguous, all parses will produce an identical serialisation.

Types of ambiguity: bad grammars

Some ambiguities are due to badly written grammars. For instance,

expression: number; identifier; expression, op, expression.
op: ["+-×÷"].
identifier: [L]+.
number: ["0"-"9"]+.

For an input like

a-b-c

this will produce two different parses with different meanings: (a-b)-c and a-(b-c).

Trying to solve this by ad hoc means wouldn't solve the underlying problem that the input is improperly described, nor would it guarantee that you get the serialisation that you want in all cases. It is a potential source of technical debt.

Types of ambiguity: inherent

Some ambiguities are inherent in the input, such as US/World dates, like 5/6/2026,

 date: us; world.
   us: month, "/", day, "/", year.
world: day, "/", month, "/", year.
  day: d, d?.
month: d, d?.
 year: d, d, d, d.
    d: ["0"-"9"].

Fixing inherent ambiguity

Even in these cases it is possible to rewrite it to an unambiguous grammar that explicitely identifies ambiguous and non-ambiguous cases:

 date: us; world; ambig.
   us: month, "/", day, "/", year.
world: day, "/", month, "/", year.
ambig: md, "/", md, "/", year. 
  day: "1", ["3"-"9"]; "2", "0"-"9"; "3", ["0"-"1"]. {13-31}
month: "0"? ["1"-"9"]; "1", ["0"-"2"].               {1-12}
   md: -month.

This still identifies dates like 1/1/2026 as ambiguous, since they are syntactically ambiguous, even though they aren't semantically ambiguous.

Even these cases are in principle handleable with a yet more complex grammar.

Approaches

There are two possible approaches to dealing with ambiguity:

Priorities

For the first, some approaches add priorities to rules that allow the selection of one rule above another when it comes to ambiguity.

Experience has shown that there are some dangers to this approach; for instance CSS has a similar feature where the !important keyword gives a rule priority over others.

This has proven to be an enormous source of technical debt, since people often use it for a quick fix, making stylesheets that use it fragile, and hard to update.

Example

A problem for grammars is splitting input into distinct cases. For example:

catalogue: entry*.
    entry: header, item+.
   header: text, code, #a.
     item: text, #a.
     text: word++" ".
    -word: (l; d)+.
    @code: l, l, l, d, d, d.
       -l: [L].
       -d: ["0"-"9"].

Example input:

Fiction fic001
Brave New World
1984
NonFiction non123
Translating Beaudelaire
The Sixth Extinction

The problem is a badly designed formats, but we still need to handle them.

Adding a priority

In this case, word and code are not distinct, so header also matches item. You could add a priority to a rule:

 header: text, code, #a. !important

which would say that in the case of ambiguity, choose this rule.

Anti ambiguity

A better solution would be the ability to say explicitly

"An item is any string of characters where the last word doesn't match a code"

For instance (proposal),

item: word**" ", " ", lastword, #a.
-lastword: word!code.

Here word!code means a word, as long as it doesn't match a code.

This approach will be mentioned more shortly.

Spaces

In passing it is worth mentioning the problem of spaces, since they are a typical source of ambiguity, and one of the harder aspects of free-format inputs for inexperienced grammar writers.

This may be because classically there is a lexical analyser feeding tokens to the parser, so that the parser never sees spaces, and so the grammar doesn't need to mention them.

Dealing with spaces

One way is to have a 'lexical' part to the grammar, that simulates the lexical analyser, returns 'symbols', and deals with spaces between symbols:

OPEN: "(", s.
CLOSE: ")", s.
PLUS: "+", s.

and so on, so that spaces don't have to appear in the main body of the grammar.

Adding a lexical analyser to ixml wouldn't solve this problem, because the syntax of tokens still have to be described, and ixml already has a method of describing syntax.

Ignoring spaces?

Since spaces are such a recurring problem, it would be tempting to define a mode of processing where if an input character fails to match, and it turns out to be a space, then it is just ignored.

But in truth, there are few input languages where spaces are never relevant.

In the 60's programming languages like FORTRAN and the Algols were explicitly designed so that spaces were never relevant (even in strings, in the case of the Algols), but nowadays that is seldom the case.

For instance in CSS p.note and p .note have very different meanings. Even ixml has places where spaces are required.

Lexerless parsing

As mentioned, traditionally parsing has been done with two parsers running in parallel

This was needed traditionally to enable the use of non-general parsing algorithms such as LL(1), which would otherwise not be possible in most cases.

However, for general parsers, there are new approaches that add constructs to the syntax description method that remove the need for a lexical stage. (Search for "scannerless parsing")

Proposal

The current ixml proposal is to add one construct, with the syntax not yet finalised. Alongside A*, A+, A?, a construct A! is added to mean "An A may not appear here".

For example,

identifier: letter+, letter!.

means "The longest stretch of letters that can be matched". Thus, with

keyword: "if", letter!; 
         "then", letter!; 
         "else", letter!.
identifier: keyword!, letter+, letter!.

this last rule would then mean "an identifier is the longest string of letters that is not a keyword".

Namespaces

In designing XML, the group responsible did a clever thing when adding namespaces:

The same approach could be used for ixml:

For implementations that produce textual output, this adds no extra processing.

For implementations that go directly to an XML internal form, the namespace declarations have to be recognised and handled appropriately, as they are in XML processors.

Proposal

Accepting this, you could define a rule whose output is the shell of an HTML document to include a namespace in this way:

html: xhtml-ns, head, body.
@xhtml-ns>xmlns: +"http://www.w3.org/1999/xhtml".

which would give

<html xmlns='http://www.w3.org/1999/xhtml'>

Greedy Matching

A problem for beginners coming to ixml is that they may have internalised idioms from other similar systems that work differently from ixml.

For example, greedy matching as used in typical regular expression recognisers, where in many regular expression implementations the pattern

["a"-"z"]*

matches the longest-possible stretch of lower case letters.

However, in traditional grammar usage it represents any length that fits in the context of where it is used, and not necessarily the longest.

Option

An option would be to introduce a separate notation to specify the longest possible stretch, for example

["a"-"z"]>>

which then for consistency would require a similar construct for separated repeats

["a"-"z"]>>(",", s?)

However, as already seen, the negation construct would already allow the specification of longest stretches, so it is not clear that adding another construct for this explicit case would be necessary.

Numbered Repeats

Grammars in ixml have extra constructs available, not available in traditional grammars, and not mentioned in traditional parsing algorithms, in particular for repeated constructs with separators.

As the ixml specification points out, it is easy to handle these constructs, partly thanks to serialisation control, by transforming the grammar into an equivalent one that doesn't use the constructs.

Possibility

Some grammar systems allow the specification of numbered repeats, for instance "zero or more up to 6 letters". As an example ABNF [abnf] allows

3 digit

to specify exactly 3 digits,

1*4 digit

to specify 1-4 digits, and so on.

There are very few grammars that requires such a notation, though it would be easy to transform a grammar using such a construct into one not using it.

Pragmas

This is surprisingly a contentious issue: how to address individual pieces of software from a grammar.

Software occasionally provides a mechanism for instructing a processor to act in a certain way.

XML itself has Processing Instructions, that consist of a target that specifies what the pragma is about, and content, which is used by software that processes such instructions. For instance

<?xml-stylesheet type="text/xsl" href="style.xsl"?>

These are typically a type of comment: they don't alter the semantics of the language, but instruct a mode of operation to the processor.

Proposal

A paper [pragmas] proposed using pragmas not only to address individual software, but also as an extension mechanism. The danger of this is that conflating the two aspects risks undermining the interoperability of ixml.

Since pragmas are there to talk to individual pieces of software, it would seem advisable to let the software specify what they expect, within a broad but simple structure in the style of XML processing instructions: identify the target, and let the software do the rest.

A pragma should be identifiable as such, and the content should be up to the addressed software.

Versioning

Most software (for instance programming languages) doesn't require its input to specify which version of the processor is required.

XML is a bit of an exception on this point, and it is not obvious what the advantages are to the user of requiring it; it certainly seems to have obstructed adoption of new versions in the case of XML.

It may be used as a pragma to the processor to require a certain type of processing or checking, but this is only really necessary when the semantic meaning of a particular syntactic structure has changed between versions.

The current method of specifying the ixml version was added in haste shortly before publication of the specification, which was a mistake, because it left no time to implement and try it out beforehand.

A user shouldn't need to know which version they are using: the absence of a version should always be taken to mean "use the most recent version".

Conclusion

The adage "Good food takes time" applies equally well to design: there are many interlocking decisions where a change in one part can affect the design of another part, and they need to be designing to achieve a good symbiosis with each other, and then user-tested to see the real-life effects of the changes.

The design of the first version of ixml itself went through several iterations, and had small-scale user testing before it ended up as version 1.0.

The next version, 1.1, or 2.0, whatever it will be called, has many, sometimes apparently conflicting, requirements that need to be resolved and meshed well together.

References

[abnf] D. Crocker, Ed., RFC 5234 Augmented BNF for Syntax Specifications: ABNF, ietf.org, 2008, https://datatracker.ietf.org/doc/html/rfc5234

[art] Mary Holstege, “Invisible Fish: API Experimentation with InvisibleXML.” In Proceedings of Balisage: The Markup Conference 2024, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Holstege01

[bank] Steven Pemberton, Banking with ixml and XForms, Proc. Declarative Amsterdam 2024, Amsterdam, The Netherlands. https://declarative.amsterdam/article?doi=da.2024.pemberton.banking

[css] Håkon Wium Lie et al. (eds.), Cascading Style Sheets level 1, W3C, 1996, https://www.w3.org/TR/CSS1/

[ixml2] Steven Pemberton (ed.), Invisible XML Specification Community Group Editorial Draft, Invisible XML Organisation, 2026, https://invisiblexml.org/current/

[ixml] Steven Pemberton (ed.), Invisible XML Specification, Invisible XML Organisation, 2022, https://invisiblexml.org/1.0/

[knit] Bethan Tovey-Walsh, “When women do algorithms: a semi-generative approach to overlay crochet with iXML and XSLT.” In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Tovey-Walsh01

[lit] Steven Pemberton, “The Book of Doublends Jined: Parsing Finnegans Wake with ixml.” In Proceedings of Balisage: The Markup Conference 2025. Balisage Series on Markup Technologies, vol. 30 (2025). https://doi.org/10.4242/BalisageVol30.Pemberton01

[mod2] Norm Tovey-Walsh, An Invisible XML modularity proposal, 2026, https://nineml.org/proposals/2026/modularity/

[mod] Steven Pemberton, Modular ixml, Proc. MarkupUK 2025, pp 6-20, https://markupuk.org/pdf/proceedings-2025-2.pdf

[msg] Ari Nordström, “Adventures in Mainframes, Text-based Messaging, and iXML.” In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Nordstrom01

[peg] Bryan Ford, "Parsing Expression Grammars: A Recognition Based Syntactic Foundation" (PDF). Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 2004, ACM. pp. 111–122. doi:10.1145/964001.964011. ISBN 1-58113-729-X.

[pragma] Tomos Hillman, C. M. Sperberg-McQueen, Bethan Tovey-Walsh and Norm Tovey-Walsh. “Designing for change: Pragmas in Invisible XML as an extensibility mechanism.” In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Sperberg-McQueen01

[prio] E. Shinan, Lark: A parsing toolkit for python (2025), github, https://github.com/lark-parser/lark

[rti] Alain Couthures, Text normalization with Invisible XML round-tripping, Proc Declarative Amsterdam 2025, https://declarative.amsterdam/article?doi=da.2025.couthures.grammix

[rt] Steven Pemberton, Round-tripping Invisible XML, in Proc. XML Prague 2024, Prague, Czechia, 2024, pp 153-164, ISBN 978-80-907787-2-6, https://archive.xmlprague.cz/2024/files/xmlprague-2024-proceedings.pdf#page=163

[sym] Various, The First International Symposium on Invisible XML, invisiblexml.org, 2026, https://invisiblexml.org/events/symposium2026/

[trials] C. M. Sperberg-McQueen, “From Word to XML via iXML: a Word-first XML workflow in the TLRR 2e project.” In Proceedings of Balisage: The Markup Conference 2024. Balisage Series on Markup Technologies, vol. 29 (2024). https://doi.org/10.4242/BalisageVol29.Sperberg-McQueen01

[vdb] M.G.J. van den Brand, et al., Disambiguation Filters for Scannerless Generalized LR Parsers. In: Horspool, R.N. (eds) Compiler Construction CC 2002. Lecture Notes in Computer Science, vol 2304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45937-5_12, https://cwi.nl/~jurgenv/papers/CC-2002.pdf

[vin] Ari Nordström, It's Useful After All — VIN Numbers, DITA, and iXML, Proc XML Prague 2024, pp 295-306 https://archive.xmlprague.cz/2024/files/xmlprague-2024-proceedings.pdf#page=305

[wg] Invisible Markup Community Group, https://www.w3.org/community/ixml/[xmlns] Tim Bray et al., Namespaces in XML 1.0, W3C, 2009, https://www.w3.org/TR/xml-names/

[xp4] John Lumley, Invisible XML workbench, Github, 2024, https://johnlumley.github.io/jwiXML.xhtml

[zen] Tim Peters, PEP 20 - The Zen of Python, python.org, 2004, https://peps.python.org/pep-0020/