Invisible XML Case Studies

Steven Pemberton, CWI, Amsterdam

Steven Pemberton

Contents

Introduction

The tutorial is online at at cwi.nl/~steven/ixml/case-studies/

It is also designed for self-teaching, and these slides are online.

For self study, there are exercises, with answers at the back. We will not be doing the exercises today.

Please ask questions, and give feedback!

Implementations

You have a choice of zero-install implementations, all accessed via a browser.

Recap

rule: a, "b", c;
     "d", e, "f". {comment}

Terminals match a literal string:

Literals: "terminal" 'terminal' "don't" 'don''t'
Encoded: #fff

Character sets match a single character:

Inclusion: ["a"; "()+"; #fff; "A"-"Z"]
Exclusion: ~["a"; "()+"; #fff; "A"-"Z"]

Insertions match nothing, only appear in the output:

+"string" +#fff

Group : (a, "b", c; "d", e, "f")
Option: a?

Recap

Repetitions:

Zero or more: a*
One or more: a+
Zero or more with separator: a**b
One or more with separator: a++b

Serialisation markers:

- do not include in serialisation
^
do include in serialisation (default)
@
serialise nonterminal as attribute

New: Renaming

nmonth > month: digit, digit?.

Uses a different name on serialisation:

<date>
   <day>05</day>/
   <month>11</month>/
   <year>2025</year>
</date

Case study: IPv6

Traditionally, syntax descriptions are used to describe what is acceptable or correct, but are not particularly interested in presenting any particular structure.

On the other hand, with ixml it is principally the structure we are interested in, and to a lesser extent the correctness of the input (since in many cases we can assume that the input is already correct).

To this end we can talk about

Permissive

For instance, an IPv4 address is a 32 bit number represented by splitting it into 4 groups of 8 bits, each group represented by an (up to) 3 digit decimal number, such as

192.168.1.2

The most permissive description of this would be:

IPv4: n++".".
   n: d+.
   d: ["0"-"9"].

If we wanted to be somewhat stricter:

IPv4: d3, ".", d3, ".", d3, ".", d3.
  d3: d, (d, d?)?.
   d: ["0"-"9"].

If we know that the input is going to be correct, then this is fine; or we could check that the values are in range after serialisation.

Stricter

If we want to be yet stricter, then we can restrict the values of d3 to the range 0-255:

d3: d;                 {  0-  9}
    ["1"-"9"], d;      { 10- 99}
    "1", d, d;         {100-199}
    "2", ["0"-"4"], d; {200-249}
   "25", ["0"-"5"].    {250-255}

We can adopt a similar permissive/strict approach to dates like 31/12/2026, either accepting any range of digits

date: d, d, "/", d, d, "/", d, d, d, d.

or restricting the values suitably.

IPv6 address

This is a 128 bit number, represented in 8 groups of 16 bits, separated by colons.

Each group is represented by up to 4 hexadecimal digits. For instance:

2001:0db8:85a3:0000:0000:8a2e:0370:7334

This is easy in ixml. If we decide not to restrict it to 8 groups, we can say:

IPv6: h4++-":".
h4: h, h, h, h.
-h: ["0"-"9"; "a"-"f"; "A"-"F"].

Shortcuts

Leading zeros can be omitted, but at least one digit must remain:

2001:db8:85a3:0:0:8a2e:370:7334

So we define:

h4: h, (h, (h, h?)?)?.

The left-most longest string of zero values may be replaced by :: :

2001:db8:85a3::8a2e:370:7334

which we could represent with

IPv6: h4**-":", zeros, h4**-":".
zeros: -"::".

Advantage

The nice thing about doing it this way, is the semantic value of having an element representing the missing values, 2001:DB8::1 giving:

  <ipv6>
      <h4>2001</h4>
      <h4>DB8</h4>
      <zeros/>
      <h4>1</h4>
   </ipv6>

Case study: CSV

CSV, "Comma separated values", is a spreadsheet-derived data format, with no formal definition, but that is so often used, is worth having an ixml grammar for.

Here is an example of CSV from the Wikipedia page on CSV:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

Structure

What we can see is that a CSV document consists of a number of rows:

csv: row*.

Each row consists of a number of values, comma separated, followed by a newline:

row: v**",", nl.

But we don't want the commas in the output, since they aren't part of the data:

row: v**-",", nl.

A newline is a newline character, possibly preceded by a carriage return, neither of which should be retained:

-nl: -#d?, -#a.

Values

A basic value consists of any number of characters as long as they aren't a comma, or a newline:

v: ~[","; #a; #d]*.

However, it is possible to include commas and newlines, as long as the value is quoted. So we change the definition of v to:

v: quoted; unquoted.

An unquoted value is either empty, or is the same as the old definition of v, except that the first character can't be a quote (or comma, or newline):

-unquoted: ; ~['",'; #a; #d], ~[","; #a; #d]*.

(Quoted values are the exercise)

Output

Using the above example csv from the wikipedia page (note values with commas, quotes, and newlines) as input, we get this result:

<csv>
   <row>
      <v>Year</v>
      <v>Make</v>
      <v>Model</v>
      <v>Description</v>
      <v>Price</v>
   </row>
   <row>
      <v>1997</v>
      <v>Ford</v>
      <v>E350</v>
      <v>ac, abs, moon</v>
      <v>3000.00</v>
   </row>
   <row>
      <v>1999</v>
      <v>Chevy</v>
      <v>Venture "Extended Edition"</v>
      <v/>
      <v>4900.00</v>
   </row>
   <row>
      <v>1999</v>
      <v>Chevy</v>
      <v>Venture "Extended Edition, Very Large"</v>
      <v/>
      <v>5000.00</v>
   </row>
   <row>
      <v>1996</v>
      <v>Jeep</v>
      <v>Grand Cherokee</v>
      <v>MUST SELL!
air, moon roof, loaded</v>
      <v>4799.00</v>
   </row>
</csv>

Output

As an extra simple example, if we feed an empty file to the ixml, we get:

<csv/>

Case study: ls -l

In Unix-like operating systems, the command ls produces a directory listing, and ls -l produces a long version, with more details. For instance:

total 16
drwxrwxr-x 2 steven ixml 4096 May 28 21:50 answers
drwxrwxr-x 2 steven ixml 4096 May 28 21:50 examples
-rw-rw-r-- 1 steven ixml  840 May 28 21:51 next.html
-rw-rw-r-- 2 steven ixml  192 May 28 21:51 tutorial.css

An empty directory looks like

total 0

ls

On such systems, you can typically do something like

ls -l | ixml -g ls.ixml

to produce a marked-up version.

The top level rule is going to look like this:

directory: total, entry*.
   @total: -"total ", n, -#a.
       -n: d+.
       -d: ["0"-"9"].

And each entry will be of the structure:

entry: props, links, owner, group, size, date, name, #a.

for

-rw-rw-r-- 2 steven ixml  192 May 28 21:51 tutorial.css

Basic fields

Some of these are easy:

links: n, s.
    -s: " "+.

You'll note that some fields are separated by more than one space, hence the +.

owner: id.
group: id.
  -id: [L]+, s.
 size: n, s.

Dates

Dates has two different formats:

     date: month, day, (year; time).
    month: [L]+, s.

      day: (d, d; +"0", d), s.
     year: n, s.
     time: h, -":", m, s.
        h: n.
        m: n.

Properties

That leaves two things to take care of. It's a permissive grammar, which gives us more freedom.

Firstly the properties of the file, that odd string at the front of each row. We could just say

props: ~[" "]+, s.

to give us a structure like

<props>-rw-rw-r--</props>

but knowing more about what those characters represent, but without committing to particular characters, we can do:

       props: type, user, group, other, s.
       @type: x.
       @user: x, x, x.
@grp > group: x, x, x.
      @other: x, x, x.
          -x: ~[" "].

Filename

The one remaining thing is the file name. Again we don't need to commit to particular characters: we just accept everything up to the end of the line:

       @name: ~[#a]+.

This will give us output like

<entry name='tutorial.css'>
   <props type='-' user='rw-' group='rw-' other='r--'/>
   <links>2</links>
   <owner>steven</owner>
   <group>ixml</group>
   <size>192</size>
   <date>
      <month>Sep</month>
      <day>12</day>
      <year>2025</year>
   </date>
</entry>

Case study: DocBook

The hardest part of getting an article into DocBook format (the XML format used by several conferences for their papers) is getting the bibliography right.

Although ixml was not designed to produce particular versions of XML, it is possible to produce a DocBook bibliography with the help of ixml.

Example

For instance, the text for the ixml specification:

[spec] Steven Pemberton (ed.), Invisible XML Specification,
invisiblexml.org, 2022, https://invisiblexml.org/ixmlspecification.html

can be processed by an ixml grammar whose top-level rules are something like

bibliography: biblioentry+.
 biblioentry: abbrev, 
              (author; editor), -", ", 
              title, -", ", 
              publisher, -", ", 
              pubdate, -", ", 
              (artpagenums, -", ")?,
              (bibliomisc; biblioid)**-", ",
              -#a.

Structure

It is a fairly fixed format, field separators are a comma and a space.

It is largely a permissive grammar: many fields are defined as any string of characters not containing a comma, close square bracket, or newline:

    title: entry.
publisher: entry.
   -entry: ~[",]"; #a]+.

Optional fields are identified by particular substrings. For example artpagenums start with pp:

artpagenums: -"pp ", [Nd; "-–"]+.

A bibliomisc is a web address, beginning http

A bibliod is either an ISBN, beginning with ISBN, or a DOI, beginning doi:.

Output

[spec] Steven Pemberton (ed.), Invisible XML Specification, invisiblexml.org, 2022, https://invisiblexml.org/ixmlspecification.html

gives

<biblioentry>
   <abbrev>spec</abbrev>
   <editor>
      <personname>
         <firstname>Steven</firstname>
         <surname>Pemberton</surname>
      </personname>
   </editor>
   <title>Invisible XML Specification</title>
   <publisher>invisiblexml.org</publisher>
   <pubdate>2022</pubdate>
   <bibliomisc>
       <link xlink-href='https://invisiblexml.org/ixml-specification.html'/>
   </bibliomisc>
</biblioentry>

which can then further be tweaked by hand.

Case study: gedcom

Gedcom is a (non-context-free) format for recording genealogical data (family trees).

To be honest, it is a fairly badly-designed format, that looks like it was designed by a programmer used to assembly language.

Example

The leading numbers give the nesting of the field, followed by the 3 or 4-letter name of the field, and then the value of that field, if any. Since it is not context-free, we are constrained in the range of solutions for this, but we can still make something that is more structured.

0 @I1@ INDI
1 NAME Robert Eugene /Williams/
2 SURN Williams
2 GIVN Robert Eugene
1 SEX M
1 BIRT
2 DATE 2 Oct 1822
2 PLAC Weston, Madison, Connecticut, United States of America
2 SOUR @S1@
3 PAGE Sec. 2, p. 45
1 DEAT
2 DATE 14 Apr 1905
2 PLAC Stamford, Fairfield, Connecticut, United States of America
1 BURI
2 PLAC Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America
1 FAMS @F1@
1 FAMS @F2@
1 RESI 
2 DATE from 1900 to 1905

Structure

At the top level, we have a record numbered zero, followed by some number of nested fields numbered 1:

gedcom: record*.
record: -"0 ", field, r1*.

Similarly, a record numbered 1 is going to have its value, followed by any number of records numbered 2:

r1: -"1 ", field, r2*.

and so on, as far we need to go:

r2: -"2 ", field, r3*.
r3: -"3 ", field, r4*.
r4: -"4 ", field, r5*.
r5: -"5 ", field, r6*.
r6: -"6 ", field, r7*.
r7: -"7 ", field, r8*.
r8: -"8 ", field, r9*.
r9: -"9 ", field.

Fields

Keeping it simple now, a field is a name and an optional value:

-field: name, (-" ", value)?, -#a.
@name: ["A"-"Z"; "0"-"9"; "@"]+.
@value: ~[#a]*.

Output

This already gives a reasonably structured result:

<gedcom>
   <record name='@I1@' value='INDI'>
      <r1 name='NAME' value='Robert Eugene /Williams/'>
         <r2 name='SURN' value='Williams'/>
         <r2 name='GIVN' value='Robert Eugene'/>
      </r1>
      <r1 name='SEX' value='M'/>
      <r1 name='BIRT'>
         <r2 name='DATE' value='2 Oct 1822'/>
         <r2 name='PLAC' value='Weston, Madison, Connecticut, United States of America'/>
         <r2 name='SOUR' value='@S1@'>
            <r3 name='PAGE' value='Sec. 2, p. 45'/>
         </r2>
      </r1>
      <r1 name='DEAT'>
         <r2 name='DATE' value='14 Apr 1905'/>
         <r2 name='PLAC' value='Stamford, Fairfield, Connecticut, United States of America'/>
      </r1>
      <r1 name='BURI'>
         <r2 name='PLAC' value='Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America'/>
      </r1>
      <r1 name='FAMS' value='@F1@'/>
      <r1 name='FAMS' value='@F2@'/>
      <r1 name='RESI' value=''>
         <r2 name='DATE' value='from 1900 to 1905'/>
      </r1>
   </record>
</gedcom>

One thing that this exposes is that the structure isn't always completely consistent. For instance, the RESI field at the end of the example has an empty value attribute because the name is followed by a space, but no value.

Renaming subrecords

The r1, r2, etc. are due to the fact that each level has a different syntax, with a number specifying its depth.

Renaming allows us to get rid of those:

gedcom: record*.
record: -"0 ", field, r1*.
r1 > field: -"1 ", field, r2*.
r2 > field: -"2 ", field, r3*.
r3 > field: -"3 ", field, r4*.

etc.

Output

<gedcom>
   <record id='I1' name='INDI'>
      <field name='NAME' value='Robert Eugene /Williams/'>
         <field name='SURN' value='Williams'/>
         <field name='GIVN' value='Robert Eugene'/>
      </field>
      <field name='SEX' value='M'/>
      <field name='BIRT'>
         <field name='DATE' value='2 Oct 1822'/>
         <field name='PLAC' value='Weston, Madison, Connecticut, United States of America'/>
         <field name='SOUR' link='S1'>
            <field name='PAGE' value='Sec. 2, p. 45'/>
         </field>
      </field>
      <field name='DEAT'>
         <field name='DATE' value='14 Apr 1905'/>
         <field name='PLAC' value='Stamford, Fairfield, Connecticut, United States of America'/>
      </field>
      <field name='BURI'>
         <field name='PLAC' value='Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America'/>
      </field>
      <field name='FAMS' link='F1'/>
      <field name='FAMS' link='F2'/>
      <field name='RESI'>
         <field name='DATE' value='from 1900 to 1905'/>
      </field>
   </record>
</gedcom>

Case study: Markdown

Markdown is a class of languages meant to make writing HTML text documents easier.

Unfortunately, there are several different versions, and not all of them have specifications.

Since we'll only touch on some of the features here, we'll take the liberty of not using any particular version of Markdown.

To show it working, this chapter in the tutorial was actually produced using the ixml below on a Marked-down version of the chapter.

Structure

Since Markdown is just a representation of an HTML document, the top level is:

html: head, body.

We can add anything we like to the head using insertions:

head: meta, title.
meta: name, content.
@name: +"generator".
@content: +"ixml".
title: +"Markdown".

This will cause every serialised result to start

<html>
   <head>
      <meta name='generator' content='ixml'/>
      <title>Markdown</title>
   </head>
   <body>

Body

For the time being, let's keep the body simple: headings and paragraphs separated by blank lines:

body: part++(-#a+).
-part: heading; para.
-heading: h1; h2; h3; h4; h5; h6.
-para: p.

Headings

Headings are a line starting with # characters, and a space, and optionally ending with them as well.

h1: -"# " , htext, -"#"*, -#a.
h2: -"## ", htext, -"#"*, -#a.
h3: -"### ", htext, -"#"*, -#a.
h4: -"#### ", htext, -"#"*, -#a.
h5: -"##### ", htext, -"#"*, -#a.
h6: -"###### ", htext, -"#"*, -#a.

Heading text is a series of heading characters. Hash characters are only allowed in the text if they are followed by a non-hash character:

-htext: hc+.
-hc: ~["#"; #a]; "#", ~["#"; #a].

Paragraphs

Paragraphs must not start with a hash, clearly, and consist of a number of lines of text:

p: ~["#"], line++nl, -#a.
-line: c+.
-c: ~[#a].

We have separated out the nl, since it should be retained in the output, and not deleted.

-nl: #a.

Text Style

Text within a paragraph can be marked with

So we change the definition of the contents of line accordingly:

-line: c+.
-c: ~[#a; "*_`"]; em; strong; code.

strong: -"*", cstar+, -"*".
em: -"_", cunder+ , -"_".
code: -"`", ccode+, -"`".

-cstar: ~["*"; #a].
-cunder: ~["_"; #a].
-ccode: ~["`"; #a].

Code

Finally, we will add one more type of paragraph, code blocks, which will be produced using the pre element in HTML, and start with a space in the input. So add the new type of paragraph:

-para: p; pre.

Make sure that p paragraphs don't start with a space:

p: ~["# "], line++nl, -#a.

and now define a pre paragraph similar to p paragraphs:

pre: (" ", preline)++nl, -#a.
-preline: ~[#a]*.

Case study: ixml

The final case study is the grammar of ixml itself. This grammar has the following properties:

Structure

At the top level is the rule for rule:

rule: (mark, s)?, name, s, -["=:"], s, -alts, -".".

This rule defines itself: a rule is an optional mark, followed by a name, a colon or equals, some alternatives, and then a full-stop.

Alts are just one or more alts separated by semicolons or vertical bars, an alt is zero or more terms, separated by commas:

alts: alt++(-[";|"], s).
 alt: term**(-",", s).

Whitespace

The rule s is for optional whitespace.

-s: (whitespace; comment)*.

Its use always directly follows a terminal (such as "-["=:"], s" above in rule), except if that terminal is in an attribute.

In that case the whitespace is moved to directly after the attribute (such as for mark, and name above).

This prevents comments ending up in attribute content.

Whitespace is any character so classified in Unicode, plus tab, carriage-return, and linefeed:

-whitespace: -[Zs]; tab; lf; cr.

Comments

A comment is any number of comment characters or comments, surrounded by { } braces. A comment character is any character that isn't one of the braces. This definition allows nested comments, so that you can comment out a piece of ixml:

comment: -"{", (cchar; comment)*, -"}".
 -cchar: ~["{}"].

Strings

Most other rules are self-explanatory. But there are a couple worth looking at.

A string is one or more dchars enclosed by double quotes (and similar for single quotes).

@string: -'"', dchar+, -'"';
         -"'", schar+, -"'".

A dchar is any character except a double quote or a newline (since strings may not extend over lines), or two double quotes:

dchar: ~['"'; #a; #d];
       '"', -'"'.

Note though that since two double quotes represent a single quote in the string, one is deleted, and the other is not.

Literals

A literal is either a string or a hex encoding:

 literal: quoted;
          encoded.
 -quoted: (tmark, s)?, string, s.
-encoded: (tmark, s)?, -"#", hex, s.

Since string and hex are both attributes, they appear as a literal element in the serialisation, and the attributes are raised up to it. For "a":

<literal string='a'/>

for #a

<literal hex='a'/>

If they have a mark, it will similarly appear here. For instance for -#a:

<literal tmark='-' hex='a'/>

Encodings

Hex encodings are treated slightly differently in character sets, which appear in the serialisation as either an inclusion or an exclusion, containing a series of members:

["a"; #a; "A"-"Z"; #a-"a"]

appears as:

<inclusion>
   <member string='a'/>
   <member hex='a'/>
   <member from='A' to='Z'/>
   <member from='#a' to='a'/>
</inclusion>

That is to say, for a range, there aren't separate attributes for a string from and a hex from: they are both called from, and they are distinguished by whether they contain one character, or two or more where the first is a '#', so the '#' in that one case is not deleted.

Conclusion