The tutorial is online at at cwi.nl/~steven/ixml/case-studies/
It is also designed for self-teaching, and these slides are online.
For self study, there are exercises, with answers at the back. We will not be doing the exercises today.
Please ask questions, and give feedback!
You have a choice of zero-install implementations, all accessed via a browser.
rule: a, "b", c;
"d", e, "f". {comment}
Terminals match a literal string:
Literals:
"terminal" 'terminal' "don't" 'don''t'
Encoded:#fff
Character sets match a single character:
Inclusion:
["a"; "()+"; #fff; "A"-"Z"]
Exclusion:~["a"; "()+"; #fff; "A"-"Z"]
Insertions match nothing, only appear in the output:
+"string" +#fff
Group : (a, "b", c; "d", e, "f")
Option: a?
Repetitions:
Zero or more:
a*
One or more:a+
Zero or more with separator:a**b
One or more with separator:a++b
Serialisation markers:
-do not include in serialisationdo include in serialisation (default)
^serialise nonterminal as attribute
@
nmonth > month: digit, digit?.
Uses a different name on serialisation:
<date> <day>05</day>/ <month>11</month>/ <year>2025</year> </date
Traditionally, syntax descriptions are used to describe what is acceptable or correct, but are not particularly interested in presenting any particular structure.
On the other hand, with ixml it is principally the structure we are interested in, and to a lesser extent the correctness of the input (since in many cases we can assume that the input is already correct).
To this end we can talk about
For instance, an IPv4 address is a 32 bit number represented by splitting it into 4 groups of 8 bits, each group represented by an (up to) 3 digit decimal number, such as
192.168.1.2
The most permissive description of this would be:
IPv4: n++".". n: d+. d: ["0"-"9"].
If we wanted to be somewhat stricter:
IPv4: d3, ".", d3, ".", d3, ".", d3. d3: d, (d, d?)?. d: ["0"-"9"].
If we know that the input is going to be correct, then this is fine; or we could check that the values are in range after serialisation.
If we want to be yet stricter, then we can restrict the values of
d3 to the range 0-255:
d3: d; { 0- 9}
["1"-"9"], d; { 10- 99}
"1", d, d; {100-199}
"2", ["0"-"4"], d; {200-249}
"25", ["0"-"5"]. {250-255}
We can adopt a similar permissive/strict approach to dates like
31/12/2026, either accepting any range of digits
date: d, d, "/", d, d, "/", d, d, d, d.
or restricting the values suitably.
This is a 128 bit number, represented in 8 groups of 16 bits, separated by colons.
Each group is represented by up to 4 hexadecimal digits. For instance:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
This is easy in ixml. If we decide not to restrict it to 8 groups, we can say:
IPv6: h4++-":". h4: h, h, h, h. -h: ["0"-"9"; "a"-"f"; "A"-"F"].
Leading zeros can be omitted, but at least one digit must remain:
2001:db8:85a3:0:0:8a2e:370:7334
So we define:
h4: h, (h, (h, h?)?)?.
The left-most longest string of zero values may be replaced by
:: :
2001:db8:85a3::8a2e:370:7334
which we could represent with
IPv6: h4**-":", zeros, h4**-":". zeros: -"::".
The nice thing about doing it this way, is the semantic value of having an
element representing the missing values, 2001:DB8::1 giving:
<ipv6>
<h4>2001</h4>
<h4>DB8</h4>
<zeros/>
<h4>1</h4>
</ipv6>
CSV, "Comma separated values", is a spreadsheet-derived data format, with no formal definition, but that is so often used, is worth having an ixml grammar for.
Here is an example of CSV from the Wikipedia page on CSV:
Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
What we can see is that a CSV document consists of a number of rows:
csv: row*.
Each row consists of a number of values, comma separated, followed by a newline:
row: v**",", nl.
But we don't want the commas in the output, since they aren't part of the data:
row: v**-",", nl.
A newline is a newline character, possibly preceded by a carriage return, neither of which should be retained:
-nl: -#d?, -#a.
A basic value consists of any number of characters as long as they aren't a comma, or a newline:
v: ~[","; #a; #d]*.
However, it is possible to include commas and newlines, as long as the value is quoted. So we change the definition of v to:
v: quoted; unquoted.
An unquoted value is either empty, or is the same as the old definition of v, except that the first character can't be a quote (or comma, or newline):
-unquoted: ; ~['",'; #a; #d], ~[","; #a; #d]*.
(Quoted values are the exercise)
Using the above example csv from the wikipedia page (note values with commas, quotes, and newlines) as input, we get this result:
<csv>
<row>
<v>Year</v>
<v>Make</v>
<v>Model</v>
<v>Description</v>
<v>Price</v>
</row>
<row>
<v>1997</v>
<v>Ford</v>
<v>E350</v>
<v>ac, abs, moon</v>
<v>3000.00</v>
</row>
<row>
<v>1999</v>
<v>Chevy</v>
<v>Venture "Extended Edition"</v>
<v/>
<v>4900.00</v>
</row>
<row>
<v>1999</v>
<v>Chevy</v>
<v>Venture "Extended Edition, Very Large"</v>
<v/>
<v>5000.00</v>
</row>
<row>
<v>1996</v>
<v>Jeep</v>
<v>Grand Cherokee</v>
<v>MUST SELL!
air, moon roof, loaded</v>
<v>4799.00</v>
</row>
</csv>
As an extra simple example, if we feed an empty file to the ixml, we get:
<csv/>
In Unix-like operating systems, the command ls produces a
directory listing, and ls -l produces a long version, with more
details. For instance:
total 16 drwxrwxr-x 2 steven ixml 4096 May 28 21:50 answers drwxrwxr-x 2 steven ixml 4096 May 28 21:50 examples -rw-rw-r-- 1 steven ixml 840 May 28 21:51 next.html -rw-rw-r-- 2 steven ixml 192 May 28 21:51 tutorial.css
An empty directory looks like
total 0
On such systems, you can typically do something like
ls -l | ixml -g ls.ixml
to produce a marked-up version.
The top level rule is going to look like this:
directory: total, entry*.
@total: -"total ", n, -#a.
-n: d+.
-d: ["0"-"9"].
And each entry will be of the structure:
entry: props, links, owner, group, size, date, name, #a.
for
-rw-rw-r-- 2 steven ixml 192 May 28 21:51 tutorial.css
Some of these are easy:
links: n, s.
-s: " "+.
You'll note that some fields are separated by more than one space, hence the
+.
owner: id. group: id. -id: [L]+, s. size: n, s.
Dates has two different formats:
date: month, day, (year; time).
month: [L]+, s.
day: (d, d; +"0", d), s.
year: n, s.
time: h, -":", m, s.
h: n.
m: n.
That leaves two things to take care of. It's a permissive grammar, which gives us more freedom.
Firstly the properties of the file, that odd string at the front of each row. We could just say
props: ~[" "]+, s.
to give us a structure like
<props>-rw-rw-r--</props>
but knowing more about what those characters represent, but without committing to particular characters, we can do:
props: type, user, group, other, s.
@type: x.
@user: x, x, x.
@grp > group: x, x, x.
@other: x, x, x.
-x: ~[" "].
The one remaining thing is the file name. Again we don't need to commit to particular characters: we just accept everything up to the end of the line:
@name: ~[#a]+.
This will give us output like
<entry name='tutorial.css'>
<props type='-' user='rw-' group='rw-' other='r--'/>
<links>2</links>
<owner>steven</owner>
<group>ixml</group>
<size>192</size>
<date>
<month>Sep</month>
<day>12</day>
<year>2025</year>
</date>
</entry>
The hardest part of getting an article into DocBook format (the XML format used by several conferences for their papers) is getting the bibliography right.
Although ixml was not designed to produce particular versions of XML, it is possible to produce a DocBook bibliography with the help of ixml.
For instance, the text for the ixml specification:
[spec] Steven Pemberton (ed.), Invisible XML Specification, invisiblexml.org, 2022, https://invisiblexml.org/ixmlspecification.html
can be processed by an ixml grammar whose top-level rules are something like
bibliography: biblioentry+.
biblioentry: abbrev,
(author; editor), -", ",
title, -", ",
publisher, -", ",
pubdate, -", ",
(artpagenums, -", ")?,
(bibliomisc; biblioid)**-", ",
-#a.
It is a fairly fixed format, field separators are a comma and a space.
It is largely a permissive grammar: many fields are defined as any string of characters not containing a comma, close square bracket, or newline:
title: entry. publisher: entry. -entry: ~[",]"; #a]+.
Optional fields are identified by particular substrings. For example
artpagenums start with pp:
artpagenums: -"pp ", [Nd; "-–"]+.
A bibliomisc is a web address, beginning http
A bibliod is either an ISBN, beginning with ISBN,
or a DOI, beginning doi:.
[spec] Steven Pemberton (ed.), Invisible XML Specification, invisiblexml.org, 2022, https://invisiblexml.org/ixmlspecification.html
gives
<biblioentry>
<abbrev>spec</abbrev>
<editor>
<personname>
<firstname>Steven</firstname>
<surname>Pemberton</surname>
</personname>
</editor>
<title>Invisible XML Specification</title>
<publisher>invisiblexml.org</publisher>
<pubdate>2022</pubdate>
<bibliomisc>
<link xlink-href='https://invisiblexml.org/ixml-specification.html'/>
</bibliomisc>
</biblioentry>
which can then further be tweaked by hand.
Gedcom is a (non-context-free) format for recording genealogical data (family trees).
To be honest, it is a fairly badly-designed format, that looks like it was designed by a programmer used to assembly language.
The leading numbers give the nesting of the field, followed by the 3 or 4-letter name of the field, and then the value of that field, if any. Since it is not context-free, we are constrained in the range of solutions for this, but we can still make something that is more structured.
0 @I1@ INDI 1 NAME Robert Eugene /Williams/ 2 SURN Williams 2 GIVN Robert Eugene 1 SEX M 1 BIRT 2 DATE 2 Oct 1822 2 PLAC Weston, Madison, Connecticut, United States of America 2 SOUR @S1@ 3 PAGE Sec. 2, p. 45 1 DEAT 2 DATE 14 Apr 1905 2 PLAC Stamford, Fairfield, Connecticut, United States of America 1 BURI 2 PLAC Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America 1 FAMS @F1@ 1 FAMS @F2@ 1 RESI 2 DATE from 1900 to 1905
At the top level, we have a record numbered zero, followed by some number of nested fields numbered 1:
gedcom: record*. record: -"0 ", field, r1*.
Similarly, a record numbered 1 is going to have its value, followed by any number of records numbered 2:
r1: -"1 ", field, r2*.
and so on, as far we need to go:
r2: -"2 ", field, r3*. r3: -"3 ", field, r4*. r4: -"4 ", field, r5*. r5: -"5 ", field, r6*. r6: -"6 ", field, r7*. r7: -"7 ", field, r8*. r8: -"8 ", field, r9*. r9: -"9 ", field.
Keeping it simple now, a field is a name and an optional value:
-field: name, (-" ", value)?, -#a. @name: ["A"-"Z"; "0"-"9"; "@"]+. @value: ~[#a]*.
This already gives a reasonably structured result:
<gedcom>
<record name='@I1@' value='INDI'>
<r1 name='NAME' value='Robert Eugene /Williams/'>
<r2 name='SURN' value='Williams'/>
<r2 name='GIVN' value='Robert Eugene'/>
</r1>
<r1 name='SEX' value='M'/>
<r1 name='BIRT'>
<r2 name='DATE' value='2 Oct 1822'/>
<r2 name='PLAC' value='Weston, Madison, Connecticut, United States of America'/>
<r2 name='SOUR' value='@S1@'>
<r3 name='PAGE' value='Sec. 2, p. 45'/>
</r2>
</r1>
<r1 name='DEAT'>
<r2 name='DATE' value='14 Apr 1905'/>
<r2 name='PLAC' value='Stamford, Fairfield, Connecticut, United States of America'/>
</r1>
<r1 name='BURI'>
<r2 name='PLAC' value='Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America'/>
</r1>
<r1 name='FAMS' value='@F1@'/>
<r1 name='FAMS' value='@F2@'/>
<r1 name='RESI' value=''>
<r2 name='DATE' value='from 1900 to 1905'/>
</r1>
</record>
</gedcom>
One thing that this exposes is that the structure isn't always completely
consistent. For instance, the RESI field at the end of the example
has an empty value attribute because the name is followed by a
space, but no value.
The r1, r2, etc. are due to the fact that each
level has a different syntax, with a number specifying its depth.
Renaming allows us to get rid of those:
gedcom: record*. record: -"0 ", field, r1*. r1 > field: -"1 ", field, r2*. r2 > field: -"2 ", field, r3*. r3 > field: -"3 ", field, r4*.
etc.
<gedcom>
<record id='I1' name='INDI'>
<field name='NAME' value='Robert Eugene /Williams/'>
<field name='SURN' value='Williams'/>
<field name='GIVN' value='Robert Eugene'/>
</field>
<field name='SEX' value='M'/>
<field name='BIRT'>
<field name='DATE' value='2 Oct 1822'/>
<field name='PLAC' value='Weston, Madison, Connecticut, United States of America'/>
<field name='SOUR' link='S1'>
<field name='PAGE' value='Sec. 2, p. 45'/>
</field>
</field>
<field name='DEAT'>
<field name='DATE' value='14 Apr 1905'/>
<field name='PLAC' value='Stamford, Fairfield, Connecticut, United States of America'/>
</field>
<field name='BURI'>
<field name='PLAC' value='Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America'/>
</field>
<field name='FAMS' link='F1'/>
<field name='FAMS' link='F2'/>
<field name='RESI'>
<field name='DATE' value='from 1900 to 1905'/>
</field>
</record>
</gedcom>
Markdown is a class of languages meant to make writing HTML text documents easier.
Unfortunately, there are several different versions, and not all of them have specifications.
Since we'll only touch on some of the features here, we'll take the liberty of not using any particular version of Markdown.
To show it working, this chapter in the tutorial was actually produced using the ixml below on a Marked-down version of the chapter.
Since Markdown is just a representation of an HTML document, the top level is:
html: head, body.
We can add anything we like to the head using insertions:
head: meta, title. meta: name, content. @name: +"generator". @content: +"ixml". title: +"Markdown".
This will cause every serialised result to start
<html>
<head>
<meta name='generator' content='ixml'/>
<title>Markdown</title>
</head>
<body>
For the time being, let's keep the body simple: headings and paragraphs separated by blank lines:
body: part++(-#a+). -part: heading; para. -heading: h1; h2; h3; h4; h5; h6. -para: p.
Headings are a line starting with # characters, and a space,
and optionally ending with them as well.
h1: -"# " , htext, -"#"*, -#a. h2: -"## ", htext, -"#"*, -#a. h3: -"### ", htext, -"#"*, -#a. h4: -"#### ", htext, -"#"*, -#a. h5: -"##### ", htext, -"#"*, -#a. h6: -"###### ", htext, -"#"*, -#a.
Heading text is a series of heading characters. Hash characters are only allowed in the text if they are followed by a non-hash character:
-htext: hc+. -hc: ~["#"; #a]; "#", ~["#"; #a].
Paragraphs must not start with a hash, clearly, and consist of a number of lines of text:
p: ~["#"], line++nl, -#a. -line: c+. -c: ~[#a].
We have separated out the nl, since it should be retained in
the output, and not deleted.
-nl: #a.
Text within a paragraph can be marked with
* for strong text_ for emphasised text` for code.So we change the definition of the contents of line
accordingly:
-line: c+. -c: ~[#a; "*_`"]; em; strong; code. strong: -"*", cstar+, -"*". em: -"_", cunder+ , -"_". code: -"`", ccode+, -"`". -cstar: ~["*"; #a]. -cunder: ~["_"; #a]. -ccode: ~["`"; #a].
Finally, we will add one more type of paragraph, code blocks, which will be
produced using the pre element in HTML, and start with a space in
the input. So add the new type of paragraph:
-para: p; pre.
Make sure that p paragraphs don't start with a space:
p: ~["# "], line++nl, -#a.
and now define a pre paragraph similar to
p paragraphs:
pre: (" ", preline)++nl, -#a.
-preline: ~[#a]*.
The final case study is the grammar of ixml itself. This grammar has the following properties:
alts in rule), the marks
for a rule are on the definition and are not overridden.At the top level is the rule for rule:
rule: (mark, s)?, name, s, -["=:"], s, -alts, -".".
This rule defines itself: a rule is an optional mark, followed by a name, a colon or equals, some alternatives, and then a full-stop.
Alts are just one or more alts separated by
semicolons or vertical bars, an alt is zero or more
terms, separated by commas:
alts: alt++(-[";|"], s). alt: term**(-",", s).
The rule s is for optional whitespace.
-s: (whitespace; comment)*.
Its use always directly follows a terminal (such as "-["=:"],
s" above in rule), except if that terminal is in
an attribute.
In that case the whitespace is moved to directly after the attribute (such
as for mark, and name above).
This prevents comments ending up in attribute content.
Whitespace is any character so classified in Unicode, plus tab, carriage-return, and linefeed:
-whitespace: -[Zs]; tab; lf; cr.
A comment is any number of comment characters or comments, surrounded by { } braces. A comment character is any character that isn't one of the braces. This definition allows nested comments, so that you can comment out a piece of ixml:
comment: -"{", (cchar; comment)*, -"}".
-cchar: ~["{}"].
Most other rules are self-explanatory. But there are a couple worth looking at.
A string is one or more dchars enclosed by double
quotes (and similar for single quotes).
@string: -'"', dchar+, -'"';
-"'", schar+, -"'".
A dchar is any character except a double quote or a newline
(since strings may not extend over lines), or two double quotes:
dchar: ~['"'; #a; #d];
'"', -'"'.
Note though that since two double quotes represent a single quote in the string, one is deleted, and the other is not.
A literal is either a string or a hex encoding:
literal: quoted;
encoded.
-quoted: (tmark, s)?, string, s.
-encoded: (tmark, s)?, -"#", hex, s.
Since string and hex are both attributes, they
appear as a literal element in the serialisation, and the
attributes are raised up to it. For "a":
<literal string='a'/>
for #a
<literal hex='a'/>
If they have a mark, it will similarly appear here. For instance for
-#a:
<literal tmark='-' hex='a'/>
Hex encodings are treated slightly differently in character sets, which appear in the serialisation as either an inclusion or an exclusion, containing a series of members:
["a"; #a; "A"-"Z"; #a-"a"]
appears as:
<inclusion> <member string='a'/> <member hex='a'/> <member from='A' to='Z'/> <member from='#a' to='a'/> </inclusion>
That is to say, for a range, there aren't separate attributes for a
string from and a hex from: they are both called
from, and they are distinguished by whether they contain one
character, or two or more where the first is a '#', so the '#' in that one case
is not deleted.