Invisible Markup is a method for discovering structure in (textual) documents.
Follow along at http://cwi.nl/~steven/ixml/tutorial/
If we don't finish everything today, fear not! The tutorial is also designed for self-teaching.
And if that wasn't enough, there is also a video. (And these slides are online)
Example answers to all the exercises are at the back.
You have one or more documents in some textual format.
You supply a description of that format, that includes how it should be converted to XML.
The ixml processor then uses the description to read and convert the documents to XML.
7 November 2024
Describe this:
date: day, month, year.
And then the parts:
day: digit; digit, digit.
A digit:
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
Literals can be single or double quoted: "0" or '0'.
month: "January"; "February"; "March"; "April"; "May"; "June"; "July"; "August"; "September"; "October"; "November"; "December".
And finally
year: digit, digit, digit, digit.
7 November 2024
Describe this:
date: day, month, year.
should be:
date: day, " ", month, " ", year.
<date> <day> <digit>7</digit> </day> <month>November</month> <year> <digit>2</digit> <digit>0</digit> <digit>2</digit> <digit>4</digit> </year> </date>
Run it yourself.
Warning:
date
in this case) must always be
the first rule in the ixml.Not interested in digit
. So change
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
to
-digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
giving
<date> <day>7</day> <month>November</month> <year>2024</year> </date>
Try it yourself
Changing
year: digit, digit, digit, digit.
to
@year: digit, digit, digit, digit.
gives
<date year='2024'> <day>7</day> <month>November</month> </date>
Make them all attributes
If the last exercise worked, you should have got an output that looks like this:
<date day='7' month='November' year='2024'> </date>
It's not empty because all input ends up in the XML. Exclude it by changing:
date: day, " ", month, " ", year.
to
date: day, -" ", month, -" ", year.
Try it. Experiment a bit.
A day is:
day: digit; digit, digit.
but you can also write
day: digit, digit?.
Make the change. Does it matter which digit is made optional? What happens if you make the year optional?
If you make the year optional like this:
date: day, -" ", month, -" ", year?.
the year is optional, but the space before it isn't.
You could make a separate rule for the year:
date: day, -" ", month, optional-year?. -optional-year: -" ", year.
Better is to use grouping:
date: day, -" ", month, (-" ", year)?.
If you run this on a date with no year, no year
element is
produced:
<date> <day>7</day> <month>November</month> </date>
If you were to make the year optionally a two-digit number or a four-digit number, you could do it with
year: digit, digit, digit, digit; digit, digit.
Do it with grouping instead.
year: digit, digit, (digit, digit)?.
Adding a "+" after any item means "one or more":
date: day, -" "+, month, -" "+, year.
Adding a "*" means "zero or more":
date: -" "*, day, -" "+, month, -" "+, year.
Same effect but with an explicit rule for spaces:
-s: -" "+.
and use that, saying it is optional before a date:
date: s?, day, s, month, s, year.
Allow any number of dates in our input:
dates: date+.
Make the changes.
You will see later why it is inadvisable to add spaces after a date as well if you include that last rule for dates.
(In brief: if you have a list of dates, you don't need to have spaces following a date, since spaces already can preceed the next one).
You can include any Unicode character in a literal string (except control characters), but occasionally it is useful to be able to be explicit which character is intended.
For instance a tab or non-breaking space character is hard to distinguish from a space.
In such cases you can specify the character as a hexadecimal number, its position in the Unicode sequence:
tab: #9. nonbreak: #a0. space: " "; #9; #a0.
These match a single character in the input.
You can't include them inside strings (and don't need to):
"#a0"
stands for the three characters "#"
,
"a"
, "0"
.
Update your previous exercise to include newlines.
Most operating systems use #a
.
Windows uses #d
, #a
. Extra points for dealing with
both.
A shorthand for a rule like
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
is:
digit: ["0"-"9"].
Similarly:
letter: ["a"-"z"; "A"-Z"]. operator: ["+-×÷"]. hexdigit: ["0123456789"; "ABCDEF"; "abcdef"]. space: [" "; #9; #a0].
These all match a single character taken from the set.
Use "~" to mean "any character except what is in the set"
string: '"', ~['"']*, '"'.
which says: a quote, any number of characters that aren't a quote, followed by a quote.
Since [ ]
doesn't match anything, ~[ ]
matches any
single character.
Write a description of a list of strings, separated by spaces. For instance, the input
"Now" "is the" "time" ""
should produce an output like:
<strings> <string>Now</string> <string>is the</string> <string>time</string> <string/> </strings>
strings: string, (s, string)*. string: -'"', c*, -'"'. -c: ~['"'] -s: -" "*.
We'll see a way of doing this differently shortly.
ixml uses Unicode, with many thousands of characters (currently 144,697)
Unicode classifies characters in different ways, and ixml uses those classifications. They are only used in character sets.
For instance, class Ll
covers lower case letters, and
Lu
is upper case. Class L
covers all characters that
are considered letters (there are 130,000 of them!). So you could define a name
as:
name: [L]+.
Similarly, class N
d covers all decimal digit characters, so you
could define a number as:
number: [Nd]+.
Other useful classes:
P
for punctuation, Zs
for space characters (but not tab or newline), Sc
for currency symbols like $, £, €Make an ixml description that accepts a number of lines, where each line contains a name and an amount of money.
amounts: person*. person: name, s, amount, nl. name: [L]+. -s: [Zs; #9]+. amount: currency, [Nd]+, ([".,"], [Nd]*)?. @currency: [Sc]. -nl: [#a; #d]+.
We've seen repetition in two forms, zero or more:
spaces: " "*.
and one or more::
dates: date+.
Separators specify what comes between each repetition. For example, a list of numbers, separated by commas:
numbers: number++",".
To separate the numbers by commas and spaces, you could use:
numbers: number++(",", " "*)
Works with "**" as well.
Anything can follow the ++ or **: a rule name, a group, a literal, a character set.
Change the description of a list of strings, so that strings are now separated by commas and spaces. For instance, the input
"Now", "is the","time", ""
should produce output like:
<strings> <string>Now</string> <string>is the</string> <string>time</string> <string/> </strings>
strings: string**(-",", s). string: -'"', c*, -'"'. -c: ~['"']. -s: -" "*.
Apart from using "?", you can specify something optional in a different way. For instance:
year: digit, digit, digit, digit; .
Or to make it more obvious:
year: digit, digit, digit, digit; empty. -empty: .
Applied to a date without year gives an empty year
element,
rather than an absent one:
<date> <day>7</day> <month>November</month> <year/> </date>
Define an email address, so that its structure is revealed
email: user, -"@", host. user: ~["@"]+. host: domain++-".". domain: ["a"-"z"; "A"-"Z"; "0"-"9"]+.
The purpose of ixml:
For instance, our date format also accepts input like
You may not care, because you know the data is correct, and you will only be processing correct dates.
To use ixml to check the data, we can tighten up the definition, by excluding zero as a single digit, and only allowing two digit numbers up to 31:
day: "0"?, ["123456789"]; ["12"], digit; "30"; "31".
which says that a day is single digit (excluding 0), optionally preceded by 0, or 1 or 2 followed by any digit, or 30, or 31.
Try checking dates.
For extra points, add extra checking for months that can be 30 days long, those that can be 31 days long, and February.
You might also want to restrict which years are accepted.
dates: (date, nl)+. date: s?, day, s, month, s, year. day: "0"?, sdigit; ["12"], digit; "30"; "31". -sdigit: ["123456789"]. -digit: ["0123456789"]. month: "January"; "February"; "March"; "April"; "May"; "June"; "July"; "August"; "September"; "October"; "November"; "December". year: digit, digit, digit, digit. -s: -" "+. -nl: -[#a; #d]+.
Let's add iso dates to the format, which look like "2024-11-07"
date: day, s, month, s, year; iso. iso: year, -"-", nmonth, -"-", day. nmonth: digit, digit.
Now our ixml accepts both sorts of date. For iso dates we get:
<date> <iso> <year>2024</year> <nmonth>11</nmonth> <day>07</day> </iso> </date>
The format that is used by the web protocol uses a format like this in its headers:
Tue, 15 Nov 1994 13:45:26 GMT
(It must be GMT).
Write ixml to accept it. (You don't need to check it, just accept it)
http-date: weekday, -", ", day, -" ", month, -" ", year, -" ", time, -" GMT". weekday: "Mon"; "Tue"; "Wed"; "Thu"; "Fri"; "Sat"; "Sun". day: dd. -dd: digit, digit. month: "Jan"; "Feb"; "Mar"; "Apr"; "May"; "Jun"; "Jul"; "Aug"; "Sep"; "Oct"; "Nov"; "Dec". year: digit, digit, digit, digit. time: h, -":", m, -":", s. h: dd. m: dd. s: dd. -digit: ["0"-"9"].
Let's start again, with another date format, the one that uses slashes like 31/12/1999:
date: day, -"/", month, -"/", year. day: ("0"; "1"; "2"), digit; "30"; "31". month: "0", digit; "10"; "11"; "12". year: digit, digit, digit, digit. -digit: ["0"-"9"].
But the US writes such dates with the month first, 12/31/1999, so let's add that as well:
date: us; world. world: day, -"/", month, -"/", year. us: month, -"/", day, -"/", year.
"31/12/1999"
gives us:
<date> <world> <day>31</day> <month>12</month> <year>1999</year> </world> </date>
"12/31/1999
" gives us:
<date> <us> <month>12</month> <day>31</day> <year>1999</year> </us> </date>
But "7/11/2024
" produces:
<!-- AMBIGUOUS at date[1.1:2.1]: us[:2.1] | date[1.1:2.1]: world[:2.1] | --> <date ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS"> <us> <month>7</month> <day>11</day> <year>2024</year> </us> </date>
7/11/2024
can be both a US date and a World date, and ixml
can't tell the difference.
It has chosen one, warning that the result is ambiguous, and may not be what you were expecting.
Try it out
Ambiguity:
For example
dates: date+. date: -" "*, day, -" ", month, -" ", year, -" "*.
With
7 November 2024 8 November 2024
ixml doesn't know whether the spaces between the dates follow the first, or precede the second.
Similarly, if you define
expr: number; expr, op, expr.
Then with
1+2+3
ixml doesn't know if that's
(1+2)+3
or
1+(2+3)
both of which match.
Rewrite the example from the previous chapter so that instead of ixml telling you that it is ambiguous, you have three classes of dates: US, World, and ambig:
<dates> <date> <world> <day>31</day> <month>12</month> <year>1999</year> </world> </date> <date> <us> <month>12</month> <day>31</day> <year>1999</year> </us> </date> <date> <ambig> <daymonth>04</daymonth> <daymonth>10</daymonth> <year>2021</year> </ambig> </date> </dates>
date: us; world; ambig. world: day, -"/", month, -"/", year. us: month, -"/", day, -"/", year. ambig: daymonth, -"/", daymonth, -"/", year. day: {above 12 for the day is unambiguous} "1", ["3456789"]; "2", digit; "30"; "31". daymonth: {up to 12 is ambiguous} -month. month: "0"?, sdigit; "10"; "11"; "12". year: digit, digit, digit, digit. -digit: ["0"-"9"]. -sdigit: ["1"-"9"].
ixml can only describe 'context-free' languages.
For instance, Fortran strings like "6Hstring" can't be described in a general way, nor languages where the structure is only implied by the indentation, nor languages where structures are 'silently' closed off.
Luckily most languages are designed to be context-free.
However, it is possible to describe non-context-free cases up to a fixed number of cases.
Nested lists
Monday tea coffee Tuesday Wednesday morning coffee biscuits afternoon tea cakes Thursday closed
to genererate XML like:
<list> <word>Monday</word> <list1> <word>tea</word> <word>coffee</word> </list1> <word>Tuesday</word> <word>Wednesday</word> <list1> <word>morning</word> <list2> <word>coffee</word> <word>biscuits</word> </list2> <word>afternoon</word> <list2> <word>tea</word> <word>cakes</word> </list2> </list1> <word>Thursday</word> <list1> <word>closed</word> </list1> </list>
list: ( word, newline, list1?)+. list1: ( indent, word, newline, list2?)+. list2: ( indent, indent, word, newline, list3?)+. list3: ( indent, indent, indent, word, newline, list4?)+. list4: (indent, indent, indent, indent, word, newline )+. -indent: -" ". -newline: -#d?, -#a. word: [L]*.
As we have seen, you can hide an element:
-digit: [Nd].
or output it as an attribute:
@number: digit+.
but you can override these defaults when you use a rule.
For instance, your input contains either a single date, or a pair of dates, representing start and end dates:
2024-11-07:2024-11-08
You could write rules:
data: date; range. range: start, -":", end. date: year, -"-", month, -"-", day. year: d, d, d, d. month: d, d. day: -month. -d: [Nd]. start: -date. end: -date.
Here we say that start
has the same format as a
date
but will be output as a start
element.
With input:
2024-11-07
you get:
<data> <date> <year>2024</year> <month>11</month> <day>07</day> </date> </data>
With input
2024-11-07:2024-11-08
you get:
<data> <range> <start> <year>2024</year> <month>11</month> <day>07</day> </start> <end> <year>2024</year> <month>11</month> <day>08</day> </end> </range> </data>
You can make the root element hidden, as long as its resulting content is a
single element (because of XML rules). So if we replace the data
rule with:
-data: date; range.
then we will get
<range> <start> <year>2024</year> <month>11</month> <day>07</day> </start> <end> <year>2024</year> <month>11</month> <day>08</day> </end> </range>
etc.
To override a rule that is hidden or an attribute to make it an element, you use "^".
input: ^digit; number. number: digit, digit+. {2 or more digits} -digit: [N].
then for input "2024" we get:
<input> <number>2024</number> </input>
and for input "6" we get:
<input> <digit>6</digit> </input>
Similarly, you can override with @
:
input: @digit; @number. number: digit, digit+. {2 or more digits} -digit: [N].
would give
<input digit="6"/>
and
<input number="2024"/>
Create an ixml definition for URLs, like
http://www.example.com/documents/index.html
Make your own decisions about the structure of the XML.
url: scheme, -":", -"//", authority, -"/", path, (-"?", query)?, (-"#", fragment)?. scheme: letter+. authority: host, port?. host: sub++-".". sub: letter+. path: segment**-"/". segment: ~["/?#"]*. query: ~["# "]*. fragment: ~[" "; #a]*. -letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"]. port: -":", ["0"-"9"]+.
Up to now, all output characters came from the input.
But you can add other characters, by putting a "+" before a string or encoding.
For instance, with this input, representing a series of positive and negative numbers:
100,200,(300),400
the following ixml adds an extra attribute to the root element, adds plus signs to positive numbers, and deletes the brackets and adds a minus sign to negative numbers:
data: value++-",", @source. source: +"ixml". value: pos; neg. -pos: +"+", digit+. -neg: +"-", -"(", digit+, -")". -digit: ["0"-"9"].
The above input would produce
<data source="ixml"> <value>+100</value> <value>+200</value> <value>-300</value> <value>+400</value> </data>
Accept dates with two- or four-digit years, but output them with four
digits, so that 31/12/23
and 31/12/2023
give the same
result.
date: day, -"/", month, -"/", year. day: d, d?. month: d, d. year: d, d, d, d; +"20", d, d. -d: ["0"-"9"].
You might like to marvel at the ixml
definition of ixml in the specification, a thing of rare wonder. For
instance a rule
is
rule: (mark, s)?, name, s, ["=:"], s, -alts, ".".
which you can now read: an optional mark, a name, a colon or equals, and alternatives, followed by a point.
Note how this rule is defining itself.