Invisible Markup is a method for discovering structure in (textual) documents.
Follow along at http://cwi.nl/~steven/ixml/tutorial/
If we don't finish everything today, fear not! The tutorial is also designed for self-teaching.
And if that wasn't enough, there is also a video. (And these slides are online)
Example answers to all the exercises are at the back.
You have one or more documents in some textual format.
You supply a description of that format, that includes how it should be converted to XML.
The ixml processor then uses the description to read and convert the documents to XML.
11 June 2022
Describe this:
date: day, month, year.
And then the parts:
day: digit; digit, digit.
A digit:
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
Literals can be single or double quoted: "0" or '0'.
month: "January"; "February"; "March"; "April"; "May"; "June"; "July"; "August"; "September"; "October"; "November"; "December".
And finally
year: digit, digit, digit, digit.
11 June 2022
Describe this:
date: day, month, year.
should be:
date: day, " ", month, " ", year.
<date> <day> <digit>1</digit> <digit>1</digit> </day> <month>June</month> <year> <digit>2</digit> <digit>0</digit> <digit>2</digit> <digit>2</digit> </year> </date>
Now you.
Not interested in digit
. So change
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
to
-digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
giving
<date> <day>11</day> <month>November</month> <year>2022</year> </date>
Now you.
Changing
year: digit, digit, digit, digit.
to
@year: digit, digit, digit, digit.
gives
<date year='2022'> <day>11</day> <month>November</month> </date>
Now you: make them all attributes
If the last exercise worked, you should have got an output that looks like this:
<date day='11' month='June' year='2022'> </date>
It's not empty because all input ends up in the XML. Exclude it by changing:
date: day, " ", month, " ", year.
to
date: day, -" ", month, -" ", year.
Try it.
A day is:
day: digit; digit, digit.
but you can also write
day: digit, digit?.
Do it. Does it matter which one you make optional? Make the year optional as well.
If you make the year optional like this:
date: day, -" ", month, -" ", year?.
the year is optional, but the space before it isn't.
You could make a separate rule for the year:
date: day, -" ", month, optional-year?. -optional-year: -" ", year.
Better is to use grouping:
date: day, -" ", month, (-" ", year)?.
If you run this on a date with no year, no year
element is
produced:
<date> <day>11</day> <month>June</month> </date>
Exercise: 2 digit year with grouping
Adding a "+" after any item means "one or more":
date: day, -" "+, month, -" "+, year.
Adding a "*" means "zero or more":
date: -" "*, day, -" "+, month, -" "+, year.
Same effect but with an explicit rule for spaces:
-s: -" "+.
and use that, saying it is optional before a date:
date: s?, day, s, month, s, year.
Allow any number of dates in our input:
dates: date+.
Exercise: try it
You can include any Unicode character in a literal string (except control characters), but occasionally it is useful to be able to be explicit which character is intended.
For instance a tab or non-breaking space character is hard to distinguish from a space.
In such cases you can specify the character as a hexadecimal number, its position in the Unicode sequence:
tab: #9. nonbreak: #a0. space: " "; #9; #a0.
These match a single character in the input.
You can't include them inside strings (and don't need to):
"#a0"
stands for the three characters "#"
,
"a"
, "0"
.
Exercise: add newlines (#a
) to previous
example. Windows uses #d
, #a
. Extra points for
dealing with both.
A shorthand for a rule like
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
is:
digit: ["0"-"9"].
Similarly:
letter: ["a"-"z"; "A"-Z"]. operator: ["+-×÷"]. hexdigit: ["0123456789"; "ABCDEF"; "abcdef"]. space: [" "; #9; #a0].
These all match a single character taken from the set.
ixml uses Unicode, with many thousands of characters (currently 144,697)
Unicode classifies characters in different ways, and ixml uses those classifications. They are only used in character sets.
For instance, class Ll
covers lower case letters, and
Lu
is upper case. Class L
covers all characters that
are considered letters (there are 130,000 of them!). So you could define a name
as:
name: [L]+.
Similarly, class N
d covers all decimal digit characters, so you
could define a number as:
number: [Nd]+.
Other useful classes:
P
for punctuation, Zs
for space characters (but not tab or newline), Sc
for currency symbols like $, £, €Exercise: Make an ixml description that accepts a number of lines, where each line contains a name and an amount of money.
amounts: person*. person: name, s, amount, nl. name: [L]+. -s: [Zs; #9]+. amount: currency, [Nd]+, ([".,"], [Nd]*)?. @currency: [Sc]. -nl: [#a; #d]+.
Use "~" to mean "any character except what is in the set"
string: '"', ~['"']*, '"'.
which says: a quote, any number of characters that aren't a quote, followed by a quote.
Since []
doesn't match anything, ~[]
matches any
single character.
Exercise: list of strings:
"Now" "is the" "time" ""
should produce an output like:
<strings> <string>Now</string> <string>is the</string> <string>time</string> <string/> </strings>
We've seen repetition in two forms, zero or more, and one or more:
spaces: " "*. dates: date+.
Separators specify what comes between each repetition. For example, a list of numbers, separated by commas:
numbers: number++",".
To separate the numbers by commas and spaces, you could use:
numbers: number++(",", " "*)
Works with "**" as well.
Anything can follow the ++ or **: a rule name, a group, a literal, a character set.
Exercise: list of strings separated by commas
Apart from using "?", you can specify something optional in a different way. For instance:
year: digit, digit, digit, digit; .
Or to make it more obvious:
year: digit, digit, digit, digit; empty. -empty: .
Applied to a date without year gives an empty year
element,
rather than an absent one:
<date> <day>11</day> <month>June</month> <year/> </date>
Exercise: define an email address, so that its structure is revealed
Exercise: define an email address, so that its structure is revealed
email: user, -"@", host. user: ~["@"]+. host: domain++-".". domain: ["a"-"z"; "A"-"Z"; "0"-"9"]+.
The purpose of ixml:
For instance, our date format also accepts input like
You may not care, because you know the data is correct, and you will only be processing correct dates.
To use ixml to check the data, we can tighten up the definition, by excluding zero as a single digit, and only allowing two digit numbers up to 31:
day: "0"?, ["123456789"]; ["12"], digit; "30"; "31".
which says that a day is single digit (excluding 0), optionally preceded by 0, or 0, 1, or 2, followed by any digit, or 30, or 31.
Exercise: restricted dates
Let's add iso dates to the format, which look like "2022-06-11"
date: day, s, month, s, year; iso. iso: year, -"-", nmonth, -"-", day. nmonth: digit, digit.
Now our ixml accepts both sorts of date. For iso dates we get:
<date> <iso> <year>2022</year> <nmonth>06</nmonth> <day>11</day> </iso> </date>
Exercise: http dates
Let's start again, with another date format, the one that uses slashes like 31/12/1999:
date: day, -"/", month, -"/", year. day: ("0"; "1"; "2"), digit; "30"; "31". month: "0", digit; "10"; "11"; "12". year: digit, digit, digit, digit. -digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
But the US writes such dates with the month first, 12/31/1999, so let's add that as well:
date: us; world. world: day, -"/", month, -"/", year. us: month, -"/", day, -"/", year.
"31/12/1999"
gives us:
<date> <world> <day>31</day> <month>12</month> <year>1999</year> </world> </date>
"12/31/1999
" gives us:
<date> <us> <month>12</month> <day>31</day> <year>1999</year> </us> </date>
But "11/06/2022
" produces:
<!-- AMBIGUOUS at date[1.1:2.1]: us[:2.1] | date[1.1:2.1]: world[:2.1] | --> <date ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS"> <us> <month>11</month> <day>06</day> <year>2022</year> </us> </date>
11/06/2022 can be both a US date and a World date, and ixml can't tell the difference.
It has chosen one, warning that the result is ambiguous, and may not be what you were expecting.
Exercise: try it.
Ambiguity:
For example
dates: date+. date: -" "*, day, -" ", month, -" ", year, -" "*.
With
10 June 2022 11 June 2022
ixml doesn't know whether the spaces between the dates follow the first, or precede the second.
Similarly, if you define
expr: number; expr, op, expr.
Then with
1+2+3
ixml doesn't know if that's
(1+2)+3
or
1+(2+3)
both of which match.
Exercise: make the date ambiguity explicit.
date: us; world; ambig. world: day, -"/", month, -"/", year. us: month, -"/", day, -"/", year. ambig: sday, -"/", month, -"/", year. day: {above 12 for the day is unambiguous} "1", ["3456789"]; "2", digit; "30"; "31". sday: {up to 12 is ambiguous} "0"?, sdigit; "10"; "11"; "12". month: "0", sdigit; "10"; "11"; "12". year: digit, digit, digit, digit. -digit: ["0"-"9"]. -sdigit: ["1"-"9"].
ixml can only describe 'context-free' languages.
For instance, Fortran strings like "6Hstring" can't be described in a general way, nor languages where the structure is only implied by the indentation, nor languages where structures are 'silently' closed off.
Luckily most languages are designed to be context-free.
Exercise: nested lists
Monday tea coffee Tuesday Wednesday morning coffee biscuits afternoon tea cakes Thursday closed
list: ( word, newline, list1?)+. list1: ( indent, word, newline, list2?)+. list2: ( indent, indent, word, newline, list3?)+. list3: ( indent, indent, indent, word, newline, list4?)+. list4: (indent, indent, indent, indent, word, newline )+. -indent: -" ". -newline: -#d?, -#a. word: [L]*.
As we have seen, you can hide an element:
-digit: [Nd].
or output it as an attribute:
@number: digit+.
but you can override these defaults when you use a rule.
For instance, your input contains either a single date, or a pair of dates, representing start and end dates:
2022-06-09:2022-06-11
You could write rules:
data: date; range. range: start, ":", end. date: year, "-", month, "-", day. year: d, d, d, d. month: d, d. day: -month. -d: [Nd]. start: -date. end: -date.
Here we say that start
has the same format as a
date
but will be output as a start
element.
With input:
2022-06-11
you get:
<data> <date> <year>2022</year>- <month>06</month>- <day>11</day> </date> </data>
With input
2022-06-09:2022-06-11
you get:
<data> <range> <start> <year>2022</year>- <month>06</month>- <day>09</day> </start>: <end> <year>2022</year>- <month>06</month>- <day>11</day> </end> </range> </data>
You can make the root element hidden, as long as its resulting content is a
single element (because of XML rules). So if we replace the data
rule with:
-data: date; range.
then we will get
<range> <start> <year>2022</year>- <month>06</month>- <day>09</day> </start>: <end> <year>2022</year>- <month>06</month>- <day>11</day> </end> </range>
etc.
To override a rule that is hidden or an attribute to make it an element, you use "^".
input: ^digit; number. number: digit, digit+. {2 or more digits} -digit: [N].
then for input "2022" we get:
<input> <number>2022</number> </input>
and for input "6" we get:
<input> <digit>6</digit> </input>
Similarly, you can override with @
:
input: @digit; @number. number: digit, digit+. {2 or more digits} -digit: [N].
would give
<input digit="6"/>
and
<input number="2022"/>
Exercise: URLs
Up to now, all output characters came from the input.
But you can add other characters, by putting a "+" before a string or encoding.
For instance, with this input, representing a series of positive and negative numbers:
100,200,(300),400
the following ixml adds an extra attribute to the root element, adds plus signs to positive numbers, and deletes the brackets and adds a minus sign to negative numbers:
data: value++-",", @source. source: +"ixml". value: pos; neg. -pos: +"+", digit+. -neg: +"-", -"(", digit+, -")". -digit: ["0"-"9"].
The above input would produce
<data source="ixml"> <value>+100</value> <value>+200</value> <value>-300</value> <value>+400</value> </data>
Exercise: accept dates with two or four digit years, but output them with four digits, so that 31/12/23 and 31/12/2023 give the same result.
date: day, "/", month, "/", year. day: d, d?. month: d, d. year: d, d, d, d; +"20", d, d. -d: ["0"-"9"].
You might like to marvel at the ixml
definition of ixml in the specification, a thing of rare wonder. For
instance a rule
is
rule: (mark, s)?, name, s, ["=:"], s, -alts, ".".
which you can now read: an optional mark, a name, a colon or equals, and alternatives, followed by a point.
Note how this rule is defining itself.