Invisible XML (ixml) is a method for treating non-XML documents as if they were XML.
Work in progress, so what you learn here, while close to the final version, may differ in detail from the final version.
The software you will be using is not yet the final version, nor industrial-strength. Please bear with us!
Follow along at http://www.cwi.nl/~steven/ixml/tutorial/
You have one or more documents in a non-XML format.
You supply an ixml description of that format, that includes how it should be converted to XML.
The ixml processor then uses the description to read and convert the documents to XML.
4 November 2021
Describe this:
date: day, month, year.
And then the parts:
day: digit; digit, digit.
A digit:
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
Literals can be single or double quoted: "0" or '0'.
month: "January"; "February"; "March"; "April"; "May"; "June"; "July"; "August"; "September"; "October"; "November"; "December".
And finally
year: digit, digit, digit, digit.
4 November 2021
Describe this:
date: day, month, year.
should be:
date: day, " ", month, " ", year.
<date> <day> <digit>4</digit> </day> <month>November</month> <year> <digit>2</digit> <digit>0</digit> <digit>2</digit> <digit>1</digit> </year> </date>
Now you.
Not interested in digit
. So change
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
to
-digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
giving
<date> <day>4</day> <month>November</month> <year>2021</year> </date>
Now you.
Changing
year: digit, digit, digit, digit.
to
@year: digit, digit, digit, digit.
gives
<date year='2021'> <day>4</day> <month>November</month> </date>
Now you, make them all attributes
If the last exercise worked, you should have got an output that looks like this:
<date day='4' month='November' year='2021'> </date>
It's not empty because all input ends up in the XML. Exclude it by changing:
date: day, " ", month, " ", year.
to
date: day, -" ", month, -" ", year.
Try it.
A day is:
day: digit; digit, digit.
but you can also write
day: digit, digit?.
Do it. Does it matter which one you make optional? Make the year optional as well.
Exercise.
If you make the year optional like this:
date: day, -" ", month, -" ", year?.
the year is optional, but the space before it isn't.
You could make a separate rule for the year:
date: day, -" ", month, optional-year?. -optional-year: -" ", year.
Better is to use grouping:
date: day, -" ", month, (-" ", year)?.
If you run this on a date with no year, no year
element is
produced:
<date> <day>4</day> <month>November</month> </date>
Exercise: 2 digit year with grouping
Adding a "+" after any item means "one or more":
date: day, -" "+, month, -" "+, year.
Adding a "*" means "zero or more":
date: -" "*, day, -" "+, month, -" "+, year.
Same effect but with an explicit rule for spaces:
-s: -" "+.
and use that, saying it is optional before a date:
date: s?, day, s, month, s, year.
Allow any number of dates in our input:
dates: date+.
Exercise: try it
A shorthand for a rule like
digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
is:
digit: ["0"-"9"].
Similarly:
letter: ["a"-"z"; "A"-Z"]. operator: ["+-×÷"]. hexdigit: ["0123456789"; "ABCDEF"; "abcdef"].
These all match a single character taken from the set.
Use "~" to mean "any character except what is in the set"
string: '"', ~['"']*, '"'.
which says: a quote, any number of characters that aren't a quote, followed by a quote.
Since []
doesn't match anything, ~[]
matches any
single character.
Exercise: list of strings
We've seen repetition in two forms, zero or more, and one or more:
spaces: " "*. dates: date+.
Separators specify what comes between each repetition. For example, a list of numbers, separated by commas:
numbers: number+",".
To separate the numbers by commas and spaces, you could use:
numbers: number+(",", " "*)
Works with "*" as well.
Anything can follow the + or *: a rule name, a group, a literal, a character set.
Exercise: list of strings separated by commas
Apart from using "?", you can specify something optional in a different way. For instance:
year: digit, digit, digit, digit; .
Or to make it more obvious:
year: digit, digit, digit, digit; empty. -empty: .
Applied to a date without year gives an empty year
element,
rather than an absent one:
<date> <day>4</day> <month>November</month> <year/> </date>
With exercise: email address
The purpose of ixml:
For instance, our date format also accepts input like
You may not care, because you know the data is correct, and you will only be processing correct dates.
To use ixml to check the data, we can tighten up the definition, by excluding zero as a single digit, and only allowing two digit numbers up to 31:
day: "0"?, ["123456789"]; ["12"], digit; "30"; "31".
which says that a day is single digit (excluding 0), optionally preceded by 0, or 0, 1, or 2, followed by any digit, or 30, or 31.
Exercise: restricted dates
Let's add iso dates to the format, which look like "2021-11-04"
date: day, s, month, s, year; iso. iso: year, -"-", nmonth, -"-", day. nmonth: digit, digit.
Now our ixml accepts both sorts of date. For iso dates we get:
<date> <iso> <year>2021</year> <nmonth>10</nmonth> <day>04</day> </iso> </date>
Exercise: http dates
Let's start again, with another date format, the one that uses slashes like 31/12/1999:
date: day, -"/", month, -"/", year. day: ("0"; "1"; "2"), digit; "30"; "31". month: "0", digit; "10"; "11"; "12". year: digit, digit, digit, digit. -digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".
But the US writes such dates with the month first, 12/31/1999, so let's add that as well:
date: us; world. world: day, -"/", month, -"/", year. us: month, -"/", day, -"/", year.
"31/12/1999"
gives us:
<date> <world> <day>31</day> <month>12</month> <year>1999</year> </world> </date>
"12/31/1999
" gives us:
<date> <us> <month>12</month> <day>31</day> <year>1999</year> </us> </date>
But "04/10/2021
" produces:
<!-- AMBIGUOUS at date[1.1:2.1]: us[:2.1] | date[1.1:2.1]: world[:2.1] | --> <date ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS"> <us> <month>04</month> <day>10</day> <year>2021</year> </us> </date>
04/10/2021 can be both a US date and a World date, and ixml can't tell the difference.
It has chosen one, warning that the result is ambiguous, and may not be what you were expecting.
Exercise: try it.
Ambiguity:
For example
dates: date+. date: -" "*, day, -" ", month, -" ", year, -" "*.
With
4 November 2021 5 November 2021
ixml doesn't know whether the spaces between the dates follow the first, or precede the second.
Similarly, if you define
expr: number; expr, op, expr.
Then with
1+2+3
ixml doesn't know if that's
(1+2)+3
or
1+(2+3)
both of which match.
Exercise: make the date ambiguity explicit.
ixml can only describe 'context-free' languages.
For instance, Fortran strings like "6Hstring" can't be described in a general way, nor languages where the structure is only implied by the indentation, nor languages where structures are 'silently' closed off.
Luckily most languages are designed to be context-free.
Exercise: nested expressions
ixml uses Unicode, with many thousands of characters (currently 144,697)
Unicode classifies characters in different ways, and ixml uses those classifications. They are only used in character sets.
For instance, class Ll
covers lower case letters, and
Lu
is upper case. Class L
covers all characters that
are considered letters (there are 130,000 of them!). So you could define a name
as:
name: [L]+.
Similarly, class N
d covers all decimal digit characters, so you
could define a number as:
number: [Nd]+.
Exercise: list of names and numbers
As we have seen, you can hide an element:
-digit: [Nd].
or output it as an attribute:
@number: digit+.
but you can override these defaults when you use a rule.
For instance, your input contains either a single date, or a pair of dates, representing start and end dates:
2021-10-04:2021-10-05
You could write rules:
data: date; range. range: start, ":", end. date: year, "-", month, "-", day. year: d, d, d, d. month: d, d. day: -month. -d: [Nd]. start: -date. end: -date.
Here we say that start
has the same format as a
date
but will be output as a start
element.
With input:
2021-10-04
you get:
<data> <date> <year>2021</year>- <month>10</month>- <day>04</day> </date> </data>
With input
2021-10-04:2021-10-05
you get:
<data> <range> <start> <year>2021</year>- <month>10</month>- <day>04</day> </start>: <end> <year>2021</year>- <month>10</month>- <day>05</day> </end> </range> </data>
You can make the root element hidden, as long as its resulting content is a
single element (because of XML rules). So if we replace the data
rule with:
-data: date; range.
then we will get
<range> <start> <year>2021</year>- <month>10</month>- <day>04</day> </start>: <end> <year>2021</year>- <month>10</month>- <day>05</day> </end> </range>
etc.
To override a rule that is hidden or an attribute to make it an element, you use "^".
input: ^digit; number. number: digit, digit+. {2 or more digits} -digit: [N].
then for input "2021" we get:
<input> <number>2021</number> </input>
and for input "4" we get:
<input> <digit>4</digit> </input>
Similarly, you can override with @
:
input: @digit; @number. number: digit, digit+. {2 or more digits} -digit: [N].
would give
<input digit="4"/>
and
<input number="2021"/>
Exercise: URLs
You can include any Unicode character in a literal string, but occasionally it is useful to be able to be explicit which character is intended.
For instance a Tab or non-breaking space character is hard to distinguish from a space.
In such cases you can specify the character as a hexadecimal number, its position in the Unicode sequence:
tab: #9. nonbreak: #a0.
These match a single character in the input. You can also include them in character sets:
space: [" "; #9; #a0].
You can't include them inside strings (and don't need to):
"#a0"
stands for the three characters "#"
,
"a"
, "0"
.
Exercise: tab separated numbers.
You might like to marvel at the ixml
definition of ixml in the specification, a thing of rare wonder. For
instance a rule
is
rule: (mark, S)?, name, S, ["=:"], S, -alts, ".".
which you can now read: an optional mark, a name, a colon or equals, and alternatives, followed by a point.
Note how this rule is defining itself.