ixml Hands On

Steven Pemberton, CWI, Amsterdam

Contents

Introduction

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML.

Work in progress, so what you learn here, while close to the final version, may differ in detail from the final version.

The software you will be using is not yet the final version, nor industrial-strength. Please bear with us!

Follow along at http://www.cwi.nl/~steven/ixml/tutorial/

How it works

You have one or more documents in a non-XML format.

You supply an ixml description of that format, that includes how it should be converted to XML.

The ixml processor then uses the description to read and convert the documents to XML.

Example: dates

4 November 2021

Describe this:

date: day, month, year.

And then the parts:

day: digit;
     digit, digit.

A digit:

digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

Literals can be single or double quoted: "0" or '0'.

Example

month: "January"; "February"; "March";
       "April"; "May"; "June";
       "July"; "August"; "September";
       "October"; "November"; "December".

And finally

year: digit, digit, digit, digit.

One thing missing

4 November 2021

Describe this:

date: day, month, year.

should be:

date: day, " ", month, " ", year.

Result

<date>
   <day>
      <digit>4</digit>
   </day> 
   <month>November</month> 
   <year>
      <digit>2</digit>
      <digit>0</digit>
      <digit>2</digit>
      <digit>1</digit>
   </year>
</date>

Now you.

Serialisation

Not interested in digit. So change

digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

to

-digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

giving

<date>
   <day>4</day> 
   <month>November</month> 
   <year>2021</year>
</date>

Now you.

Attributes

Changing

year: digit, digit, digit, digit.

to

@year: digit, digit, digit, digit.

gives

<date year='2021'>
   <day>4</day> 
   <month>November</month>
</date>

Now you, make them all attributes

Excluding terminals

If the last exercise worked, you should have got an output that looks like this:

<date day='4' month='November' year='2021'> </date>

It's not empty because all input ends up in the XML. Exclude it by changing:

date: day, " ", month, " ", year.

to

date: day, -" ", month, -" ", year.

Try it.

Options

A day is:

day: digit;
     digit, digit.

but you can also write

day: digit, digit?.

Do it. Does it matter which one you make optional? Make the year optional as well.

Exercise.

Grouping

If you make the year optional like this:

date: day, -" ", month, -" ", year?.

the year is optional, but the space before it isn't.

You could make a separate rule for the year:

date: day, -" ", month, optional-year?.
-optional-year: -" ", year.

Better is to use grouping:

date: day, -" ", month, (-" ", year)?.

If you run this on a date with no year, no year element is produced:

<date>
   <day>4</day> 
   <month>November</month>
</date>

Exercise: 2 digit year with grouping

Repetition

Adding a "+" after any item means "one or more":

date: day, -" "+, month, -" "+, year.

Adding a "*" means "zero or more":

date: -" "*, day, -" "+, month, -" "+, year.

Repetition

Same effect but with an explicit rule for spaces:

-s: -" "+.

and use that, saying it is optional before a date:

date: s?, day, s, month, s, year.

Allow any number of dates in our input:

dates: date+.

Exercise: try it

Characters

A shorthand for a rule like

digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

is:

digit: ["0"-"9"].

Similarly:

letter: ["a"-"z"; "A"-Z"].
operator: ["+-×÷"].
hexdigit: ["0123456789"; "ABCDEF"; "abcdef"].

These all match a single character taken from the set.

Exclusions

Use "~" to mean "any character except what is in the set"

string: '"', ~['"']*, '"'.

which says: a quote, any number of characters that aren't a quote, followed by a quote.

Since [] doesn't match anything, ~[] matches any single character.

Exercise: list of strings

Repetition with separators

We've seen repetition in two forms, zero or more, and one or more:

spaces: " "*.
dates: date+.

Separators specify what comes between each repetition. For example, a list of numbers, separated by commas:

numbers: number+",".

To separate the numbers by commas and spaces, you could use:

numbers: number+(",", " "*)

Works with "*" as well.

Anything can follow the + or *: a rule name, a group, a literal, a character set.

Exercise: list of strings separated by commas

Alternate Options

Apart from using "?", you can specify something optional in a different way. For instance:

year: digit, digit, digit, digit; .

Or to make it more obvious:

year: digit, digit, digit, digit; empty.
-empty: .

Applied to a date without year gives an empty year element, rather than an absent one:

<date>
   <day>4</day>
   <month>November</month>
   <year/>
</date>

Pause

With exercise: email address

Accepting versus checking

The purpose of ixml:

For instance, our date format also accepts input like

You may not care, because you know the data is correct, and you will only be processing correct dates.

Checking

To use ixml to check the data, we can tighten up the definition, by excluding zero as a single digit, and only allowing two digit numbers up to 31:

day: "0"?, ["123456789"];
     ["12"], digit;
     "30"; "31".

which says that a day is single digit (excluding 0), optionally preceded by 0, or 0, 1, or 2, followed by any digit, or 30, or 31.

Exercise: restricted dates

Extending the format

Let's add iso dates to the format, which look like "2021-11-04"

date: day, s, month, s, year;
      iso.
iso: year, -"-", nmonth, -"-", day.
nmonth: digit, digit.

Now our ixml accepts both sorts of date. For iso dates we get:

<date>
   <iso>
      <year>2021</year>
      <nmonth>10</nmonth>
      <day>04</day>
   </iso>
</date>

Exercise: http dates

Extending the format 2

Let's start again, with another date format, the one that uses slashes like 31/12/1999:

date: day, -"/", month, -"/", year.
day: ("0"; "1"; "2"), digit;
     "30"; "31".
month: "0", digit;
       "10"; "11"; "12".
year: digit, digit, digit, digit.
-digit: "0"; "1"; "2"; "3"; "4";
        "5"; "6"; "7"; "8"; "9".

But the US writes such dates with the month first, 12/31/1999, so let's add that as well:

date: us; world.
world: day, -"/", month, -"/", year.
us: month, -"/", day, -"/", year.

dates

"31/12/1999" gives us:

<date>
   <world>
      <day>31</day>
      <month>12</month>
      <year>1999</year>
   </world>
</date>

dates

"12/31/1999" gives us:

<date>
   <us>
      <month>12</month>
      <day>31</day>
      <year>1999</year>
   </us>
</date>

dates

But "04/10/2021" produces:

<!-- AMBIGUOUS at 
  date[1.1:2.1]: us[:2.1] | 
  date[1.1:2.1]: world[:2.1] | 
-->
<date ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS">
   <us>
      <month>04</month>
      <day>10</day>
      <year>2021</year>
   </us>
</date>

04/10/2021 can be both a US date and a World date, and ixml can't tell the difference.

It has chosen one, warning that the result is ambiguous, and may not be what you were expecting.

Exercise: try it.

Ambiguity

Ambiguity:

For example

dates: date+.
date: -" "*, day, -" ", month, -" ", year, -" "*.

With

4 November 2021  5 November 2021

ixml doesn't know whether the spaces between the dates follow the first, or precede the second.

Accidental ambiguity

Similarly, if you define

expr: number;
      expr, op, expr.

Then with

1+2+3

ixml doesn't know if that's

(1+2)+3

or

1+(2+3)

both of which match.

Exercise: make the date ambiguity explicit.

What can't be described

ixml can only describe 'context-free' languages.

For instance, Fortran strings like "6Hstring" can't be described in a general way, nor languages where the structure is only implied by the indentation, nor languages where structures are 'silently' closed off.

Luckily most languages are designed to be context-free.

Exercise: nested expressions

Character classes

ixml uses Unicode, with many thousands of characters (currently 144,697)

Unicode classifies characters in different ways, and ixml uses those classifications. They are only used in character sets.

For instance, class Ll covers lower case letters, and Lu is upper case. Class L covers all characters that are considered letters (there are 130,000 of them!). So you could define a name as:

name: [L]+.

Similarly, class Nd covers all decimal digit characters, so you could define a number as:

number: [Nd]+.

Exercise: list of names and numbers

Serialisation Overrides

As we have seen, you can hide an element:

-digit: [Nd].

or output it as an attribute:

@number: digit+.

but you can override these defaults when you use a rule.

Overrides

For instance, your input contains either a single date, or a pair of dates, representing start and end dates:

2021-10-04:2021-10-05

You could write rules:

data: date; range.

range: start, ":", end.

date: year, "-", month, "-", day.
year: d, d, d, d.
month: d, d.
day: -month.
-d: [Nd].

start: -date.
end: -date.

Here we say that start has the same format as a date but will be output as a start element.

Overrides

With input:

2021-10-04

you get:

<data>
   <date>
      <year>2021</year>-
      <month>10</month>-
      <day>04</day>
   </date>
</data>

Overrides

With input

2021-10-04:2021-10-05

you get:

<data>
   <range>
      <start>
         <year>2021</year>-
         <month>10</month>-
         <day>04</day>
      </start>:
      <end>
         <year>2021</year>-
         <month>10</month>-
         <day>05</day>
      </end>
   </range>
</data>

Root element

You can make the root element hidden, as long as its resulting content is a single element (because of XML rules). So if we replace the data rule with:

-data: date; range.

then we will get

<range>
   <start>
      <year>2021</year>-
      <month>10</month>-
      <day>04</day>
   </start>:
   <end>
      <year>2021</year>-
      <month>10</month>-
      <day>05</day>
   </end>
</range>

etc.

overriding hidden or attribute

To override a rule that is hidden or an attribute to make it an element, you use "^".

input: ^digit; number.
number: digit, digit+. {2 or more digits}
-digit: [N].

then for input "2021" we get:

<input>
   <number>2021</number>
</input>

and for input "4" we get:

<input>
   <digit>4</digit>
</input>

override with @

Similarly, you can override with @:

input: @digit; @number.
number: digit, digit+. {2 or more digits}
-digit: [N].

would give

<input digit="4"/>

and

<input number="2021"/>

Exercise: URLs

Encoded Characters

You can include any Unicode character in a literal string, but occasionally it is useful to be able to be explicit which character is intended.

For instance a Tab or non-breaking space character is hard to distinguish from a space.

In such cases you can specify the character as a hexadecimal number, its position in the Unicode sequence:

tab: #9.
nonbreak: #a0.

These match a single character in the input. You can also include them in character sets:

space: [" "; #9; #a0].

You can't include them inside strings (and don't need to): "#a0" stands for the three characters "#", "a", "0".

Exercise: tab separated numbers.

The end

You might like to marvel at the ixml definition of ixml in the specification, a thing of rare wonder. For instance a rule is

rule: (mark, S)?, name, S, ["=:"], S, -alts, ".".

which you can now read: an optional mark, a name, a colon or equals, and alternatives, followed by a point.

Note how this rule is defining itself.