Tutorial: Invisible Markup

Steven Pemberton, CWI, Amsterdam

http://cwi.nl/~steven/ixml/tutorial/

Contents

Introduction

Invisible Markup is a method for discovering structure in (textual) documents.

Follow along at http://cwi.nl/~steven/ixml/tutorial/

If we don't finish everything today, fear not! The tutorial is also designed for self-teaching.

And if that wasn't enough, there is also a video. (And these slides are online)

Example answers to all the exercises are at the back.

How it works

You have one or more documents in some textual format.

You supply a description of that format, that includes how it should be converted to XML.

The ixml processor then uses the description to read and convert the documents to XML.

Example: dates

7 November 2024

Describe this:

date: day, month, year.

And then the parts:

day: digit;
     digit, digit.

A digit:

digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

Literals can be single or double quoted: "0" or '0'.

Example

month: "January"; "February"; "March";
       "April"; "May"; "June";
       "July"; "August"; "September";
       "October"; "November"; "December".

And finally

year: digit, digit, digit, digit.

One thing missing

7 November 2024

Describe this:

date: day, month, year.

should be:

date: day, " ", month, " ", year.

Result

<date>
   <day>
      <digit>7</digit>
   </day> 
   <month>November</month> 
   <year>
      <digit>2</digit>
      <digit>0</digit>
      <digit>2</digit>
      <digit>4</digit>
   </year>
</date>

Exercise

Run it yourself.

  1. Open the in-browser ixml processor.
  2. Copy and paste the required bits of ixml above into the Grammar field.
  3. Copy or type the required input into the Input field.
  4. Click on Go.

Warning:

Serialisation

Not interested in digit. So change

digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

to

-digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

giving

<date>
   <day>7</day> 
   <month>November</month> 
   <year>2024</year>
</date>

Exercise

Try it yourself

Attributes

Changing

year: digit, digit, digit, digit.

to

@year: digit, digit, digit, digit.

gives

<date year='2024'>
   <day>7</day>
   <month>November</month>
</date>

Exercise

Make them all attributes

Excluding terminals

If the last exercise worked, you should have got an output that looks like this:

<date day='7' month='November' year='2024'> </date>

It's not empty because all input ends up in the XML. Exclude it by changing:

date: day, " ", month, " ", year.

to

date: day, -" ", month, -" ", year.

Exercise

Try it. Experiment a bit.

Options

A day is:

day: digit;
     digit, digit.

but you can also write

day: digit, digit?.

Exercise

Make the change. Does it matter which digit is made optional? What happens if you make the year optional?

Grouping

If you make the year optional like this:

date: day, -" ", month, -" ", year?.

the year is optional, but the space before it isn't.

You could make a separate rule for the year:

          date: day, -" ", month, optional-year?.
-optional-year: -" ", year.

Better is to use grouping:

date: day, -" ", month, (-" ", year)?.

If you run this on a date with no year, no year element is produced:

<date>
   <day>7</day>
   <month>November</month>
</date>

Exercise

If you were to make the year optionally a two-digit number or a four-digit number, you could do it with

year: digit, digit, digit, digit;
      digit, digit.

Do it with grouping instead.

Sample answer

year: digit, digit, (digit, digit)?.

Repetition

Adding a "+" after any item means "one or more":

date: day, -" "+, month, -" "+, year.

Adding a "*" means "zero or more":

date: -" "*, day, -" "+, month, -" "+, year.

Repetition

Same effect but with an explicit rule for spaces:

-s: -" "+.

and use that, saying it is optional before a date:

date: s?, day, s, month, s, year.

Allow any number of dates in our input:

dates: date+.

Exercise

Make the changes.

You will see later why it is inadvisable to add spaces after a date as well if you include that last rule for dates.

(In brief: if you have a list of dates, you don't need to have spaces following a date, since spaces already can preceed the next one).

Encoded Characters

You can include any Unicode character in a literal string (except control characters), but occasionally it is useful to be able to be explicit which character is intended.

For instance a tab or non-breaking space character is hard to distinguish from a space.

In such cases you can specify the character as a hexadecimal number, its position in the Unicode sequence:

     tab: #9.
nonbreak: #a0.
   space: " "; #9; #a0.

These match a single character in the input.

You can't include them inside strings (and don't need to): "#a0" stands for the three characters "#", "a", "0".

Exercise

Update your previous exercise to include newlines.

Most operating systems use #a.

Windows uses #d, #a. Extra points for dealing with both.

Character sets

A shorthand for a rule like

digit: "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9".

is:

digit: ["0"-"9"].

Similarly:

  letter: ["a"-"z"; "A"-Z"].
operator: ["+-×÷"].
hexdigit: ["0123456789"; "ABCDEF"; "abcdef"].
   space: [" "; #9; #a0].

These all match a single character taken from the set.

Exclusions

Use "~" to mean "any character except what is in the set"

string: '"', ~['"']*, '"'.

which says: a quote, any number of characters that aren't a quote, followed by a quote.

Since [ ] doesn't match anything, ~[ ] matches any single character.

Exercise

Write a description of a list of strings, separated by spaces. For instance, the input

"Now" "is the" "time" ""

should produce an output like:

<strings>
   <string>Now</string>
   <string>is the</string>
   <string>time</string>
   <string/>
</strings>

Sample answer

strings: string, (s, string)*.
 string: -'"', c*, -'"'.
     -c: ~['"']
     -s: -" "*.

We'll see a way of doing this differently shortly.

Character classes

ixml uses Unicode, with many thousands of characters (currently 144,697)

Unicode classifies characters in different ways, and ixml uses those classifications. They are only used in character sets.

For instance, class Ll covers lower case letters, and Lu is upper case. Class L covers all characters that are considered letters (there are 130,000 of them!). So you could define a name as:

name: [L]+.

Similarly, class Nd covers all decimal digit characters, so you could define a number as:

number: [Nd]+.

Other useful classes:

Exercise

Make an ixml description that accepts a number of lines, where each line contains a name and an amount of money.

Sample Answer

  amounts: person*.
   person: name, s, amount, nl.
     name: [L]+.
       -s: [Zs; #9]+.
   amount: currency, [Nd]+, ([".,"], [Nd]*)?.
@currency: [Sc].
      -nl: [#a; #d]+.

Repetition with separators

We've seen repetition in two forms, zero or more:

spaces: " "*.

and one or more::

 dates: date+.

Separators specify what comes between each repetition. For example, a list of numbers, separated by commas:

numbers: number++",".

To separate the numbers by commas and spaces, you could use:

numbers: number++(",", " "*)

Works with "**" as well.

Anything can follow the ++ or **: a rule name, a group, a literal, a character set.

Exercise

Change the description of a list of strings, so that strings are now separated by commas and spaces. For instance, the input

"Now", "is the","time", ""

should produce output like:

<strings>
   <string>Now</string>
   <string>is the</string>
   <string>time</string>
   <string/>
</strings>

Sample answer

strings: string**(-",", s).
 string: -'"', c*, -'"'.
     -c: ~['"'].
     -s: -" "*.

Alternate Options

Apart from using "?", you can specify something optional in a different way. For instance:

year: digit, digit, digit, digit; .

Or to make it more obvious:

  year: digit, digit, digit, digit; empty.
-empty: .

Applied to a date without year gives an empty year element, rather than an absent one:

<date>
   <day>7</day>
   <month>November</month>
   <year/>
</date>

Exercise: Go it alone

Define an email address, so that its structure is revealed

Sample Answer

 email: user, -"@", host.
  user: ~["@"]+.
  host: domain++-".".
domain: ["a"-"z"; "A"-"Z"; "0"-"9"]+.

Accepting versus checking

The purpose of ixml:

For instance, our date format also accepts input like

You may not care, because you know the data is correct, and you will only be processing correct dates.

Checking

To use ixml to check the data, we can tighten up the definition, by excluding zero as a single digit, and only allowing two digit numbers up to 31:

day: "0"?, ["123456789"];
     ["12"], digit;
     "30"; "31".

which says that a day is single digit (excluding 0), optionally preceded by 0, or 1 or 2 followed by any digit, or 30, or 31.

Exercise

Try checking dates.

For extra points, add extra checking for months that can be 30 days long, those that can be 31 days long, and February.

You might also want to restrict which years are accepted.

Sample answer

  dates: (date, nl)+.
   date: s?, day, s, month, s, year.
    day: "0"?, sdigit;
         ["12"], digit;
         "30"; "31".
-sdigit: ["123456789"].
 -digit: ["0123456789"].
  month: "January"; "February"; "March"; "April"; "May"; "June";
         "July"; "August"; "September"; "October"; "November"; "December".
   year: digit, digit, digit, digit.
     -s: -" "+.
    -nl: -[#a; #d]+.

Extending the format

Let's add iso dates to the format, which look like "2024-11-07"

  date: day, s, month, s, year;
        iso.
   iso: year, -"-", nmonth, -"-", day.
nmonth: digit, digit.

Now our ixml accepts both sorts of date. For iso dates we get:

<date>
   <iso>
      <year>2024</year>
      <nmonth>11</nmonth>
      <day>07</day>
   </iso>
</date>

Exercise

The format that is used by the web protocol uses a format like this in its headers:

Tue, 15 Nov 1994 13:45:26 GMT

(It must be GMT).

Write ixml to accept it. (You don't need to check it, just accept it)

Sample answer

http-date: weekday, -", ", day, -" ", month, -" ", year, -" ", time, -" GMT".
  weekday: "Mon"; "Tue"; "Wed"; "Thu"; "Fri"; "Sat"; "Sun".
      day: dd.
      -dd: digit, digit.
    month: "Jan"; "Feb"; "Mar"; "Apr"; "May"; "Jun";
           "Jul"; "Aug"; "Sep"; "Oct"; "Nov"; "Dec".
     year: digit, digit, digit, digit.
     time: h, -":", m, -":", s.
        h: dd.
        m: dd.
        s: dd.
   -digit: ["0"-"9"].

Extending the format 2

Let's start again, with another date format, the one that uses slashes like 31/12/1999:

  date: day, -"/", month, -"/", year.
   day: ("0"; "1"; "2"), digit;
        "30"; "31".
 month: "0", digit;
        "10"; "11"; "12".
  year: digit, digit, digit, digit.
-digit: ["0"-"9"].

But the US writes such dates with the month first, 12/31/1999, so let's add that as well:

 date: us; world.
world: day, -"/", month, -"/", year.
   us: month, -"/", day, -"/", year.

Dates

"31/12/1999" gives us:

<date>
   <world>
      <day>31</day>
      <month>12</month>
      <year>1999</year>
   </world>
</date>

Dates

"12/31/1999" gives us:

<date>
   <us>
      <month>12</month>
      <day>31</day>
      <year>1999</year>
   </us>
</date>

Dates

But "7/11/2024" produces:

<!-- AMBIGUOUS at 
  date[1.1:2.1]: us[:2.1] | 
  date[1.1:2.1]: world[:2.1] | 
-->
<date ixml:state="ambiguous" xmlns:ixml="http://invisiblexml.org/NS">
   <us>
      <month>7</month>
      <day>11</day>
      <year>2024</year>
   </us>
</date>

7/11/2024 can be both a US date and a World date, and ixml can't tell the difference.

It has chosen one, warning that the result is ambiguous, and may not be what you were expecting.

Exercise

Try it out

Ambiguity

Ambiguity:

For example

dates: date+.
 date: -" "*, day, -" ", month, -" ", year, -" "*.

With

7 November 2024  8 November 2024

ixml doesn't know whether the spaces between the dates follow the first, or precede the second.

Accidental ambiguity

Similarly, if you define

expr: number;
      expr, op, expr.

Then with

1+2+3

ixml doesn't know if that's

(1+2)+3

or

1+(2+3)

both of which match.

Exercise

Rewrite the example from the previous chapter so that instead of ixml telling you that it is ambiguous, you have three classes of dates: US, World, and ambig:

<dates>
   <date>
      <world>
         <day>31</day>
         <month>12</month>
         <year>1999</year>
      </world>
   </date>
   <date>
      <us>
         <month>12</month>
         <day>31</day>
         <year>1999</year>
      </us>
   </date>
   <date>
      <ambig>
         <daymonth>04</daymonth>
         <daymonth>10</daymonth>
         <year>2021</year>
      </ambig>
   </date>
</dates>

Sample Answer

    date: us; world; ambig.
   world: day, -"/", month, -"/", year.
      us: month, -"/", day, -"/", year.
   ambig: daymonth, -"/", daymonth, -"/", year.
     day: {above 12 for the day is unambiguous}
         "1", ["3456789"];
         "2", digit;
         "30"; "31".
daymonth: {up to 12 is ambiguous}
         -month.
   month: "0"?, sdigit;
          "10"; "11"; "12".
    year: digit, digit, digit, digit.
  -digit: ["0"-"9"].
 -sdigit: ["1"-"9"].

What can't be described

ixml can only describe 'context-free' languages.

For instance, Fortran strings like "6Hstring" can't be described in a general way, nor languages where the structure is only implied by the indentation, nor languages where structures are 'silently' closed off.

Luckily most languages are designed to be context-free.

However, it is possible to describe non-context-free cases up to a fixed number of cases.

Exercise

Nested lists

Monday
   tea
   coffee
Tuesday
Wednesday   
   morning
      coffee
      biscuits
   afternoon
      tea
      cakes
Thursday
   closed

to genererate XML like:

<list>
   <word>Monday</word>
   <list1>
      <word>tea</word>
      <word>coffee</word>
   </list1>
   <word>Tuesday</word>
   <word>Wednesday</word>
   <list1>
      <word>morning</word>
      <list2>
         <word>coffee</word>
         <word>biscuits</word>
      </list2>
      <word>afternoon</word>
      <list2>
         <word>tea</word>
         <word>cakes</word>
      </list2>
   </list1>
   <word>Thursday</word>
   <list1>
      <word>closed</word>
   </list1>
</list>

Sample Answer

    list: (                                word, newline, list1?)+.
   list1: (                        indent, word, newline, list2?)+.
   list2: (                indent, indent, word, newline, list3?)+.
   list3: (        indent, indent, indent, word, newline, list4?)+.
   list4: (indent, indent, indent, indent, word, newline        )+.
 -indent: -"   ".
-newline: -#d?, -#a.
    word: [L]*.

Serialisation Overrides

As we have seen, you can hide an element:

-digit: [Nd].

or output it as an attribute:

@number: digit+.

but you can override these defaults when you use a rule.

Overrides

For instance, your input contains either a single date, or a pair of dates, representing start and end dates:

2024-11-07:2024-11-08

You could write rules:

 data: date; range.

range: start, -":", end.

 date: year, -"-", month, -"-", day.
 year: d, d, d, d.
month: d, d.
  day: -month.
   -d: [Nd].

start: -date.
  end: -date.

Here we say that start has the same format as a date but will be output as a start element.

Overrides

With input:

2024-11-07

you get:

<data>
   <date>
      <year>2024</year>
      <month>11</month>
      <day>07</day>
   </date>
</data>

Overrides

With input

2024-11-07:2024-11-08

you get:

<data>
   <range>
      <start>
         <year>2024</year>
         <month>11</month>
         <day>07</day>
      </start>
      <end>
         <year>2024</year>
         <month>11</month>
         <day>08</day>
      </end>
   </range>
</data>

Root element

You can make the root element hidden, as long as its resulting content is a single element (because of XML rules). So if we replace the data rule with:

-data: date; range.

then we will get

<range>
   <start>
      <year>2024</year>
      <month>11</month>
      <day>07</day>
   </start>
   <end>
      <year>2024</year>
      <month>11</month>
      <day>08</day>
   </end>
</range>

etc.

Overriding hidden or attribute

To override a rule that is hidden or an attribute to make it an element, you use "^".

input: ^digit; number.
number: digit, digit+. {2 or more digits}
-digit: [N].

then for input "2024" we get:

<input>
   <number>2024</number>
</input>

and for input "6" we get:

<input>
   <digit>6</digit>
</input>

Override with @

Similarly, you can override with @:

input: @digit; @number.
number: digit, digit+. {2 or more digits}
-digit: [N].

would give

<input digit="6"/>

and

<input number="2024"/>

Exercise

Create an ixml definition for URLs, like http://www.example.com/documents/index.html

Make your own decisions about the structure of the XML.

Sample answer

      url: scheme, -":", -"//", authority, -"/", 
           path, (-"?", query)?, (-"#", fragment)?.
   scheme: letter+.
authority: host, port?.
     host: sub++-".".
      sub: letter+.
     path: segment**-"/".
  segment: ~["/?#"]*.
    query: ~["# "]*.
 fragment: ~[" "; #a]*.
  -letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
     port: -":", ["0"-"9"]+.

Insertions

Up to now, all output characters came from the input.

But you can add other characters, by putting a "+" before a string or encoding.

For instance, with this input, representing a series of positive and negative numbers:

100,200,(300),400

the following ixml adds an extra attribute to the root element, adds plus signs to positive numbers, and deletes the brackets and adds a minus sign to negative numbers:

  data: value++-",", @source.
source: +"ixml".
 value: pos; neg.
  -pos: +"+", digit+.
  -neg: +"-", -"(", digit+, -")".
-digit: ["0"-"9"].

The above input would produce

<data source="ixml">
   <value>+100</value>
   <value>+200</value>
   <value>-300</value>
   <value>+400</value>
</data>

Exercise

Accept dates with two- or four-digit years, but output them with four digits, so that 31/12/23 and 31/12/2023 give the same result.

Sample Answer

 date: day, -"/", month, -"/", year.
  day: d, d?.
month: d, d.
 year: d, d, d, d;
       +"20", d, d.
   -d: ["0"-"9"].

The end

You might like to marvel at the ixml definition of ixml in the specification, a thing of rare wonder. For instance a rule is

rule: (mark, s)?, name, s, ["=:"], s, -alts, ".".

which you can now read: an optional mark, a name, a colon or equals, and alternatives, followed by a point.

Note how this rule is defining itself.