Invisible Structure

Steven Pemberton, CWI, Amsterdam

The author

Abstract

Of the huge amounts of data in the world, very little is structured, restricting its usability and availability. Adding structuring is costly.

Invisible Markup is a method of adding structuring to data without touching the data itself.

Contents

Exponential world

Moore's Law is not deadWe live in an exponential world.

We have known Moore's Law since 1965, a prediction that the number of components on a chip would double at constant price per 2 years

Other Doublings

These are mostly taken from a 1960's book "Big Science, little science...and beyond", by Derek J. La Solla Price. Most of them I haven't checked against modern data.

Other Doublings

Other Doublings

Data

No of scientific journals 1665-1960We knew in the 60's that data had been doubling every 15 years.

Data (2007)

Explosion of digital dataSince the turn of the millenium, new data has been mainly digital.

In 2007 there was nearly 300 EB of data, of which 94% was digital.

Data point: The Internet archive at 2TB in 1995 is currently at 175PB.

Prefixes

How to remember the prefixes (if you know Greek):

TETRA = 4 → TERA = 10004
PENTA = 5 → PETA = 10005
HEX = 6 → EXA = 10006
HEPTA = 7 → ZETTA = 10007
OCTO = 8 → YOTTA = 10008

(FYI: the next two are Ronna and Quetta, but you won't need those for a few years yet, about 10 for ronno and 20 for quetta.)

Big Numbers, Little Numbers

If 1 byte = 1 second, then

1KB = 17 mins
1MB = 12 days
1GB = 34 years
1TB = 35 millenia
1PB = 36M years
1EB = 10 × age of universe
1ZB = 1000 × that
1YB = 1000 × that

In 2025 about 181 ZB of new data was produced.
In 2015 that was 15 ZB.
This is a doubling time of 2.78 years.

This means that every three years, we produce as much data as we already have.

Which means that in the next three years we will produce about 1YB of new data, which suggests that is how much we now have.

The good news is that disk capacity is growing faster than the data.

Unstructured data

Most data is unstructured, or only slightly structured.

Thus, every program has to have an input function that recognises and reads the data and creates internal data structures.

And even though we are in the 4th decade of the internet age, we are still providing images of paper rather than real data!

The web imitating the old

A receiptExamples include ticketing, contracts, and receipts.

These are all typically PDFs, with no machine processable elements. They are a picture of a paper version of the thing.

The only thing that has happened is that the paper has been digitised away, and is sent to you electronically. Otherwise it is the same as it ever was.

Machine-readable information

Information can be used in two ways (at the same time): to communicate to people, and to communicate between machines.

For instance there is a service, tripit, to which you send tickets, hotel bookings, and so on. It assembles them and creates an itinary for you automatically. Really handy: everything in one place.

But it has to know what the information is.

It has to try and work out what is in these things, in order to do something useful with it. It often gets it wrong.

Example ticket

A train ticket as PDFThis is the sort of thing tripit has to deal with, a ticket sent to me as a PDF.

A half megabyte of picture and no data.

They have to OCR it, and try and work out what information is contained in it.

The Data

This is the essence of that PDF ticket (250 bytes):

 document: ticket
     type: train
 supplier: Eurostar
reference: PCX4GZ
passenger: Steven Pemberton
    train: 9114
    leave: 2023-07-20T08:16:00+02:00
     from: Amsterdam CS
       to: St Pancras International
   arrive: 2023-07-20T13:51:00+01:00
    class: SP
    coach: 3
     seat: 21

Making this pretty for a human reader is a trivial task, and the technology already exists to do that.

Automatically getting the information out of a PDF is not trivial. It is hard, and it is often got wrong.

Structure

Textual documents have an implicit structure that is mostly recognisable for human readers

For computers the structure has to be made explicit to enable processing.

That is why markup languages were invented: rather than the structure being implicit, and needing a separate input routine, it is made explicit, and all programs can then have a single input routine to read all data.

Furthermore, the data is self-describing, making it easier to analyse, combine, and reuse.

The only problem is that converting unstructured data into a marked-up version is a lot of work.

Invisible Markup

Which is why Invisible Markup was invented.

Rather than add markup to a document, a description of the structure of the class of documents is created, and that is used to automatically structure the data.

You get the best of both worlds:

How it works:

Extra information in the grammar controls the serialisation.

Very Simple Example

A date:

04/12/2025

Description:

 date: day, -"/", month, -"/", year.
  day: d, d?.
month: d, d?.
 year: d, d, d, d.
   -d: ["0"-"9"].

Generates:

<date>
   <day>04</day>
   <month>12</month>
   <year>2025</year>
</date>

Flexibility

Dates in different formats:

04/12/2025

and

2025-12-04

 date: day, -"/", month, -"/", year;
       year, -"-", month, -"-", day.
  day: d, d?.
month: d, d?.
 year: d, d, d, d.
   -d: ["0"-"9"].

Larger example

bibliography: (biblioentry, nl)*.
 biblioentry: -"[", @abbrev, -"] ",
              (author; editor), -", ",
              title, -", ",
              publisher, -", ", 
              pubdate, -", ", 
              (artpagenums, -", ")?,
              (bibliomisc; biblioid)**-", ".
author: name.
editor: name, "(ed.)".
 -name: firstname, surname.

etc.

[spec] Steven Pemberton (ed.), Invisible XML Specification, 
   invisiblexml.org, 2022,
   https://invisiblexml.org/ixmlspecification.html

giving

<biblioentry abbrev='spec'>
   <editor>
      <firstname>Steven</firstname>
      <surname>Pemberton</surname>
   </editor>
   <title>Invisible XML Specification</title>
   <publisher>invisiblexml.org</publisher>
   <pubdate>2022</pubdate>
   <link href='https://invisiblexml.org/ixml-specification.html'/>
</biblioentry>

Use cases

I designed Invisible Markup in order to get non-structured data into a system that required it.

However, since its availability, there has been an incredible diversity in the uses it has been put to. For instance:

Roundtripping

The resulting structured data contains much more semantic data than the original unstructured data.

As a result it is rather easy, after processing, to reconstruct the data in its original format, if that is needed.

Implementations

There are now a number of different implementations, six in use that I know of.

Conclusion

There's a vast amount of data in the world

Most of that data is unstructured

Computer programs need structured data

Adding markup is expensive and slow

Invisible Markup gives the best of both worlds:

ADVERT

First international Invisible Markup Symposium, Feb 26/27

Two afternoons, Online, Free to attend.

https://invisiblexml.org/events/