Invisible Structure

Abstract

Of the huge amounts of data in the world, very little is structured, restricting its usability and availability. Adding structuring is costly.

Invisible Markup is a method of adding structuring to data without touching the data itself.

Exponential world

Moore's Law is not dead We live in an exponential world.

We have known Moore's Law since 1965, a prediction that the number of components on a chip would double at constant price per 2 years

Other Doublings

These are mostly taken from a 1960's book "Big Science, little science...and beyond", by Derek J. La Solla Price. Most of them I haven't checked against modern data.

100 years
Entries in dictionaries of national biography
50 years
Labour force
Population (I checked this one, and got 58 years)
Number of universities
20 years
Gross National Product (I got 10 years for UK 1955-2012)
Important discoveries
Important physicists
Number of chemical elements known
Accuracy of instruments
College entrants/1000 population

Other Doublings

15 years
B.A., B.Sc.
Scientific journals
Membership of scientific institutes
Number of chemical compounds known
Number of scientific abstracts, all fields
The amount of light at constant price
10 years
Number of asteroids known
Literature in many scientific disciplines
Number of telephones in United States
Number of engineers in United States
Speed of transportation
Kilowatt-hours of electricity generated
5 years
Number of overseas telephone calls
Magnetic permeability of iron
The amount of light from LEDs at constant price

Other Doublings

3 years
The stock market
2 Years
Number of components on a chip at constant price (Recently checked)

Million electron volts of accelerators. (With 1960 data I got about 1.7 years. With modern data, I got more or less exactly 2 years.)
1½ years
Peak Internet throughput at AMSIX (28 doublings over 43 years since 1988)
4 days
The number of people getting Covid-19 before restrictions were imposed (after 40 days you've got 1000, after 80 days a million).

Data (2007)

Explosion of digital data Since the turn of the millenium, new data has been mainly digital.

In 2007 there was nearly 300 EB of data, of which 94% was digital.

Data point: The Internet archive at 2TB in 1995 is currently at 175PB.

Prefixes

How to remember the prefixes (if you know Greek):

TETRA = 4 → TERA = 1000⁴
PENTA = 5 → PETA = 1000⁵
HEX = 6 → EXA = 1000⁶
HEPTA = 7 → ZETTA = 1000⁷
OCTO = 8 → YOTTA = 1000⁸

(FYI: the next two are Ronna and Quetta, but you won't need those for a few years yet, about 10 for ronno and 20 for quetta.)

Big Numbers, Little Numbers

If 1 byte = 1 second, then

1KB = 17 mins
1MB = 12 days
1GB = 34 years
1TB = 35 millenia
1PB = 36M years
1EB = 10 × age of universe
1ZB = 1000 × that
1YB = 1000 × that

In 2025 about 181 ZB of new data was produced.
In 2015 that was 15 ZB.
This is a doubling time of 2.78 years.

This means that every three years, we produce as much data as we already have.

Which means that in the next three years we will produce about 1YB of new data, which suggests that is how much we now have.

The good news is that disk capacity is growing faster than the data.

Unstructured data

Most data is unstructured, or only slightly structured.

Thus, every program has to have an input function that recognises and reads the data and creates internal data structures.

And even though we are in the 4th decade of the internet age, we are still providing images of paper rather than real data!

The web imitating the old

A receipt Examples include ticketing, contracts, and receipts.

These are all typically PDFs, with no machine processable elements. They are a picture of a paper version of the thing.

The only thing that has happened is that the paper has been digitised away, and is sent to you electronically. Otherwise it is the same as it ever was.

Machine-readable information

Information can be used in two ways (at the same time): to communicate to people, and to communicate between machines.

For instance there is a service, tripit, to which you send tickets, hotel bookings, and so on. It assembles them and creates an itinary for you automatically. Really handy: everything in one place.

But it has to know what the information is.

It has to try and work out what is in these things, in order to do something useful with it. It often gets it wrong.

Example ticket

A train ticket as PDF This is the sort of thing tripit has to deal with, a ticket sent to me as a PDF.

A half megabyte of picture and no data.

They have to OCR it, and try and work out what information is contained in it.

The Data

This is the essence of that PDF ticket (250 bytes):

 document: ticket
     type: train
 supplier: Eurostar
reference: PCX4GZ
passenger: Steven Pemberton
    train: 9114
    leave: 2023-07-20T08:16:00+02:00
     from: Amsterdam CS
       to: St Pancras International
   arrive: 2023-07-20T13:51:00+01:00
    class: SP
    coach: 3
     seat: 21

Making this pretty for a human reader is a trivial task, and the technology already exists to do that.

Automatically getting the information out of a PDF is not trivial. It is hard, and it is often got wrong.

Structure

Textual documents have an implicit structure that is mostly recognisable for human readers

For computers the structure has to be made explicit to enable processing.

That is why markup languages were invented: rather than the structure being implicit, and needing a separate input routine, it is made explicit, and all programs can then have a single input routine to read all data.

Furthermore, the data is self-describing, making it easier to analyse, combine, and reuse.

The only problem is that converting unstructured data into a marked-up version is a lot of work.

Invisible Markup

Which is why Invisible Markup was invented.

Rather than add markup to a document, a description of the structure of the class of documents is created, and that is used to automatically structure the data.

You get the best of both worlds:

readable data for the humans
structured data for the machines
a single input routine.

How it works:

The format of the document is described using context-free grammars,
the document is parsed with the grammar
producing an abstract representation of the document,
which can then be serialised to a document with explicit markup, or used in some other way.

Extra information in the grammar controls the serialisation.

Very Simple Example

A date:

04/12/2025

Description:

 date: day, -"/", month, -"/", year.
  day: d, d?.
month: d, d?.
 year: d, d, d, d.
   -d: ["0"-"9"].

Generates:

<date>
   <day>04</day>
   <month>12</month>
   <year>2025</year>
</date>

Flexibility

Dates in different formats:

04/12/2025

and

2025-12-04

 date: day, -"/", month, -"/", year;
       year, -"-", month, -"-", day.
  day: d, d?.
month: d, d?.
 year: d, d, d, d.
   -d: ["0"-"9"].

Larger example

bibliography: (biblioentry, nl)*.
 biblioentry: -"[", @abbrev, -"] ",
              (author; editor), -", ",
              title, -", ",
              publisher, -", ", 
              pubdate, -", ", 
              (artpagenums, -", ")?,
              (bibliomisc; biblioid)**-", ".
author: name.
editor: name, "(ed.)".
 -name: firstname, surname.

etc.

[spec] Steven Pemberton (ed.), Invisible XML Specification, 
   invisiblexml.org, 2022,
   https://invisiblexml.org/ixmlspecification.html

giving

<biblioentry abbrev='spec'>
   <editor>
      <firstname>Steven</firstname>
      <surname>Pemberton</surname>
   </editor>
   <title>Invisible XML Specification</title>
   <publisher>invisiblexml.org</publisher>
   <pubdate>2022</pubdate>
   <link href='https://invisiblexml.org/ixml-specification.html'/>
</biblioentry>

Use cases

I designed Invisible Markup in order to get non-structured data into a system that required it.

However, since its availability, there has been an incredible diversity in the uses it has been put to. For instance:

Linking Dutch national laws together
Producing bibliographies from OCRd versions
Decoding car VIN numbers
Designing knitting patterns for an automatic knitting machine
Designing art
Building a code-browser in 141 lines of code
Analysing trial reports from the late Roman Empire
Exposing structure in James Joyce's last book
Music analysis
Turning PDFs of bank accounts into processable data
Communications in aerospace parts acquisition

Roundtripping

The resulting structured data contains much more semantic data than the original unstructured data.

As a result it is rather easy, after processing, to reconstruct the data in its original format, if that is needed.

Conclusion

There's a vast amount of data in the world

Most of that data is unstructured

Computer programs need structured data

Adding markup is expensive and slow

Invisible Markup gives the best of both worlds:

readable data for humans,
structured data for machines,
one single input routine for all programs.

ADVERT

First international Invisible Markup Symposium, Feb 26/27

Two afternoons, Online, Free to attend.

https://invisiblexml.org/events/

Invisible Structure

Abstract

Contents

Exponential world

Other Doublings

Other Doublings

Other Doublings

Data

Data (2007)

Prefixes

Big Numbers, Little Numbers

Unstructured data

The web imitating the old

Machine-readable information

Example ticket

The Data

Structure

Invisible Markup

Very Simple Example

Flexibility

Larger example

Use cases

Roundtripping

Implementations

Conclusion

ADVERT