Of the huge amounts of data in the world, very little is structured, restricting its usability and availability. Adding structuring is costly.
Invisible Markup is a method of adding structuring to data without touching the data itself.
We live in an exponential world.
We have known Moore's Law since 1965, a prediction that the number of components on a chip would double at constant price per 2 years
These are mostly taken from a 1960's book "Big Science, little science...and beyond", by Derek J. La Solla Price. Most of them I haven't checked against modern data.
Entries in dictionaries of national biography
Labour force
Population (I checked this one, and got 58 years)
Number of universities
Gross National Product (I got 10 years for UK 1955-2012)
Important discoveries
Important physicists
Number of chemical elements known
Accuracy of instruments
College entrants/1000 population
B.A., B.Sc.
Scientific journals
Membership of scientific institutes
Number of chemical compounds known
Number of scientific abstracts, all fields
The amount of light at constant price
Number of asteroids known
Literature in many scientific disciplines
Number of telephones in United States
Number of engineers in United States
Speed of transportation
Kilowatt-hours of electricity generated
Number of overseas telephone calls
Magnetic permeability of iron
The amount of light from LEDs at constant price
3
years
The stock market
Number of components on a chip at constant price (Recently checked)
Million electron volts of accelerators. (With 1960 data I got about 1.7 years. With modern data, I got more or less exactly 2 years.)
Peak Internet throughput at AMSIX (28 doublings over 43 years since 1988)
The number of people getting Covid-19 before restrictions were imposed (after 40 days you've got 1000, after 80 days a million).
We knew in the 60's that data had been doubling every 15
years.
Since the turn of the millenium, new
data has been mainly digital.
In 2007 there was nearly 300 EB of data, of which 94% was digital.
Data point: The Internet archive at 2TB in 1995 is currently at 175PB.
How to remember the prefixes (if you know Greek):
TETRA = 4 → TERA = 10004
PENTA = 5 → PETA = 10005
HEX = 6 → EXA = 10006
HEPTA = 7 → ZETTA = 10007
OCTO = 8 → YOTTA = 10008
(FYI: the next two are Ronna and Quetta, but you won't need those for a few years yet, about 10 for ronno and 20 for quetta.)
If 1 byte = 1 second, then
1KB = 17 mins
1MB = 12 days
1GB = 34 years
1TB = 35 millenia
1PB = 36M years
1EB = 10 × age of universe
1ZB = 1000 × that
1YB = 1000 × that
In 2025 about 181 ZB of new data was produced.
In 2015 that was 15 ZB.
This is a doubling time of 2.78 years.
This means that every three years, we produce as much data as we already have.
Which means that in the next three years we will produce about 1YB of new data, which suggests that is how much we now have.
The good news is that disk capacity is growing faster than the data.
Most data is unstructured, or only slightly structured.
Thus, every program has to have an input function that recognises and reads the data and creates internal data structures.
And even though we are in the 4th decade of the internet age, we are still providing images of paper rather than real data!
Examples include ticketing, contracts, and receipts.
These are all typically PDFs, with no machine processable elements. They are a picture of a paper version of the thing.
The only thing that has happened is that the paper has been digitised away, and is sent to you electronically. Otherwise it is the same as it ever was.
Information can be used in two ways (at the same time): to communicate to people, and to communicate between machines.
For instance there is a service, tripit, to which you send tickets, hotel bookings, and so on. It assembles them and creates an itinary for you automatically. Really handy: everything in one place.
But it has to know what the information is.
It has to try and work out what is in these things, in order to do something useful with it. It often gets it wrong.
This is the sort of thing tripit
has to deal with, a ticket sent to me as a PDF.
A half megabyte of picture and no data.
They have to OCR it, and try and work out what information is contained in it.
This is the essence of that PDF ticket (250 bytes):
document: ticket
type: train
supplier: Eurostar
reference: PCX4GZ
passenger: Steven Pemberton
train: 9114
leave: 2023-07-20T08:16:00+02:00
from: Amsterdam CS
to: St Pancras International
arrive: 2023-07-20T13:51:00+01:00
class: SP
coach: 3
seat: 21
Making this pretty for a human reader is a trivial task, and the technology already exists to do that.
Automatically getting the information out of a PDF is not trivial. It is hard, and it is often got wrong.
Textual documents have an implicit structure that is mostly recognisable for human readers
For computers the structure has to be made explicit to enable processing.
That is why markup languages were invented: rather than the structure being implicit, and needing a separate input routine, it is made explicit, and all programs can then have a single input routine to read all data.
Furthermore, the data is self-describing, making it easier to analyse, combine, and reuse.
The only problem is that converting unstructured data into a marked-up version is a lot of work.
Which is why Invisible Markup was invented.
Rather than add markup to a document, a description of the structure of the class of documents is created, and that is used to automatically structure the data.
You get the best of both worlds:
How it works:
Extra information in the grammar controls the serialisation.
A date:
04/12/2025
Description:
date: day, -"/", month, -"/", year. day: d, d?. month: d, d?. year: d, d, d, d. -d: ["0"-"9"].
Generates:
<date> <day>04</day> <month>12</month> <year>2025</year> </date>
Dates in different formats:
04/12/2025
and
2025-12-04
date: day, -"/", month, -"/", year;
year, -"-", month, -"-", day.
day: d, d?.
month: d, d?.
year: d, d, d, d.
-d: ["0"-"9"].
bibliography: (biblioentry, nl)*.
biblioentry: -"[", @abbrev, -"] ",
(author; editor), -", ",
title, -", ",
publisher, -", ",
pubdate, -", ",
(artpagenums, -", ")?,
(bibliomisc; biblioid)**-", ".
author: name.
editor: name, "(ed.)".
-name: firstname, surname.
etc.
[spec] Steven Pemberton (ed.), Invisible XML Specification, invisiblexml.org, 2022, https://invisiblexml.org/ixmlspecification.html
giving
<biblioentry abbrev='spec'>
<editor>
<firstname>Steven</firstname>
<surname>Pemberton</surname>
</editor>
<title>Invisible XML Specification</title>
<publisher>invisiblexml.org</publisher>
<pubdate>2022</pubdate>
<link href='https://invisiblexml.org/ixml-specification.html'/>
</biblioentry>
I designed Invisible Markup in order to get non-structured data into a system that required it.
However, since its availability, there has been an incredible diversity in the uses it has been put to. For instance:
The resulting structured data contains much more semantic data than the original unstructured data.
As a result it is rather easy, after processing, to reconstruct the data in its original format, if that is needed.
There are now a number of different implementations, six in use that I know of.
There's a vast amount of data in the world
Most of that data is unstructured
Computer programs need structured data
Adding markup is expensive and slow
Invisible Markup gives the best of both worlds:
First international Invisible Markup Symposium, Feb 26/27
Two afternoons, Online, Free to attend.