On the Design of the URL

Abstract

Notations can affect the way we think, and how we operate; consider as a simple example the difference between Roman Numerals and Arabic Numerals, where Arabic Numerals allow us not only to more easily represent numbers, but also ease manipulations of numbers and calculations.

One of the innovations of the World Wide Web was the URL. In the last 30 years, URLs have become an ever-present element of everyday life, so present that we scarcely even grant them a second thought. And yet they are a designed artefact: there is nothing natural about their structure -- each part is there as part of a design.

This talk looks at the design issues behind the URL, what a URL is meant to represent, and how it relates to the resources it identifies, and its relationship with representational state transfer (REST) and the protocols that REST is predicated on. The talk considers, with hindsight, how the design, if at all, could have been improved.

While it is too late now to change the design of URLs, are there any the lessons that we can draw from their design, and if so, can they be used to direct the future designs of notations?

Birthplace of the European Open Internet

In November 1988 the first European open internet connection was established here, with the mind-boggling speed of 64kbps. Since then it has nearly doubled every year, and is currently peaking at nearly 8.4 Tbps. (Note how visible the lockdown is).

That is 27 doublings in 32 years.

Amsix stats 2020, peaking at 8.3 Tb/s

The Web

The introduction of the internet in Europe was an impetus for the creation of the World Wide Web.

Tim Berners Lee wrote his proposal in March 1989, and the first web server was running in 1990, and announced in 1991.

It couldn't have been created in a more international environment: by a Brit and a Dutch-speaking Belgian with a French surname on designated international territory straddling the Swiss and French borders.

The URL

The World Wide Web cleverly combined several existing technologies such as hypertext, SGML, and the internet to create a successful global information system.

One of its innovations was the URL, allowing resources over the whole world to be identified for retrieval.

References

If you are interested you can read an early design document: Universal Document Identifiers on the Network (OSI-DS 29) (1992)

The first formal definition was: RFC 1738 (1994)

The most recent is RFC: RFC 3986 (2005)

Internationalised in: RFC 3987 (2005)

Updated for IPv6 in: RFC 6874 (2013)

See also: URI Design and Ownership (2014)

The Role of the URL

There are four terms used: URL, URN, URI, and IRI.

URL: Uniform Resource Locator, to locate and retrieve a resource.
```
http://www.cwi.nl/~steven/Talks/2020/10-09-urls/
```
URN: Uniform Resource Name, to give a name to a resource, without specifying how to retrieve it.
```
urn:isbn:0451450523
```
URI: Uniform Resource Identifier
The umbrella name for URL and URN.
IRI: Internationalised Resource Identifier, using a wider range of Unicode characters:
```
https://www.石川.日本/雅康#mimasa
```

The Structure of the URL (and of this talk)

A URL consists of several distinct parts, each with its own special character to demarcate either its start or end:

scheme: //authority /path ?query #fragment

where an authority for HTTP is

user:password@ host :port

A client uses a URL to retrieve a resource from a server.

The client uses all parts except the fragment to retrieve the resource.

The server only sees and uses the path and the query parts.

The client uses the fragment once the resource has been retrieved.

Syntax

The syntax of a URI is very general, and different parts may have different syntaxes. The URI specs only define the characters that may be used, but not the syntax of those parts.

Owners of other parts are:

The scheme definition defines the syntax of the authority and path
The media type definition of a retrieved resource defines the syntax of a fragment.

It is a little vague who owns the syntax of a query.

Leaving parts out

Every single part of a URL is optional, though there are some combinations you never see in real life, such as leaving the scheme out but not the host:

//www.example.com/

Or retaining the scheme but not the host:

http:/file.txt

In fact even the empty string is a valid URL (meaning "the current resource"), and does even have a few uses.

Scheme

The scheme principally represents a couple of things:

The protocol to be used
The default port number

Not all schemes have a formally defined protocol (or port for that matter), for instance mailto: and file: which depend on native properties of the client platform.

Some naughty applications specify schemes in order to identify document types for that application. This is wrong.

Schemes are used for defining how to retrieve resources.

The returned mediatype of the retrieved resource should be used for initiating an application.

HTTP

The most important scheme is clearly http.

An obvious observation is that there is no reason for the name of the scheme to match the name of the protocol it identifies.

It is one of the great missed chances not to have used web: as the name of the main scheme, rather than the technical-looking http:

We will say something about the https: scheme at the end of the talk.

The Double Slash

The double slash announces the start of the authority, the name for everything up to the path, in other words user:password, hostname, and port.

If there is no authority part, you don't need to use the double slash; that notwithstanding, you still see a lot of file: URLs of the form

file:///home/user/documents/file.txt

when in fact

file:/home/user/documents/file.txt

means the same thing.

Why Double Slash?

Any suitable character would have done to announce the start of the authority.

Berners-Lee reports that he took the double slash from the Apollo file system, where they needed a syntax to differentiate a local file

/folder/file

from a file over the network:

//machine/folder/file

As he said:

"When I was designing the Web, I tried to use forms which people would recognize from elsewhere."

Design

In fact Berners-Lee has been reported as saying the double slash was the only thing he would change in his design because "it is unnecessary".

I tend to disagree, because it is useful to be able to distinguish between the machine (the authority) and the path. You need to be able to distinguish between

http:machine/file

and

http:directory/file

unless you dictate that a scheme always requires an authority.

User and password

It may seem bizarre in the context of the modern web that there is a place in a URL for a username and password: they are both exposed in any document containing the URL, and exposed again as they are passed over the connection (there is no secure method of transmitting them in the protocol).

The reason they are there is almost certainly because ftp required them to retrieve documents, and so they had to be present in the URL. The username was almost always anonymous.

We will say more about passwords later.

Host

The host part was already a given fact, having been defined for the DNS system in the mid 1980s.

It can be either a hostname using DNS, or an IP address.

Interestingly, a hostname can also be a number, so you may not be able to tell them apart until the very last segment:

192.168.178.com

The host part of a URL originally indentified a unique machine.

It was only realised later that several hostnames could point to the same machine, and a single machine could easily host several servers.

It required an update to the HTTP protocol to support this properly though.

www

Because originally the host identified a single machine, many sites kept one machine specifically reserved as a Web server.

The first two URLs published in 1991 (as far as I can tell) were

http://cernvm.cern.ch/FIND

and

http://info.cern.ch/hypertext/WWW/TheProject.html

Notably neither server was called "www".

And yet within a short period, nearly all webserver hosts were called exactly that, so that many people thought that "www" somehow was an essential part of a URL.

Again, it is frankly weird that people didn't settle on 'web' instead of 'www', since www, with 9 syllables, is an acronym that is 3 times longer (in English) than the 3 syllable phrase it stands for.

Port

Each protocol has a default port. For instance, http uses 80. However it doesn't have to be at 80, and this is the place you can change it.

However, it is rather odd that it is here, since it is associated with the protocol, not the authority.

It should really have been something like

http-8080://example.com/doc.html

or similar.

Endianism

If I write an address in the UK, it looks something like

Steven Pemberton
21, Sandridge Road,
St. Albans,
Herts,
UK

What you see is that it is almost entirely little-endian: from left to right and top to bottom the information gets more and more significant (except the 21, which is big-endian).

Endianism

On the other hand, an address in Soviet Russia looked like this:

РОССИЯ
г.Москва 125252
ул.Куусинена 21-Б
Междунродный Центр Научной и Технической Информации
Чичикову П.И.

which is entirely big-endian, even down to the person's name with family name first.

Endianism

We're confronted with clashes in endianism all the time:

Times: 12:15:45 - big endian

Dates: 31/12/1999 - little-endian

Dates (in USA) - 12/31/1999 - confused endian

Numbers

We inherited our number system from Arabic.

The interesting thing is if you look at a piece of Arabic writing, such as this from the Wikipedia page on the Second World War:

الحرب العالمية الثانية: هي نزاع دولي مدمر بدأ في الأول من سبتمبر 1939 في أوروبا وانتهى في الثاني من سبتمبر 1945، شاركت فيه الغالبية العظمى من دول العالم، في حلفين رئيسيين هما: قوات الحلفاء ودول المحور. وقد وضعت الدول الرئيسية كافة قدراتها العسكرية والاقتصادية والصناعية والعلمية في خدمة المجهود الحربي، وتعد الحرب العالمية الثانية من الحروب الشمولية، وأكثرها كلفة في تاريخ البشرية لاتساع بقعة الحرب وتعدد مسارح المعارك والجبهات فيها، حيث شارك فيها أكثر من 100 مليون جندي، وتسببت بمقتل ما بين 50 إلى 85 مليون شخص ما بين مدنيين وعسكريين، أي ما يعادل 2.5% من سكان العالم في تلك الفترة.

what is notable is that even though the text is read right-to-left, the numbers are still what an English-language reader would consider "the right way round".

In other words, in the original Arabic form, numerals were little-endian (which has some slight arithmetical advantages by the way), but were imported into Western languages back-to-front, and so became big-endian.

History

It is interesting to note that historically, before the introduction of Arabic numerals, we spoke numbers (at least under one hundred) little endian:

sixteen,
four-and-twenty blackbirds baked in a pie,
the time is five and twenty past three.

Arabic numerals started being introduced during the enlightenment, which may have affected how we spoke numbers, since in English at least we now speak numbers big-endian, with the exception of the numbers between 13 and 19 inclusive.

Shakespeare

Shakespeare used both styles of numbering.

I did a bit of research and discovered that he used little-endian numbers (four and twenty) about twice as often as big endian (twenty four), which may show that the style of speaking numbers was changing at that time (though you need to take into account the requirements of the metre).

Endianism in URLs

At the point where the authority ends, and the path begins, a URL shifts from little-endian, to big-endian.

www.example.com - little-endian

/Talks/2020/09/slides.html - big endian.

Berners-Lee has in the past suggested that he would have preferred

http:com/example/www/Talks/2020/10/slides.html

but this does have a problem that it is hard to locate where the authority ends, and where the path begins, and could even be ambiguous.

Path

The path suggests a file path, and indeed on the first implementations, and many current ones, is indeed so; but it isn't a requirement.

In fact, officially, the path is opaque: you cannot necessarily conclude anything from it.

For instance, there is no a priori reason why

/2020/Talks/

and

/Talks/2020/

shouldn't refer to the same thing.

Type

At the end of a vast number of URLs is a file type

http://www.example.com/index.html

and

http://example.org/logo.gif

and this reflects that many implementations map a URL directly onto the filestore.

This is a great pity, because it means we lose out on a great under-used property of http.

Accept

HTTP has a number of headers that allow you to characterise properties of the resource that you are interested in. This is called content negotiation, though there is no actual negotation going on:

language: which languages you would prefer, and which languages you will accept
document type: similarly, which you prefer, and which you are willing to accept
character set/encoding
compression types your machine can deal with.

The latter two are more for machines, but the first two are definitely useful for people.

Accept type

If a page specified an image as

<img src="logo"/>

(rather than logo.gif) then the client program could include an accept header that specified which image types it preferred, and the site could include more alternatives.

If a site had a document available in several different types, a client program could list them in preference order, or a machine could check if a particular type was available by saying it only accepted that one type.

http://example.org/documents/declaration

Accept lang

Many sites offer their content in a variety of languages, and allow you to click to select a different one.

And yet, the HTTP protocol says which languages you prefer. They don't need to ask.

google.com in particular is really bad at this. If you take your laptop to another country, their interface suddenly switches to the/a language of that country, even if your browser is telling them which language you want.

In the international train between Amsterdam and London I get the Swedish interface!

Missing in the URL

One of the shortcomings of HTTP is that you can't specify these features in the URL.

For instance it would be useful if you could say you wanted the Dutch version of a PDF document with

http://example.com/doc[type=pdf;lang=nl]

Or that you preferred it, but were willing to accept something else:

http://example.com/doc[type=pdf,*;lang=nl,en]

Nothing acceptable

Another shortcoming of many servers is that while they do use the Accept: headers to decide what to give you, they don't return the correct response code when they fail to supply what you asked for. Typically you get a 200, as if all were well.

This is nearly as bad as returning a 200 when a 404 should have been returned, and it means that even if you wanted to, you couldn't write a program to discover what a site offered.

Query

The query allows you to send a small amount of data with the URL, rather than in the body of the HTTP request.

The major design blunder of the URL for HTML was using an ampersand to separate the values, since it is a special character in HTML, meaning that you can't just paste a URL into HTML, but have to replace every occurrence of & with &

There was later a rear-guard action to try and replace it with semicolon, but too late.

Another weirdness of the HTML query is that spaces should officially be replaced with a "+", and therefore "+" should be replaced with %2B.

Fragment

As mentioned earlier, the fragment is for use after the resource is retrieved. Its syntax and how it is used is up to the specification for the media type of the resource returned.

Even though they are sometimes called 'fragment identifiers', and for HTML must be identifiers, there is no syntactic requirement in general that they be identifiers.

Berners-Lee reports that he chose the # sign, because of its similar use in postal addresses in the USA, indicating a sub-location in an address.

Special Characters

As we have seen, a URL uses a number of special characters for its syntax. They are, in order,

: Used in three different places - to signal the end of the scheme, to separate username and password, and to signal the start of the port.
// To signal the start of the authority
@ To signal the end of the user/password
/ To signal the start of the path
? To signal the start of the query
# To signal the start of the fragment.

The general rule is that to the left of the last use of the character in the URL syntax, the character must be encoded, and to the right, it doesn't have to be. For instance:

a colon in a password has to be encoded, but not in a path.
a question mark in the path has to be encoded, but not in a query or fragment.

One mistake: this doesn't count for #. They still have to be encoded in a fragment.

Encoding

The web (and URLs) were defined pre-Unicode, and used Latin-1.

This was already better than the rest of the internet, which used pure ASCII (for instance DNS is ASCII).

Unfortunately, when encoding a character, there is no terminator. It is just two hex digits: a space is %20

This meant that when Unicode came along there was no way to encode its characters; you would want to just use the codepoint of a character; unfortunately you have to know how to encode it into UTF-8, and encode those bytes.

So for

search?q=♥

while ideally you might want to write

search?q=%2665;

you have to write

search?q=%E2%99%A5

This is a shame.

Spaces

A URL may not contain an (unencoded) space.

The use of white space characters has been avoided in UDIs: spaces are not legal characters. This was done because of the frequent introduction of extraneous white space when lines are wrapped by systems such as mail, or sheer necessity of narrow column width, and because of the inter-conversion of various forms of white space which occurs during character code conversion and the transfer of text between applications.

However, there is the weirdness that spaces everywhere but in a query should be encoded with %20, but in a query with +, therefore meaning that + has to be encoded only in a query.

HTTPS

https is a mistake, one of a number of mistakes introduced by Netscape, such as <blink>, <img>, and <frame>.

The URLs

http://www.example.com/doc.html

and

https://www.example.com/doc.html

locate the same document in every way. The only difference is in details of the protocol being used to retrieve the document.

In other words, whether the connection is encrypted or not should be part of the protocol; we as users shouldn't have to care about, nor be confronted by, the negotation between the machines about how they are going to transport the data.

Now we are being forced to change all our documents to use https links. That shouldn't have been necessary.

Passwords

The password situation on the web is a disaster.

Basic authentication in the URL must be ignored of course.

The way it is done now is that every site has its own password system (and rules for passwords), and requires every site builder to understand the finer points of passwords (which they don't), and protect the passwords sufficiently (which they don't).

The combination is a royal security mess.

Public Key Cryptography

Everyone has two matched keys:

One is public: anyone can get a copy.
One is private: only you have it.
You can lock a message with either key, making it unreadable.
If you lock with one key, only the other key can open it.

('Keys' are very large numbers, 300 digits or more; 'locking' means scrambling the message using those large numbers)

Identity

I lock a message with my private key (so it can only be opened with my public key).

I send the locked message to you.

You get a copy of my public key: if it opens the message, you know it was really from me.

No more phishing!

Registering with a site

You want to create an account at an online shop. You click on the "Create an account" button.

The site asks you to for a username, which you provide, and click OK.

The site checks the user name is free, and returns a message to your browser that you want to register.

The browser asks if you want to register.

If you click yes, the browser and the site exchange public keys.

Logging in to a website

You fill in your username, and click on Log in.

The site generates a one-time password, locks it with your public key, and its own private key, and returns it to the browser.

Your browser knows it is really from the site (and not someone pretending).

The browser asks if you really want to log in with that username.

If you say yes, it decodes the password and sends it back, this time locked with your private key and the site's public key.

The site therefore knows it is really you and lets you in, without you having to type in a password!

Conclusion

The password situation is a disaster; it can still be fixed in the protocol!

URLs are useful, and have been durable.

They exhibit some design faults, which is easy to say in hindsight.

Those faults are lessons we can take on board for future designs of notations.

On the Design of the URL

Contents

Abstract

Greetings from Amsterdam

Birthplace of the European Open Internet

The Web

The URL

References

The Role of the URL

The Structure of the URL (and of this talk)

Syntax

Leaving parts out

Scheme

Scheme

HTTP

Authority

The Double Slash

Why Double Slash?

Design

User and password

User and password

Host

Host

www

Port

Port

Endianism

Endianism

Endianism

Numbers

History

Shakespeare

Endianism in URLs

Path

Path

Type

Accept

Accept type

Accept lang

Missing in the URL

Nothing acceptable

Query

Query

Fragment

Fragment

Special Characters

Encoding

Spaces

HTTPS

Passwords

Public Key Cryptography

Identity

Privacy

Secure messaging

Public keys for passwords

Registering with a site

Logging in to a website

Conclusion

Thank you!