The author

On the Descriptions of Data: The Usability of Notations

Steven Pemberton, CWI, Amsterdam

Contents

Background

I introduced IXML originally at Balisage 2013.

We choose which representations of our data to use, JSON, CSV, XML, or whatever, depending on habit, convenience, or the context we want to use that data in.

On the other hand, having an interoperable generic toolchain such as that provided by XML to process data is of immense value.

How do we resolve the conflicting requirements of convenience, habit, and context, and still enable a generic toolchain?

Invisible XML (ixml) is a method for treating non-XML documents as if they were XML, enabling authors to write documents and data in a format they prefer while providing XML for processes that are more effective with XML content.

Review

"This is clearly a submission that needs to be shredded, burned, and the ashes buried in multiple locations"

"I think the audience will eat him alive. But I want to be there to hear it."

I lived to tell the tale

It was a proposal.

I did a pilot implementation.

I also have a background in usability, and applied usability principles.

The following talk resulted.

People

People

Different people have different psychologies.

This seems almost too obvious to be true, but it is surprising how many people don't properly understand it.

My favourite description of how people – particularly programmers – differ is in chapter 15 of Bruce Tognazzini's book Tog on Interface.

When Sensories drive to work, they are aware of the birds, the trees, the hills turning green. They notice a cow lowing in the field. [...]

People

Different people have different psychologies.

This seems almost too obvious to be true, but it is surprising how many people don't properly understand it.

My favourite description of how people – particularly programmers – differ is in chapter 15 of Bruce Tognazzini's book Tog on Interface.

When Sensories drive to work, they are aware of the birds, the trees, the hills turning green. They notice a cow lowing in the field. [...]

Intuitives live in their own private universe, depending on an internal model of external events. [...]

People

Different people have different psychologies.

This seems almost too obvious to be true, but it is surprising how many people don't properly understand it.

My favourite description of how people – particularly programmers – differ is in chapter 15 of Bruce Tognazzini's book Tog on Interface.

When Sensories drive to work, they are aware of the birds, the trees, the hills turning green. They notice a cow lowing in the field. [...]

Intuitives live in their own private universe, depending on an internal model of external events. [...]

When Intuitives drive to work, they watch the tectonic plates, deep in the earth's crust, rubbing together...

People

Different people have different psychologies.

This seems almost too obvious to be true, but it is surprising how many people don't properly understand it.

My favourite description of how people – particularly programmers – differ is in chapter 15 of Bruce Tognazzini's book Tog on Interface.

When Sensories drive to work, they are aware of the birds, the trees, the hills turning green. They notice a cow lowing in the field. [...]

Intuitives live in their own private universe, depending on an internal model of external events. [...]

When Intuitives drive to work, they watch the tectonic plates, deep in the earth's crust, rubbing together. They run into the cow.

HCI

The problem is that the people designing things are usually not the people who will be using those things, and they tend to design for themselves.

So... you have to use HCI techniques:

Usability

Usability is about designing things (software/programming languages/cookers) to allow people to do their work:

Efficient, Error-free, Enjoyable or
Fast, Faultless and Fun

Don't confuse usability with learnability: they are distinct and different.

Notations

No one really talks seriously about the usability of notations.

Notations

Notations affect what you can do with them.

For instance, Roman numerals:

Hypothesis: Programmers are human too

Programmers are Human too

Imagine, hypothetically, that programmers are humans...despite all evidence to the contrary:

Programmers are Human too

Imagine, hypothetically, that programmers are humans...despite all evidence to the contrary:

Programmers are Human too

Imagine, hypothetically, that programmers are humans...despite all evidence to the contrary:

Programmers are Human too

Imagine, hypothetically, that programmers are humans...despite all evidence to the contrary:

Programmers are Human too

Imagine, hypothetically, that programmers are humans...despite all evidence to the contrary:

Programmers are Human too

Imagine, hypothetically, that programmers are humans...despite all evidence to the contrary:

Also pretend, just for a moment, that their chief method of communicating with a computer was with programming languages.

Programmers are Human too

Imagine, hypothetically, that programmers are humans...despite all evidence to the contrary:

Also pretend, just for a moment, that their chief method of communicating with a computer was with programming languages.

What should you do?

ABC

We designed a programming language: ABC

We used the HCI principles:

ABC: results

ABC: results

ABC: results

ABC: results

ABC: results

ABC went on to form the basis of Python.

Another notation: 2 letter US state codes

Looking at

it appears that there is no real rule.

Apparently:

Codes

NE: Nevada or Nebraska?

Codes

NE: Nevada or Nebraska?

It's Nebraska, but NB would have been a better choice

Codes

NE: Nevada or Nebraska?

It's Nebraska, but NB would have been a better choice

MI: Mississippi, Missouri, Michigan, or Minnisota?

Codes

NE: Nevada or Nebraska?

It's Nebraska, but NB would have been a better choice

MI: Mississippi, Missouri, Michigan, or Minnisota?

It's Michigan, but MG would have been a better choice

Codes

NE: Nevada or Nebraska?

It's Nebraska, but NB would have been a better choice

MI: Mississippi, Missouri, Michigan, or Minnisota?

It's Michigan, but MG would have been a better choice

MS: Mississippi, Missouri, or Minnisota?

Codes

NE: Nevada or Nebraska?

It's Nebraska, but NB would have been a better choice

MI: Mississippi, Missouri, Michigan, or Minnisota?

It's Michigan, but MG would have been a better choice

MS: Mississippi, Missouri, or Minnisota?

It's Mississippi, but MP would have been a better choice.

Active and Passive

But solving these problems with reading 2-letter codes would still not solve the problem of writing them.

Winter-school was open in December

Water is warm

Even if we could solve the problems of recognising a state code, it still wouldn't help you with remembering them.

Doing it better

I couldn't believe it wasn't possible to do the 2-letter codes better. So I wrote a program.

The simplest rule I came up with:

It can be done!

The point

My point here is that the 2-letter codes were introduced for automation.

The solution they chose was technically sufficient.

That is no excuse for ignoring the needs of people.

Invisible XML

A method for treating any context-free parsable document as XML.

Example: Expression

The input

pi×(10+b)

can result in the XML

<prod>
   <id>pi</id>
   <sum>
      <number>10</number>
      <id>b</id>
   </sum>
</prod>

or

<prod>
   <id name='pi'/>
   <sum>
      <number value='10'/>
      <id name='b'/>
   </sum>
</prod>

Example: URL

The input

http://www.w3.org/TR/1999/xhtml.html

can give

<url>
   <scheme name='http'/>
   <authority>
      <host>
         <sub name='www'/>
         <sub name='w3'/>
         <sub name='org'/>
      </host>
   </authority>
   <path>
      <seg sname='TR'/>
      <seg sname='1999'/>
      <seg sname='xhtml.html'/>
   </path>
</url>

Example: JSON

{"name": "pi", "value": 3.145926}

can give

<json>
   <object>
      <pair string='name'>
         <string>pi</string>
      </pair>
      <pair string='value'>
         <number>3.145926</number>
      </pair>
   </object>
</json>

Example: XML :-)

<test lang="en" class="test">
  This <em>is</em> a test.
</test>

gave

<xml>
   <element name='test' close='test'>
      <attribute name='lang' value='en'/>
      <attribute name='class' value='test'/>
      <content>  This 
         <element name='em' close='em'>
            <content>is</content>
         </element> a test.</content>
   </element>
</xml>

Why?

Getting all sorts of other stuff into XForms

Possibly: Creating a non-XML version of XForms.

Already used in at least one Dutch Government project

Grammars

ixml works by describing the document to be treated in a grammar:

expr: term; sum.
sum: expr, "+", term.
term: factor; prod.
prod: term, "×", factor.
factor: id; number; "(", expr, ")".
id: letter+.
number: digit+.
letter: ["a"-"z"].
digit: ["0"-"9"].

(This is the notation we are interested in).

Initial design

In the initial design, the document was parsed to a parse-tree, and then the parse-tree was serialised to XML, using marks that you added to the grammar definition rules:

expr: term; ^sum.
sum: expr, "+", term.
term: factor; ^prod.
prod: term, "×", factor.
factor: ^id; ^number; "(", expr, ")".
id: letter+.
number: digit+.
letter: ^["a"-"z"].
digit: ^["0"-"9"].

Improvements

After user-testing we identified a number of changes that could be made to make ixml more usable.

Improvements: reduce not build

It is easier to design the data description by starting from the full parse tree, and incrementally pruning the parts that are not needed.

Improvements: prune whole rules

Very many non-terminals are not necessary in the final serialisation at all and it is more sensible to prune these at the definition rather than the use-point.

-expr: term; sum.

Improvements: point of use has priority

Occasionally you want to prune all uses of a nonterminal but one, so it is useful to be able to mark a definition as deleted, but mark it as inserted at a use-point.

-expr: term; sum.
 ...
factor: id; number; "(", ^expr, ")".

Improvements: characters

There are occasions where you need to say "any character except this list is acceptable at this position" (this had as consequence that a notation for character sets was necessary, something that was rejected in the initial design).

string: '"', ~["]*, '"'.

Improvements: options

It is useful to have an explicit notation for something that is optional.

number: sign?, digit+.

Improvements: character classes

It is useful to be able to use Unicode character classes.

letter: [lc].

Example: pi×(10+b)

<expr>
   <term>
      <prod>
         <term>
            <factor>
               <id>
                  <letter>p</letter>
                  <letter>i</letter>
               </id>
            </factor>
         </term>×
         <factor>(
            <expr>
               <sum>
                  <expr>
                     <term>
                        <factor>
                           <number>
                              <digit>1</digit>
                              <digit>0</digit>
                           </number>
                        </factor>
                     </term>
                  </expr>+
                  <term>
                     <factor>
                        <id>
                           <letter>b</letter>
                        </id>
                     </factor>
                  </term>
               </sum>
            </expr>)
         </factor>
      </prod>
   </term>
</expr>
<expr>
   <term>
      <prod>
         <term>
            <factor>
               <id>
                  <letter>p</letter>
                  <letter>i</letter>
               </id>
            </factor>
         </term>×
         <factor>(
            <expr>
               <sum>
                  <expr>
                     <term>
                        <factor>
                           <number>
                              <digit>1</digit>
                              <digit>0</digit>
                           </number>
                        </factor>
                     </term>
                  </expr>+
                  <term>
                     <factor>
                        <id>
                           <letter>b</letter>
                        </id>
                     </factor>
                  </term>
               </sum>
            </expr>)
         </factor>
      </prod>
   </term>
</expr>

Remove term and factor:

expr: term; sum.
sum: expr, "+", term.
-term: factor; prod.
prod: term, "×", factor.
-factor: id; number; "(", expr, ")".
id: letter+.
number: digit+.
letter: ["a"-"z"].
digit: ["0"-"9"].

This removes the element from the serialisation, but not its children.

Result

<expr>
   <prod>
      <id>
         <letter>p</letter>
         <letter>i</letter>
      </id>×(
      <expr>
         <sum>
            <expr>
               <number>
                  <digit>1</digit>
                  <digit>0</digit>
               </number>
            </expr>+
            <id>
               <letter>b</letter>
            </id>
         </sum>
      </expr>)
   </prod>
</expr>

Remove letter and digit

expr: term; sum.
sum: expr, "+", term.
-term: factor; prod.
prod: term, "×", factor.
-factor: id; number; "(", expr, ")".
id: letter+.
number: digit+.
-letter: ["a"-"z"].
-digit: ["0"-"9"].

Result

<expr>
   <prod>
      <id>pi</id>×(
      <expr>
         <sum>
            <expr>
               <number>10</number>
            </expr>+
            <id>b</id>
         </sum>
      </expr>)
   </prod>
</expr>

Remove expr

-expr: term; sum.
sum: expr, "+", term.
-term: factor; prod.
prod: term, "×", factor.
-factor: id; number; "(", expr, ")".
id: letter+.
number: digit+.
-letter: ["a"-"z"].
-digit: ["0"-"9"].

Result

<prod>
   <id>pi</id>×(
   <sum>
      <number>10</number>+
      <id>b</id>
   </sum>
</prod>

You can delete the extraneous characters if you wish:

sum: expr, -"+", term.
-factor: id; number; -"(", expr, -")".

Adding attributes

Changing

id: letter+.
number: digit+.

to

id: @name.
name: letter+.
number: @value.
value: digit+.

or

id: name.
@name: letter+.
number: value.
@value: digit+.

gives

<prod>
   <id name='pi'/>
   <sum>
       <number value='10'/>
       <id name='b'/>
    </sum>
</prod>

Result

-expr: term; sum.
sum: expr, -"+", term.
-term: factor; prod.
prod: term, -"×", factor.
-factor: id; number; -"(", expr, -")".
id: @name.
name: ["a"-"z"]+.
number: @value.
value: ["0"-"9"]+.

Data really wants to be format-neutral

There is strictly speaking no reason why the parse tree need be in XML, but could be equally well serialised in some other form, such as JSON.

Serialising to JSON

With

<expr>
    <prod>
        <letter>a</letter>
        <sum>
            <digit>3</digit>
            <letter>b</letter>
        </sum>
    </prod>
</expr>

You might be tempted to say:

{"expr":
    {"prod": 
        {"letter": "a"; 
         "sum": {"digit":"3"; "letter":"b"}
        }
    }
}

But JSON object members are more like XML attributes than child elements:

Serialising to JSON

Solution is to use arrays, and single-member objects:

{"expr":
    [{"prod":
        [{"letter": "a"}], 
        [{"sum":
            [{"digit":"3"}], 
            [{"letter":"b"}]
        }]
    }]
}

Conclusion

If a notation is to be human-facing, then it is not enough to make it functionally sufficient.

HCI techniques, although usually applied to interaction, are also applicable to make notations more usable for the people using them.