Author: Joost Geurts
Date: 3-5-2004


Annotation Template and Document Structure


One of the lessons learned mentioned in the iswc2004 paper was the
need for an annotation template which described the annotation. This
need rose from the fact that the annotations in Media Streams were, in
contrast to traditional semanticweb applications, not instances of the
MS ontology. Instead the annotation mirrored the datastructure of the
original MS application. This structure consisted of a:

- TIMELINE, which contained information about the video as a whole
(title,file etc.).

- OCCURRENCE, which indicated an interval of an annotation (it had a
start and end frame)

- COMPOUND, which is a container, associated with a Stream, to hold the
descriptions which make up an annotation. E.g. character compound

- SLOT, a compound has slots, a slot has a name and a value. The name
of a slot can be a literal or a reference to a CIDI (term in ontology)

- CIDI is a term from the ontology

This structure, I described using a simple RDFS schema, and although it
described the basic MS structure there were things I could not express
conveniently. These mostly resulted from ambiguous annotations which
were already in the original MS application. For example, MS describes
actions by defining special slots SUBJECT, ACTION, OBJECT,
START-POINT, END-POINT etc. An example is "SUBJECT: door, ACTION:
open" which denotes "door opens". An alternative description however
can be "SUBJECT:man, ACTION:open, OBJECT:door" denoting "A man opens
the door" There might be slightly different meanings to these
annotations, but what they are exactly is impossible to tell
intuitively. Part of the problem is inherent to annotating and has to
do with the subjectivity of the annotator and how it sees the
event. However to make sense of annotations in a generative
environment these ambiguities should be avoided wherever
possible. Part of a solution is a controlled annotation environment
were the annotations are entered in a particular pre-defined
order. For example, first SUBJECT then ACTION then OBJECT analogue to
a present simple sentence. Furthermore simple constraints stating that
there should always be a SUBJECT and an ACTION and there is optionally
an OBJECT. If there is a START-POINT than there should also be an
END-POINT etc.  In addition to syntactic constraints there are also
semantic constraints which, for example could prevent bogus
annotations like "car walks". Note this is very much domain and
application specific since there are numerous cartoons showing walking
cars. Nevertheless within a certain domain such semantic constraints
are a way of regulating your annotations.

Such constraint rules are very much like grammars focusing
specifically on order. RDF however is graph based and order therefore
is absent in the inherent structure. Languages such as XML schema in
contrast are more natural to express them since they are specifically
designed to express grammars.

A similar conclusion, the need for an XML schema like language, was
reached by Raphael (Raphael in fact needs an even more expressive
language based on constraints because of temporal reasoning). He
approached it from a different angle based on document structure
though. Document structure in his sense is the structure of a video
document which consists of scenes, shots and frames. On a higher
level, genre, such as interview, news, soap etc. add semantics on top
of structure to a document. This structure however is not inherent to
the media(=video) but more to the content, typically there are
alternative descriptions which fit a genre (eg. A news item starts
with an anchorman and has optionally some on site registrations but
not necessarily). Raphael claims these structures are more
conveniently expressed by MPEG7 descriptors, reason for that being
that it is based XML schema.

There are two types of video annotations described so far, (1) The
Marc view which almost completely ignores the structure of the
original video because structure and context largely influences the
perceived meaning of a video. Since we'd like to generate this
structure and context ourselves we don't need it in the
annotations. This is very much a content oriented annotation. (2) For
Raphael, structure and context is important since for his work the
video only exists as a whole. He can retrieve fragments but the
meaning of the fragment should not be altered therefore he needs to
include as much context as necessary to make sure the material is
interpreted in the proper (original) way. Note that these type of
annotations can perfectly well live together. In fact, in Jane's
schema's for video metadata representation paper she describes a
schema where a video consists of sequences, which consists of scenes,
which consist of shot etc. until a level of "objects" which are part
of a frame. We can argue objects need to be described at multiple
frame levels but I believe MS will suffice to describe objects, or
content if you wish.

In conclusion, both document structure and content structure are more
conveniently expressed through a grammar like language which at least
supports a notion of order. To what extent semanticweb languages, like
OWL can be used to describe such constraints as well I can not really
say. Furthermore, is describing document structure and content
structure something you should want in RDF. Alternatively you can
annotate your HTML document, which has document structure expressed in
XML Schema, with RDF statements, can we do the same for video
documents (SMIL?), or do we really need the document structure in our
reasoning process.  A problem which keeps popping up in my mind as
being related is document structure versus presentation structure
(described in a previous bluebook note
http://homepages.cwi.nl/~media/blue_book/iswc2003.txt) Document
structure refers in this case to traditional document structures, such
as book, chapter, letter, section etc. There part of the presentation
structure could be expressed through (domain independent) document
structure. To present the parts which could not be expressed, for
example, in a multimedia presentation which has a shallow document
structure, we need to fall back to domain knowledge (=content
description) to make decisions on how to (re)present it. This is
similar to Raphael's work in the sense that high level narrative
structures are conveyed through document structure/genre. However this
schema is too shallow to also describe objects and we need domain
knowledge to do so. Marc's however does this in a controlled way by
having a fixed structure for content descriptions. To what extent can
we describe content by generic schema's and are these schema's like
document structures and therefore more easy to express in a language
like XML schema?