Author: Joost Geurts Date: 3-5-2004 Annotation Template and Document Structure One of the lessons learned mentioned in the iswc2004 paper was the need for an annotation template which described the annotation. This need rose from the fact that the annotations in Media Streams were, in contrast to traditional semanticweb applications, not instances of the MS ontology. Instead the annotation mirrored the datastructure of the original MS application. This structure consisted of a: - TIMELINE, which contained information about the video as a whole (title,file etc.). - OCCURRENCE, which indicated an interval of an annotation (it had a start and end frame) - COMPOUND, which is a container, associated with a Stream, to hold the descriptions which make up an annotation. E.g. character compound - SLOT, a compound has slots, a slot has a name and a value. The name of a slot can be a literal or a reference to a CIDI (term in ontology) - CIDI is a term from the ontology This structure, I described using a simple RDFS schema, and although it described the basic MS structure there were things I could not express conveniently. These mostly resulted from ambiguous annotations which were already in the original MS application. For example, MS describes actions by defining special slots SUBJECT, ACTION, OBJECT, START-POINT, END-POINT etc. An example is "SUBJECT: door, ACTION: open" which denotes "door opens". An alternative description however can be "SUBJECT:man, ACTION:open, OBJECT:door" denoting "A man opens the door" There might be slightly different meanings to these annotations, but what they are exactly is impossible to tell intuitively. Part of the problem is inherent to annotating and has to do with the subjectivity of the annotator and how it sees the event. However to make sense of annotations in a generative environment these ambiguities should be avoided wherever possible. Part of a solution is a controlled annotation environment were the annotations are entered in a particular pre-defined order. For example, first SUBJECT then ACTION then OBJECT analogue to a present simple sentence. Furthermore simple constraints stating that there should always be a SUBJECT and an ACTION and there is optionally an OBJECT. If there is a START-POINT than there should also be an END-POINT etc. In addition to syntactic constraints there are also semantic constraints which, for example could prevent bogus annotations like "car walks". Note this is very much domain and application specific since there are numerous cartoons showing walking cars. Nevertheless within a certain domain such semantic constraints are a way of regulating your annotations. Such constraint rules are very much like grammars focusing specifically on order. RDF however is graph based and order therefore is absent in the inherent structure. Languages such as XML schema in contrast are more natural to express them since they are specifically designed to express grammars. A similar conclusion, the need for an XML schema like language, was reached by Raphael (Raphael in fact needs an even more expressive language based on constraints because of temporal reasoning). He approached it from a different angle based on document structure though. Document structure in his sense is the structure of a video document which consists of scenes, shots and frames. On a higher level, genre, such as interview, news, soap etc. add semantics on top of structure to a document. This structure however is not inherent to the media(=video) but more to the content, typically there are alternative descriptions which fit a genre (eg. A news item starts with an anchorman and has optionally some on site registrations but not necessarily). Raphael claims these structures are more conveniently expressed by MPEG7 descriptors, reason for that being that it is based XML schema. There are two types of video annotations described so far, (1) The Marc view which almost completely ignores the structure of the original video because structure and context largely influences the perceived meaning of a video. Since we'd like to generate this structure and context ourselves we don't need it in the annotations. This is very much a content oriented annotation. (2) For Raphael, structure and context is important since for his work the video only exists as a whole. He can retrieve fragments but the meaning of the fragment should not be altered therefore he needs to include as much context as necessary to make sure the material is interpreted in the proper (original) way. Note that these type of annotations can perfectly well live together. In fact, in Jane's schema's for video metadata representation paper she describes a schema where a video consists of sequences, which consists of scenes, which consist of shot etc. until a level of "objects" which are part of a frame. We can argue objects need to be described at multiple frame levels but I believe MS will suffice to describe objects, or content if you wish. In conclusion, both document structure and content structure are more conveniently expressed through a grammar like language which at least supports a notion of order. To what extent semanticweb languages, like OWL can be used to describe such constraints as well I can not really say. Furthermore, is describing document structure and content structure something you should want in RDF. Alternatively you can annotate your HTML document, which has document structure expressed in XML Schema, with RDF statements, can we do the same for video documents (SMIL?), or do we really need the document structure in our reasoning process. A problem which keeps popping up in my mind as being related is document structure versus presentation structure (described in a previous bluebook note http://homepages.cwi.nl/~media/blue_book/iswc2003.txt) Document structure refers in this case to traditional document structures, such as book, chapter, letter, section etc. There part of the presentation structure could be expressed through (domain independent) document structure. To present the parts which could not be expressed, for example, in a multimedia presentation which has a shallow document structure, we need to fall back to domain knowledge (=content description) to make decisions on how to (re)present it. This is similar to Raphael's work in the sense that high level narrative structures are conveyed through document structure/genre. However this schema is too shallow to also describe objects and we need domain knowledge to do so. Marc's however does this in a controlled way by having a fixed structure for content descriptions. To what extent can we describe content by generic schema's and are these schema's like document structures and therefore more easy to express in a language like XML schema?