date: 17-1-2003
author: Joost
description: This is braindump after working for about two months at
investigating the possibilities of extending the semantic inferencing
as described in the ecdl paper. The conclusion must be that there are
currently to many technical problems which need to be overcome before
this aspect can become of interest. 
A conceptual model is presented, mainly for personal use to
get a grip on what still needs to be done. 
Some ideas for the semanticweb paper

WARNING! THIS IS NOT A COHERENT STORY 

Semantic inferencing

Cuypers automatic multimedia presentation generator serves authors by
allowing a 'high level' specification of a presentation. This means
 authors define relations between media items stating their role
within the presentation. These relations can than be used to adapt the
presentation to the specific circumstances in which it is played. These
circumstances can include hardware-constraints such as screen size and
bandwidth, but also user-constraints such as expertise and interest. The
gain for authors is that they need only to author a single presentation
instead of creating one for every situation.

The flexibility mentioned above still assumes an author with a message
which needs to be conveyed across different users and platforms. In
some cases though, the collection of data is too large and addresses
too many topics to create individual presentations. For example,
musea repositories store large amounts of information about their
artefacts. The number of possible relations within these is virtually
unlimited. Other examples include the OAI which defines a uniform
interface to expose metadata for different repositories. Combining
these resources provides a valuable information source. Exposing
the hidden information however is a time consuming task. For these cases one
can use the metadata associated with the media items, use it to
find some of these hidden relationships and use these to automatically
generate a multimedia presentation.

The ecdl paper describes the idea of using Dublin core metadata to
infer some of these relations. Although the demo proves that for a
tailored repository the presented ideas work, it is difficult to scale
them to be more general applicable. This has a number of reasons: 

- The inference rules used are tailored to a people domain and to some
extent artists. The assumption is that the repository contains media
items about people and works they have created. The metadata can then
be used to infer relations between, for example a creator and the
objects created and that two or more people were colleagues if they are
the creator of a single artefact. Although the assumption is already
quite restrictive in its application the inferred relations can still
be in error. For example when a scanned picture of a painting can have
four creators: the painter, the photographer, the scanner and the
person who added the object to the repository, these can hardly be
considered to be colleagues.

- The values of Dublin core elements are free text strings. There are
no schemas which specify in what format data should be entered. As a
result there are a dozen different ways to represent similar
information. For example "Abraham Lincoln", "Lincoln, Abraham"
"Lincoln" "Abraham Lincoln (1809-1865)" etc. refer all to the same
person. The rules which are used to infer relationships are based upon
string matching. In order to make "Lincoln, Abraham" equivalent to
"Abraham Lincoln" the string matching algorithms needs to be quite
'loose'. As a result the chance of making a wrong inference is quite
large.  

[[ add stuff about related works such as GATE ]]

- The Dublin Core metadata scheme is used to describe digital
objects. The schema however leaves the annotator a high degree of
freedom to fill in the fields and as a result a number of ambiguities
can occur which might lead to unexpected results when doing inferences
on them. Examples of these ambiguities include:

  + Ambiguity on the object description level. Within Dublin core it
  is not clear on what level the media object is being described. A
  photograph or painting of a building can both be the intended object
  of description. What is meant though is not made clear by the Dublin
  core scheme.

  + The Dublin core element values allow free text strings. The fields
  are primarily intended to be interpreted by human beings. Therefore
  the level of ambiguity is comparable to the ambiguities in natural
  language. For example the Dublin core element subject with the value
  "Lincoln" can be about a president, a city or even sheep. 

  + DC describes a media item in terms of strings. Inferring
  relationships results in creating relations between media
  items. Sometimes this implies the implicit assumption that the media
  item in some way represents the concept to which you are
  relating. For example the creator relationship states that a media
  item is the creator of another media item, which can cause some
  confusion since the object represented concept is supposed to be the
  creator not the object itself. In contrast, the versionOf, precedes
  and in some case shareContext do relate the media objects and not
  the concepts they represent. In addition the topic of presentation is
  often not represented by a media item but relates more to a concept.

- Searching in a Dublin core repository results in a set of objects
which do not have an immediate and obvious relationship except for the
fact that they share a common keyword. After applying the inference
rules there might be a few more relationships which tell us something
about the content of the media, and its relationships with other
content. Still there needs to be made a mapping from relations to
presentation patterns. Presentation however has requirements on the
content of the media but also on the type of media. It might be
advisable not to have two videos play synchronous for example. This
kind of information can (at least in theory) be obtained from the
Dublin core fields and therefore be prevented. Information about the
functions a media item can fulfill is harder to determine. For example
to find out whether a piece of text is suitable to accompanier a couple
of images requires the text to have specific features such as that is
should be of a higher conceptual level then the images [[lynda:
covered by mpeg7]], whether it
explains, overviews or is an example of the topic. In addition the
topic of the text should be bounded by the examples. Similar examples
can be found for other type of media. It needs to be made clear in
what context media items can be used within a presentation. This
information is insufficiently described within the Dublin core meta
data.

- Currently the inference rules mentioned in the ecdl paper are
implemented in a python script. This has some serious disadvantages: 

 + A general problem is  that the number of tests as to check whether a
 relationships holds becomes intractable for a large number of media
 items(n): r * n(n+1)/2 (r = number of rules). In addition the meta
 data is stored in a mysql database. Whenever a value is needed
 within a rule this results in a database access. Since a typical rule
 (created) uses 5 values the overhead becomes an issue with a large
 number of media items.

 + The rules are not easily extensible and adaptable since they are
 programmed in python which needs to be compiled to java in order to
 use them within the cuypers-tomcat environment.

 + The inferred relationships are lost after the presentation has been
 generated. Besides that this is not very efficient from a computational
 point of view it also prevents a second inference iteration. For
 example when one person has created works with two other people,
 these other people have a mutual acquaintance, or they have lived in
 the same period of time.


- from graph to tree. Relations are organized in a graph structure. A
mm presentation is organized as a tree (or multiple trees). Therefore
there needs to be made a transformation from a graph to to a tree. An
important issue then is to determine the central question, or main
focus of the presentation. This is becomes the root of the tree, all
other nodes are being related with respect to the root node. There are
several methods to determine a central node. For example the node
which has the most relationships might qualify. This however will
often result in a node which has a lot of shareContext relations. In
practice shareContext behaves as a transitive relation. All other
relating shareContext nodes have the same number of relations. The
node which has another relation as well gets to be chosen as the
central node. This however does not lead to the desired result since a
shareContext relation doesn't tell you much about the importance of
the media item. It can well be that a describes relation would be more
suitable to function as a central node. It turns out that it is hard
to define a satisfying heuristic to choose a central node. In semin
we attempt to solve this by introducing an artificial media which
served as a title of the presentation. The content of the title were
the keywords of the search. All media items would have a relation to
the title and therefore the title is always chosen as the central
node. Although the relationships could be different the resulting graph
would be 'star' shaped (one node with a lot of outgoing
relations). This structure is just as hard or even harder to transform
into a presentation structure so the idea was dropped. Because of the
sparse data we use in our test environment and the experience that
some relations would rarely occur we could define the presentation
structure in terms of possible occurrences of relations. This combined
with simple grammar rules which defined the 'narrative' lead to the
generation of simple presentations. 

presentation -> intro content end
	;	intro content
	;	      content end
	;	      content

intro ->	SHARE_CONTEXT([A,B..])

content ->	DESCRIBES(A,[B,C..])
	;	SHARE_CONTEXT([A,B..])
		
end ->		list of creators retrieved from metadata

[[lynda: relate to stefano's rules]]

Some media items can  have multiple relationships which are used
in different contexts during the presentation. For example the
introduction of a topic can be presented by a quick collage of images
which have a shareContext relation. The same image can also be used
later in the presentation to clarify a concept or be used as an
example. In contrast a piece text should in principle only be used
once, which yields that some relations might be ignored.

the rules specified above are  ordered; to satisfy the rule for
'presentation' first 'intro content end' is tried, then 'intro
content' etc. The rules take a set of relations as input and return a
(possibly reduced) set of relations. The fist rule for 'content' for example
queries for describes(A,B) relations. If it succeeds it removes all
relations which have node A in the relation since A is a text. In
contrast the 'intro' rule leaves the relations as they are and
therefore they can be reused. Since the 'content' part of the
presentation can be considered the most important part this rule is
executed before the 'intro' and 'end' rule. This ensures the 'content'
can still use all relations available.

The result of applying a grammar rule will, in addition to a
(possibly) reduced set of relations, be a 'presentation device'
stating in what way the information should be presented. Example of
these include sequence, slide show, nbox and collage. Later on these are
mapped to formatting objects. An example of a grammar rule

content(+RelationsIn,-RelationsOut, nbox(MediaA,slideshow(MediaList)):-
	findall(M,member(describe(MediaA,M),RelationsIn), MediaList),
	Relationsout = Relations - occurrences of Medialist


- The possible relations two media items can have are to some extent
hierarchically organized. For example, when two media items have a
describes relationships they often have a shareContext relationship as
well (not always tough). Some rules infer more detailed relations than
others and are therefore more scarce. One could argue that these
relationships are more important, and that they should be presented in
favor of others less detailed relations. Although this is true in
case of a shareContext and a describes relation in general it is more
subtle. For example when two objects share both a shareContext
relation and a colleagueOf relation the shareContext relationship
might be prefered. The colleagueOf relation could be too far off-topic
to deserve a separate presentation construct as illustrated by the
following example:

a describes b,c,d
b,c,d all share context 
c colleagueOf d 

present a next to a slideshow of b,c,d (ignoring c colegueOf d)


For the limited number of rules as they are now implemented in the
semin demo the possible outcomes are still predictable. However when
more rules and relations are added the results are no longer
predictable. In addition the rules are not manageable. -> inference
rules need to be formally specified and have formal semantics.

Properties of the relationships are implicit: 
- shareContext is symmetric, transitive
- depicts and describes are inverses 
- precedes and follows imply versionOf
- etc...

you need to relate concepts. media items are not concepts but can represent
concepts.

OAI in RDF & Sesame Sesame is a storing and inferencing engine for
RDF documents. It can serve to some extent the same functionality as
the python script with some improvements which are performance, use of
'standard' formats such as rdf and rql. The possibility of combining
other RDF knowledge resources made it worth wile to convert the
database to RDF format. Sesame allows the assertion of new facts. Once
a relation has been inferred it can be added to the repository and be
used in future retrievals. This means the inference rules only need to
be applied when the repository changes. This in contrast to the python
script which did the inferencing once a presentation was being
generated. In practice the number of additions is smaller than the
number of presentation generation and therefore the performance
increased. In addition the inferencing process within sesame can be
scheduled to a convenient time while the python script requires the
user to wait while it is inferencing. The format of the rql queries is
intuitive and almost maps 1 to 1 the inferencing rules described in
the ecdl paper. There are however some differences which need to be
addressed. 

[[ add stuff about why rdf provides more flexibilty than current semin
implementation ]]

 - RQL only retrieves data and cannot make the implication and assert
 inferred knowledge. Currently the assertion is done in Prolog. First a
 rql query retrieving the items for which relation holds is
 retrieved. Then prolog constructs the new relation triples and sends
 them to sesame where they are asserted.

 - The string matching within sesame is too simple. Sesame allows some
 kind of string matching. Still these are by far insufficient to
 perform the task as done by the python scripts, which had a parameter
 stating a percentage of allowed difference. Although one could argue
 that this is an error in the metadata, it is a fact that most
 metadata is manually annotated and there is often used no formal
 schema. In order to still make some sense out of them these kind of
 techniques are the only ones to help you.

 - An rql querys does not allow ORs if part of the 'or' clause is undefined
(= has no instantiations)

example the following query (which defines part of the shareContext
relation) is correct:

select O1,O2,S1,S2,D1,D2,C1,C2
from 
	{O1} dc:subject {S1}, 
	{O2} dc:subject {S2}, 
	{O1} dc:date {D1}, 
	{O2} dc:date {D2}, 
	{O1} dc:coverage {C1}, 
	{O2} dc:coverage {C2}
where 
	O1 != O2 AND S1=S2 AND ( D1=D2 OR C1=C2 OR CC1=CC2)

In the next query the OR clause is extended. The resultset is because
of the generalization a superset of the previous resultset. However,
since there are no dc:contributors defined in the repository it
returns an empty set.

select O1,O2,S1,S2,D1,D2,C1,C2,CC1,CC2
from 
	{O1} dc:subject {S1}, 
	{O2} dc:subject {S2}, 
	{O1} dc:date {D1}, 
	{O2} dc:date {D2}, 
	{O1} dc:coverage {C1}, 
	{O2} dc:coverage {C2},
	{O1} dc:contributor {CC1}, 
	{O2} dc:contributor {CC2}
where 
	O1 != O2 AND S1=S2 AND ( D1=D2 OR C1=C2 OR CC1=CC2)

 - RQL does not allow queries on graphs path of undefined
 length. For example to retrieve relations like A precedes B, B
 precedes C, C precedes D etc. is only possible if we specify the
 chain explicitly (=know the length)

 
graph -> presentation structure (central node)
oai in rdf (sesame)
rql has too simple string matching mechanisms
infer concepts from strings
use graph structure to make presentation
match semantic paterns to presentation template
make difference between first level concepts and lower level concepts
combining knowledge means centralizing.


-------------------

Currently Cuypers implements the whole chain from user query to an
automatically generated presentation. Cuypers's developments happened
basically bottom up, starting from the final presentation to the
user-query. Currently general consensus is only established from the
Formatting objects level till the final presentation. The processing
of user query, semantics discourse and narrative is rather thin and
the focuses of our research will therefore be mainly on these areas. A
point of interest there is the mapping from RST to formatting
objects. Currently this is a one to one mapping implying that the RST
structure is equal to the presentation structure. I think that is not
the case. RST is primarily intended as an analyzing tool for
texts. Besides the fact that the analysis of a text is subjective and
can therefore be different between two analysis, going back from
analysis to the original text cannot be done since the order of the
relations is lost. Nevertheless one can construct different text
containing the same information differently organized. *We ignore the
narrative for the moment*. Taking this analogue for multimedia
presentations means it should be possible to construct different
presentations from a single rst tree. The current implementation does
not really disallow this view, but because of the simple rst structure
and the prolog depth-first bias the possibility never occurs.

Within RST every node represents a pieces of text, although this can be
built up from different nodes, every node is in a sense atomic. The
semantics are clear. Within the rst view of a multimedia presentation
this isn't the case. Only the leave nodes carry (implicit) semantics,
the higher composite nodes have a more presentation structuring
function than defining a rhetorical structure. In addition the RST
structure should at least in theory be dependent on the
query. Currently the possible queries are limited to make sure they
fit the RST template as it is now.

composite semantics, there are only a limited number of RST
relationships the way they are mapped to presentation depends on the
context they are used in. For example an elaboration on node in the
bottom of the tree can be realized by a move over and pop-up-text. In
contrast, an elaboration on a higher level needs to be treated
differently. 

On a lower level can make different mappings for rst relations based on
content and media type. Currently all RST relations are mapped to a
single HFO with some alternatives if they don't work. This is not very
efficient and in addition one may want to treat specific occurrences
differently based on other parameters than whether they fulfill the
constraints posed on them. A typical example can be to treat landscape
images differently then portrait images. Other examples are content
related images of paintings might be presented differently in
comparison to photographs. Finally the media type influences the
mapping from RST to HFO. Especially the dependency on content might be
interesting from a semanticweb point of view. One can define rules
which differentiate presentation based on semantic properties. By using
a taxonomy one can infer that labels and caption can be presented in
the same way because the belong to the same class. In addition one can
decide to differentiate between paintings and sculptures.

narrative is choosing a path/order in RST tree

-------------------

objective: extent functionality of Cuypers by making mapping RST - HFO
explicit. This is realized by using an ontology describing different
types of multimedia items such as image, text, video etc. on the higher
levels and specific subclasses of these such as title, caption,
header, but also biography, comment, painting photograph etc. This
level of detail is domain dependent and might be influenced by an
author/domain expert. The higher levels should in principle be
reusable between different domains. In addition there is an ontology
of (RST)relations and a transformation sheet defining rules which define
a mapping from relations,nuclei,satellite to hfo. Because both the
relations and and nuclei/satellite are hierarchically organized these
rules have a range of appliances comparable to the functionality of
style-sheet rules.

example:

ontolgie of relations (This is just an example)

rst (R1)
  - nucleus/satelite (R2)
      - elaboration (R3)
           - example (R4)
  - multinucleus (R5)
      - joint (R6)
           - sequence (R7)

'type' ontologie
media (A)
  - image (B)
      - painting (C)
      - drawing (D)
  - video (E)
  - audio (F)
      - background music (G)
      - voice over (H)
  - text (I)
      - one-liner (J)
          - title (K)
          - caption (L)
          - header (M)
      - fulltext (N)
          - biography (O)
          - comment (P)

transformation 

relation,nuc,sat	scope					hfo
----------------------------------------------------------------------------------
R1,A,A			everything				hfo(R1,A,A)
R3,J,B			images with a one line elaboration      nbox(R,J,B)
R2,B,H			images with a voice over		par
   

The terms described in the ontology
map to transformation rules which map ontology terms to HFO
objects. These rules have names and the mapping is made by a designer.

Besides an ontolgie describing
media types there is made a mapping between the terms from the
ontologie and the 


RST to hfo, currently the presentation strucuture is implictly coded
into the rest tree. The rules which transform rst to hfo 'know' that
at the top level there is a nucleus title and two satelites: one
elaboration text and a sequence of example. The resulting hfo
transforms both the underlying structures of the satelistes but in
adition to that it als determines implicitly the realtion between the
two satelites which is that they can/should be presented next to
eachother. 

presentation structures are domain independent and represent the logic
strucuture (grouping, ordering) in the presentation. Presentation in
this context could be a mm-presentation but also a hypermedia document
of even a textual document. Examples of a holistic strucure for
multimedia are mm-presentation, scene, subscene. For a paper document
or book : book, chapter, section, subsection. Presentation structure
define the initial parameters for a presententation, such as te
dimensions which can be used temporal spatial, possible navigatinal
options, linking. etc. Presentation structures define the possible
presentation space. A ps is at presentation time represented by a
hfo. Each ps is represented by one hfo, however this can be of
different type. For example a hbox (left to rigth) might
be substituted by a vbox (top-bottom).

At the attomic level ps and hfo are quite similar since there is not
much logical structure in a atomic media item. There are composite hfo
and they define structure at a low level. (eg slideshow) this however
is not a ps since it assumes implicetly a temporal dimension. a ps has
made this choice explictly and can choose to represent this choice by
using a slideshow. A paper document can choose to use a non-temporal
grid for the same purpose.

We distuinguish presentation structures based on holistic
strucutre. Examples of holistic strucutres are Multimedia
presentation, Hypermedia peresentation, paperdocuments, powerpoint
presentaion etc. What they all have in common are the grouping and,
order and priority specifications. The implementation into hfo's are
all different though. If we can define a correspondence between the
differnt ps, such as a scene in a mm presentation corresponds to a
section in a paper document and a page in a hypermedia document then
we might be able to define a kind of master structure applicable for
all holistic structures. The hfo generation needs to take into account
the possibilieties it can use. (eg a temporal dimension in a
mmpresentaion). There are however style issues which make thing
complicated. The base layout of a mtv mm presentation is different
from a cnn mm presentation. If we have to take both into account on
this level the hfo generation becomes too compilicated i am affraid.

Within the PS structure tree there are only composites and
atomics. Atomics are media items without children, composites have
children and define a group. There are specialised types of composites
such as presntation, slideshow, hypermedia document etc.