date: 17-1-2003 author: Joost description: This is braindump after working for about two months at investigating the possibilities of extending the semantic inferencing as described in the ecdl paper. The conclusion must be that there are currently to many technical problems which need to be overcome before this aspect can become of interest. A conceptual model is presented, mainly for personal use to get a grip on what still needs to be done. Some ideas for the semanticweb paper WARNING! THIS IS NOT A COHERENT STORY Semantic inferencing Cuypers automatic multimedia presentation generator serves authors by allowing a 'high level' specification of a presentation. This means authors define relations between media items stating their role within the presentation. These relations can than be used to adapt the presentation to the specific circumstances in which it is played. These circumstances can include hardware-constraints such as screen size and bandwidth, but also user-constraints such as expertise and interest. The gain for authors is that they need only to author a single presentation instead of creating one for every situation. The flexibility mentioned above still assumes an author with a message which needs to be conveyed across different users and platforms. In some cases though, the collection of data is too large and addresses too many topics to create individual presentations. For example, musea repositories store large amounts of information about their artefacts. The number of possible relations within these is virtually unlimited. Other examples include the OAI which defines a uniform interface to expose metadata for different repositories. Combining these resources provides a valuable information source. Exposing the hidden information however is a time consuming task. For these cases one can use the metadata associated with the media items, use it to find some of these hidden relationships and use these to automatically generate a multimedia presentation. The ecdl paper describes the idea of using Dublin core metadata to infer some of these relations. Although the demo proves that for a tailored repository the presented ideas work, it is difficult to scale them to be more general applicable. This has a number of reasons: - The inference rules used are tailored to a people domain and to some extent artists. The assumption is that the repository contains media items about people and works they have created. The metadata can then be used to infer relations between, for example a creator and the objects created and that two or more people were colleagues if they are the creator of a single artefact. Although the assumption is already quite restrictive in its application the inferred relations can still be in error. For example when a scanned picture of a painting can have four creators: the painter, the photographer, the scanner and the person who added the object to the repository, these can hardly be considered to be colleagues. - The values of Dublin core elements are free text strings. There are no schemas which specify in what format data should be entered. As a result there are a dozen different ways to represent similar information. For example "Abraham Lincoln", "Lincoln, Abraham" "Lincoln" "Abraham Lincoln (1809-1865)" etc. refer all to the same person. The rules which are used to infer relationships are based upon string matching. In order to make "Lincoln, Abraham" equivalent to "Abraham Lincoln" the string matching algorithms needs to be quite 'loose'. As a result the chance of making a wrong inference is quite large. [[ add stuff about related works such as GATE ]] - The Dublin Core metadata scheme is used to describe digital objects. The schema however leaves the annotator a high degree of freedom to fill in the fields and as a result a number of ambiguities can occur which might lead to unexpected results when doing inferences on them. Examples of these ambiguities include: + Ambiguity on the object description level. Within Dublin core it is not clear on what level the media object is being described. A photograph or painting of a building can both be the intended object of description. What is meant though is not made clear by the Dublin core scheme. + The Dublin core element values allow free text strings. The fields are primarily intended to be interpreted by human beings. Therefore the level of ambiguity is comparable to the ambiguities in natural language. For example the Dublin core element subject with the value "Lincoln" can be about a president, a city or even sheep. + DC describes a media item in terms of strings. Inferring relationships results in creating relations between media items. Sometimes this implies the implicit assumption that the media item in some way represents the concept to which you are relating. For example the creator relationship states that a media item is the creator of another media item, which can cause some confusion since the object represented concept is supposed to be the creator not the object itself. In contrast, the versionOf, precedes and in some case shareContext do relate the media objects and not the concepts they represent. In addition the topic of presentation is often not represented by a media item but relates more to a concept. - Searching in a Dublin core repository results in a set of objects which do not have an immediate and obvious relationship except for the fact that they share a common keyword. After applying the inference rules there might be a few more relationships which tell us something about the content of the media, and its relationships with other content. Still there needs to be made a mapping from relations to presentation patterns. Presentation however has requirements on the content of the media but also on the type of media. It might be advisable not to have two videos play synchronous for example. This kind of information can (at least in theory) be obtained from the Dublin core fields and therefore be prevented. Information about the functions a media item can fulfill is harder to determine. For example to find out whether a piece of text is suitable to accompanier a couple of images requires the text to have specific features such as that is should be of a higher conceptual level then the images [[lynda: covered by mpeg7]], whether it explains, overviews or is an example of the topic. In addition the topic of the text should be bounded by the examples. Similar examples can be found for other type of media. It needs to be made clear in what context media items can be used within a presentation. This information is insufficiently described within the Dublin core meta data. - Currently the inference rules mentioned in the ecdl paper are implemented in a python script. This has some serious disadvantages: + A general problem is that the number of tests as to check whether a relationships holds becomes intractable for a large number of media items(n): r * n(n+1)/2 (r = number of rules). In addition the meta data is stored in a mysql database. Whenever a value is needed within a rule this results in a database access. Since a typical rule (created) uses 5 values the overhead becomes an issue with a large number of media items. + The rules are not easily extensible and adaptable since they are programmed in python which needs to be compiled to java in order to use them within the cuypers-tomcat environment. + The inferred relationships are lost after the presentation has been generated. Besides that this is not very efficient from a computational point of view it also prevents a second inference iteration. For example when one person has created works with two other people, these other people have a mutual acquaintance, or they have lived in the same period of time. - from graph to tree. Relations are organized in a graph structure. A mm presentation is organized as a tree (or multiple trees). Therefore there needs to be made a transformation from a graph to to a tree. An important issue then is to determine the central question, or main focus of the presentation. This is becomes the root of the tree, all other nodes are being related with respect to the root node. There are several methods to determine a central node. For example the node which has the most relationships might qualify. This however will often result in a node which has a lot of shareContext relations. In practice shareContext behaves as a transitive relation. All other relating shareContext nodes have the same number of relations. The node which has another relation as well gets to be chosen as the central node. This however does not lead to the desired result since a shareContext relation doesn't tell you much about the importance of the media item. It can well be that a describes relation would be more suitable to function as a central node. It turns out that it is hard to define a satisfying heuristic to choose a central node. In semin we attempt to solve this by introducing an artificial media which served as a title of the presentation. The content of the title were the keywords of the search. All media items would have a relation to the title and therefore the title is always chosen as the central node. Although the relationships could be different the resulting graph would be 'star' shaped (one node with a lot of outgoing relations). This structure is just as hard or even harder to transform into a presentation structure so the idea was dropped. Because of the sparse data we use in our test environment and the experience that some relations would rarely occur we could define the presentation structure in terms of possible occurrences of relations. This combined with simple grammar rules which defined the 'narrative' lead to the generation of simple presentations. presentation -> intro content end ; intro content ; content end ; content intro -> SHARE_CONTEXT([A,B..]) content -> DESCRIBES(A,[B,C..]) ; SHARE_CONTEXT([A,B..]) end -> list of creators retrieved from metadata [[lynda: relate to stefano's rules]] Some media items can have multiple relationships which are used in different contexts during the presentation. For example the introduction of a topic can be presented by a quick collage of images which have a shareContext relation. The same image can also be used later in the presentation to clarify a concept or be used as an example. In contrast a piece text should in principle only be used once, which yields that some relations might be ignored. the rules specified above are ordered; to satisfy the rule for 'presentation' first 'intro content end' is tried, then 'intro content' etc. The rules take a set of relations as input and return a (possibly reduced) set of relations. The fist rule for 'content' for example queries for describes(A,B) relations. If it succeeds it removes all relations which have node A in the relation since A is a text. In contrast the 'intro' rule leaves the relations as they are and therefore they can be reused. Since the 'content' part of the presentation can be considered the most important part this rule is executed before the 'intro' and 'end' rule. This ensures the 'content' can still use all relations available. The result of applying a grammar rule will, in addition to a (possibly) reduced set of relations, be a 'presentation device' stating in what way the information should be presented. Example of these include sequence, slide show, nbox and collage. Later on these are mapped to formatting objects. An example of a grammar rule content(+RelationsIn,-RelationsOut, nbox(MediaA,slideshow(MediaList)):- findall(M,member(describe(MediaA,M),RelationsIn), MediaList), Relationsout = Relations - occurrences of Medialist - The possible relations two media items can have are to some extent hierarchically organized. For example, when two media items have a describes relationships they often have a shareContext relationship as well (not always tough). Some rules infer more detailed relations than others and are therefore more scarce. One could argue that these relationships are more important, and that they should be presented in favor of others less detailed relations. Although this is true in case of a shareContext and a describes relation in general it is more subtle. For example when two objects share both a shareContext relation and a colleagueOf relation the shareContext relationship might be prefered. The colleagueOf relation could be too far off-topic to deserve a separate presentation construct as illustrated by the following example: a describes b,c,d b,c,d all share context c colleagueOf d present a next to a slideshow of b,c,d (ignoring c colegueOf d) For the limited number of rules as they are now implemented in the semin demo the possible outcomes are still predictable. However when more rules and relations are added the results are no longer predictable. In addition the rules are not manageable. -> inference rules need to be formally specified and have formal semantics. Properties of the relationships are implicit: - shareContext is symmetric, transitive - depicts and describes are inverses - precedes and follows imply versionOf - etc... you need to relate concepts. media items are not concepts but can represent concepts. OAI in RDF & Sesame Sesame is a storing and inferencing engine for RDF documents. It can serve to some extent the same functionality as the python script with some improvements which are performance, use of 'standard' formats such as rdf and rql. The possibility of combining other RDF knowledge resources made it worth wile to convert the database to RDF format. Sesame allows the assertion of new facts. Once a relation has been inferred it can be added to the repository and be used in future retrievals. This means the inference rules only need to be applied when the repository changes. This in contrast to the python script which did the inferencing once a presentation was being generated. In practice the number of additions is smaller than the number of presentation generation and therefore the performance increased. In addition the inferencing process within sesame can be scheduled to a convenient time while the python script requires the user to wait while it is inferencing. The format of the rql queries is intuitive and almost maps 1 to 1 the inferencing rules described in the ecdl paper. There are however some differences which need to be addressed. [[ add stuff about why rdf provides more flexibilty than current semin implementation ]] - RQL only retrieves data and cannot make the implication and assert inferred knowledge. Currently the assertion is done in Prolog. First a rql query retrieving the items for which relation holds is retrieved. Then prolog constructs the new relation triples and sends them to sesame where they are asserted. - The string matching within sesame is too simple. Sesame allows some kind of string matching. Still these are by far insufficient to perform the task as done by the python scripts, which had a parameter stating a percentage of allowed difference. Although one could argue that this is an error in the metadata, it is a fact that most metadata is manually annotated and there is often used no formal schema. In order to still make some sense out of them these kind of techniques are the only ones to help you. - An rql querys does not allow ORs if part of the 'or' clause is undefined (= has no instantiations) example the following query (which defines part of the shareContext relation) is correct: select O1,O2,S1,S2,D1,D2,C1,C2 from {O1} dc:subject {S1}, {O2} dc:subject {S2}, {O1} dc:date {D1}, {O2} dc:date {D2}, {O1} dc:coverage {C1}, {O2} dc:coverage {C2} where O1 != O2 AND S1=S2 AND ( D1=D2 OR C1=C2 OR CC1=CC2) In the next query the OR clause is extended. The resultset is because of the generalization a superset of the previous resultset. However, since there are no dc:contributors defined in the repository it returns an empty set. select O1,O2,S1,S2,D1,D2,C1,C2,CC1,CC2 from {O1} dc:subject {S1}, {O2} dc:subject {S2}, {O1} dc:date {D1}, {O2} dc:date {D2}, {O1} dc:coverage {C1}, {O2} dc:coverage {C2}, {O1} dc:contributor {CC1}, {O2} dc:contributor {CC2} where O1 != O2 AND S1=S2 AND ( D1=D2 OR C1=C2 OR CC1=CC2) - RQL does not allow queries on graphs path of undefined length. For example to retrieve relations like A precedes B, B precedes C, C precedes D etc. is only possible if we specify the chain explicitly (=know the length) graph -> presentation structure (central node) oai in rdf (sesame) rql has too simple string matching mechanisms infer concepts from strings use graph structure to make presentation match semantic paterns to presentation template make difference between first level concepts and lower level concepts combining knowledge means centralizing. ------------------- Currently Cuypers implements the whole chain from user query to an automatically generated presentation. Cuypers's developments happened basically bottom up, starting from the final presentation to the user-query. Currently general consensus is only established from the Formatting objects level till the final presentation. The processing of user query, semantics discourse and narrative is rather thin and the focuses of our research will therefore be mainly on these areas. A point of interest there is the mapping from RST to formatting objects. Currently this is a one to one mapping implying that the RST structure is equal to the presentation structure. I think that is not the case. RST is primarily intended as an analyzing tool for texts. Besides the fact that the analysis of a text is subjective and can therefore be different between two analysis, going back from analysis to the original text cannot be done since the order of the relations is lost. Nevertheless one can construct different text containing the same information differently organized. *We ignore the narrative for the moment*. Taking this analogue for multimedia presentations means it should be possible to construct different presentations from a single rst tree. The current implementation does not really disallow this view, but because of the simple rst structure and the prolog depth-first bias the possibility never occurs. Within RST every node represents a pieces of text, although this can be built up from different nodes, every node is in a sense atomic. The semantics are clear. Within the rst view of a multimedia presentation this isn't the case. Only the leave nodes carry (implicit) semantics, the higher composite nodes have a more presentation structuring function than defining a rhetorical structure. In addition the RST structure should at least in theory be dependent on the query. Currently the possible queries are limited to make sure they fit the RST template as it is now. composite semantics, there are only a limited number of RST relationships the way they are mapped to presentation depends on the context they are used in. For example an elaboration on node in the bottom of the tree can be realized by a move over and pop-up-text. In contrast, an elaboration on a higher level needs to be treated differently. On a lower level can make different mappings for rst relations based on content and media type. Currently all RST relations are mapped to a single HFO with some alternatives if they don't work. This is not very efficient and in addition one may want to treat specific occurrences differently based on other parameters than whether they fulfill the constraints posed on them. A typical example can be to treat landscape images differently then portrait images. Other examples are content related images of paintings might be presented differently in comparison to photographs. Finally the media type influences the mapping from RST to HFO. Especially the dependency on content might be interesting from a semanticweb point of view. One can define rules which differentiate presentation based on semantic properties. By using a taxonomy one can infer that labels and caption can be presented in the same way because the belong to the same class. In addition one can decide to differentiate between paintings and sculptures. narrative is choosing a path/order in RST tree ------------------- objective: extent functionality of Cuypers by making mapping RST - HFO explicit. This is realized by using an ontology describing different types of multimedia items such as image, text, video etc. on the higher levels and specific subclasses of these such as title, caption, header, but also biography, comment, painting photograph etc. This level of detail is domain dependent and might be influenced by an author/domain expert. The higher levels should in principle be reusable between different domains. In addition there is an ontology of (RST)relations and a transformation sheet defining rules which define a mapping from relations,nuclei,satellite to hfo. Because both the relations and and nuclei/satellite are hierarchically organized these rules have a range of appliances comparable to the functionality of style-sheet rules. example: ontolgie of relations (This is just an example) rst (R1) - nucleus/satelite (R2) - elaboration (R3) - example (R4) - multinucleus (R5) - joint (R6) - sequence (R7) 'type' ontologie media (A) - image (B) - painting (C) - drawing (D) - video (E) - audio (F) - background music (G) - voice over (H) - text (I) - one-liner (J) - title (K) - caption (L) - header (M) - fulltext (N) - biography (O) - comment (P) transformation relation,nuc,sat scope hfo ---------------------------------------------------------------------------------- R1,A,A everything hfo(R1,A,A) R3,J,B images with a one line elaboration nbox(R,J,B) R2,B,H images with a voice over par The terms described in the ontology map to transformation rules which map ontology terms to HFO objects. These rules have names and the mapping is made by a designer. Besides an ontolgie describing media types there is made a mapping between the terms from the ontologie and the RST to hfo, currently the presentation strucuture is implictly coded into the rest tree. The rules which transform rst to hfo 'know' that at the top level there is a nucleus title and two satelites: one elaboration text and a sequence of example. The resulting hfo transforms both the underlying structures of the satelistes but in adition to that it als determines implicitly the realtion between the two satelites which is that they can/should be presented next to eachother. presentation structures are domain independent and represent the logic strucuture (grouping, ordering) in the presentation. Presentation in this context could be a mm-presentation but also a hypermedia document of even a textual document. Examples of a holistic strucure for multimedia are mm-presentation, scene, subscene. For a paper document or book : book, chapter, section, subsection. Presentation structure define the initial parameters for a presententation, such as te dimensions which can be used temporal spatial, possible navigatinal options, linking. etc. Presentation structures define the possible presentation space. A ps is at presentation time represented by a hfo. Each ps is represented by one hfo, however this can be of different type. For example a hbox (left to rigth) might be substituted by a vbox (top-bottom). At the attomic level ps and hfo are quite similar since there is not much logical structure in a atomic media item. There are composite hfo and they define structure at a low level. (eg slideshow) this however is not a ps since it assumes implicetly a temporal dimension. a ps has made this choice explictly and can choose to represent this choice by using a slideshow. A paper document can choose to use a non-temporal grid for the same purpose. We distuinguish presentation structures based on holistic strucutre. Examples of holistic strucutres are Multimedia presentation, Hypermedia peresentation, paperdocuments, powerpoint presentaion etc. What they all have in common are the grouping and, order and priority specifications. The implementation into hfo's are all different though. If we can define a correspondence between the differnt ps, such as a scene in a mm presentation corresponds to a section in a paper document and a page in a hypermedia document then we might be able to define a kind of master structure applicable for all holistic structures. The hfo generation needs to take into account the possibilieties it can use. (eg a temporal dimension in a mmpresentaion). There are however style issues which make thing complicated. The base layout of a mtv mm presentation is different from a cnn mm presentation. If we have to take both into account on this level the hfo generation becomes too compilicated i am affraid. Within the PS structure tree there are only composites and atomics. Atomics are media items without children, composites have children and define a group. There are specialised types of composites such as presntation, slideshow, hypermedia document etc.