Blue note about Video Generation

Stefano Bocconi

Last modified: Fri Dec 12 14:33:46 CET 2003

General Introduction

The work I have been doing up till now can be considered as an exploration in the field or assembling existing pieces of information together to form a new document. Up till now the existing pieces of information all belong to the same data source, the Rijksmuseum in this case, even though the Semantic Web promises to deliver more data source and we (as members of the I2RP project) promised to use more data sources to carry out our research.

The fact that the data source is unique gives a sort of homogeneity to the information items, in other words you can expect that the information items have some similarity with each other, be it in content or style. This is because they actually belong to the same "document", in this case the Rijksmuseum website. In the following I use the word "document" in a loose way to indicate a document or a collection of documents, possibly on line (a website) and possibly retrieved from a database.

This homogeneity is also the reason I prefer to use the term reassembling over the term assembling: the information items were already "assembled", we break the document in pieces and put some of them back in a new document. Why do we want to do this? Chances are that the new document will be qualitatively worse because no one (no human) designed the information items to be part of it. The information items were designed to "live" in the original document.

The answer is well known and it basically boils down to the fact that the original document is not the answer to the questions I have when I access it, i.e. it will not satisfy my information need. Not that users have always questions, but in case they do, probably or possibly the existing "document" will not be able to answer it, or not in the way they would prefer (of course a well designed system will go a long way in achieving that).

So the work we do can be seen as (a part of) the attempt to disclose and deliver information from information sources (note that the delivering part contains the adaptation to the user hardware).

Repurposing of existing information

If I have to give a general statement what my thesis is about, I like the following: Repurposing of existing information. In the previous paragraph I presented the scenario we are operating in: the information is there but was NOT created and designed to suit: Here the first two research questions pop up: The reuse is a process composed of two parts: retrieve the information and put it together. I am not so concerned with retrieval techniques (I assume we can retrieve all information contained in the source, as long as we know what we want), more on the combination of the information. All effort are directed at the creation of a document (in our case a multimedia presentation) that is relevant and coherent. Again, of the two aspects I am more concerned on the coherence, because I am interested in the meaning-making mechanisms: I see A and then B, and I understand C. Can an engine "understand" that if it presents the user with A and B, the user might understand C?

This sounds probably far-fetched for multimedia research, but it has a more multimedia-oriented formulation: How can I control with the tools I have in Multimedia (i.e. structure and format of a presentation) that C is OR IS NOT generated from A and B?

DISC

We started the investigation in this field with DISC. The approach taken was to use stereotypical characters (actants) to select the information and structure the presentation. These stereotype characters are genre-dependent roles, e.g. in a biography they could be, apart from the main character, characters from the main character's family, from his/her professional life (teachers, colleagues, students), and so on.

In DISC, information is selected to create these actants and structured to group them in the chapter of the story where they stereotypically belong (e.g. a Teacher in the professional life chapter). The assumption is that using this stereotypical mechanism, the system exploits a sort of common understanding or social-cultural background that increases the probability that the information is understood and perceived as coherent.

A next step in the research related to DISC is to see how the system behaves with other information source, i.e. investigating the two above mentioned research questions. This is in line with the research roadmap for the I2RP project.

The next step

My project is to automatically generate presentations using this time existing video footage (www.interviewwithamerica.com). Recently I have made a program that selects randomly segments of the video footage and assembles them in a new video with transitions between segments (this is a SMIL file). The idea was to get a feeling about what kind of presentations can be generated from the material. This experimental phase was useful because I had to adjust my initial ambitious plan.

As I see it now, the subject of the generated presentations at first will be interviews over a specific subject, e.g. about the opinion of interviewed people about one issue. This means that the engine has to select all fragments where the issue is handled. When such a mechanism is in place, I can start studying how the composition of these fragments reinforces or weakens a particular interpretation, and what interpretations can be suggested to the viewer.

My initial plan was a bit too ambitious: "create some sort of stories, in my intention stories that are not described in the video. What kind of stories should be clear after the experimenting phase. One possibility I have in mind is that the viewer gives an emotional effect the story should provoke and the engine generates it."

The research direction is the same though: I want to knowingly use editing to create some interpretations, even if I think that limiting the scope of the generated presentations to the original subject of the video footage (i.e. interviews) is more realistic.

How it fits in the whole

All is driven by the problem of reusing existing information in a new context. This requires understanding of "compositional semantics", i.e. the effect of putting together more things, to evaluate the compositional semantics of the existing information source and of the the generated presentation. This is needed to understand what is required to break the original document in reusable pieces and to control the idea of the whole presentation a viewer gets linking the presentation's different information items, i.e. the meaning making process.

In a generated presentation the engine should be aware of this process as much as possible and control or drive this process in the viewer. Being aware and control are interrelated but not necessarily equally achievable.

How it fits in Multimedia

In controlling this process of meaning making, especially in this phase with video, the engine generating the presentation will use specific video properties that can be symbolized by the Kuleshov effect. The two phases of awareness and control of the meaning making process will be tested in the automatic creation of video in a way which is different from the presentation generated from the RijksMuseum; there we assume the story to tell is one (the art-historical truth) and the engine should convey the message without creating misunderstanding in the viewer, while the video is more controversial and the idea is to play with point of view and interpretation. Both approaches rely on the mastering of the relationship (or meaning) creation process and the first can benefit from the second.

Literature

Literature search will (is) going along two lines:
  1. Film theory
  2. Automatic Narrative generation
From film theory I mainly expect to gather a set of "editing vocabulary" to use to provoke a certain effect on the viewer. From Automatic Narrative generation I mainly expect to have an overview of the techniques that the engine I want to implement can use (e.g. grammars, agent based). A nice example is Tale-spin, and the first paragraph of the paper is also inspiring to try and play with rules to see how and when the generation breaks down, considering that video has more ways to fool a viewer that text (it is more reality like). An interesting idea about grammars and new languages is sketched in the Digital Mantras which in fact forms one of my source of inspiration. Considering that the system should try to suggest an interpretation to the viewer, Terminal Time comes immediately to the mind, and that will be a source of inspiration and differentiation. Frank's thesis, which I am now reading, should also provide some insights.

System design

I expect the theory to be focused on the metadata or on the process or on both, depending where more contribution is to be given. In the metadata I expect some contribution/insight as to what annotations support the story building process, e.g.:
  1. what subjects (actants)
  2. what temporal relations
  3. how to deal with continuity
while in the process I expect insights as to what process can best use the editing vocabulary to induce meaning in the viewer. The implementation will go in parallel with the theory generation and will focus on the selection mechanism, using probably first a sort of grammar (if possible) based on the cultural aspects and maybe shifting to more complex methods. I am particularly interested in grammars because I think that it is a very interesting result to see if meaning can be achieved with such a simple mechanism, or more in general how far you can get with a grammar. As I see it, this is strongly related to the philosophy sketched in the Digital Mantras which in fact forms one of my source of inspiration.

What research contributions

Considering my intended first subject of the presentations, a contribution (especially to the field of Digital Libraries) is to provide the access to large video material based on content. Another contribution I expect comes from the familiarization with the usage of MPEG-7 and related annotating tools. These two points will possibly provide same research insights, but they themselves are not research contributions: the real research contributions I expect to be:
  1. The annotation process
  2. The generation process
The annotation process consists in defining how to annotate and what to put in the annotation. These two issues are clear in the video example: what annotation can support the engine in creating a story out of the video footage and how do you attach annotation to the footage so that the scene that is selected can become a building block of an automatically generated story (there are temporal and editing issues here).

The generation process must create some sort of story and different approaches from literature can be used, from grammars to independent agents. The actant idea comes in place here, because the actants should generate the story either by being characters in it or by influencing the selection process (for example emotional actants).

Using existing material (not meant to be used in the way I want to use it) is also interesting because it represents a sort of default mode people use to create video material: from our experience and difficulties with it we could in principle generate a set of guidelines on how to create material to maximize the possibility of reuse in the future.

Some References

J. R. Meehan. TALE-SPIN, an interactive program that writes stories. In Proceedings of the Fifth International Joint Conference on Artificial Intelligence, August 1977.

Digital Mantras: The Language of Abstract and Virtual Worlds by Steven R. Holtzman (Author)

Nack, F. (August 1996) AUTEUR: The Application of Video Semantics and Theme Representation for Automated Film Editing. Ph.D. Thesis, Lancaster University.

Davis, M. (1995) Media Streams: Representing Video for Retrieval and Repurposing. Massachusetts Institute of Technology. Cambridge, Massachusetts. http://garage.sims.berkeley.edu/pdfs/1995_Marc_Davis_Dissertation.pdf

Remarks

Lloyd observes that media items I will use for the second step are not contained in a pre-existing structure, i.e. the video footage was not shot for a particular structure and has to be put now in another one. Therefore the idea of repurposing does not apply to that as well as it does to the ARIA database. This is a correct remarks and I think the relation with the all repurposing issue is more that trying to automatically edit video footage should give us a clue for the kind of problems we might find when we automatically put together media items. The fact that there might be a structure containing these to-be-reused media items can make the problem of assembling them easier or more difficult, I still do not know.
Another point that Lloyd stresses is that annotations should be presentation independent, but we have to deal with annotation that is specifically presentation-dependent (ARIA database).
An interesting aspect to consider is also the time constraints that a presentation should satisfy. This can open up an interesting field of research that was already approached (also in Cuypers) by the use of RST nucleus-satellite relations (and there is a paper about how RST can serve the purpose of selecting the most salient points in the domain of news). Michel Crampes has done some work about this, more when I read his paper again.
Just a small idea, I am also curious whether SMIL capability of zooming in can be used in document editing.