Blue note about Video Generation
Stefano Bocconi
Last modified: Fri Dec 12 14:33:46 CET 2003
General Introduction
The work I have been doing up till now can be considered as an
exploration in the field or assembling existing pieces of information
together to form a new document. Up till now the existing pieces of
information all belong to the same data source, the Rijksmuseum in
this case, even though the Semantic Web promises to deliver more
data source and we (as members of the I2RP project) promised to use
more data sources to carry out our research.
The fact that the data source is unique gives a sort of
homogeneity to the information items, in other words you can
expect that the information items have some similarity with each
other, be it in content or style. This is because they actually
belong to the same "document", in this case the Rijksmuseum
website. In the following I use the word "document" in a loose way
to indicate a document or a collection of documents, possibly on
line (a website) and possibly retrieved from a database.
This homogeneity is also the reason I prefer to use the term
reassembling over the term assembling: the information items were
already "assembled", we break the document in pieces and put some
of them back in a new document. Why do we want to do this? Chances
are that the new document will be qualitatively worse because no
one (no human) designed the information items to be part of
it. The information items were designed to "live" in the original
document.
The answer is well known and it basically boils down to the fact
that the original document is not the answer to the questions I
have when I access it, i.e. it will not satisfy my information
need. Not that users have always questions, but in case they do,
probably or possibly the existing "document" will not be able to
answer it, or not in the way they would prefer (of course a well
designed system will go a long way in achieving that).
So the work we do can be seen as (a part of) the attempt to
disclose and deliver information from information sources (note
that the delivering part contains the adaptation to the user
hardware).
Repurposing of existing information
If I have to give a general statement what my thesis is about, I
like the following: Repurposing of existing information. In the
previous paragraph I presented the scenario we are operating in:
the information is there but was NOT created and designed to suit:
- All user information needs
- Our effort to reuse it to satisfy all user information needs
Here the first two research questions pop up:
- How do we reuse this information to answer information needs?
- How should an information source be annotated to make this
reuse? This can be decomposed in two more questions:
- What "class" of metadata do we need?
- What improvement can be achieved in the reuse by having
more suited "classes" of metadata?
The reuse is a process composed of two parts: retrieve the
information and put it together. I am not so concerned with
retrieval techniques (I assume we can retrieve all information
contained in the source, as long as we know what we want), more on
the combination of the information. All effort are directed at the
creation of a document (in our case a multimedia presentation)
that is relevant and coherent. Again, of the two aspects I am
more concerned on the coherence, because I am interested in the
meaning-making mechanisms: I see A and then B, and I understand
C. Can an engine "understand" that if it presents the user with A
and B, the user might understand C?
This sounds probably far-fetched for multimedia research, but it
has a more multimedia-oriented formulation: How can I control with
the tools I have in Multimedia (i.e. structure and format of a
presentation) that C is OR IS NOT generated from A and B?
DISC
We started the investigation in this field with DISC. The
approach taken was to use stereotypical characters (actants) to select
the information and structure the presentation. These stereotype
characters are genre-dependent roles, e.g. in a biography they
could be, apart from the main character, characters from the
main character's family, from his/her professional life
(teachers, colleagues, students), and so on.
In DISC, information is selected to create these actants and
structured to group them in the chapter of the story where they
stereotypically belong (e.g. a Teacher in the professional life
chapter). The assumption is that using this stereotypical
mechanism, the system exploits a sort of common understanding or
social-cultural background that increases the probability that
the information is understood and perceived as coherent.
A next step in the research related to DISC is to see how the
system behaves with other information source, i.e. investigating
the two above mentioned research questions. This is in line with
the research roadmap for the I2RP project.
The next step
My project is to automatically generate presentations using this
time existing video footage
(www.interviewwithamerica.com). Recently I have made a program
that selects randomly segments of the video footage and assembles
them in a new video with transitions between segments (this is a
SMIL file). The idea was to get a feeling about what kind of
presentations can be generated from the material. This
experimental phase was useful because I had to adjust my initial
ambitious plan.
As I see it now, the subject of the generated presentations at
first will be interviews over a specific subject, e.g. about the
opinion of interviewed people about one issue. This means that the
engine has to select all fragments where the issue is
handled. When such a mechanism is in place, I can start studying
how the composition of these fragments reinforces or weakens a
particular interpretation, and what interpretations can be
suggested to the viewer.
My initial plan was a bit too ambitious: "create some sort of
stories, in my intention stories that are not described in the
video. What kind of stories should be clear after the
experimenting phase. One possibility I have in mind is that the
viewer gives an emotional effect the story should provoke and the
engine generates it."
The research direction is the same though: I want to knowingly use
editing to create some interpretations, even if I think that
limiting the scope of the generated presentations to the original
subject of the video footage (i.e. interviews) is more realistic.
How it fits in the whole
All is driven by the problem of reusing existing information in
a new context. This requires understanding of "compositional
semantics", i.e. the effect of putting together more things, to
evaluate the compositional semantics of the existing information
source and of the the generated presentation. This is needed to
understand what is required to break the original document in
reusable pieces and to control the idea of the whole
presentation a viewer gets linking the presentation's different
information items, i.e. the meaning making process.
In a generated presentation the engine should be aware of this
process as much as possible and control or drive this process in
the viewer. Being aware and control are interrelated but not
necessarily equally achievable.
How it fits in Multimedia
In controlling this process of meaning making, especially in
this phase with video, the engine generating the presentation
will use specific video properties that can be symbolized by the
Kuleshov effect. The two phases of awareness and control of the
meaning making process will be tested in the automatic creation
of video in a way which is different from the presentation
generated from the RijksMuseum; there we assume the story to
tell is one (the art-historical truth) and the engine should
convey the message without creating misunderstanding in the
viewer, while the video is more controversial and the idea is to
play with point of view and interpretation. Both approaches rely
on the mastering of the relationship (or meaning) creation
process and the first can benefit from the second.
Literature
Literature search will (is) going along two lines:
- Film theory
- Automatic Narrative generation
From film theory I mainly expect to gather a set of "editing
vocabulary" to use to provoke a certain effect on the
viewer. From Automatic Narrative generation I mainly expect to
have an overview of the techniques that the engine I want to
implement can use (e.g. grammars, agent based). A nice example
is Tale-spin, and the first paragraph of the paper is also
inspiring to try and play with rules to see how and when the
generation breaks down, considering that video has more ways to
fool a viewer that text (it is more reality like).
An interesting idea about grammars and new languages is sketched
in the Digital Mantras which in fact forms one of my source of
inspiration.
Considering that the system should try to suggest an
interpretation to the viewer, Terminal Time comes immediately to
the mind, and that will be a source of inspiration and
differentiation.
Frank's thesis, which I am now reading, should also provide some
insights.
System design
I expect the theory to be focused on the metadata or on the
process or on both, depending where more contribution is to be
given. In the metadata I expect some contribution/insight as to
what annotations support the story building process, e.g.:
- what subjects (actants)
- what temporal relations
- how to deal with continuity
while in the process I expect insights as to what process can
best use the editing vocabulary to induce meaning in the viewer.
The implementation will go in parallel with the theory
generation and will focus on the selection mechanism, using
probably first a sort of grammar (if possible) based on the
cultural aspects and maybe shifting to more complex methods. I
am particularly interested in grammars because I think that it
is a very interesting result to see if meaning can be achieved
with such a simple mechanism, or more in general how far you can
get with a grammar. As I see it, this is strongly related to the
philosophy sketched in the Digital Mantras which in fact forms
one of my source of inspiration.
What research contributions
Considering my intended first subject of the presentations, a
contribution (especially to the field of Digital Libraries) is
to provide the access to large video material based on
content. Another contribution I expect comes from the
familiarization with the usage of MPEG-7 and related annotating
tools.
These two points will possibly provide same research insights,
but they themselves are not research contributions: the real
research contributions I expect to be:
- The annotation process
- The generation process
The annotation process consists in defining how to annotate and
what to put in the annotation. These two issues are clear in the
video example: what annotation can support the engine in
creating a story out of the video footage and how do you attach
annotation to the footage so that the scene that is selected can
become a building block of an automatically generated story
(there are temporal and editing issues here).
The generation process must create some sort of story and
different approaches from literature can be used, from grammars
to independent agents. The actant idea comes in place here,
because the actants should generate the story either by being
characters in it or by influencing the selection process (for
example emotional actants).
Using existing material (not meant to be used in the way I want
to use it) is also interesting because it represents a sort of
default mode people use to create video material: from our
experience and difficulties with it we could in principle
generate a set of guidelines on how to create material to
maximize the possibility of reuse in the future.
Some References
J. R. Meehan. TALE-SPIN, an interactive program that writes
stories. In Proceedings of the Fifth International Joint
Conference on Artificial Intelligence, August 1977.
Digital Mantras: The Language of Abstract and Virtual Worlds by
Steven R. Holtzman (Author)
Nack, F. (August 1996) AUTEUR: The Application of Video Semantics
and Theme Representation for Automated Film Editing. Ph.D. Thesis,
Lancaster University.
Davis, M. (1995) Media Streams: Representing Video for Retrieval
and Repurposing. Massachusetts Institute of Technology. Cambridge,
Massachusetts.
http://garage.sims.berkeley.edu/pdfs/1995_Marc_Davis_Dissertation.pdf
Remarks
Lloyd observes that media items I will use for the second step are
not contained in a pre-existing structure, i.e. the video footage
was not shot for a particular structure and has to be put now in
another one. Therefore the idea of repurposing does not apply to
that as well as it does to the ARIA database. This is a correct
remarks and I think the relation with the all repurposing issue is
more that trying to automatically edit video footage should give
us a clue for the kind of problems we might find when we
automatically put together media items. The fact that there might
be a structure containing these to-be-reused media items can make
the problem of assembling them easier or more difficult, I still
do not know.
Another point that Lloyd stresses is that annotations should be
presentation independent, but we have to deal with annotation that
is specifically presentation-dependent (ARIA database).
An interesting aspect to consider is also the time constraints
that a presentation should satisfy. This can open up an
interesting field of research that was already approached (also in
Cuypers) by the use of RST nucleus-satellite relations (and there
is a paper about how RST can serve the purpose of selecting the
most salient points in the domain of news). Michel Crampes has
done some work about this, more when I read his paper again.
Just a small idea, I am also curious whether SMIL capability of
zooming in can be used in document editing.