Workshop on Multimodal Output Generation (MOG 2007)

25 and 26 January 2007, Aberdeen, Scotland


Trip Report by Zeljko Obrenovic


Past research in human-computer interaction provides evidence that the use of multiple output modalities makes systems more robust and efficient to use. A quick and successful interaction is expected when, for instance, the system's output is presented to the user via multimedia/hypermedia in which text and graphics are merged, or by a conversational agent that combines the use of speech and gesture. In such multimodal systems sophisticated specifications are needed to combine the different modalities in such a way that each bit of information is presented in the most appropriate manner (i.e., the system should select the most suitable modalities and modality combinations to convey information to the user).

Work on multimodal output generation is spread across disciplines and separate subfields such as multimodal natural language generation, the generation of multimedia/hypermedia presentations, and research on output modalities for conversational agents. Presentation of this work has been scattered across various events. One of our objectives of the MOG 2007 workshop is therefore to bring this work together in one workshop that aims to present the state of the art, and identify future research needs, in multimodal output generation by focusing on two research questions:

(1) what output modalities are most suitable in which situation?
(2) how should different output modalities be combined?

The intention is to provide a forum for researchers on multimodal output generation to meet, exchange ideas and engage in scientific and academic research collaboration.


Workshop was very well organized and very interesting. All the presentations were well prepared, with many people participating in discussions.

Although the CfP for the workshop included general problems of multimodal interaction, most of the papers addressed natural language processing and virtual characters. However, all the talks were interesting, and some of them are particularly interesting for INS2, such as:

In the rest of the report, I give the list of authors and some notes on their presentations. For main author, I have provided a link to his home page, with a list of his general research interests.


(Legend: Bold -  authors present at the workshop).

Jon Oberlander
University of Edinburgh

  • Discourse generation, individual differences, and multimodality:

    • Intelligent labeling (ILEX, M-PIRO)

    • Personality-based differences in discourse generation (CrAg),

    • Diagrammatic reasoning and communication (MAGIC, COMIC).

Invited talk: What Are You Looking At? A Personal View on Multimodal Output Generation
  • users prefer virtual characters to text, but not to pure voice: T < A < S
  • RUTH talking head, Festival2
10.45-11.15 Erwin Marsi and Ferdi van Rooden
Tilburg University
  • Natural language and speech from a computational point of view.

    • Recognizing Textual Entailment

    • Text-to-text generation and sentence fusio

    • Dependency Parsing

    • Prosody prediction

    • Speech Synthesis, both Text-to-Speech and Concept-to-Speech conversion

    • Talking head animation

    • Natural Language Generation

    • Corpus annotation and validation

    • Morphological analysis and POS tagging of Arabic

    • Machine Learning (memory-based learning in particular)

Expressing Uncertainty with a Talking Head in a Multimodal Question-Answering System
  • Main question: how to add cues of uncertainty to a talking head when it is answering question whit or without certanity
  • Problems with experiments and evaulations
11.15-11.45 Dr. Jan Peter de Ruiter
Max Planck Institute for Psycholinguistics, Nijmegen
  • Durational Aspects of Turn-Taking in Spontaneous Face-to-Face and Telephone Dialogues
Some Multimodal Signals in Humans
11.45-12.15 Markus Guhe
University of Edinburgh
  • Cognitive processes underlying communication
  • Processes and mechanisms with which communicative intentions are expressed. Mainly language, but also encompasses other modalities: gesture, facial expression, body posture, expression of the affective state, non-linguistic vocal expression
  • Building computational cognitive models to gain scientific insights
Towards a Cognitive Model of Multimodal Output for Language Production
  • where to put multimodal fission
13.15-14.00 Harry Bunt
Tilburg University
Invited talk: Towards Standardization in Semantic Annotation
  • as undertaken by the ISO organisation
    • giving standard for multimodal representation (ISO group)
    • recent workshop
    • ISO, LIRICS projects, ACL SISEM Working Group...
    • ACL-SIGSEM Working Group on the Representation of Multimodal Semantic Information
    • 2005, LIRICS Linguistics...
    • "to prepare int. standards and guidelines for effective language resource management in the multilingual information society"
    • - diversity of theoretical approaches
    • - limited researchers' freedom
    • + reuse and integration of language resources from different sources
    • ISO semantic annotation
      • registry of data categories: temporal information, reference relations, semantic roles, dialog acts, discourse relations
    • LIRICS (European),
      • temporal information, reference relations, semantic roles, dialog acts, but NOT discourse relations
    • no semantic annotation without a semantics
    • trying to design a metamodel is a useful approach to see differences/similarities among approaches
  • Defining semantic roles
    • approaches to semantic role: description model (event/verb-dependent), semantic granularity (coarse, medium, fine)
    • roles of frame-net
    • semantic roles metamodels
    • current work: test/validate in annotation experiments
  • Data categories for reference annotation
    • central notion is the "markable"
    • additional relational and objectal relations (in additional to lexical - synonymy, hyponymy...)
    • punctual and extended events
    • ISO-TimeML
  • Dialog acts (favorite subject)
    • dialogue, turns, sender, overhearer, addresses, utterances, dialog act, semantic content, communicative function
    • communicative functions stressed by dialog
    • utterances have multiple functions ==> multidimensionality
    • DAMSL
    • what is a dimension in dialog: it can be addressed by means and independently of other aspects
    • feedback, turn-taking, time, contact attention, opening, closing
    • general purpose functions (to any dimension): informs, question
    • contact management, auto-feedback...
    • pilot testing: for usability by multiple annotators with little training
    • segmentation in dimensions, not in dialog...
    • dimension specific functions
15.30-16.00 Yulia Bachvarova, Betsy van Dijk and Anton Nijholt
University of Twente
  • Working on PhD thesis which focuses on developing formal, computational model of how different modalities communicate within a multimedia presentation. This model is used in automatic generation of multimedia presentations to provide the generation engine with the required knowledge base and algorithms to properly assign the appropriate modality combinations.
  • ICIS/CHIM project
Towards a Unified Knowledge-Based Approach To Modality Choice
16.00-16.30 Željko Obrenovic, Raphaël Troncy and Lynda Hardman
Centrum voor Wiskunde en Informatica, Amsterdam
Vocabularies for Describing Accessibility Issues in Multimodal User Interfaces
16.30-17.00 Charles Callaway
University of Edinburgh
  • Natural Language Generation, research, as well as its integration into cultural activities, learning environments, interfaces for animated agents, and large-scale generation projects.
  • Discourse planning, sentence planning, surface realization, document planning, revision, multimodal explanation generation, spatial expressions, pronominalization, discourse markers, self-explaining documents and multilingual generation
  • PhD Thesis:
Non-localized, Interactive Multimodal Direction Giving
9.00-9.45 Elisabeth André
University of Augsburg, Germany
Invited talk: From Annotated Multimodal Corpora to Simulated Human-like Behaviors
  • Variants of Information Presentation Systems with Virtual Characters
  • TV style, role plays, face-to-face dialogs, multi-party dialogs (future)
  • how to acquire knowledge about multimodal human-human communication (intuitively, not model), how to code such knowledge, how to implement such behaviors in an embodied conversational agent
  • data, models and ECA: analysis-by-observation, analysis-by-synthesis ... synthesis by observation
  • always good to start from models, and then implement it and test it
  • capturing knowledge on human-like behaviors, motion capturing, study of literature, video recordings
  • use a corpus to derive typical behaviors
  • use a corpus to compare human-human and human-agent communication
  • humaine European network of excellence: emotions in man-machine communication
  • facial action coding system: automatic generation of mimics based on MPEG-4 standards
  • Ekman 1992 model, four ways how people lie: micro expressions, masks, timing, asymmetry
  • effect appears only in situations which allows them to fully concentrate on agent's face
  • Modeling politeness: people apply politeness norms when talking to computers, users feel better if computer appear polite, book by Brown and Levinson
  • gesture classes: hand/arm/movement, non-communicative (adaptor), communicative (emblem, deictic, illustrative (iconic, metaphoric),
  • annotation of corpora
  • Anvil annotation tool
  • new scheme for annotation politeness gestures
  • how to exploit the knowledge to control the behaviour of an ECA: copy-synthesis approach, over-generate and filter, derivation of rules
  • cross-cultural aspects of politeness: American vs. German students, statistically significant, slight differences were manly cause by problems in translation
10.15-10.45 Mary Ellen Foster
Technische Universität München
  • Research interests:
    • Multimodal generation
    • Generation in dialogue systems
    • Example-driven generation
    • Variation in generation
    • Practical implementations of all of the above
  • Current projects:
    • JAST - Joint Action Science and Technology
  • Previous projects:
    • COMIC - Conversational Multimodal Interaction with Computers
Issues for Corpus-based Multimodal Generation
  • corpora in text generation, multimodal corpora
  • corpus provide guidance for human developers
  • more direct use: design decision making, automated evaulation (cross-validation)
  • role of variations
  • multimodal corpora: recorded annotated collection of human behavior, annotated on multiple layers with imlicit and explicit links between the layers
  • uses of multimedia corpora: analysis, indexing and retrieval, summarisation, generation
  • non-verbal behavious for ECAs, using human behavior to decide how an agent should work
  • contextual information: characteristics of the speech signal, information structure and affect, motion and error type, intended prosody, syntactic structure, dialog history, user model
  • representing context
  • this could also be interessting contribution for canonical processes of media production: Canonical Processes of corpus-based multimodal generation


10.45-11.15 Dirk Heylen
University of Twente
  • Generating Expressive Speech for Storytelling Applications
Multimodal Backchannel Generation for Conversational Agents
  • Sensitive Artificial Listener
  • nose tracker device
  • interest in listening heads
11.15-11.45 Paul Piwek
The Open University
Modality Choice for Generation of Referring Acts: Pointing versus Describing
  • AIM: challenge two assumptions common in generation algorithms for multimodal referring acts: non-verbal means of referring are secondary to verbal means, there is single strategy for this
  • previous work: point when you cannot express it with words
  • cost of pointing
  Adrian Bangerter and Eric Chevalley
University of Neuchâtel
  • Research interest:
    • Social interaction and practices in selection and appraisal interviews
    • Coordination in collaborative work
    • Group processes and teamwork
    • Discourse and conversation analysis of task-related communication
    • Interplay of language and non-verbal communication (gesture)
    • Social representations, Diffusion of ideas and social construction of knowledge
    • Life course research
Pointing and Describing in Referential Communication: When Are Pointing Gestures Used to Communicate?
  • study gestures used in communicative situations
  • show audience design
  • gestures may be functional both for speakers and for communication
  • question is not weather or not, nut when gestures communicate
  • functions of pointing gestures
13.15-13.45 John Bateman and Renate Henschel
University of Bremen

  • Research
    • Ontologies, particularly for natural language
    • Discourse structure
    • The presentation of information combining presentation modalities: texts, pictures, graphics, layout, video and so on.
    • The automatic production of natural language texts and discourse
    • Various aspects of SFL, particularly on the intersection of SFL and computational linguistic description.
Generating Text, Diagrams and Layout Appropriately According to Genre
  • GeM project
  • communicative artifacts that adopt a page metaphor are combining an increasing array of simultaneous modes
  • systematic means for exploring kinds of means
  • Twyman's classification of the combination of modes in documents (pure linear, linear interrupted, list, linear branching, matrix...)
  • things organized about space
  • Genre and multimodality
  • Kress and van Leeuwen, Waller Rod, Martin (genre)
  • Kress / van Leeuwen a semiotic mode
  • waller's model of document design
  • corpus basics, GEM annotation scheme (CSS3, XSL:F)
  • content structure, rhetorical structure, layout structure, navigation structure, linguistic structure
  • basic vocabulary of the page, layout unutis, ...
  • xml, multilayered annotation ==> non-time based annotation -> no tools for it!
  • gradually transfer the implicit spatial information in the visual image to explicit representational structure
  • import from XSL:FO
  • RST and Layout Structure often diverge => generally layout consequences
  • Xalan-J, XSLT, XSL-FO, FOP, pdf
  • break conditions ofr paricular genre
  • derive genre constraints
  • problem of working in XSLT framework
  • notion of the virtual canvas
13.45-14.15 Charlotte van Hooijdonk, Emiel Krahmer, Fons Maes
Tilburg University
Mariët Theune and Wauter Bosma
University of Twente
  • Cognitive processing and representation of hyperlinked documents
  • Information presentation in a multimodal environment
  • Experience concerning Culture and Web Design
Towards Automatic Generation of Multimodal Answers to Medical Questions: A cognitive engineering approach
  • IMIX INteractive Multimodal Information eXtraction
  • answer modalities
  • media allocation problem
  • experiment: obtain a corpus of (multimodal) answers to different types of medical quastions
  • decorational function of media (no informativity), representational (if removing it does not alter informativity, but presence make it more clear), additional function (if removing it alters the informativity)
  • Cohen's scores of agreements of annotators
14.15-14.45 Christopher Habel and Cengiz Acartürk
University of Hamburg


  • Representation of knowledge about space, time and events
  • Language comprehension and language production
  • Granularity



On Reciprocal Improvement in Multimodal Generation: Co-reference by text and information graphics
  • combining text & information graphics
  • co-reference in multimodal documents
  • improvement: the role of comprehension in producing complex multimodal documents
  • the conceptual and lexical bases of cross-modal comprehension
  • text: written language vs. speech, monological text vs. dialogues
  • figures: information graphics: line graphs, bar charts; drawings, photographs
  • tables: two dimensional text
  • equations, formulas: part of text vs. separated from text
  • combining modalities is good for sensory substitution
  • Pinker's model of graph comprehension
14.45-15.15 Somayajulu Sripada and Feng Gao
University of Aberdeen
Summarising Dive Computer Data: A case study in integrating textual and graphical presentations of numerical data
  • presenting numerical data in text and graphics
  • quantitative information (QI), always presented using graphical displays
  • do the reduction, and than present reduced data
  • data summaries integrating TEXT and GRAPHICS could releive data overload
  • Scuba Diving
  • Dive Computer (DC) - swatch; records all the data about the dive
