Collected thoughts on generating presentations

Patrick Schmitz, Ludicrum Enterprises
<cogit@ludicrum.org>

Notes on Oscar's thesis

These are just some ideas that arose as I read his thesis. I have added additional notes as I read other papers, but his ideas about attention, and some omissions in the architecture spurred my thinking in this area.

Using function/intent of media rather than media type

It seems critical to consider the function/use/author-intent of a piece of media (e.g., an image) in a presentation, and not just the fact of its media type, or even of its contents . Simply combining "image" as a class with other content is much less meaningful without the function/intentional-use included in the analysis.

If the work done to classify audio is moved onto a slightly different axis, I think it becomes much more meaningful, and is also general to all media types (not just audio). This axis is the intended level of user focus. Sticking with his three levels of distinction, we get content that is:

ambient: This is content that is not the central focus in the presentation. It may well be important to create presence, mood, etc. but it is not central to communicating the core concept of the presentation. Background music, graphical backgrounds, margin styling and color, and in some cases, even some video may be largely ambient. Ambient media should require minimal cognitive processing.
focal: This is the media that is carrying the main weight of the communication. It should be the true focus of the user's attention, and is presumably the most important media from the perspective of teaching or communicating. The bulk of the user's cognitive processing should be consumed with media in this category.
alert: This is media that is designed specifically to interrupt, alert, guide or prompt the user. It is typically short, requires immediate attention but little - or only short duration - cognitive processing. Some guide content is persistent (e.g. control buttons and iconic guides), but still is processed in short bursts (when searching for, or actually using the guides or controls), and fits well into this category.

This of course begs the question: how do we determine/deduce/declare the function/intentional-use of a piece of media? When we are generating media, I would think it important in the rhetorical and discourse models to note the function of the media, and so this should come for free, to some extent. Stated another way, this would be one of my design criteria or requirements for a model of rhetorical structure or discourse.

In an existing presentation, I suspect that some heuristics might be applied that could help with this. These might well be inspired by the classification scheme that Oscar describes. However, it makes more sense to define a model that handles the broader case (including the exceptions to his rules he mentions). Once we have a model like this in place, it may make sense use his simpler rules to approximate the proper model, when the function/intentional-use is otherwise unknown.

Modeling the temporal duration of static media

The general assumption (or definition) is that static media has no intrinsic duration. It is usually modeled as having 0 duration or indefinite duration, such that other constraints or some explicit authoring directive controls the effective duration. However, in the context of generated presentations in which media is selected to fit or function in some context, often in association with other media, I think that even static media can often be understood to have an implicit duration. I see two approaches to defining this implicit duration, using as an example an image.

Associate the image to a duration of some time-based media. This may sound obvious, but the point is that if you have some audio talking about Rembrandt and you pick an image of him to show, the duration of the image is implicitly the duration of the associated audio. More generally, the duration of an instance of static media is the duration of the associative context.
This requires that the timed media be sufficiently annotated to indicate subject portions (sub-segments) within the timed media (e.g., from when to when in the audio is "Rembrandt" talked about). I would presume that this sort of annotation would be generally useful (if not required) in order to generate a useful presentation that reasonably associated images to audio (for example). This is probably not a new idea.
Understand the cognitive requirements of the image (or more generally of a given piece of static media, or perhaps even more generally of any piece of media), and associate a duration with this. This putative duration value could be based upon simple heuristics that use some general attributes or derived features of the media, and/or it could be based upon experience, where a system can learn, and refine initial values. The duration could possibly be scaled to account for factors like the contextual complexity, cognitive resources of the viewer (based upon profile, accumulated experience, etc.), and fidelity of display (device characteristics). This is probably also not a new idea.
Some simple examples of this would be:
1. For text, compute the time it would take to generate speech from the text at a normal pace. This is well understood, and provides a reasonable base value for the duration. Given the ability to derive semantic features from the text, associated with text extents, it should even be possible to produce begin and end times for semantic sections of the text, as a basis for the approached described in 1) above.
2. For images, high-level features can be combined with annotations to estimate the cognitive processing required. This could combine the image display size, the class of image (photo or graphic/diagram, landscape or human face or object (etc.), illustration or diagram or map), the amount of high-frequency information in the image.
Naturally, one of the cognitive factors for media will be the function or intentional use of the media. For focal media, rules like the above are most likely to be useful. However, ambient media will have an implicit (useful) duration determined by the associated focal content in the presentation. By the same token, alert/interrupt/guide content will either be short (usually constrained to some stylistic rule for consistency across the presentation), or it will be persistent (as with menus, control buttons, guides, etc.), and will be constrained only by the associated presentation context as for ambient media.

General observations on the model

Animation is not really covered. Raises lots of issues w.r.t. attraction, attention, the classification of the media, etc. For the purposes of this discussion, animation basically breaks down into three general categories:

Animation used for style, for transitions, for fun and distraction. This includes motion transitions, wipes, fading between slides, etc. This is really just a kind of styling, and so is largely unimportant to the model. It impacts the temporal layout, and must be considered when solving the temporal constraints. This in turn may mean that what was a last stage process (styling and pretty effects available in a given language and/or on a particular runtime platform), may ripple back up into the temporal layout. In the terms I described above for media, this class of animation is essentially ambient.
Animation used as content, especially focal content. This is different from video in some ways, and may be significant in terms of the platform, network delivery constraints, etc. However, for the most part this kind of animation is essentially equivalent to video, and so should not require additional thought.
Animation as a compensational tool for layout. This allows for content models that change their layout constraints over time (either in response to user interaction or to a described timeline). For example, if there is not enough room to display all four of a set of closely related images, one alternative is to display iconic images that can indicate the semantic relation, and then grow them over time (e.g. in sequence) or as the user explores them (interactively).
This can be more effective than splitting the images over several pages, or converting them to a sequence (slideshow), because the grouping of the images together conveys a particular semantic relation. In particular for images, scaled down images can work visually/cognitively much like a low-frequency wavelet function: while the larger image form is required to make details visible, the general visual composition, the choice and balance of color and light and other "low frequency" information is still clear. Thus for example, the end user can see and compare the use of color and composition in a set of related paintings, using timed or interactive animation to scale the images up large enough to view details.
In order to be effective, this may require that the generation engine can synchronize the animation of images to audio that describes features of the paintings. This should not be difficult for the engine, given the appropriate information about the audio (see also the discussion of Modeling the temporal duration of static media).

Audio not considered as multichannel, or positional. Should be 3-D as well, although it has no extent. I.e., it has position (which leads to stereo balance when spatial processing is done), but audio has no extent. Position is particularly important when animated, to enhance presence in a 2-D or 3-D navigation space. It has also been used to model human interruptions (whispering a reminder or a hint in one ear, by placing the audio immediately offscreen to one side).

Audio volume not considered. To an extent not directly comparable to visual media, audio has volume which can be used for layering. This can be independent of position (as an attribute of the media), or it can be modeled with spatialized audio (that can include motion animation), and/or it can be controlled by styling (e.g., keying off the function/intent of the audio in context). Can layer text with opacity, and so too images, but not as commonly used as volume is for audio.

Images versus text - a whine. Images are inherently limited in the information they can convey (as are most media types). They are not necessarily richer than text - it depends on what must be communicated. This needs to be better accounted for. OTOH, if there is no good means of conveying an idea with an image, then an image is unlikely to exist, so the alternative will not present itself.

Open question on text to speech: Oscar describes some rules that govern how and when speech can be effective in concert with other media types. Should these rules be considered or extended to support compensational layout tools such as speech-to-text? Note that in particular for small display devices, text-to-speech may be an important strategy to compensate for the lack of screen real-estate and poor text display. Similarly, highly passive modalities (e.g., watching TV) and non-visual modalities (e.g., working with an autoPC) will prefer speech forms. If there is only text, then speech-to-text will be an important tool. It could be modeled as orthogonal (as though it just exists and is "chosen" when media is chosen from the MMDB), but it may also be a compensation tool to deal with constraints much later/deeper in the process/model.

Narrative and non-narrative content

I think there is an important distinction between narrative content and non-narrative content¹. My use of these terms is based upon folks I knew in English and Comparative Literature majors in school, for whom narrative forms were the means of telling stories in the general sense, and included novels, textbooks, movies, poetry, etc. Non-narrative forms are things like query results that do not generally include a discourse model or equivalent.

The narrative forms lend themselves to a degree of authoring, even if only at some abstract level. Non-narrative forms can incorporate graphic design, and perhaps some intelligent ordering and grouping, but not really authoring as such; presentation generation has tended to be more automatic and simplistic, and less heuristic or analytical. As such, narrative forms also lend themselves to the application of cognitive modeling, but non-narrative forms will either not be as amenable to cognitive modeling, or will have a simpler and more general cognitive model associated with the app/content model in general, and not per presentation.

Nevertheless, in the area of semantic web queries and applications, it may be very interesting to develop some models for synthesizing discourse, narrative or cognitive models for non-narrative content. I would like to keep this in mind as we explore the application of semantic relations to generation and translation of narrative/discourse/cognitive models.

Footnotes:

¹I am not sure if there is such a thing as non-narrative expository content (that would sit in between these two - I am pretty sure this can be ignored).