Issue

Article

Vol.28 No.2, April 1996

Article

Issue

Gesture at the User Interface

A CHI '95 Workshop

Alan Wexelblat

Coutaz
Maggioni
Altman
Gao
Fels
Kuehn
Hataoka
Cavazza
Wexelblat
Pook
Mulder
Hibbits
Appendix A

Panel on Issues in Vision and Gesture

Panel on Issues in Gesture Recognition

Panel on Issues in Multimodal Interaction

Panel on Issues in Application Development

About the Author
Address

At the CHI'95 conference, Alan Wexelblat and Dr. Marc Cavazza ran a one-and-one-half day workshop on Gesture at the User Interface. The workshop was attended by ourselves and ten other people working in the field. The attendees came from several different countries and had backgrounds in research, academia, industry, as well as related disciplines such as law and music.

The workshop gave time to each participant to discuss his or her work and included focussed group discussions around the areas of Vision & Gesture, Gesture Recognition, Multimodal Interaction, and Application Development. These discussions addressed a number of issues with lively debate among the workshop attendees. No formal consensus was reached, but the issues are listed in Appendix A so that readers can get a sense of what was discussed.

This report is written by Wexelblat, and any errors should be attributed solely to him.

The attendees were:

Edward Altman: a researcher for ATR, Japan, working on ways to extract continuous motion and structure from gesture and use the results to give symbolic commands to computer systems.
Joelle Coutaz: a software engineer and researcher with IMAG in Grenoble, using active vision to look at people and trying to develop a taxonomic understanding of the functional role of gesture in HCI.
Sidney Fels: a developer working for Virtual Technologies. His recently completed Ph.D. involved creating a system to translation continuous motions into inputs for the formants of a speech synthesizer or musical instrument.
Wen Gao: a visiting professor working with Dr. Rodney Brooks in MIT's AI lab on ways to interpret the full range of human language, including body language.
Nobuo Hataoka: a researcher working for the intelligent systems research department of Hitachi in Japan, interested in multimodal interaction capturing both speech and gesture.
Bernard Hibbits: a law professor from the University of Pittsburgh, doing research on the history of gesture in the law, particularly legal gesture and its relation to other forms of communication.
Volker Kuehn: a consultant for the German government working on the problems of three-dimensional interaction and input with an eventual goal of trying to create a 3-D scanner.
Christoph Maggioni: a developer with Siemens building systems to recognize hand and head gestures in real time on PCs.
Axel Mulder: an independent software developer working in Vancouver on building virtual musical instruments.
Polly Pook: finishing her Ph.D. at the University of Rochester on gesture as a symbolic means of communication with a robot,

Additionally, Cavazza presented his group's work on building a multimodal (speech and deictic gesture) system at Thomson in France and Wexelblat talked about the gesture analysis system he built for his Master's thesis at the MIT Media Lab.

The workshop opened with an introduction from Cavazza, urging people to spend their time talking about what does not run, rather than what does and to share the technical tricks and hard lessons learned from real implementations. This set the tone for the workshop, which focussed largely on practical matters and actual experiences.

Coutaz

Coutaz's presentation served as a good introduction to the topic of gesture itself. She laid out three major functions for gesture:

semiotic: it is used to communicate meaningful information;
ergotic: it is used to perform manipulation in the real world;
epistemic: it is used in learning from the environment (by touching and manipulating objects).

In her experience with HCI, gesture is used primarily for its ergotic uses, particularly in Virtual Reality handling and manipulation of objects. The problem is that this kind of interface too often comes with artificial add-ons which wire the user to the computer; here we hope computer vision can help.

She presented a vision-based gesture system called FingerPaint, which captured 2 1/2-D gestures: the movement of the index finger over a surface and the lifting up or touching down of the finger. The tracked finger created a computer-painted picture, as the system name suggests.

Problems with the system were primarily related to robustness, especially under conditions of changing illumination. A "trick" was used to capture the up/down motion: a small microphone on the desk detected the sound of the finger tapping versus sliding on the surface. The workshop participants expressed the sentiment that this was less a trick than a clever combination of two input modes to do easily what would have been very hard to do with just one mode.

Maggioni

Maggioni presented the GestureComputer project on which he was working, attempting to build a wireless gesture interface. In his opinion, Virtual Reality is only useful for games or entertainment; for in-situ applications such as an office, an augmented reality system would be better.

His hand tracker captures the position, orientation, gesture movement and manipulation of objects at 200-400 pixel resolution. The system also has a head tracking component, which observes the head position and captures changes of viewpoint so that the image being observed by the user can be moved correctly relative to the head. This is done via a trained classifier which works with image samples of skin colors. So far, this works only for light-colored skin, but was resilient to lighting changes.

The intended application areas include imaging CAD drawings, control of view of rooms, and games.

In his opinion, the major engineering problem is to reduce a potentially huge data stream (realtime video plus gesture information) to a smaller representation, on the order of 50 points per image. The major design problem is selecting and implementing the correct set of degrees of freedom -- for a given application it is not always clear ahead of time how many degrees of freedom the user will be able to comprehend and make use of.

He also discussed a number of high-level issues which stood in the way of commercial gestural applications, including ways to integrate input modes such as the head and hand, how to sell products that are gesture-based, and how to package gestural interactions into understandable metaphors -- what are the gestural equivalents for "double-click" or "click&drag"?

Altman

Altman presented his work on recognition of dynamic gestures, specifically symbolic gestures which occur in natural conversation and sign languages such as American Sign Language (ASL) and Japanese Sign Language (JSL). Altman described his focus on one of the hardest parts of gesture recognition -- the subtle changes in meaning that occur as slight variations are made in the movement of a gesture. For example, in ASL, a different way of moving the hands is used to differentiate "give a person some things" from "give each person in turn a thing."

His goal is to be able to create a highly structured representation of gestural movement, capturing the embedding of features that goes on in real-life sign language. His system is intended to be applicable to a large family of motions and shapes, with rapid response time (similar to what is done for speech input processing).

For his work he uses a dynamical system, based on a hand-position sensor feeding information to a PC to capture time-varying signals. Once the raw signals are capture, the system tries to abstract the input for data reduction purposes and then applies recognition algorithms.

His key insight is the application of chaotic processes to do trajectory recognition in the data. The system allows for dynamically formed trajectories, dynamic segmentation of inputs, and transitions between signs. He explained that nonlinear dynamics provides a rich domain for modeling, and one which has not yet been explored by researchers in this field. The idea is to map the gesture in physical space onto a manifold surface and represent the result by a system of equations.

The immediate challenge is to find proper ways to decompose the motion and synchronize the dynamical system and the input so that the hand is accurately modeled by system. One approach is to create multiple models and use the best one to synchronize. Segmentation of motion is done by transition from one manifold to another.

Gao

Gao is trying to use approaches from image processing and speech recognition technology to recognize hand gestures. The ultimate goal is to transform 2D input data into a 1D vector of meanings.

His method involves producing a chain code using the position and orientation of hands and fingers as captured in video images. Motion is determined via difference frame-to-frame. He has a neural network trained on 13 gestures; this approach was chosen because of a basic belief that gesture recognition is a pattern classification process.

The intended end application is a hand-talker for deaf people; that is, a system which would recognize gestures and translate them into spoken language for people who do not speak ASL or another sign language. He is also working on a room-based system which will allow remote control of cameras and associated technologies (such as VCRs, multiple screens, and so on) via deictic gesture integrated with speech.

Fels

The impact of Fels' presentation cannot be adequately captured in written form. His primary presentation was the showing of a video of his system, which uses gestures to control an artificial vocal tract. Speech is the end product, but the process much more closely resembles playing a musical instrument than anything else.

This system, though clearly a "gesture recognizer" didn't fit any of the classifications or taxonomies used by others in the workshop and provoked a great deal of discussion as a result. Fels also pointed out that this system was but one of a spectrum of systems which could perform hand-to-speech mappings, ranging from a word generator to a syllable generator to finger spelling (which is quite common) to a phoneme generator or (as in his current system) an artificial vocal tract.

The artificial vocal tract differs from many other approaches in not being a signal mapping problem. There is a form of class lookup, but no discretization -- it uses continuous motion. Several different approaches were tried. In the final version of the system the user (a trained pianist) controlled a formant-based synthesizer, mapping a cognitive model of gesture space into some task space that has the same characteristics of continuous parameters, high dimensionality, etc. The user's final proficiency was the result of many hours of training.

While the artificial vocal tract is probably not a practical application in itself, it did point out completely new ways of thinking about the problems of gesture input.

Kuehn

Kuehn works with the GIVEN (Gesture-based Interactions in Virtual ENvironments) toolkit. This kit is based around neural networks to account for inter-user variability. He pointed out that gesture-based interactions are powerful for specific tasks, but usually require memorization. In addition, complex dialogs require context switches which can be difficult in ordinary interfaces. The GIVEN solution is to provide 3D widgets and menus while using gesture as an additional input stream.

The goal is to create solutions for (limited) interaction capabilities in gesture-based virtual environments, and the construction of new gesture- and speech-based interactions. To do this GIVEN accepts 6-degrees-of-freedom input, collision results, gesture recognition, and speech recognition. Dialogs can be built using AND or OR capabilities from each channel (e.g. deletion may require both "OK" speech and a correct gesture).

Their current demonstration system uses an interface agent, a graphical representation of a servant, named James. The idea of using a graphical agent is to overcome prompt-/feedback requirements and to give the user an easy visualization of the system's status. For example, when speech is recognized, James' ears become visible; when a gesture is recognized, his eyes open, and when the whole system recognizes an input, an object appears on the tray.

The next steps planned are to change the system over to camera input (no wires), begin experimenting with cooperative work (over ISDN), and test out applications in telemedicine and tele-teaching.

Hataoka

Hataoka's work is on a multimodal system using speech and pointing gestures on a prototype application of a household interior design system. An initial evaluation of the multimodal interface was done to assess the effectiveness of multimodal interfaces with speech and pointing and to clarify a desirable specification for the interface, especially from the viewpoint of utterances.

Further evaluations were done using a comparison wizard-of-oz system. Both the prototype and the mock-up system allowed users a choice of 3 input means:

pointing only (select action, select object, select location),
verbal command and pointing,
sentence utterance and pointing.

Over both experiments the second option came out best, presumably due to errors in sentence recognition, because with wizard-of-oz study, sentence & pointing rated best. If sentence-level recognition can be improved in an actual system, the experiment should be redone to test this hypothesis.

Their future work involves extension to their existing agent-type interface (which subjects have rated "unfriendly") to give better replies, and to increase its politeness. This will be done by the use of more dynamic pointing gestures and by a study of theoretical methods of information integration for asynchronous multimodal inputs.

Cavazza

Cavazza is working on a multimodal control&command system with both desktop and immersive interfaces. In their system, a natural language module extracts voice from a speech recognizer and a gesture feature parser extracts semantic features to produce combined semantic representation. Their gestural interface concentrates in this first version on deictic gestures only. They have three objectives:

1. to explore the relevance of analyzing the semantic content of gestures (beyond their use as pragmatic markers, e.g anaphora deictics);

2. to experiment with how the processing of real communication situations (human-human interaction) can be envisioned with unified formalisms (wherein the final knowledge representation encodes both speech & gesture)

3. to study the real influence of gesture in communication, with applications to HCI.

Cavazza believes that human-human interaction can be envisioned with unified formalisms wherein the final knowledge representation encodes both speech and gesture. The goal of such a unified formalism would be to study the real influence of gesture in communication, with applications to HCI. For example, we know that gesture interacts with natural language at several levels, but seemingly not at syntactic; what does this imply for interface dialogs, which are often organized around a particular (artificial) syntax?

Wexelblat

Wexelblat talked about the difference between gesture recognition and gesture analysis. In his work, he separates what is commonly called recognition into analysis and interpretation. By analogy, this is like what speech recognizers do when they take sound waves from the air and produce words according to some grammar as a result. The words may or may not have meaning in the application context, but that's a separate determination from whether the sound waves map to words in the grammar.

His system produces a computational representation of the gesture (hand position, configuration and movement); this representation is fed to a separate multimodal interpreter which takes speech and eyegaze input as well. The interpreter is responsible for extracting meaning from the inputs and application context.

Wexelblat also operates off a joint-based model of the upper body/arm/hand. Much of the problem involves recovering the actual gesture from the data stream, which contains artificial divisions as a result of the digitization process and the model. Wexelblat's system does not require the user to explicitly mark the start and stop boundaries of a gesture. Since it is not recognizing specific gestures, no training is needed (though the system has to be configured at start-up for the user in order to accommodate different body sizes).

His system tries to capture all the gestures made in natural human conversation, even though any given application may use only a small fraction of these gestures. The gesture analyzer has no knowledge of what the upstream application wants, which limits its ability to perform semantic operations on gestures, but allows it to recognize any performable human motion rather than just recognizing a limited set of templates.

Pook

Pook is working on using gestures to direct robotic manipulators. Robots are interesting in that they're heavily situated in their tasks. Fully autonomous robots have a hard time recovering if the situation changes; in teleoperation task you get high-level control, but lack of force feedback, and lack of awareness of hardware constraints makes control hard for a real-time user.

Her solution is to use a combination: the person is in control for high level direction, but the robot has autonomy for low-level action. Essentially, people teleoperate the robot and record their actions in terms of state variables of robot. A Hidden Markov Model (HMM) is used to provide context for the primitive operations.

The problem is that gesture provides a huge feature space; this leads to the central question of what features are salient parameters for high-level control? Her answer: intention (what are you trying to do) and geometric parameters (spatial binding of actions to relative coordinate frame).

This is implemented as a selected set of pre-shapes for user's instrumented hand. The user makes one of these shapes and the robot executes a low-level program of motion/grasping. The motor program is autonomous.

Mulder

Mulder is interested in the relationships between gesture and music and in automating a process of tracking gestures (postures and movements) and doing a form of analysis or recognition then translating the results into a form which can be fed to a synthesizer to produce sound (music). In general, you can look at conventionally physical musical instruments as mapping a fixed gesture space into a sound space; electronic musical instruments allow mapping into different sets of sound spaces. The end goal is to produce virtual musical instruments which would allow continuous change in mapping from gesture to sound spaces and would allow enlarged gesture spaces.

In the ideal implementation, the entire instrument would be virtual -- musicians would just make hand/arm/finger motions with auditory and kinaesthetic feedback. An advantage of this would be allowing multiple levels of abstraction -- musicians could play virtual musical instruments directly or conduct them in an indirect fashion. The virtual instruments would be given intelligent, programmable object behavior.

This could lead to possible changes in musical performance technique:

sound sculpture tools,
active listeners (instrumental karaoke),
conductor and virtual orchestra,
live dance: interactive accompaniment (look at the dance steps and create music to go with them).

Hibbits

Hibbits works in the field of legal history. The law is traditionally identified with writing, however an emerging new view (influenced by radio and audio-tape) emphasizes a more oral tradition. This tradition has long historical roots.

Gesture -- purposive body motion or symbolic motion -- can signify a legal change or condition such as the establishment of a contract, oath, or land-transfer. Gesture used to be an essential part of legal relationships, some of which are still preserved (e.g. "raise your right hand" to take oath) today.

There are some different categories of legal gesture:

indicative: indicates or signals that a change is taking place and marks parties/objects to transactions;
ordinative: orders or organizes meaning within a broader legal ceremony;
punctuative: divides into bits or chunks;
evidentiary: manifests intent or willingness and often supports speech;
demonstrative: shows the nature of the legal agreements or creates community in space -- you have to be present to witness the gesture -- and in time -- via formalized gestures that are passed down from ancient to present;
mnemonic: gives dynamic visual image to aid memory and draw attention.

Hibbits pointed out that the survival of gesture in the modern era indicates the power of the communication mode; with TV/video (such as video wills) challenging the dominance of text legal gesture may again prosper. Machines may become depositories of legal documents and may need to understand gestures. One interesting aspect of this is the symbolic manipulation of virtual objects -- what happens when this becomes imbued with legal significance?

Appendix A

Panel on Issues in Vision and Gesture

What features of the task can be used to drive the vision; how do we constrain what we expect? What are the right levels of abstraction?
What computer vision techniques are useful for gesture recognition?
What about tactile/force feedback? How much feedback can we expect from the real world versus added from devices?
What is the relationship of computer clothing/wearables to computer vision?
2D versus 3D gestures -- vision may be more useful for 2D. How to integrate vision with other inputs such as direct manipulation of 3D objects?
If we're mimicking human interaction -- what do people see, really?
Time for processing? Button-push is very fast, but vision processing never fast enough?

Panel on Issues in Gesture Recognition

What features are salient?
Is gesture recognition really pattern classification?
Is gesture recognition enough for sign language recognition?
Why used particular technologies (HMM, neural nets) for recognition?
Is recognition even sensible for human gesture?
How many degrees of freedom are really needed?
Continuous versus discrete?
Can any recognition be done without context?
Free/empty-hand gestures versus pen-gesture?
What constitutes a gesture? (How much of hand/arm/body/gaze do we count?)
Are there semantics of gesture at all? Are there cross-cultural semantics?

Panel on Issues in Multimodal Interaction

What do we do about temporal windowing and synchronization of multiple modes of input?
How to handle environments with multiple people speaking to computers?
What modes when?
How far can we go in simultaneous processing of speech and gesture?
Are there verbal utterances which are ambiguous, where the gesture disambiguates/clarifies/corrects, other than referential cases?

Panel on Issues in Application Development

What are the "right" applications?
What are the ergonomic issues of using gestures?
What is the role of learning/what can we expect of the user?
What is the role of feedback (especially in terms of establishing context)?
Can we work with "spontaneous" gestures?
Viewpoint manipulation versus object manipulation
What technologies do we need to develop for gesture applications to be practical?
Is taxonomy an application-specific problem?
What are the differences between gesture as a novel input mode to applications versus gesture as replacement for other input devices
How much can we do with a small set of gestures? (If it's hard to learn a large set of gestures, what's the right minimal set?)
What aspects/features of applications cause us to want to use gestures?
How can we map gestures into useful application action (intuition, level of abstraction, level of autonomy)?
Does everything really depend on the application? Or is there another notion of context that is meaningful for gestures?

About the Author

Alan Wexelblat is currently a Ph.D. candidate in the Autonomous Agents group at the MIT Media Lab; he completed his Masters work there in May, 1994, in the Advanced Human Interface Group. His thesis described a system and process for gesture analysis.

Address

Alan Wexelblat
MIT Media Lab - Intelligent Agents Group
20 Ames St., E15-305
Cambridge, MA 02139 USA
wex@media.mit.edu

Issue
Article
Vol.28 No.2, April 1996
Article
Issue