Issue |
Article |
Vol.28 No.2, April 1996 |
Article |
Issue |
At the CHI'95 conference, Alan Wexelblat and Dr. Marc Cavazza ran a one-and-one-half day workshop on Gesture at the User Interface. The workshop was attended by ourselves and ten other people working in the field. The attendees came from several different countries and had backgrounds in research, academia, industry, as well as related disciplines such as law and music.
The workshop gave time to each participant to discuss his or her work and included focussed group discussions around the areas of Vision & Gesture, Gesture Recognition, Multimodal Interaction, and Application Development. These discussions addressed a number of issues with lively debate among the workshop attendees. No formal consensus was reached, but the issues are listed in Appendix A so that readers can get a sense of what was discussed.
This report is written by Wexelblat, and any errors should be attributed solely to him.
The attendees were:
Additionally, Cavazza presented his group's work on building a multimodal (speech and deictic gesture) system at Thomson in France and Wexelblat talked about the gesture analysis system he built for his Master's thesis at the MIT Media Lab.
The workshop opened with an introduction from Cavazza, urging people to spend their time talking about what does not run, rather than what does and to share the technical tricks and hard lessons learned from real implementations. This set the tone for the workshop, which focussed largely on practical matters and actual experiences.
Coutaz's presentation served as a good introduction to the topic of gesture itself. She laid out three major functions for gesture:
In her experience with HCI, gesture is used primarily for its ergotic uses, particularly in Virtual Reality handling and manipulation of objects. The problem is that this kind of interface too often comes with artificial add-ons which wire the user to the computer; here we hope computer vision can help.
She presented a vision-based gesture system called FingerPaint, which captured 2 1/2-D gestures: the movement of the index finger over a surface and the lifting up or touching down of the finger. The tracked finger created a computer-painted picture, as the system name suggests. Problems with the system were primarily related to robustness, especially under conditions of changing illumination. A "trick" was used to capture the up/down motion: a small microphone on the desk detected the sound of the finger tapping versus sliding on the surface. The workshop participants expressed the sentiment that this was less a trick than a clever combination of two input modes to do easily what would have been very hard to do with just one mode.
Maggioni presented the GestureComputer project on which he was working, attempting to build a wireless gesture interface. In his opinion, Virtual Reality is only useful for games or entertainment; for in-situ applications such as an office, an augmented reality system would be better.
His hand tracker captures the position, orientation, gesture movement and manipulation of objects at 200-400 pixel resolution. The system also has a head tracking component, which observes the head position and captures changes of viewpoint so that the image being observed by the user can be moved correctly relative to the head. This is done via a trained classifier which works with image samples of skin colors. So far, this works only for light-colored skin, but was resilient to lighting changes.
The intended application areas include imaging CAD drawings, control of view of rooms, and games.
In his opinion, the major engineering problem is to reduce a potentially huge data stream (realtime video plus gesture information) to a smaller representation, on the order of 50 points per image. The major design problem is selecting and implementing the correct set of degrees of freedom -- for a given application it is not always clear ahead of time how many degrees of freedom the user will be able to comprehend and make use of.
He also discussed a number of high-level issues which stood in the way of commercial gestural applications, including ways to integrate input modes such as the head and hand, how to sell products that are gesture-based, and how to package gestural interactions into understandable metaphors -- what are the gestural equivalents for "double-click" or "click&drag"?
Altman presented his work on recognition of dynamic gestures, specifically symbolic gestures which occur in natural conversation and sign languages such as American Sign Language (ASL) and Japanese Sign Language (JSL). Altman described his focus on one of the hardest parts of gesture recognition -- the subtle changes in meaning that occur as slight variations are made in the movement of a gesture. For example, in ASL, a different way of moving the hands is used to differentiate "give a person some things" from "give each person in turn a thing."
His goal is to be able to create a highly structured representation of gestural movement, capturing the embedding of features that goes on in real-life sign language. His system is intended to be applicable to a large family of motions and shapes, with rapid response time (similar to what is done for speech input processing).
For his work he uses a dynamical system, based on a hand-position sensor feeding information to a PC to capture time-varying signals. Once the raw signals are capture, the system tries to abstract the input for data reduction purposes and then applies recognition algorithms.
His key insight is the application of chaotic processes to do trajectory recognition in the data. The system allows for dynamically formed trajectories, dynamic segmentation of inputs, and transitions between signs. He explained that nonlinear dynamics provides a rich domain for modeling, and one which has not yet been explored by researchers in this field. The idea is to map the gesture in physical space onto a manifold surface and represent the result by a system of equations.
The immediate challenge is to find proper ways to decompose the motion and synchronize the dynamical system and the input so that the hand is accurately modeled by system. One approach is to create multiple models and use the best one to synchronize. Segmentation of motion is done by transition from one manifold to another.
Gao is trying to use approaches from image processing and speech recognition technology to recognize hand gestures. The ultimate goal is to transform 2D input data into a 1D vector of meanings.
His method involves producing a chain code using the position and orientation of hands and fingers as captured in video images. Motion is determined via difference frame-to-frame. He has a neural network trained on 13 gestures; this approach was chosen because of a basic belief that gesture recognition is a pattern classification process.
The intended end application is a hand-talker for deaf people; that is, a system which would recognize gestures and translate them into spoken language for people who do not speak ASL or another sign language. He is also working on a room-based system which will allow remote control of cameras and associated technologies (such as VCRs, multiple screens, and so on) via deictic gesture integrated with speech.
The impact of Fels' presentation cannot be adequately captured in written form. His primary presentation was the showing of a video of his system, which uses gestures to control an artificial vocal tract. Speech is the end product, but the process much more closely resembles playing a musical instrument than anything else.
This system, though clearly a "gesture recognizer" didn't fit any of the classifications or taxonomies used by others in the workshop and provoked a great deal of discussion as a result. Fels also pointed out that this system was but one of a spectrum of systems which could perform hand-to-speech mappings, ranging from a word generator to a syllable generator to finger spelling (which is quite common) to a phoneme generator or (as in his current system) an artificial vocal tract.
The artificial vocal tract differs from many other approaches in not being a signal mapping problem. There is a form of class lookup, but no discretization -- it uses continuous motion. Several different approaches were tried. In the final version of the system the user (a trained pianist) controlled a formant-based synthesizer, mapping a cognitive model of gesture space into some task space that has the same characteristics of continuous parameters, high dimensionality, etc. The user's final proficiency was the result of many hours of training.
While the artificial vocal tract is probably not a practical application in itself, it did point out completely new ways of thinking about the problems of gesture input.
Kuehn works with the GIVEN (Gesture-based Interactions in Virtual ENvironments) toolkit. This kit is based around neural networks to account for inter-user variability. He pointed out that gesture-based interactions are powerful for specific tasks, but usually require memorization. In addition, complex dialogs require context switches which can be difficult in ordinary interfaces. The GIVEN solution is to provide 3D widgets and menus while using gesture as an additional input stream.
The goal is to create solutions for (limited) interaction capabilities in gesture-based virtual environments, and the construction of new gesture- and speech-based interactions. To do this GIVEN accepts 6-degrees-of-freedom input, collision results, gesture recognition, and speech recognition. Dialogs can be built using AND or OR capabilities from each channel (e.g. deletion may require both "OK" speech and a correct gesture).
Their current demonstration system uses an interface agent, a graphical representation of a servant, named James. The idea of using a graphical agent is to overcome prompt-/feedback requirements and to give the user an easy visualization of the system's status. For example, when speech is recognized, James' ears become visible; when a gesture is recognized, his eyes open, and when the whole system recognizes an input, an object appears on the tray.
The next steps planned are to change the system over to camera input (no wires), begin experimenting with cooperative work (over ISDN), and test out applications in telemedicine and tele-teaching.
Hataoka's work is on a multimodal system using speech and pointing gestures on a prototype application of a household interior design system. An initial evaluation of the multimodal interface was done to assess the effectiveness of multimodal interfaces with speech and pointing and to clarify a desirable specification for the interface, especially from the viewpoint of utterances.
Further evaluations were done using a comparison wizard-of-oz system. Both the prototype and the mock-up system allowed users a choice of 3 input means:
Over both experiments the second option came out best, presumably due to errors in sentence recognition, because with wizard-of-oz study, sentence & pointing rated best. If sentence-level recognition can be improved in an actual system, the experiment should be redone to test this hypothesis.
Their future work involves extension to their existing agent-type interface (which subjects have rated "unfriendly") to give better replies, and to increase its politeness. This will be done by the use of more dynamic pointing gestures and by a study of theoretical methods of information integration for asynchronous multimodal inputs.
Cavazza is working on a multimodal control&command system with both desktop and immersive interfaces. In their system, a natural language module extracts voice from a speech recognizer and a gesture feature parser extracts semantic features to produce combined semantic representation. Their gestural interface concentrates in this first version on deictic gestures only. They have three objectives:
1. to explore the relevance of analyzing the semantic content of gestures (beyond their use as pragmatic markers, e.g anaphora deictics);
2. to experiment with how the processing of real communication situations (human-human interaction) can be envisioned with unified formalisms (wherein the final knowledge representation encodes both speech & gesture)
3. to study the real influence of gesture in communication, with applications to HCI.
Cavazza believes that human-human interaction can be envisioned with unified formalisms wherein the final knowledge representation encodes both speech and gesture. The goal of such a unified formalism would be to study the real influence of gesture in communication, with applications to HCI. For example, we know that gesture interacts with natural language at several levels, but seemingly not at syntactic; what does this imply for interface dialogs, which are often organized around a particular (artificial) syntax?
Wexelblat talked about the difference between gesture recognition and gesture analysis. In his work, he separates what is commonly called recognition into analysis and interpretation. By analogy, this is like what speech recognizers do when they take sound waves from the air and produce words according to some grammar as a result. The words may or may not have meaning in the application context, but that's a separate determination from whether the sound waves map to words in the grammar.
His system produces a computational representation of the gesture (hand position, configuration and movement); this representation is fed to a separate multimodal interpreter which takes speech and eyegaze input as well. The interpreter is responsible for extracting meaning from the inputs and application context.
Wexelblat also operates off a joint-based model of the upper body/arm/hand. Much of the problem involves recovering the actual gesture from the data stream, which contains artificial divisions as a result of the digitization process and the model. Wexelblat's system does not require the user to explicitly mark the start and stop boundaries of a gesture. Since it is not recognizing specific gestures, no training is needed (though the system has to be configured at start-up for the user in order to accommodate different body sizes).
His system tries to capture all the gestures made in natural human conversation, even though any given application may use only a small fraction of these gestures. The gesture analyzer has no knowledge of what the upstream application wants, which limits its ability to perform semantic operations on gestures, but allows it to recognize any performable human motion rather than just recognizing a limited set of templates.
Pook is working on using gestures to direct robotic manipulators. Robots are interesting in that they're heavily situated in their tasks. Fully autonomous robots have a hard time recovering if the situation changes; in teleoperation task you get high-level control, but lack of force feedback, and lack of awareness of hardware constraints makes control hard for a real-time user.
Her solution is to use a combination: the person is in control for high level direction, but the robot has autonomy for low-level action. Essentially, people teleoperate the robot and record their actions in terms of state variables of robot. A Hidden Markov Model (HMM) is used to provide context for the primitive operations.
The problem is that gesture provides a huge feature space; this leads to the central question of what features are salient parameters for high-level control? Her answer: intention (what are you trying to do) and geometric parameters (spatial binding of actions to relative coordinate frame).
This is implemented as a selected set of pre-shapes for user's instrumented hand. The user makes one of these shapes and the robot executes a low-level program of motion/grasping. The motor program is autonomous.
Mulder is interested in the relationships between gesture and music and in automating a process of tracking gestures (postures and movements) and doing a form of analysis or recognition then translating the results into a form which can be fed to a synthesizer to produce sound (music). In general, you can look at conventionally physical musical instruments as mapping a fixed gesture space into a sound space; electronic musical instruments allow mapping into different sets of sound spaces. The end goal is to produce virtual musical instruments which would allow continuous change in mapping from gesture to sound spaces and would allow enlarged gesture spaces.
In the ideal implementation, the entire instrument would be virtual -- musicians would just make hand/arm/finger motions with auditory and kinaesthetic feedback. An advantage of this would be allowing multiple levels of abstraction -- musicians could play virtual musical instruments directly or conduct them in an indirect fashion. The virtual instruments would be given intelligent, programmable object behavior.
This could lead to possible changes in musical performance technique:
Hibbits works in the field of legal history. The law is traditionally identified with writing, however an emerging new view (influenced by radio and audio-tape) emphasizes a more oral tradition. This tradition has long historical roots.
Gesture -- purposive body motion or symbolic motion -- can signify a legal change or condition such as the establishment of a contract, oath, or land-transfer. Gesture used to be an essential part of legal relationships, some of which are still preserved (e.g. "raise your right hand" to take oath) today.
There are some different categories of legal gesture:
Hibbits pointed out that the survival of gesture in the modern era indicates the power of the communication mode; with TV/video (such as video wills) challenging the dominance of text legal gesture may again prosper. Machines may become depositories of legal documents and may need to understand gestures. One interesting aspect of this is the symbolic manipulation of virtual objects -- what happens when this becomes imbued with legal significance?
Alan Wexelblat is currently a Ph.D. candidate in the Autonomous Agents group at the MIT Media Lab; he completed his Masters work there in May, 1994, in the Advanced Human Interface Group. His thesis described a system and process for gesture analysis.
Alan Wexelblat
MIT Media Lab - Intelligent Agents Group
20 Ames St., E15-305
Cambridge, MA 02139 USA
wex@media.mit.edu
Issue |
Article |
Vol.28 No.2, April 1996 |
Article |
Issue |