Issue |
Article |
Vol.28 No.4, October 1996 |
Article |
Issue |
During the last decade there has been significant progress in the development of Automatic Speech Recognition (ASR) systems. As a result of technical advances in speech modeling techniques, recognition search strategies, and other areas, combined with the increased processing power of workstations and PCs, large vocabulary continuous speech recognition is now feasible even under the constraints and demanding conditions imposed by the public switched telephone network. These new technical capabilities, along with advances in Natural Language Processing, have opened up the possibility of a wide range of new services and applications, and have made it possible to incorporate more natural styles of human-computer verbal interactions.
The purpose of the workshop was to bring together a small group of researchers and practitioners and focus discussion on how to design applications and services that rely on speech as the primary medium for communication between the user and the system. The workshop participants represented an interesting mix of researchers and practitioners affiliated with academia, the computer industry, telecommunication providers, and vendors of speech technology. Workshop participants from academia included Maxine Cohen from Nova Southeastern University, Sharon Oviatt from the Oregon Graduate Institute of Science & Technology, and Bernhard Suhm from Carnegie Melon University. Barry Arons, from Speech Interaction Research, and David Duff and Susan Luperfoy from the MITRE corporation provided the perspective of consultants who are active in the field. Michael Cohen from Nuance Communications, Nancy Gardner from Dragon Systems, Caroline Henton from Voice Processing Corporation (now at Digital), and Matt Marx from Applied Language Technologies work for vendors of speech technology. Stephan Gamm from Philips, Joseph Mankoski from Apple, Catherine Wolf from IBM, and Nicole Yankelovich from Sun brought the perspective of the Computing Industry, while Susan Boyce, Candy Kamm and Amir ManÉ from AT&T, Demetrios Karis from GTE Labs, and Eileen Schwab from Ameritech brought the perspective of the telecommunication industry.
We asked would-be participants to send us a position paper discussing one of the following topics: What implication does the current state of recognition technology have for design? What design principles should one follow? And what design methodology should be used? We also asked participants to provide us with examples of successful and not so successful existing applications. As part of the preparation for the workshop we shared all the position papers and provided all participants with an opportunity to call in to several demonstrations of voice interfaces. This report follows the path of our discussion and captures its highlights. It represents the collective learning of the workshop organizers and may not represent the views of the other workshop participants.
The design of user interfaces for speech-based applications is dominated by the underlying ASR technology. More often than not, design decisions are based more on the kind of recognition the technology can support rather than on the best dialogue for the user. The type of design will depend, broadly, on the answer to this question: What type of speech input can the system handle, and when can it handle it? When isolated words are all the recognizer can handle, then the success of the application will depend on our ability as designers to construct dialogues that lead the user to respond using single words. Word spotting and the ability to support more complex grammars opens up additional flexibility in the design, but can make the design more difficult by allowing a more diverse set of responses from the user. Some current systems allow a limited form of natural language input, but only within a very specific domain at any particular point in the interaction. For example, when dates are requested, acceptable phrases might include "June 15," "Next Wednesday," and "Thursday, December 12th". Even in these cases, the prompts must constrain the natural language within acceptable bounds. No systems allow unconstrained natural language interaction, and it's important to note that most human-human transactions over the phone do not permit unconstrained natural language either. Typically, a customer service representative will structure the conversation by asking a series of questions.
With "barge-in" (also called "cut-through") a caller can interrupt prompts and the system will still be able to process the speech, although recognition performance will generally be lower. This obviously has a dramatic influence on the prompt design, because when barge-in is available it's possible to write longer more informative prompts and let experienced users barge-in. Interruptions are very common in human-human conversations, and in many applications designers have found that without barge-in people often have problems. There are a variety of situations, however, in which it may not be possible to implement barge-in. In these cases, it is still usually possible to implement successful applications, but particular care must be taken in the dialogue design and error messages.
Another situation in which technology influences design involves error recovery. It is especially frustrating when a system makes the same mistake twice, but when the active vocabulary can be updated dynamically, recognizer choices that have not been confirmed can be eliminated, and the recognizer will never make the same mistake twice. Also, when more than one choice is available (this is not always the case, as some recognizers return only the top choice), then after the top choice is disconfirmed, the second choice can be presented.
It is interesting that some of the demonstrations described below involved very large vocabularies and worked quite well, even over the public telephone network (PSTN). Only a few years ago, most experts were predicting that it would be a long, long time before commercial systems could operate in real time, over the PSTN, and deal with vocabularies of over a thousand items. Raw recognition power has increased dramatically and made this possible, while the transition to more conversational dialogs is happening much more slowly. Given the complexity and variability of human conversational behavior, this is not surprising.
Although many recognition vendors boast about the size of the vocabulary their systems can support, with some having dictionaries of more than 100,000 words (never active simultaneously, of course), we should keep in mind that any one particular conversation uses a surprisingly small set of unique words (often called "types"). In human-human conversations for both air-travel planning (Kowtko & Price, 1989) and for installing, disconnecting, or inquiring about telephone service (Karis & Dobroth, 1995), the total number of types (across multiple conversations) was less than 1,200. In a single twelve minute conversation to install telephone service, only 188 different words were used on average.
Despite the fact that people often attribute more intelligence to simple systems than is warranted, and have been exposed to completely natural human-computer dialogues on TV and in the movies, in many interactions with speech-based systems people seem to want explicit guidance about what to say and how to say it. As explained below, one of the most critical tasks for a designer of speech recognition systems is to communicate effectively to the user what is appropriate to say at each point in the interaction.
Mike Cohen, from Nuance Communications, gave a presentation explaining why there has been such a lag between the impressive developments in research labs and the release of viable commercial products. Mike started by providing a chronology of recent advances in speech recognition. A major factor in the rapid advancement during the last decade has been the sustained support from the Advanced Research Projects Agency (ARPA) of the US Department of Defense. The fruits of this research are now slowly making their way into the marketplace, often via start-up companies with connections to the original ARPA grant recipients. Although the research systems often performed superbly in the laboratory on the tasks and domains for which they were designed, there are several difficulties in creating commercial products from these research systems: accuracy on complex tasks with large vocabularies is still not high enough for commercial systems; research systems were originally trained using speech collected under unrealistic conditions (e.g., "read" speech recorded in a studio environment); the emphasis was on accuracy and not rejection, which is critically important for commercial applications; and there was little emphasis placed on developing systems that could run in real time (many, in fact, ran in more than 100x real time).
In the new world order of telecommunication deregulation, cost for recognition resources is becoming an important factor in the decision about whether to deploy new speech-based services. Even for telecommunication applications in which recognition occurs in the network, there is pressure for DSP-based rather than workstation-based recognizers in order to reduce costs. Given the software engineering and coding required to implement current recognition algorithms on DSP cards, many of the new start-up companies with top recognizers can only run on workstations, or on a combination of DSP cards and workstations, and this is slowing their widespread deployment. Commercial electronic products employing ASR, in contrast, need to run on a single chip.
While speech technologists still view recognition accuracy as the yardstick for their progress, there was agreement among the participants that other factors were also critical. Need was expressed for better rejection, and in general better information about the recognition process, including a list of possible recognizer outcomes along with confidence values for each. Better tools with greater flexibility are also needed that will allow the construction of grammars dynamically in near real-time, and better language understanding tools are also needed. Another need, that was further discussed in the methodology session, was for prototyping tools that would allow designers to specify the way that the interface will be implemented and would allow testing of the usability of parts or all of the interface early in the development cycle.
Eileen Schwab of Ameritech gave a brief presentation in which she argued that speech recognition will only succeed in telecommunications applications in which:
a) The caller does not have an easily available alternative, or b) There is a desperate need to use speech rather than touchtones.
She then described several applications which made her point. Ameritech's Automated Alternate Billing System (AABS), which allows callers to place a collect or calling card call, is an example of an application that succeeds because callers do not have an easily accessible alternative. A second application, Home Incarceration, is successful for the same reason. The Home Incarceration service uses speaker verification technology to allow some non-violent prisoners to spend their prison term at home rather than in prison. The system calls the inmate 10-20 times per day at random intervals and uses speaker verification to authenticate the person's identity. Schwab reported that users of the Home Incarceration service accept the technology because the alternative of being sent to prison is much worse.
The second class of successful speech technology applications that were discussed were ones in which users had a desperate need to use speech rather than touchtones. Examples of these kinds of applications were voice dialing for cell phones and voice control of voice messaging from a cell phone environment. Schwab noted that voice dialing applications for regular land-line environments didn't meet either of the criteria and has met internal resistance that stopped it from reaching the marketplace.
Several interesting examples of the challenges of using speech recognition in the public telephone network came up in discussion. Schwab related some experiences that she has had with certain voices mimicking the sound of a touchtone (the problem of "talk-off") which can cause the system to respond in confusing ways (you might say "Mom" or "Bob," but the system recognizes a keypress of 0). In addition, there were problems with certain kinds of answering machines that did not release the phone line when expected and actually placed outgoing calls when the network recognizer interpreted the answering machine's recorded greeting as a voice dialing command.
Several sections of the workshop dealt specifically with user interface issues. Roughly, these issues were broken down into three broad topic areas: multimodal interaction, dialog management, and error handling.
Cathy Wolf from IBM's T.J. Watson Research Center gave a short presentation about user interface issues related to multimodal speech systems. She defined three types of multimodal interfaces:
A major design challenge is to select the appropriate modality for each user task. This involves matching the strengths of a particular modality to the task. For example, speech is probably not a good choice if the user's task involves interacting with a computer while talking to another person, as when a nurse takes a patient's medical history. In many applications, however, the expressive power of language can be a valuable complement to other modalities.
Another challenge, according to Cathy, is to set users' expectations appropriately. A system that uses speech output and sounds "fluent" may be misleading to the user if the system can only accept limited speech input. It is not in people's common experience to encounter a person who can speak fluently, but not understand very well. Non-native speakers, for example, typically have the opposite problem. They can often understand the language better than they can produce it.
Combining speech with a GUI is perhaps the most common type of multimodal interface involving speech. Cathy makes a clear distinction between a graphical interface that includes speech, and a speech interface that includes graphics.
In the first case, of a graphical interface that includes speech, there are several potential benefits to including both modalities. Users can specify actions and objects that are not visible (e.g., while in a map application, "Show me the intersection of Hope Street and George Street," or while in a spreadsheet, "What's on my calendar?"). Users can also specify an action and an object in a single chunk (e.g., "Copy the heading to page 3."), or in parallel (e.g., "Move this there."). Finally, speech output is a way to present short segments of information without cluttering the screen or obscuring the current display (e.g., for presenting spot help or the results of a short query).
According to Cathy, the advantages of including graphical output in a primarily speech interface are also numerous. Graphical output can be used as feedback to confirm a user's request. Since graphical output is more persistent than speech, users can refer back to the visual display if they have forgotten what was spoken aloud. Graphical output can also be used as a prompt to let the user know what he or she can say. This potentially cuts down on the need to speak prompts aloud. Since speech is a slow output channel, the graphical display can be a time-saver for the user. Another benefit of an integrated GUI is that the user can specify an object by pointing rather than by naming or describing. Once an object is selected, this selection can be used to constrain the speech input. For example, after selecting a circle in a speech-enabled graphics editor, the user's input can be constrained to commands related to circles, such as "Make the line thicker," or "Change the color to light blue." Finally, keypad, keyboard, or mouse input can act as a fallback input mechanism if speech recognition fails.
During the discussion in this portion of the workshop, Sharon Oviatt shared her experiences with multimodal map design. She described a series of user studies that included a graphical map interface combined with pen and voice input (see her paper in the CHI '96 Proceedings, "Multimodal Interfaces for Dynamic Interactive Maps"). Consistent with other research findings, her group determined that users have a strong preference for using multiple input modalities, even if the task could be accomplished using any one modality alone.
Caroline Henton, formerly of Voice Processing Corporation (VPC) and now at Digital, opened the segment on dialog management and proposed a set of topics that fall under this broad category. These include dialog design for different user types, prompt design, use of anthropomorphism, feedback messages, on-line help, and localization issues.
In describing the Phone Wizard system she helped designed at VPC, Caroline shared with the group the speech user interface (SUI) principles she followed. Here's a condensed, generalized version of the principles:
A good portion of the discussion in this segment of the workshop focused on the controversial issue of anthropomorphism. All participants agreed that a system should be cooperative; however, there was no agreement on whether or not anthropomorphism was the best way to achieve this goal. The Wildfire call-in demo (see below), which exhibits strong anthropomorphism, was particularly controversial. As users become expert with the system, some argued that the use of personality and characterization would grow old. Others felt that, if done well, a system with or without anthropomorphism could be equally effective.
Colloquial dialogue, which sometimes accompanies anthropomorphism, was another controversial topic. Some participants felt colloquial phrases such as "Oops, my mistake," or "Got it" would help to put users at ease, while others felt this too would become tiresome.
On a less controversial note, participants unanimously echoed the opinion that speech recognizers overlaid onto existing touch tone interfaces without any redesign usually produced disastrous results. Asking callers to "say or press one" offers no improvement over touch tones, except for those callers with rotary or cellular telephones.
In discussing various types of speech interfaces, a common problem that emerged involved users' inability to interpret silence. Especially in speech-only systems, silence can either mean that the speech recognizer didn't hear an utterance or that it is processing the user's input. In a demonstration provided by Matt Marx from Altech, an "audio watch cursor" that covers silence with a signature percolating sound provided an effective method of avoiding one of the two types of silences. This techniques leave silence to unambiguously mean that the system did not hear the request.
In a slightly different vein, the discussion moved to intellectual property issues that could result in "copying" others' designs. Accept for specific trademark sounds, the consensus was that designers should be free to copy effective aspects of other designs, as is common practice in the realm of graphical interface design.
Bernhard Suhm from Carnegie Mellon University presented his thoughts on error correction in another segment of the workshop. He described two phases: error detection and error recovery. Before an error can be corrected, either the user or the system must detect that an error has occurred. The system can do this using confidence measures, but often the burden is on the user to detect errors. In simple cases, the user indicates an error by rejecting the system's request for a confirmation (e.g., Computer: "Did you say to hang up?", User: "No"). In a dialog system, the user might be able to indicate and repair errors conversationally (e.g., User: "No, I said to hold on."). And in a multimodal system, the user may be able to point to or highlight an error.
Once detected, there are a range of techniques that a system might use to allow a user to correct an error in addition to the conversational repair illustrated above. The simplest means of error correction is for the user to repeat the input. This might be done by respeaking, by spelling, by typing, or by handwriting the correction. In a multimodal system, the user may be also be able to select from a list of next best choices. Often paraphrasing the input is more helpful than simply repeating it.
Susann Luperfoy outlined a slightly more detailed scheme for error correction. In the systems that she's worked with, the steps for correcting an error are as follows:
The same basic design principles that apply to GUIs and multimodal systems apply to speech-based systems. User-centered design, iterative cycles of design-test-modify, and other techniques and approaches can be employed whether the interface involves speech, a GUI, a control panel, or various combinations of speech and other modalities. This seemed to be accepted by all participants, and there was little discussion about the design process. There was strong agreement, however, that the designer must not work in isolation. The user-interface designer needs to work closely with a "speech technologist" who has an intimate knowledge of the recognizer and it's capabilities, as well as with a usability specialist (assuming the designer is not also versed in usability testing).
There was discussion about user-interface specifications, the need for better development tools, and simulation techniques. There was great variability among the participants in both the use and format of specifications. On one hand, there are some fairly complex commercial applications for which there are no detailed paper specifications at all -- the code running the application is all there is. This tends to happen when the same group of people is both designing and building the application. On the other hand, when a design has to be handed off, either between units in the same company, or from one company to another, then specifications are essential. This is usually the case in the telecommunications industry, where outside vendors build a system under contract according to specifications, or else a separate development organization within the company builds according to the specification.
Although essential, several participants noted that preparing and maintaining specifications is a very time consuming task, and that information transfer between designers and application developers is still inefficient. Current systems are specified frequently using flow charts, although other techniques are also used, such as a text-based state notation format. There was agreement that new techniques would have to be developed to specify the more conversational systems now being planned, and that a complete specification might include several different documents (e.g., general instructions on how time-outs, touch tone input, and other aspects should be handled, a flow chart or state notation format representing the relationship between various states, the grammar for each state, and a list of the prompts and any other words or sounds played to the caller).
A variety of companies now provide development environments that make creating applications much easier than in the past, when everything had to be coded in low level languages. As designers, however, we need even higher level scripting languages and modules to work with. For example, many applications ask for confirmation or other yes-no requests, dates and times, locations, dollar amounts, and so on. Having multi-state dialogue modules that would also include error recovery routines for these common elements would speed development of new applications enormously. There would be a need to customize these modules, of course, but having the basic structure to start with would be very helpful.
Wizard-of-Oz Simulations: The use of a Wizard-of-Oz technique, in which a human (the Wizard) simulates an automated system, is very effective during design. If there are easy-to-use specifications for a system, this type of simulation is easy to carry out. The specifications (for example, flow charts with associated prompts) are spread out over a table, or on the walls around the room, and the experimenter reads the appropriate prompts from the specs, waits for a response from the subject (or no response), checks the specs on how to proceed, and then speaks the next prompt. This technique can be used with the subject in the same room, or over the phone, or more sophisticated computer controlled versions can be set up in which the experimenter chooses the correct system response and a recorded prompt is played to the subject. These simulation techniques, even the more "primitive" versions, are very effective in uncovering problems with logic, navigation, awkward sequences of prompts, omissions, and so on. Even with effective prototyping tools, the non-computer-mediated Wizard technique is often preferable for quickly evaluating different design alternatives. An open issue is how to simulate recognizer errors, and how detailed a knowledge one must have of the recognizer in order to do this effectively. Also, what are the limitations of simulation studies? That is, under what circumstances will they be misleading or unproductive? We were unable to answer these questions.
Simple applications that involve only speech are very easy to simulate using Wizard-of-Oz techniques, but some participants pointed out that simulating more complex applications can be difficult. The system response to a user's utterance may depend upon a variety of constraints and variables, and processing these in real time can be difficult for the Wizard. Simulating multimodal systems is also possible, and in some cases researchers (including Sharon Oviatt) have developed fairly elaborate computer-based simulation environments to support the Wizard.
An important part of the workshop was the use of call-in ASR demonstrations. Before the workshop we distributed a list of demos in order to create a "common ground" of understanding of current ASR capabilities and designs, and as a means to promote discussion. We included demos that were publicly advertised and available, and we also made special arrangements with a number of leading speech recognition companies to permit access, for a short period of time before the workshop, to some of their current applications that were under development. The demos that were available only to workshop participants included a system for purchasing music over the telephone (the caller could select from a list of nearly 2,500 artists and over 4000 albums), a system for travelers that automatically connected them to a company's 800 number after they spoke the name, a call router with over 2000 names on it, two banking applications, the Sun Microsystems' SpeechActs system that allows users to call in and access their electronic mail and their on-line calendar, a voice dialing application that includes a personal directory, and a soon to be released competitor to Wildfire called Phone Wizard (produced by VPC). In addition, Amir ManÉ demonstrated an AT&T car reservation system during the workshop.
We also included a variety of interactive voice response systems in our list of demos. We need to be familiar with these systems because many of our users will be familiar with them, and because many applications are now combining both touch tone and speech input.
Here are a few of the publicly available ASR demos. We can't guarantee that they'll be up when you call, and we should point out that these demos include examples of both very good and very bad designs. We do not, of course, recommend any of these companies or products; they are listed here only to provide experience with ASR systems to members of the CHI community.
It is hard to capture in this report the highly interactive nature of the workshop experience and the passion with which participants argued their position on several of the key issues, such as whether "to beep" or "not to beep" (at the end of a prompt, indicating that it is the user's turn to speak). We believe that this excitement and involvement are also a testimony for the vigor with which the field is growing.
The progress made over the last few years in speech technology has generated new opportunities and new challenges for designers. One common need that was expressed by the participants is for a rich set of tools for specifying and prototyping user interfaces, and for a rich exchange of information between the part of the application that governs the user interface and the part that analyzes the user's speech.
The workshop successfully covered a gamut of issues that underlie the design of speech-based interfaces. For each we had a theoretical discussion, peppered with concrete examples. Based on the feedback that the attendees provided the organizers of CHI, we concluded that most participants were very pleased with the workshop. One common sentiment was that after having this theoretical discussion as a background, we can now focus more on the practical challenges of designing speech based interfaces. Adding another day to the workshop in CHI 96 was not feasible; therefore, it is our intent to conduct another workshop at CHI 97. This time, we plan to follow the format of a design exercise, one in which various sub-teams will come up with alternatives for a given problem and the group as a whole will compare and critique the various solutions. If this sparks your interest, please follow closely the upcoming issues of SIGCHI and the CHI 97 web site for details.
Karis, D., & Dobroth, K. M. (1995). Psychological and human factors issues in the design of speech recognition systems. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.) Applied Speech Technology. CRC Press.
Kowtko, J.C., and Price, P.J. (1989). Data collection and analysis in the air travel planning domain. DARPA Speech and Natural Language Workshop. Los Altos, CA: Morgan Kaufmann Publishers, 119-125.
Amir ManÉ AT&T 101 Crawfords Corner Road Holmdel, NJ 07733 amir.mane@att.com
Amir ManÉ is a Senior Technical Staff Member in AT&T. The main focus of his work is the design of voice interfaces for interaction with Intelligent Agents.
Susan Boyce AT&T 101 Crawfords Corner Road Holmdel, NJ 07733 sjboyce@att.com
Susan Boyce is a Senior Technical Staff Member at AT&T. She conducts research on speech recognition user interface design issues.
Demetrios Karis GTE Laboratories 40 Sylvan Road Waltham, MA 02254 dkaris@gte.com
Demetrios Karis is a Principal Member of Technical Staff at GTE Laboratories. He is involved in the design and development of a variety of speech-based telecommunication services.
Nicole Yankelovich Sun Microsystems Laboratories Two Elizabeth Drive Chelmsford, MA 01824 nicole.yankelovich@east.sun.com
Nicole Yankelovich is the Co-Principal Investigator of the Speech Applications project at Sun Microsystems Laboratories. Her work focuses on the design of speech interfaces in a variety of application areas.
Issue |
Article |
Vol.28 No.4, October 1996 |
Article |
Issue |