No earlier issue with same topic
Issue
Previous article
Article
SIGCHI Bulletin
Vol.30 No.2, April 1998
Next article
Article
No later issue with same topic
Issue

Speech User Interface Design Challenges

CHI 97 Workshop Report

Susan Boyce, Demetrios Karis, Amir Mané, Nicole Yankelovich

Speech recognition technology has made dramatic improvements in the last few years, and commercial products based on this technology are becoming increasingly common. Despite this progress, it has become clear that recognition performance in the lab is not a good predictor of success in the field, and that extensive work on dialogue design and human factors "tuning" is required before most services can be used successfully by the general population. As in the early days of graphical user interfaces, there is no clear body of knowledge that a designer can turn to when developing new services that rely on speech.

The purpose of the 1997 speech workshop was to bring together speech researchers and practitioners in a non-competitive setting to explore, share, and discuss speech user interface design ideas. As preparation for the workshop, participants wrote position papers in which they suggested ideas for design exercises. In addition, they were encouraged to call and interact with speech systems available over the telephone from companies such as Applied Language Technologies, BBN, Lernout & Hauspie, Nortel, Nuance, Sun Microsystems Laboratories, Texas Instruments, Voice Control Systems, and Voice Processing Corporation.

During the course of the workshop, participants divided into teams of four to five people to work on solutions to two real-world design exercises derived from the position papers. In formulating solutions, participants were asked to consider only technology that was currently available or would reasonably be available in the next few years. There were no "correct" solutions to these exercises; rather, they were intended as a catalyst for participants to learn from each other and share successful speech user interface techniques and strategies.

At the conclusion of each exercise, each group was asked to state their assumptions about technology and target users, and then present their design solution.

Design Exercise 1: Telephony Application

The first exercise was to design a telephone-based service for selling used cars. One part of the service was for the seller to enter information about the car he or she wanted to list for sale. The second part of the service was for the potential buyer to search through cars listed. This exercise was chosen because it was a realistically complex service. In particular, we hoped to generate discussions about various entry methods for a variety of data types (digit strings, addresses, alphabetic characters, dollar amounts) using either Automatic Speech Recognition (ASR), touch-tone or some combination of both. We also hoped to share information on confirmation strategies, error recovery, formulation of queries, and presentation of long lists.

The text of the problem that was used in the workshop is below.

The CARBANK Exercise

Marketing & Product Management of ACME Services Inc. are in agreement that there is a great opportunity for a nationwide telephone based clearinghouse for used cars. People who wish to sell their car can dial 1 900 CARBANK and for $9.95 list their car. People looking to buy a used car can dial 1 800 CARBANK and find out about cars that are in the market.

1 900 CARBANK

To list a car for sale, a caller must provide the following:

  1. The name of a town in the US where the car is located
  2. The name of the owner
  3. The make, year, and model of the vehicle
  4. The condition of the car (mint, excellent, good, average, poor, junk)
  5. Number of miles
  6. The asking price
  7. A telephone number where the seller can be reached
  8. An optional 20 sec. voice announcement

1 800 CARBANK

Potential buyers can call and request to search the cars database. The following are the parameters that can be used in the search:

  1. Make (Optional)
  2. Model (Optional, but when model is requested Make must be specified)
  3. Condition (Optional)
  4. Buyer's location (Mandatory)
  5. Purchase radius (Optional, by default is set by system to within a one-hour driving distance)
  6. Price range (Optional)

Each team will be assigned a unique task. The following are the assignments for the four teams:

Team 1: Design the data entry for the people listing the car, make sure you cover the entry of alpha characters and use of DTMF (Dual-tone multi-frequency signaling, or touch-tone).

Team 2: Design the data entry for the people listing the car, make sure you cover error recovery.

Team 3: Design the buyers' dialogue; make sure you cover formulation of query.

Team 4: Design the buyers' dialogue; make sure you cover presentation of results.

Each design team had less than 1.5 hours to complete a solution to their piece of this problem. Since this was such a short period of time the solutions offered were not finished products, but rather some initial designs and thoughts about how to tackle the design problems. After the time was up, each design team presented the team solutions to the rest of the workshop. What follows is a summary of each team's design solutions and a report of the general discussion that followed the reports.

Team 1 was asked to design the data entry for someone who wanted to list a used car. This team was instructed to determine where and when they might use touch-tone and alphabetic entry in addition to ASR.

Example dialogue from the Team 1 design:

System: Welcome to the CarBank Used Car Listing System. This system allows you to enter information about the used car you wish to sell that will then be made available to interested buyers. The system will ask several questions about your car and how you can be contacted. When your car is listed in our system, you will be billed $9.95 on your next telephone bill. If you would like further information about how to use this system, say "Help" at any time.

System: May I have your name please?

Caller: John Smith (the system simply records this)

System: Please enter or say your area code and phone number.

Caller: [silence]

System: Excuse me? [pause 1.5s] I need your phone number so interested buyers can contact you. Please enter or say your area code and phone number.

Caller: 908 555 1111 {spoken digits}

System: Please enter or say your zip code.

The design Team 1 proposal was based largely on speech recognition and was designed so that the system took the initiative and asked the caller a series of questions. The dialogue was a step by step progression through the data entry rather than a more open-ended, mixed initiative dialogue. This was chosen because the designers felt most callers would be novices. Selling a used car is not an everyday activity for most people. Also, the designers thought that in most situations callers prefer a system-initiative dialogue.

Touch-tone was offered as an option for the digit entry tasks by including the phrase "enter or say". It was the consensus of the team that in some cases touch-tone entry could be faster and less error-prone and therefore should always be offered for digit entry. The only place where alphanumeric input was used was to allow people to spell words when the system was having particular difficulty recognizing a word, such as recognizing a make or model of a particular car.

Team 2 was also assigned to design the user interface for the used car seller. This team was told to focus particularly on error recovery strategies

Example dialogue from Team 2 design.

System: Welcome to ACME Service's 1-900 CarBank. Please say either Introduction, Tutorial, or Start Now.

Caller: Start now.

System: It will cost you $9.95 to list your car. What major credit card would you like to use?

Caller: Visa

System: Please say the number.

Caller: 1234 1234 3456 1234

System: Please say the month and year of the expiration date

Caller: November ninety eight

System Thank you.

System: Please say the vehicle make, such as Ford or Nissan.

Caller: Toyota

System: Please say the model of Toyota, such as Camry, or say Backup if Toyota is incorrect.

This design differs from the first in that it does not assume all callers are novices. Team 2 assumed that this service might be useful for used car sales staff, so the tutorial information was not included in the main call flow, but was available upon request for the novice user. This decision made the interface more streamlined for repeat callers. Additionally, Team 2 added the ability to list multiple cars with a single session. The differences between the two teams' interfaces stems largely from assumptions they made about the user population.

One interesting aspect of the design offered by Team 2 was the suggestion to combine the confirmation of the previous entry with the next question to eliminate steps in the dialogue, as in the example above. They provided the keyword "backup" to correct errors in the previous entry. They also agreed that the confirmations should be done in blocks after multiple pieces of data are collected from the caller. When the caller replies that something is an error, then a menu of items to correct is offered.

System: Toyota Camry, 1994. Is this correct?

Caller: No

System: What do you want to change? Make, model or year?

Team 2, like Team 1, proposed that touch-tone be used for fields requiring digit entry and that spelling out of words be reserved as an error recovery strategy.

Team 3 was assigned to design the buyer interface. In particular, this team focused on the entry of the buyer's query to the car database. Team 3 determined that the order of the questions the user answered was important because some attributes of the car constrained others. For example, if the caller specified the model of the car being sought early in the transaction, there was no need to ask for the make of the car.

Team 3 decided to make some default assumptions initially to simplify the user interface. For example, they assumed that callers would not be willing to travel more than 50 miles from their home in search of a car. The database was queried dynamically throughout the dialogue. That is, after each response from the caller, the database would be queried and the number of results would be evaluated. The dialogue would continue until the list of suitable cars was pruned down to a pre-set number of results. In this way they would be less likely to come up with a list that was too long to present in an auditory interface and this also should make it less likely to come up with no matches. The relative importance of the car attributes could be tuned for each query by asking the caller questions like "What is more important, price or location?"

Team 4 also worked on the buyer interface, but focused on how to present the results of the database query to the caller. Team 4 began the exercise by assigning some team members to imagine they were users and the others designers. Then they employed a user-centered design methodology to determine user needs, which then fueled the design. From this exercise they determined that the most important aspects of cars to buyers are price, make/model, and condition.

Team 4 began their design by assuming that the system would have a spoken natural language interface. This would allow for the recognition of continuous fluent speech. Further, their system would have to be smart enough to take some base-rate information into account and know that certain queries would require further refinement while others might not. In this way they could address the issue that a query for a Toyota Camry is likely to yield thousands of hits while a request for a VW bug will not.

After the presentation of all the designs there was a general discussion about the exercise. This discussion began by having each team comment on the process that they used to arrive at their design. As previously mentioned, one team simulated a user-centered design methodology. Others used brainstorming and trying to define a metaphor or vision for the service.

The biggest issue that was discussed was how to best design for novices and experts within a single interface. There was general agreement that a highly structured interface is annoying to experts, but somehow novices have to be educated about the service in order to become experts. Lexical entrainment, using the same words in prompts that you want users to use in their responses, was suggested as one way to train novices. Team 2 decided to solve the novice/expert problem by building an expert user interface with a clearly identified route to instructions and tutorials. Many workshop members commented that novice users rarely seek help in an interface. It was generally agreed that this was probably due to bad experiences with poorly designed help which "trapped" callers into listening to information that wasn't helping to answer their particular problem. Users know there is a large investment of time necessary to listen to help and so will try unsuccessfully many times before turning to system help. The workshop concluded that the best solution was to offer many different kinds of help. Help could be in the form of hints in reprompts. It could be a simple list of available commands, or it could be more detailed information about each one of the commands available.

Design Exercise 2: Multimedia Training

In the afternoon we again divided up into four design teams, this time working on the multimedia exercise presented below. This differed dramatically from the mornings' exercise in that the team could assume the existence of a graphical display in addition to speech input and output. The basic idea was to create a system that could be used to learn a task that required the use of the hands; since the hands were busy, or dirty, or both, they could not be used to operate a keyboard or mouse. We did assume, however, that the person using the system would be able to remain in one place and be able to see a computer monitor (although there are, of course, a variety of portable multimedia options now available). The similarities among the groups were striking. There was a large overlap of command words and navigation mechanisms, and in all cases the user controlled the pace of the interaction. There was also similarity in the way in which controls were displayed on the screen. Here's the text of the exercise:

Multimedia training using speech

Sketch out the rough design of a multimedia instructional system using speech input, speech output, and any other combination of media types that you wish to include. With the system, a user should be able to learn a task that requires the use of their hands. The system should be able to accommodate any subject matter, but for this exercise, select a specific, limited task to teach users. Be sure someone in your group is a subject-matter expert in the task you select. Here are some examples of hands-busy tasks:

Using one of the example tasks above or one of your own choosing, create a storyboard that includes sketches of the screen and samples of the dialog that might accompany each screen. The design should include, but is not limited to, the following features. Users should be able to:

It is not necessary to work out every detail of the system. Do just enough to provide a flavor of the interaction.

Each group chose a different task: bicycle repair, cooking (making pancakes), origami (making a jumping frog), and putting together a bookshelf or CD rack. After struggling during the morning session with all the difficulties inherent in a speech-only interface, we were struck by how "easy" it was to design a system when both speech and a visual display could be used. It makes a tremendous difference having a second medium available. With a visual display, some of the difficult problems of a speech-only interface go away, or become much easier to solve. For example:

People noted that the tasks chosen strongly influenced the dialogue and interface design, and that these tasks were all fairly simple in the sense that they could be decomposed into discrete linear steps. All groups did, in general, provide options to control the pace of the interaction, to stop, to backup, to ask for additional details, and so on. There was also consensus that the user's options should be integrated into the visual display and not treated separately. Designers should also anticipate typical questions, and provide answers either directly as part of the normal flow of the interaction or indirectly in the form of a tutorial or help module.

In all these tasks, it was very important to present information in manageable chunks and to describe well-bounded discrete steps. In some cases these were obvious, while in others some user testing or additional analysis would be required. It also helped in these tasks to be able to present an optional overview to the user before beginning. Highlights of the entire sequence could be presented visually, and a view of the final product could also be presented. This could make it easier for the user to understand intermediate steps.

One group noted how important it was to understand the process before working on the dialog. This, of course, is essential and may seem obvious, but it is a rule that is sometimes violated. If the designer doesn't have a clear understanding of how something can be done, then it will be impossible to develop an easy-to-use interface. Just knowing the individual elements of information that the system requires is not enough, as it may often be beneficial to combine these into meaningful groups. Once this is done, these groups can provide a framework for navigation, confirmation, and error correction. For example, mechanisms can be provided to jump from one group to another, and confirmation and correction can operate on multiple elements within a single group simultaneously. The group working on a system to help with a bookcase or CD rack assembly listed out a number of steps, but also combined them into four high-level groupings: inventory, frame assembly, back assembly, and shelf assembly. An outline of the process that included these steps remained on the side of the display, with the current location highlighted and expanded. Command words included the names of the four high-level groupings to make it easy to navigate between them. Other command words included "Next Step," "Repeat," "Back," "Start over," "Skip," "Show Overview," "Undo," "Troubleshooting," and "Help".

One group chose bicycle repair, and in particular the task of truing a wheel. Their target user was between 14 and 30 years old and had no special mechanical skills or training. They used both yes-no questions as well as success criteria to determine whether it was appropriate to advance to the next step. In addition, they planned to have available a number of remedial video clips to help the user trouble-shoot. One of their innovative suggestions was to hook sensors up to the bike to provide an independent assessment of the user's progress (in addition to the user's responses to questions). For example, sensors might record tire pressure, wheel wobble, and so on. Such sensors were not essential to the design, but would certainly make trouble shooting much easier.

All groups noted that video clips, diagrams, and short animations would be very useful. These clips could show how to join two pieces of wood, how to use a special tool on the bicycle, how to prepare or combine ingredients while cooking, or how to fold paper for the origami project. The origami exercise brought this home dramatically during the discussion period when two conditions were demonstrated. First, a series of steps were read without any visual information; in this situation they seemed overwhelming and almost incomprehensible. Second, the same instructions were read while a person demonstrated the actions. In this situation, it was very easy to follow the instructions and to understand what to do.

Conclusion

One interesting observation can be made about the way that the teams managed their time. It was striking to see the amount of time that was spent before the teams "put pen to paper". On the first exercise, for example, each team spent at least 20 of the 90 minutes allotted to the exercise mapping out the various factors that needed to be considered before they actually started the design; one team spent as much as 40 minutes before they started with their design work. These factors fell into three major categories: the task, the users, and the technology

Of course, we did spend the time, and had fun in the actual design of the interfaces. Some tricks of the trade were exchanged, many of them addressing the need to balance two opposing requirements. For example, how do you verify that the system accurately understood each piece of information that was entered by voice and at the same time keep the dialogue moving and avoid the tediousness of explicit confirmation in every other question? Or how do you support both the expert and the novice, letting the first one zoom along unhindered while providing the information the latter needs in order to complete the task? As one would expect there was an intense discussion of error recovery. Much of the work that went into the design of an application focused on handling the situation where the outcome was not the one that was initially expected. Whether the break-down in communication is a result of the limitation of speech recognition or a by-product of limitations of the interaction itself, anticipating the possible point of difficulty is important. It serves as a trigger for a modification to the call flow; a side conversation that would bring the system and the user together again, similar to the way that a conversion of the dialogue is achieved in a conversation between two people.

The CHI 97 workshop on Speech User Interface Design Challenges was a sequel of sorts to the CHI 96 workshop on the same topic (Mané et al., 1996). In the first workshop we had an exciting academic discussion. This time around we had an opportunity to revisit many of these topics in the context of actual design. The contrast between the phone-based exercise and the multi-media exercise echoed much of the discussion that we had of the unique limitations of a speech-only interface. Our study of both the processes and the guidelines that one should follow in solving design problems illustrated how our insights can be put into practice.

Speech Recognition technology has made dramatic improvements in the last few years, but the number of people who focus their efforts on the design of interfaces that make use of this technology remained rather small. For some of us, it was a rare opportunity to discuss with peers the design challenges that we usually face alone. This, no doubt, contributed to the atmosphere of sharing and led to a pleasant and invigorating experience.

Reference

Mané, A. M., Boyce, S.J. Karis, D. and Yankelovich, N.(1996). Designing the User Interface for Speech Recognition Applications. SIGCHI Bulletin Vol. 28 (4) p. 29-34.

Participants in the CHI97 ASR Workshop

Susan Boyce, AT&T
Doug Brems, Sprint PCS
Maxine Cohen, Nova Southeastern University
Wayne Hank, Unisys
Demetrios Karis, GTE Labs
Mika Kivimiki, Nokia Mobile Phones Ltd.
Jeffrey Kurtz, MITRE
Jennifer Lai, IBM
Gina-Anne Levow, MIT AI Lab
Susann LuperFoy, MITRE
Amir Mané, AT&T Labs
Matt Marx, Applied Language Technologies (ALTech)
Jennifer Ockerman, Georgia Institute of Technology
Angelien Sanderman, KPN Research
Noi Sukaviriya, IBM
Shirley Tobias, Lucent Technologies
Nicole Yankelovich, Sun Microsystems Labs
Vincent van Amerongen, KPN Research

About the Authors

Susan Boyce is a Principal Technical Staff Member at AT&T Labs. She conducts research on speech recognition user interface design issues.

Demetrios Karis is a Principal Member of Technical Staff at GTE Laboratories. He is involved in the design and development of a variety of speech-based telecommunication services.

Amir Mané is a Principal Technical Staff Member in AT&T Labs. The main focus of his work is the design of voice interfaces for interaction with Intelligent Agents.

Nicole Yankelovich is the Principal Investigator of the Speech Applications project at Sun Microsystems Laboratories. Her work focuses on the design of speech interfaces in a variety of application areas.

Authors' Addresses

Susan Boyce
AT&T labs
101 Crawfords Corner Road
Holmdel, NJ 07733 USA

sjboyce@att.com

Demetrios Karis
GTE Laboratories
40 Sylvan Road
Waltham, MA 02254, USA

dkaris@gte.com

Amir Mané
AT&T Labs
101 Crawfords Corner Road
Holmdel, NJ 07733, USA

amane@att.com

Nicole Yankelovich
Sun Microsystems Laboratories
Two Elizabeth Drive
Chelmsford, MA 01824, USA

nicole.yankelovich@sun.com

No earlier issue with same topic
Issue
Previous article
Article
SIGCHI Bulletin
Vol.30 No.2, April 1998
Next article
Article
No later issue with same topic
Issue