No earlier issue with same topic
Issue
Previous article
Article
SIGCHI Bulletin
Vol.30 No.2, April 1998
Next article
Article
No later issue with same topic
Issue

Interaction-Driven Speech Input

A Data-Driven Approach to the Capture of both Local and Global Language Constraints

Jerome R. Bellegarda

Introduction

Speech technology (both recognition -- speech-to-text -- and synthesis -- text-to-speech) has made great strides over the past decade. Commercial applications such as automated voice response, command and control, and domain-specific dictation are now common. Thanks to ever more sophisticated acoustic models [1], under typical conditions the technology works quite well for a reasonably wide range of users.

Still, the user experience, for speech recognition in particular, is far from ideal. In small vocabulary applications such as command and control, users are heavily constrained regarding how to proceed at any given point in time. In large vocabulary applications such as dictation, the output text, even when transcribed with high accuracy, often fails to satisfy elementary gender, number, or tense agreements, thereby requiring tedious proof-reading and error correction.

In large part, these shortcomings can be traced to the language modeling component of the speech recognizer. This component purports to capture some of the constraints present in the language [2]. In small vocabulary applications, it typically relies on a finite-state grammar to define the list of allowable sentences. In large vocabulary applications, where such an approach is not feasible, it usually relies on a low order statistical n-gram language model such as a bigram or trigram [3].

Since they operate at the level of the entire sentence, finite state grammars capture both local and global constraints in the language. Because any attempt to speak an "out-of-grammar" utterance is guaranteed to fail, however, they effectively require the user to memorize what can (and cannot) be said. This is probably the biggest problem hindering the widespread adoption of speech input for command and control, since it essentially rules out natural dialogs and queries.

This problem is alleviated in the case of the n-gram paradigm, due to its data-driven nature. Because n-grams are trained on a finite amount of text, however, there is an inherent trade-off between n and the reliability of the parameter estimates [4]. This imposes an artificially local horizon on the language model. In other words, global constraints are not taken into account, causing the large-span discrepancies mentioned above.

What seems to be needed is a data-driven framework to capture some of the global constraints in the language. Then conceivably this approach could be combined with the standard n-gram paradigm to support more flexible dialogs and queries based on large vocabulary recognition. This in turn would enable a more natural interaction-driven speech input experience.

This paper proposes such a framework, based on a concept originally formulated in the context of information retrieval, called latent semantic analysis [5]. We start with a brief review of this concept below. Next, we use the latent semantic analysis framework to derive a large-span semantic language model and discuss its predictive power. Then, we address the integration of the new framework with conventional n-gram language models. Finally, a series of experimental results illustrates some of the benefits associated with the integrated language model.

Latent Semantic Analysis

Consider predicting the word fell from the word stocks in the two equivalent phrases: stocks fell sharply as the result of the announcement and stocks, as a result of the announcement, sharply fell. In the first phrase, the prediction is straightforward using a bigram language model. In the second phrase, however, an n-gram approach would entail the value n=9 , an unrealistic proposition at the present time.

This suggests that a different framework is desirable to extract long distance information. Note that a formal parsing mechanism is not necessarily adequate either, if one wants to preserve the data-driven nature of n-grams. The problem is to relate to one another those words which are found to be semantically linked from the evidence presented in the training text database, without regard to the particular syntax used to express that semantic link. In other words, we would like to exploit the correlations between the current word and features of the word history.

This is where the latent semantic paradigm comes into play. In latent semantic indexing [5], co-occurrence analysis takes place across much larger spans than with a traditional n-gram approach. The span of choice is a document, which can be defined as a semantically homogeneous set of sentences embodying a given storyline. An important benefit of this generalization is the systematic integration of long-term dependencies into the analysis.

To take advantage of the concept of document, we of course have to assume that the available training data is tagged at the document level, i.e., there is a way to identify article boundaries. This is the case, for example, with the ARPA North American Business (NAB) News corpus [6]. This assumption enables the construction of a matrix of co-occurrences between words and documents. This matrix is accumulated from the available training data by simply keeping track of which word is found in what document.

Note that, in marked contrast with n-gram modeling, word order is ignored. This means that the latent semantic paradigm not only does not exploit syntactic information, but effectively throws it away. Thus, it should not be expected to replace conventional n-grams, but rather to complement them.

We have recently used the latent semantic analysis (LSA) approach to derive a novel word clustering algorithm [7]. For the sake of brevity, we refer the reader to [7] for further details on the mechanics of LSA, and just briefly summarize here. See also [8] and [9] for additional details.

After the word-document matrix of co-occurrences is constructed, LSA proceeds by computing the singular value decomposition (SVD) of the word-document matrix. The left singular vectors in this SVD represent the words in the given vocabulary, and the right singular vectors represent the documents in the given corpus. Thus, the role of the SVD is to establish a one-to-one mapping between words/documents and some vectors in a space of appropriate dimension. Specifically, this space is spanned by the singular vectors resulting from the SVD.

An important property of this space is that two words whose representations are "close" (in some suitable metric) tend to appear in the same kind of documents, whether or not they actually occur within identical word contexts in those documents. Conversely, two documents whose representations are "close" tend to convey the same semantic meaning, whether or not they contain the same word constructs. Thus, we can expect that the respective representations of words and documents that are semantically linked would also be "close" in that space. This property is what makes the framework useful for language modeling purposes.

LSA Language Modeling

Let V, |V| = M, be some vocabulary of interest and T a training text corpus, i.e., a collection of N articles (documents) from a variety of sources. (Typically, M and N are on the order of ten thousand and hundred thousand, respectively; T might comprise a hundred million words or so.) As described in [7-9], the LSA approach defines a mapping between the sets V, T and a vector space S, whereby each word wi in V is represented by a vector ui in S and each document dj in T is represented by a vector vj in S.

Recall that the matrix of co-occurrences embodies structural associations between words and documents. Thus, the extent to which word wi and document dj co-occur in the training corpus can be inferred from the (i,j) cell of the matrix. From the SVD formalism, it follows that this can be characterized by taking the dot product between the ith row of the matrix US1/2 and the jth row of the matrix VS1/2, namely ui S1/2 and vj S1/2. In other words, how "close" ui is to vj in the space S can be characterized by the dot product between ui S1/2 and vj S1/2 [7-9].

Let wq denote the word about to be predicted, Hq-1 the admissible history (context) for this particular word, and Pr(wq| Hq-1) the associated language model probability. In the case of an n-gram language model, for example, Pr(wq | Hq-1) = Pr(wq| wq-1 wq-1 ... wq-n+1). To take the LSA framework into account, we have to consider the slightly modified expression: Pr(wq| Hq-1) = Pr(wq| Hq-1,S), where the conditioning on S reflects the fact that in the proposed derivation the probability depends on the particular vector space arising from the SVD representation.

Thus, to construct an LSA language model, there are two issues that need to be addressed: (i) specify what Hq-1 is; and (ii) find a way to compute Pr(wq| Hq-1,S).

Since the SVD operates on a matrix of co-occurrences between words and documents, the nominal history is the document in which wq appears. However, to be admissible, the context must be causal, and therefore be truncated at word wq-1. Thus, in practice, we have to define Hq-1 to be the current document up to word wq-1.

Obviously, the current document will not (normally) have been seen in T, therefore qualifying as a pseudo-document in the terminology of [7-9]. If we denote this pseudo-document by dq-1, then it is possible (cf. [7-9]) to derive a vector representation vq-1 in S associated with this pseudo-document. The language model thus becomes:

Pr(wq| Hq-1,S) = Pr(wq | dq-1),

where Pr(wq | dq-1) is computed directly from the representations of wq and dq-1 in the space S. In other words, this expression can be directly inferred from the "closeness" between uq and vq-1 in S.

From the discussion above, a natural metric to consider for this "closeness" is the cosine of the angle between uq S1/2 and vq-1 S1/2, say K(uq, vq-1). A value of K(uq, vq-1) = 1 means that dq-1 is a strong semantic predictor of wq, while a value of K(uq, vq-1) < 1 means that the history carries increasingly less information about the current word. Note that this functional does not satisfy the definition of a distance over the space S. However, it is straightforward to modify it to define one. For a given history (document dq-1), this distance then induces an empirical multivariate distribution in the space S, which in turn allows for the computation of Pr(wq | dq-1).

Interestingly, Pr(wq | dq-1) reflects the "relevance" of word wq to the admissible history, as observed through dq-1. As such, it will be highest for words whose meaning aligns most closely with the semantic fabric of dq-1 (i.e., relevant "content" words), and lowest for words which do not convey any particular information about this fabric (e.g., "function" words like the).

Since content words tend to be rare and function words tend to be frequent, this will translate into a relatively high value for perplexity, a common measure of performance in language modeling (perplexity measures the average branching factor at the current word, cf., e.g., [2]). Thus, even though this model appears to have the same order as a standard unigram, it will likely exhibit a significantly weaker predictive power. This was to be expected, since the LSA language model does not exploit positional information at all.

Integration with n-Grams

To remedy this shortcoming, we now integrate the LSA approach as described above with the standard n-gram formalism. The overall reasoning is that the global constraints captured by LSA will complement the local constraints provided by the n-gram. This amounts to leveraging both short-span syntactic and long-span semantic information to derive a language model with the benefits of both.

This integration can occur in a number of ways, such as straightforward interpolation, or within the maximum entropy framework [4]. In [8-9], we have developed an alternative formulation to obtain a modified n-gram language model incorporating the semantic information in a Bayesian way. The resulting integrated language model probability can be expressed as:

Pr(wq | Hq-1) = A/B,
where A = Pr(wq | wq-1 wq-2 ... wq-n+1) Pr (dq-1 | wq)
and B = Sum(wi in V) of Pr(wi | wq-1 wq-2 ... wq-n+1) Pr (dq-1 | wi)

Performance

To evaluate the performance of the language model proposed in the previous section, we trained it on the so-called WSJ0 part of the NAB News corpus. This was convenient for comparison purposes since conventional bigram and trigram language models are readily available, trained on exactly the same data [6].

Thus, the training text corpus T was composed of about N = 87,000 documents spanning the years 1987 to 1989, comprising approximately 42 million words. In addition, about 2 million words from 1992 and 1994 were used for test purposes. The vocabulary V was constructed by taking the 20,000 most frequent words of the NAB News corpus, augmented by some words from an earlier release of the Wall Street Journal corpus, for a total of M = 23,000 words.

We performed the singular value decomposition of the matrix of co-occurrences between words and documents using the single vector Lanczos method as implemented by Berry[10]. Over the course of this decomposition, we experimented with different numbers of singular values retained, and found that R=125 seemed to achieve an adequate balance between reconstruction error (as measured by Frobenius norm differences) and noise suppression (as measured by trace ratios).

Using the resulting vector space S of dimension 125, we constructed the LSA language model as described above, and combined it with the standard bigram, according to (1). We then measured the resulting perplexity on the test data, and found a value of 147. This result is to be compared with the baseline results obtained with the standard bigram and trigram language models of [6], found to be 215 and 142, respectively.

Thus, compared to the standard bigram, we obtain a 32% reduction in perplexity with the integrated bigram/LSA language model (1), which brings it to the same level of performance as the standard trigram. We conclude that the new integrated language model is quite effective in combining global semantic prediction with the usual local predictive power of the bigram language model. In addition, we expect that much of the reduction in perplexity observed at the bigram level would carry over to a combined trigram/LSA language model.

Conclusion

We have described a data-driven language modeling approach based on the latent semantic analysis paradigm, which tracks hidden redundancies across documents. The resulting LSA language models capture semantically oriented, large span relationships between words. This stands in marked contrast with conventional n-grams, which inherently rely on syntactically-oriented, short-span relationships. Hence, one paradigm is better suited to account for the local constraints in the language, while the other one is more adept at handling global constraints.

To harness the synergy between the two, we have derived an integrative formulation to combine a standard n-gram with a LSA language model. The resulting multi-span language model was shown to substantially outperform the associated standard n-gram on a subset of the NAB News corpus. Because of its ability to capture both local and global constraints present in the language, this multi-span language model can be expected to contribute to a more natural interaction-driven speech input experience.

Acknowledgements

The author would like to thank Noah Cocarro, an intern at Apple in the summer of 1995, for many profitable discussions on information retrieval and latent semantic indexing.

References

[1] L.R. Bahl, F.Jelinek, and R.L. Mercer, "A Maximum Likelihood Approach to Continuous Speech Recognition," IEEE Trans. Pattern Anal. Mach. Intel., Vol. PAMI-5, No. 2, pp. 179-190, March 1983.

[2] F. Jelinek, "Self-Organized Language Modeling for Speech Recognition," Readings in Speech Recognition, A. Waibel and K.F. Lee, Eds, Morgan Kaufmann Publishers, pp. 450-506, 1990.

[3] T. Niesler and P. Woodland, "A Variable-Length Category-Based N-Gram Language Model," in Proc. 1996 Int. Conf. Acoustics, Speech, Signal Proc., Atlanta, GA, pp. I164-I167, May 1996.

[4] R. Rosenfeld, "A Maximum Entropy Approach to Adaptive Statistical Language Modeling," Computer Speech and Language, Vol. 10, London: Academic Press, pp. 187-228, July 1996.

[5] S. Deerwester et al., "Indexing by Latent Semantic Analysis," J. Am. Soc. Inform. Science, Vol. 41, pp. 391-407, 1990.

[6] F. Kubala et al., "The Hub and Spoke Paradigm for CSR Evaluation", in Proc. ARPA Speech and Natural Language Workshop, Morgan Kaufmann Publishers, pp. 40-44, March 1994.

[7] J.R. Bellegarda et al., "A Novel Word Clustering Algorithm Based on Latent Semantic Analysis," in Proc. 1996 Int. Conf. Acoustics, Speech, Signal Proc., Atlanta, GA, pp. I172-I175, May 1996.

[8] J.R. Bellegarda , "A Latent Semantic Analysis Framework for Large-Span Language Modeling," in Proc. EuroSpeech'97, Rhodes, Greece, Vol. 3, pp. 1451-1454, September 1997.

[9] J.R. Bellegarda , "A Multi-Span Language Modeling Framework for Large Vocabulary Speech Recognition," IEEE Trans. Speech and Audio, in press.

[10] M.W. Berry, "Large-Scale Sparse Singular Value Computations," Int. J. Supercomp. Appl., Vol. 6, No. 1, pp. 13-49, 1992.

About the Author

Jerome Bellegarda has worked on speech recognition technology for the past decade. He has been with the Spoken Language Group at Apple Computer since 1994. His interests include voice-driven man-machine communications and multiple input/output modalities.

Author's Address

Jerome R. Bellegarda
Apple Computer, Inc.
MS 301-3KM, One Infinite Loop
Cupertino CA 95014, USA

Email: jerome@apple.com
Tel: +1-408-974 7647

No earlier issue with same topic
Issue
Previous article
Article
SIGCHI Bulletin
Vol.30 No.2, April 1998
Next article
Article
No later issue with same topic
Issue