TRL
TOP PAGETokyo Research LaboratoryEmploymentProjectsRelated InformationIBM Research
Japanese page is here.

Speech recognition

Speech recognition
IBM Japan/HTC/Yoko 1996.1998


Overview

Research activities on HMM (Hidden Markov Model) - based speech recognition has been conducted at the IBM T. J. Watson Research Center for over 20 years. In cooperation with the Watson speech group, TRL is doing research on Japanese speech recognition, focusing on Japanese word recognition and aiming at a large vocabulary, high accuracy, and speaker adaptation.



Research items

Our current subjects for practical use of speech recognition technology are as follows:

  • A speaker-independent Japanese dictation system for spontaneous speech
  • Speech recognition technology for telephony

Speaker-independent speech recognition requires a quantity of speech data and efficient modeling of many variations in speech signals. Over ten thousand words are necessary for a Japanese dictation system, even in a limited domain. The homonym problem is also unavoidable. We are tackling these problems by using HMM-based acoustic modeling and a Japanese language model.

The figure bellow shows an overview of our dictation system. In each time-frame, a feature vector is extracted by acoustic processing. Candidate words are selected by fast match (preselection), and the decoder outputs the final results on the basis of the language context and the likelihood calculated by the detailed match.


Statistical language modeling is a method for estimating the probability of a sentence from the probabilities of N consecutive-word sequences. Those sequences are named as bigrams (when N=2) and trigrams (when N=3). Naturally longer sequences (larger N) make it possible to predict a sentence more accurately. However, the dictionary size increases exponentially, and a huge amount of data is required for learning the probabilities of n-grams. In previous research on English language models, trigrams are usual.

When we apply this method to Japanese, the ambiguity of word boundaries is the biggest problem. Some researchers have used morphemes instead of words. However short units are difficult to identify from acoustic information, and scope of n-grams will be small. On the other hand, excessively long units are not acceptable, because the number of units is too large, and utterances in shorter units cannot be handled. We made a model of subconscious word-units, which exist in the mind of Japanese speakers, by comparing texts segmented by human subjects with those segmented by our Japanese morphological analyzer. About 80,000 of the units obtained from the model accounts for 99% of all words in newspapers. On the basis of this model, we created a speech recognition system that allows discrete and continuous utterances. However, further research should be conducted for its extensive use. The pronounciation is not clear. Unnecessary words (i.e. filled pause "uhh") and disfluencies are often observed. The current language model created mainly from newspaper articles cannot handle a colloquial style Japanese sufficiently. If a headset microphone is not required (called hands-free), the system becomes more flexible. We are now focusing on these sponteneous speech in terms of both acoustic model and language model.

It is indispensable for building a good language model to divide data texts into words as accurately as possible, because Japanese has no explicit word boundary. In Tokyo Research Laboratory, we have continued our research of Japanese natural language processing, for about fifteen years. We are making a research on stochastic method to estimate word boundaries, and to parse a result of speech recognizer.

Systems/Products

This result of our research is used in some of IBM's speech products,
In 1997, March, the first product of Japanese dictation was shipped. We subsequently provided ViaVoice Gold, which allowed continuous speech, in 1997, December, and ViaVoice 98 with enriched vocabulary (60,000) in 1998, July. The current pruduct is
ViaVoice for Windows, Version 10 Japanese. Due to these contributions, four researchers in TRL and one ex-member received the prize for Outstanding Technological Development in Acoustics from Acoustical Society of Japan. Also, our technology is used in ViaVoice Telephony.

Publications

The list of our papers

Research home IBM home Order Privacy Legal Contact IBM
Last modified 22 April 2003