Monday, October 26, 2009

Speech Recognition System


Note : The Images are reversed in sequence i.e. aligned Bottom to top


Fundamentals of Speech Recognition




















AUTOMATIC SPEECH RECOGNITION (ASR).

The concept of a machine than can recognize the human voice has long been an accepted feature in Science Fiction. From ‘Star Trek’ to George Orwell’s ‘1984’ - “Actually he was not used to writing by hand. Apart from very short notes, it was usual to dictate everything into the speakwriter.” - it has been commonly assumed that one day it will be possible to converse naturally with an advanced computer-based system. Indeed in his book ‘The Road Ahead’, Bill Gates (co-founder of Microsoft Corp.) hails ASR as one of the most important innovations for future computer operating systems.

From a technological perspective it is possible to distinguish between two broad types of ASR: ‘direct voice input’ (DVI) and ‘large vocabulary continuous speech recognition’ (LVCSR). DVI devices are primarily aimed at voice command-and-control, whereas LVCSR systems are used for form filling or voice-based document creation. In both cases the underlying technology is more or less the same. DVI systems are typically configured for small to medium sized vocabularies (up to several thousand words) and might employ word or phrase spotting techniques. Also, DVI systems are usually required to respond immediately to a voice command. LVCSR systems involve vocabularies of perhaps hundreds of thousands of words, and are typically configured to transcribe continuous speech. Also, LVCSR need not be performed in real-time - for example, at least one vendor has offered a telephone-based dictation service in which the transcribed document is e-mailed back to the user.

From an application viewpoint, the benefits of using ASR derive from providing an extra communication channel in hands-busy eyes-busy human-machine interaction (HMI), or simply from the fact that talking can be faster than typing. Also, whilst speaking to a machine cannot be described as natural, it can nevertheless be considered intuitive; as one ASR advertisement declared “you have been learning since birth the only skill needed to use our system”.

ASR products have existed in the marketplace since the 1970s. However, early systems were expensive hardware devices that could only recognize a few isolated words (i.e. words with pauses between them), and needed to be trained by users repeating each of the vocabulary words several times. The 1980s and 90s witnessed a substantial improvement in ASR algorithms and products, and the technology developed to the point where, in the late 1990s, software for desktop dictation became available ‘off-the-shelf’ for only a few tens of dollars. As a consequence, the markets for ASR systems have now grown to include:

· large vocabulary dictation - for RSI sufferers and quadriplegics, and for formal document preparation in legal or medical services

· interactive voice response - for callers who do not have tone pads, for the automation of call centers, and for access to information services such as stock market quotes

· telecom assistants - for repertory dialing and personal management systems

· process and factory management - for stocktaking, measurement and quality control

The progress in ASR has been fuelled by a number of key developments, not least the relentless increase in the power of desktop computing. Also R&D has been greatly stimulated by the introduction of competitive public system evaluations, particularly those sponsored by the US Defense Advanced Research Projects Agency (DARPA). However, scientifically, the key step has been the introduction of statistical techniques for modeling speech patterns coupled with the availability of vast quantities of recorded speech data for training the models.

The main breakthrough in ASR has been the discovery that recognition can be viewed as an integrated search process, and this first appeared in the 1970s with the introduction of a powerful mathematical search technique known as ‘dynamic programming’ (DP) or ‘Viterbi search’. Initially DP was used to implement non-linear time alignment in a whole-word template-based approach, and this became known as ‘dynamic time warping’ (DTW).

DTW-based systems were quite successful, and could even be configured to recognize connected words. However another significant step came in the late 1980s when pattern matching was replaced by ‘hidden Markov modeling’. This not only allowed systems to be configured for large numbers of users – providing so-called ‘speaker independent’ systems – but ‘sub-word HMMs’ enabled the recognition of words that had not been encountered in the training material.

A hidden Markov model (HMM) is a stochastic generative process that is particularly well suited to modeling time-varying patterns such as speech. HMMs represent speech as a sequence of observation vectors derived from a probabilistic function of a first-order Markov chain. Model ‘states’ are identified with an output probability distribution that describes pronunciation variations, and states are connected by probabilistic ‘transitions’ that capture durational structure. An HMM can thus be used as a ‘maximum likelihood classifier’ to compute the probability of a sequence of words given a sequence of acoustic observations.

Figure 1 illustrates a contemporary ASR system. Incoming speech is subject to some form of front-end signal processing - usually ‘cepstral’ analysis – that outputs a sequence of acoustic vectors. Using Viterbi search, this sequence is compared with an integrated network of HMM states in order to find the path that corresponds to the most likely explanation of the observations. The path reveals the recognized sequence of words.

The key to this approach is the process for compiling the HMM network. Two sets of training corpora are involved; one consisting of many hours of annotated speech material, and another comprising several million words of text. The first is used to estimate the parameters of the ‘acoustic model’ – an inventory of context-sensitive sub-word HMMs such as ‘diphones’ or ‘triphones’ – and the second is used to estimate the parameters of an n-gram ‘language model’. Each word in the target vocabulary is then expressed in terms of a sequence of phonetic sub-word units, and compiled into a network together with the language model and non-speech HMMs (to accommodate noise).

This mainstream approach to ASR is not without its detractors. It is difficult to construct such a system to exhibit accurate discriminatory behavior. As a result, a handful of researchers have investigated ‘artificial neural networks’ (ANNs), particularly for sub-word modeling. However, such systems have not outperformed HMMs on benchmark tests. A more general criticism – primarily leveled at the dominance of the DARPA-sponsored evaluations – has been concerned with the inadvertant suppression of scientific diversity (Bourlard et al, 1996). Participation in such prestige activities not only commits a large research effort, thereby severely reducing the opportunity for lateral thinking, but also discourages any short-term risk that the resultant performance might be worse.

Finally, a comprehensive comparison between ASR and HSR accuracy was performed in 1997. Richard Lippmann presented comparative word error rates for a range of tasks and conditions. The results indicated that ASR currently performs about an order-of-magnitude worse than a human listener.

Bibliography

Bourlard, H., Hermansky, H. & Morgan, N. (1996). Towards increasing word recognition error rates, J. Speech Communication (Vol. 18, pp. 205-231). Elsevier.

Deller, J. R., Proakis, J. G. & Hansen, J. H. L. (2000). Discrete-time processing of speech signals, IEEE Press Classic Reissue, Piscataway, NJ, IEEE Press.

Gibbon, D., Moore, R. K. & Winski, R. eds. (1997). Handbook of Standards and Resources for Spoken Language Systems, Mouton de Gruyter.

Gold, B. & Morgan, N. (2000). Speech and Audio Processing, New York: John Wiley and sons.

Holmes, J.N. & Holmes, W.J. (2001). Speech Synthesis and Recognition (second edition), Taylor and Francis, London.

Jelinek, F. (1997). Statistical Methods for Speech Recognition, Cambridge, MA: MIT Press.

Lippmann, R. (1997). Speech recognition by machines and humans. J. Speech Communication (Vol. 22, pp. 1-15). Elsevier.

O'Shaughnessy, D. (2000). Speech Communications: human and machine, Second Edition, Piscataway, NJ: IEEE Press.

Rabiner, L. R. and Juang, B.-H. (1993). Fundamentals of Speech Recognition, Englewood Cliffs, NJ: Prentice Hall.

Young, S. J. (1996). A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine (pp. 45–57).





1 comment:

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Ai based Text Analytics Tool

    Text Analytics Solutions

    ReplyDelete