Speech Recognition Using Stochastic Approach:
A Review

Ranu Dixit; Navdeep Kaur

Speech Recognition Using Stochastic Approach: A Review

Ranu Dixit¹ Navdeep Kaur²
M.Tech Students, Information Technology, Chandigarh Engineering College, Landran, Mohali,Punjab, India¹
Faculty of Information Technology, Chandigarh Engineering College, Landran,Mohali, Punjab, India¹

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Science, Engineering and Technology

Abstract

In todayÃ¢Â€ÂŸs world, Speech Recognition is very important and popular. Automatic Speech Recognition System consists of three phases: Preprocessing, Feature Extraction and Recognition. Speech recognition is the process of converting spoken words into text. In case of speech recognition the research followers are mainly using three different approaches namely Acoustic phonetic approach, Pattern recognition approach and Artificial intelligence approach. The main goal of this paper is to discuss the various techniques of speech recognition and study Hidden Markov Model of stochastic approach to develop voice based, user friendly interface software system.

Keywords

Automatic Speech Recognition (ASR), Hidden Markov Model (HMM), Feature Extraction. Algorithms, Stochastic Approach

INTRODUCTION

A. Speech Recognition: Definition and Issues: Speech recognition is the process of converting an input acoustic signal (input in audio format in the form of spoken words) and recognises the various words contained in the speech. These recognised words can be the final results, which may serve as commands and control, or they may serve as input to further language processing. In simple words, speech recognition can be put together as the ability to take the audio format as input and then generate the text format from it as output.

Speech recognition [1] [2] involves different steps:

1. Voice recording

2. Word boundary detection

3. Feature extraction [3]

4. Recognition with the help of language models [4]

II. SPEECH RECOGNITION APPROACHES:

Speech recognition process deal with speech variability and account for learning the relationship between specific utterance and the corresponding word or word [5].There has been steady progress in the field of speech recognition over the recent year with two trends [6].First is academic approach and second is the pragmatic, include the technology, which provides the simple low-level interaction with machine, replacing with buttons and switches. A second approach is useful now, while the former mainly make promises for the future. There are three approaches to speech recognition [7] [8] [9].

A. Acoustic-phonetic approach [10][11][12][13]

B. Artificial Intelligence approach

C. Pattern recognition approach

A Acoustic-phonetic Approach: The earliest approaches to speech recognition were based on finding speech sounds and providing appropriate labels to these sounds. This is the basis of the acoustic phonetic approach (Hemdal and Hughes 1967), which postulates that there exist finite, distinctive phonetic units (phonemes) in spoken language and that 3these units are broadly characterized by a set of acoustics properties that are manifested in the speech signal over time. Even though, the acoustic properties of phonetic units are highly variable, both with speakers and with neighboring sounds (the so-called co articulation effect), it is assumed in the acoustic-phonetic approach that the rules governing the variability are straightforward and can be readily learned by a machine. The first step in the acoustic phonetic approach is a spectral analysis of the speech combined with a feature detection that converts the spectral measurements to a set of features that describe the broad acoustic properties of the different phonetic units. The next step is a segmentation and labeling phase in which the speech signal is segmented into stable acoustic regions, followed by attaching one or more phonetic labels to each segmented region, resulting in a phoneme lattice characterization of the speech. The last step in this approach attempts to determine a valid word (or string of words) from the phonetic label sequences produced by the segmentation to labeling. In the validation process, linguistic constraints on the task (i.e., the vocabulary, the syntax, and other semantic rules) are invoked in order to access the lexicon for word decoding based on the phoneme lattice. The acoustic phonetic approach has not been widely used in most commercial applications [14].

B Artificial Intelligence Approach (Knowledge Based Approach): The Artificial Intelligence approach [15] is a hybrid of the acoustic phonetic approach and pattern recognition approach. In this, it exploits the ideas and concepts of Acoustic phonetic and pattern recognition methods. Knowledge based approach uses the information regarding linguistic, phonetic and spectrogram. Some speech researchers developed recognition system that used acoustic phonetic knowledge to develop classification rules for speech sounds. While template based approaches have been very effective in the design of a variety of speech recognition systems; they provided little insight about human speech processing, thereby making error analysis and knowledge-based system enhancement difficult. On the other hand, a large body of linguistic and phonetic literature provided insights and understanding to human speech processing. In its pure form, knowledge engineering design involves the direct and explicit incorporation of expert speech knowledge into a recognition system. This knowledge is usually derived from careful study of spectrograms and is incorporated using rules or procedures. Pure knowledge engineering was also motivated by the interest and research in expert systems. However, this approach had only limited success, largely due to the difficulty in quantifying expert knowledge. Another difficult problem is the integration of many levels of human knowledge phonetics, phonotactics, lexical access, syntax, semantics and pragmatics. Alternatively, combining independent and asynchronous knowledge sources optimally remains an unsolved problem. In more indirect forms, knowledge has also been used to guide the design of the models and algorithms of other techniques such as template matching and stochastic modeling. This form of knowledge application makes an important distinction between knowledge and algorithms Algorithms enable us to solve problems. Knowledge enable the algorithms to work better. This form of knowledge based system enhancement has contributed considerably to the design of all successful strategies reported. It plays an important role in the selection of a suitable input representation, the definition of units of speech, or the design of the recognition algorithm itself.

C. Pattern Recognition Approach: The pattern-matching approach (Itakura 1975; Rabiner 1989; Rabiner and Juang 1993) involves two essential steps namely, pattern training and pattern comparison. The essential feature of this approach is that it uses a well formulated mathematical framework and establishes consistent speech pattern representations, for reliable pattern comparison, from a set of labeled training samples via a formal training algorithm. A speech pattern representation can be in the form of a speech template or a statistical model (e.g., a HIDDEN MARKOV MODEL or HMM [21] [22] [23] [24] [25] [26] and can be applied to a sound (smaller than a word), a word, or a phrase. In the pattern-comparison stage of the approach, a direct comparison is made between the unknown speeches (the speech to be recognized) with each possible pattern learned in the training stage in order to determine the identity of the unknown according to the goodness of match of the patterns. The pattern-matching approach has become the predominant method for speech recognition in the last six decades [14]. In this, there exists two methods namely template approach and stochastic approach.

1. Template Based Approach: Template based approach [15] to speech recognition have provided a family of techniques that have advanced the field considerably during the last six decades. The underlying idea is simple. A collection of prototypical speech patterns are stored as reference patterns representing the dictionary of candidate s words. Recognition is then carried out by matching an unknown spoken utterance with each of these reference templates and selecting the category of the best matching pattern. Usually templates for entire words are constructed. This has the advantage that, errors due to segmentation or classification of smaller acoustically more variable units such as phonemes can be avoided. In turn, each word must have its own full reference template; template preparation and matching become prohibitively expensive or impractical as vocabulary size increases beyond a few hundred words. One key idea in template method is to derive a typical sequences of speech frames for a pattern(a word) via some averaging procedure, and to rely on the use of local spectral distance measures to compare patterns. Another key idea is to use some form of dynamic programming to temporarily align patterns to account for differences in speaking rates across talkers as well as across repetitions of the word by the same talker.

2. Stochastic Approach: Stochastic modeling [15] entails the use of probabilistic models to deal with uncertain or incomplete information. In speech recognition, uncertainty and incompleteness arise from many sources; for example, confusable sounds, speaker variability s, contextual effects, and homophones words. Thus, stochastic models are particularly suitable approach to speech recognition. The most popular stochastic approach today is hidden Markov modeling. A hidden Markov model is characterized by a finite state markov model and a set of output distributions. The transition parameters in the Markov chain models, temporal variabilities, while the parameters in the output distribution model, spectral variabilities. These two types of variabilites are the essence of speech recognition. Compared to template based approach, hidden Markov modeling is more general and has a firmer mathematical foundation. A template based model is simply a continuous density HMM, with identity covariance matrices and a slope constrained topology. Although templates can be trained on fewer instances, they lack the probabilistic formulation of full HMMs and typically underperforms HMMs. Compared to knowledge based approaches; HMMs [27] [28] [29] [30] [31] [32] enable easy integration of knowledge sources into a compiled architecture. A negative side effect of this is that HMMs do not provide much insight on the recognition process. As a result, it is often difficult to analyze the errors of an HMM system in an attempt to improve its performance. Nevertheless, prudent incorporation of knowledge has significantly improved HMM based systems.

III. HIDDEN MARKOV MODEL

HMM is one of the key technologies developed in the 1980s, is the hidden Markov model (HMM) approach [16][17][18]. It is a doubly stochastic process which as an underlying stochastic process that is not observable (hence the term hidden), but can be observed through another stochastic process that produces a sequence of observations. Although the HMM was well known and understood in a few laboratories (primarily IBM, Institute for Defense Analysis (IDA) and Dragon Systems), it was not until widespread publication of the methods and theory of HMMs in the mid-1980s that the technique became widely applied in virtually every speech recognition research laboratory in the world. In the early 1970s, Lenny Baum of Princeton University invented a mathematical approach to recognize speech called Hidden Markov Modeling (HMM).

The HMM pattern-matching strategy was eventually adopted by each of the major companies pursuing the commercialization of speech recognition technology (SRT).The U.S. Department of Defense sponsored many practical research projects during the 70s that involved several contractors, including IBM, Dragon, AT&T, Philips and others. Progress was slow in those early years. The technique of HMM has been broadly accepted in today?s modern state-or-the art ASR systems mainly for two reasons: its capability to model the non-linear dependencies of each speech unit on the adjacent unit and a powerful set of analytical approaches provided for estimating model parameters [19][20]

A. Definition and Description of HMM: Hidden Markov Model (HMM) [33][34][35][36][37][38] [39][40] is a state machine. The states of the model are represented as nodes and the transition are represented as edges. The difference in case of HMM is that the symbol does not uniquely identify a state. The new state is determined by the symbol and the transition probabilities from the current state to a candidate state. [1] is a tutorial on HMM which shows how it can be used. Figure 1 shows a diagrammatic representation of a HMM. Nodes denoted as circles are states. O1 to O5 are observations. Observation O1 takes us to states S1. aij defines the transition probability between Si and Sj . It can be observed that the states also have self transitions. If we are at state S1 and observation O2 is observed, we can either decide to go to state S2 or stay in state S1. The decision is made depending on the probability of observation at both the states and the transition probability.

Thus HMM Model is defined as:

λ = (Q,O, A,B, π )

Where Q is {qi} (all possible states)

O is {vi} (all possible observation)

A is {aij} where aij = P(Xt+1 = qj |Xt = qi) (transition probabilities)

B is {bi} where bi(k) = P(Ot = vk|Xt = qit) (observation probabilities of observation k at state i)

π is { π i} where π i = P(X0 = qi) (initial state probabilities)

Xt denotes the state at time t

Ot denotes the observation at time t.

B. HMM and Speech Recognition: HMM can be classified upon various criteria:

1. Values of Occurrences – Discrete

– Continuous

2. Dimension

– One Dimensional

– Multi Dimensional

3. Probability density function

– Continuous density (Gaussian distribution) based

– Discrete density (Vector quantisation) based

While using HMM for recognition, we provide the occurrences to the model and it returns a number. This number is the probability with which the model could have produced the output (occurrences). In speech recognition occurrences are feature vector rather than just symbols. Hence for each occurrence, feature vector has a group of real numbers. Thus, what we need for speech recognition is a Continuous, Multi-dimensional HMM [39][40][41][42][43][44][45].

C. Implementation: There are HMM library that were looked at:

HTK: HMM Tool Kit - is matured HMM implementation. But the license of usage does not allow redistribution of code. A C++ implementation of HMM by Prof. Dekang Lin: The problem with this implementation was that it was a discrete HMM implementation. GHMM: GHMM is a open source library for HMM. It supports both discrete and continuous HMM. But it did not have support for more than one dimension.

Continuous HMM library, which supports vector as observations, has been implemented in the project. The library uses Gaussian probability distribution function.XML file containing a specification of HMM. The sample has five states with a vector size of three The root tag in the HMM file is HMM which indicates that the file contains a HMM Model. The tag has two attributes states and vector-size indicating the number of states and the vector size of an observation for the HMM respectively. Each state consists of the outgoing edges with their probabilities. These outgoing edges are stored as transition tag inside the state. Each tag has the target state id and the probability of transition. A state also has one or more mixtures. A mixture consists of a vector of mean and a matrix of variance, one for each dimension. These mean and variance are used to calculate probability for an occurrence. The way of calculating the probability is discussed.

IV. RECOGNITION USING HMM

Recognise a word using the existing models of words that we have. Sound recorder need to record the sound when it detects the presence of a word. This recorded sound is then passed through feature vector extractor model. The output of the above module is a list of features taken every 10 msec. This features are then passed to the Recognition module for recognition. The list of all the words that the system is trained for and their corresponding models are given in a file called models present in the HMMs. All models corresponding of the words are then loaded in memory. The feature vectors generated by the feature vector generator module act as the list of observation for the recognition module. Probability of generation of the observation given a model,P(O|λ) , is calculated for each of the model using find probability function. The word corresponding to the HMM [46][47][48][49][50], that gives the probability that is highest and is above the threshold, is considered to be spoken.

A. Forward Variable

Forward variable was used to find the probability of list of occurrence given a HMM. For a model with N states, probability of observation, in terms of forward variable given the model is defined as

where α_t+1 is recursive defined as

where

B. Training the Model

Train command is used to train the system for a new word. The command takes at-least 3 parameters:

1. No of states the HMM model should have N.

2. The size of the feature vector D.

3. One or more filenames each containing a training set.

For generating an initial HMM we take the N equally placed observations (feature vector) from the first training set. Each one is used to train a separate state. After training the states have a mean vector which is of size D. And a variance matrix of size D * D containing all zeros. Then for each of the remaining observations, we find the Euclidean distances between it and the mean vector of the states. We assign a observation to the closest state for training. The state assigned to consecutive observations are tracked to find the transitional probabilities.

V. CONCLUSION

In this paper, we discussed the various techniques of speech recognition and studied Hidden Markov Model (HMM). We observed that HMM is best among all modeling technique. This study has been carried out to develop voice based, user friendly interface software system. We can use it in various applications and can take advantages as real interface. We would use it for blind and visually impaired persons as their virtual eye in future.

ACKNOWLEDGEMENT

The authors remain thankful to Rajvinder Singh (H.O.D of Computer Science Engineering, CGC Landran, Mohali), for their useful discussions and suggestions during the preparation of this technical paper.

Figures at a glance

Figure 1

References

Ripul Gupta (2011), “Speech Recognition for Hindi,” M.Tech Thesis, IIT Bombay.
Abhisek Paul(2011), “Speech Recognition in Hindi,” M.Tech Thesis, National Institute of Technology, Rourkela
Q. Zhu and A. Alwan (2003),“Non-linear feature extraction for robust speech recognition in stationary and non-stationary noise,” Computer Speech and Language, vol. 17, no. 4, pp.381–402.
F. Jelinek, B. Merialdo, S. Roukos, and M. Strauss I, “A Dynamic Language Mode for Speech Recognition,” IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY 10598
Anusuya and Katti(2009), “Speech Recognition by Machine: A Review,” International Journal of Computer Science and Information Security, Vol.6, No. 3, pp.181-205
Abdul Kadir K, (2010), “Recognition of Human Speech using q-Bernstein Polynominals,” International Journal of Computer Application, Vol.2 – No.5, pp.22-28.
Reddy, R. (1976), “Speech Recognition by Machine: A Review,” in proceedings of IEEE transaction, Vol. 64, No. 4, pp. 501-531.
Gaikwad, Gawali and Yannawar (2010),“A Review on Speech Recognition Technique,” International Journal of Computer Application, Vol.10, No.3, pp.16-24.
Rohini B Shinde and V P Pawar (2012), “A Review on Acoustic Phonetic Approach for Marathi Speech,” Recognition. International Journal of Computer Applications 59(2): 40-44.
Friesen, L. M., Shannon, R. V., Bas¸kent, D., and Wang, X. (2001), “Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants,”J. Acoust. Soc. Am. 110(2), 1150–1163.
A. Mohamed, G. Dahl, and G. Hinton (2012), “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing,, vol. 20, no. 1, pp. 14–22.
L. Deng (2003), “Switching dynamic system models for speech articulation and acoustics,” in Mathematical Foundations of Speech and Language Processing, pp. 115–134. Springer-Verlag, New York
L. Deng and D. Yu (2007), “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc.ICASSP, pp. 445–448.
Dat Tat Tran (2000), “Fuzzy Approaches to Speech and Speaker Recognition,” A thesis submitted for the degree of Doctor of Philosophy of the university of Canberra.
R.K.Moore,(1994), “Twenty things we still don’t know about speech,” Proc.CRIM/ FORWISS Workshop on „Progress and Prospects of speech Research an Technology”.
J.Ferguson,Ed.,(1980),“Hidden Markov models for speech,” IDA, Princeton, NJ.
L.R.Rabiner,(1989),“A tutorial on hidden Markov models and selected applications in speech recognition,”Proc.IEEE,77(2),pp.257-286.
L.R.Rabiner and B.H.Juang,(1993),“Fundamentals of Speech Recognition,” Prentice-Hall, Englewood Cliff, New Jersy.
Picone, J.(1990), “Continues Speech Recognition using Hidden Markov Models,” IEEE ASSP Magazine, Vol.7, Issue 3, pp. 26-41.
Flahert, M.J. and Sidney, T.(1994), “Real Time implementation of HMM Speech recognition for telecommunication application,” in proceedings of IEEE International Conference on Acustics, Speech, and Signal Processing, (ICASSP), Vol.6, pp. 145-148.
Leggetter, C. & Woodland, P. (1995), “Maximum likelihood linear regession for speaker adaptation of continuous density HMMs Computer Speech and Language,” 9, 171–186.
Sak, H.; Saraclar, M.; Gungor, T. (2012), "Morpholexical and Discriminative Language Models for Turkish Automatic Speech Recognition," Audio, Speech, and Language Processing, IEEE Transactions on,vol.20, no.8, pp.2341-2351.
M. Richardson, J. Bilmes, and C. Diorio (2003), “Hidden-articulator Markovmodels for speech recognition,” Speech Communication, vol. 41, no. 2–3, pp. 511–529.
Otsuka, T., and Ohya, J. (1998), “Spotting segments displaying facial expression from image sequences using HMM,” IEEE Proceedings of the Second International Conference on Automatic Face and Gesture Recognition (FG?98), Nara, Japan, pp. 442–447.
Nefian A, Liang L H, Pi X B (2002), “A coupled HMM for audio visual speech recognition.” In Int’l Conf Acoustics , Speech and Signal Processing: 2013-2016.
T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai (1996), “Speech synthesis from HMMs using dynamic features,”Proc. of ICASSP, pp.389–392.
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, andT. Kitamura (1999), “Simultaneous modeling of spectrum, pitch,and duration in HMM-based speech synthesis,” Proc. Of EUROSPEECH, pp.2347–2350.
H. Zen and N. Braunschweile (2009), “Context-dependent additive log F0 model for HMM-based speech synthesis,”Proc. Interspeech, pp. 2091–2094
K. Oura, H. Zen, Y. Nankaku, A, Lee, and K. Tokuda (2010),“A Covariance-Tying Technique for HMM-based Speech Synthesis,” IEICE, vol. E93–D, no.3, pp.595–601.
K. Tokuda, T. Masuko, N. Miyazaki and T. Kobayashi (1999),“Hidden Markov Models Based on Multi-SpaceProbability Distribution for Pitch Pattern Modeling,”Proc. ICASSP, pp. 229–232.
G Heigold, R Schlter, and H Ney (2007), “On the equivalence of Gaussian HMM and Gaussian HMM-like hidden conditional random fields,”in Interspeech, 2007, pp. 1721–1724
J. Kaiser, B. Horvat, and Z. Kacic (2000), “A novel loss function for the overall risk criterion based discriminative training of HMM models,”in Proc. ICSL.
M J F Gales (1998), “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98.
A Acero, L Deng, T Kristjansson, and J Zhang (2000), “HMM Adaptation using Vector Taylor Series for Noisy Speech Recognition,” inProc. ICSLP, Beijing, China.
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura (2000),“Speech parameter generation algorithms for HMM-based speech synthesis,” in Proc. ICASSP 2000, pp. 1315–1318.
H. Zen, K. Tokuda, and T. Kitamura (2007), “Reformulating the HMM as atrajectory model by imposing explicit relationships between static and dynamic feature vector sequences,” Computer Speech and Language,vol. 21, no. 1, pp. 153–173.
C. Wellekens (1987), “Explicit time correlation in hidden Markov models for speech recognition,” in Proc. ICASSP, vol. 12, 1987, pp. 384–386.
P. Kenny, M. Lennig, and P. Mermelstein (1990), “A linear predictive HMM for vector-valued observations with applications to sp eech recognition,”IEEE Trans. Acoust., Speech, Signal Process., vol. 38, no. 2, pp. 220–225.
P. Woodland (1992), “Hidden Markov models using vector linear prediction and discriminative output distributions,” in Proc. ICASSP 1992, pp. 509–512.
B. Juang and L. Rabiner (1985), “Mixture autoregressive hidden Markov models for speech signals,” IEEE Trans. Acoust., Speech, Signal Process.,vol. 33, no. 6, pp. 1404–1413.
M. Shannon and W. Byrne (2009), “Autoregressive HMMs for speech synthesis,” in Proc. Interspeech 2009, pp. 400–403.
C. Quillen (2012), “Autoregressive HMM speech synthesis,” in Proc. ICASSP, 2012, pp. 4021–4024.
HTS working group (2012), “HMM-based speech synthesis system (HTS),” http://hts.sp.nitech.ac.jp
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura (1998),“Duration modeling for HMM-based speech synthesis,” in Proc. ICSLP
H. Zen, K. Tokuda, T. Masuko, T. Kobayasih, and T. Kitamura (2007), “A hidden semi-Markov model-based speech synthesis system,” IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 825–834.
S. Yu and H. Kobayashi (2003), “An efficient forward-backward algorithm for an explicit-duration hidden Markov model,” IEEE Signal Process. Lett., vol. 10, no. 1, pp. 11–14.
H. Zen (2007), “Implementing an HSMM-based speech synthesis system using an efficient forward-backward algorithm,” Nagoya Institute of Technology, Technical Report TR-SP-0001.
Annon and W. Byrne, “A formulation of the autoregressiveHMM for speech synthesis (2009),” Department of Engineering, University of Cambridge, UK, Technical Report CUED/F-INFENG/TR.629, http://mi.eng.cam.ac.uk/sms46/papers/shannon2009fah.pdf.
T. Toda and K. Tokuda (2007), “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816–824.
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura (2004),“Incorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesis,” IEICE Trans. Inf. Syst. (Japanese edition), vol. J87-D-II, no. 8, pp. 1565–1571.