ISOLATED SPEECH RECOGNITION
USING MFCC AND DTW

Shivanker Dev Dhingra; Geeta Nijhawan; Poonam P; it

ISOLATED SPEECH RECOGNITION USING MFCC AND DTW

Shivanker Dev Dhingra ¹, Geeta Nijhawan ² , Poonam Pandit³

Student, Dept. of ECE, MRIU, Faridabad, Haryana, India
Associate Professor, Dept. of ECE, MRIU, Faridabad, Haryana, India
Assistant Professor, Dept. of ECE, MRIU, Faridabad, Haryana, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

Abstract

This paper describes an approach of isolated speech recognition by using the Mel-Scale Frequency Cepstral Coefficients (MFCC) and Dynamic Time Warping (DTW). Several features are extracted from speech signal of spoken words. An experimental database of total five speakers, speaking 10 digits each is collected under acoustically controlled room is taken. MFCC are extracted from speech signal of spoken words. To cope with different speaking speeds in speech recognition Dynamic Time Warping (DTW) is used. DTW is an algorithm, which is used for measuring similarity between two sequences, which may vary in time or speed.

Keywords

MATLAB, Mel frequency cepstral coefficients (MFCC), Speech Recognition, Dynamic Time Warping (DTW)

INTRODUCTION

SPEECH recognition is the process of automatically recognizing the spoken words of person based on information in speech signal. Each spoken word is created using the phonetic combination of a set of vowel semivowel and consonant speech sound units. The most popular spectral based parameter used in recognition approach is the Mel Frequency Cepstral Coefficients called MFCC. MFCCs are coefficients, which represent audio, based on perception of human auditory systems. The basic difference between the operation of FFT/DCT and the MFCC is that in the MFCC, the frequency bands are positioned logarithmically (on the mel scale) which approximates the human auditory system's response more closely than the linearly spaced frequency bands of FFT or DCT.

Due to its advantage of less complexity in implementation of feature extraction algorithm, certain coefficients of MFCC corresponding to the Mel scale frequencies of speech Cepstrum are extracted from spoken word samples in database [1].

Two utterances of the same word by the same user can differ in time. For example, two can be pronounced as to or too. DTW resolves this problem by aligning the words properly and calculating the minimum distance between two words. A local distance matrix is formed for all the segments in the sample word and template word.

II. DATABASE

Database consists of two groups of speech samples recorded in an environmentally controlled recording room to have all possibly less acoustical interferes to the quality of sound sample during the recording time. The first group (train) comprises of total five speakers, speaking 10 digits each from one to zero and another group (test) of same sound samples. All speech signals are recorded under most similar setting condition such as the same length of recording time, and the level of sound amplitude. In training, a Matlab program named „train‟ extracts features of all the 50 words and are stored in a file named „allfeatures.mat‟. In testing phase when a Matlab program named „test‟ is executed it postulates to the user to choose any speech sample from the test group that are prerecorded in the database. MFCC at the back end extracts the features of the chosen speech sample. Then „allfeatures.mat‟ file is called for feature matching. DTW first locally matches the features of the selected sampled speech signal with „allfeatures.mat‟ by measuring the local distance. DTW then measures the global distance and the part that matches with the chosen sampled speech is the result of the „test‟ program that shows the correct spoken word in the command window.

III. METHODOLOGY

Feature Extraction

Several feature extraction algorithms can be used to do this task, such as - Linear Predictive Coefficients (LPC), Linear Predictive Cepstral Coefficients (LPCC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficient (HFCC). [2]

The MFCC algorithm is used to extract the features. The functions used for feature extraction [x_cep, x_E, x_delta, x_acc]. MFCC are chosen for the following reasons:-

1. MFCC are the most important features, which are required among various kinds of speech applications.

2. It gives high accuracy results for clean speech.

3. MFCC can be regarded as the "standard" features in speaker as well as speech recognition.

A. Preprocessing

To enhance the accuracy and efficiency of the extraction processes, speech signals are normally pre-processed before features are extracted. There are two steps in Pre-processing.

1. Pre-emphasization.

2. Voice Activation Detection (VAD).

1. Pre-emphasization

The digitized speech waveform has a high dynamic range and suffers from additive noise. In order to reduce this range and spectrally flatten the speech signal, pre-emphasis is applied. First order high pass FIR filter is used to preemphasize the higher frequency components.

2. Voice Activation Detection (VAD)

VAD facilitates speech processing, and it is used to deactivate some processes during non-speech section of an audio sample. The speech sample is divided into non-overlapping blocks of 20ms. It differentiates the voice with silence and the voice without silence.

B. Frame Blocking

The speech signal is split into several frames such that each frame can be analysed in the short time instead of analysing the entire signal at once. The frame size is of the range 0-20 ms. Then overlapping is applied to frames. Overlapping is done because on each individual frame, hamming window is applied. Hamming window gets rid of some of the information at the beginning and end of each frame. Overlapping reincorporates this information back into our extracted features.

C. Windowing

Windowing is performed to avoid unnatural discontinuities in the speech segment and distortion in the underlying spectrum [3]. The choice of the window is a tradeoff between several factors. In speech recognition, the most commonly used window shape is the hamming window [4].

D. Fast Fourier Transform

The basis of performing fast Fourier transform is to convert the convolution of the glottal pulse and the vocal tract impulse response in the time domain into multiplication in the frequency domain [5]. Spectral analysis signify that different timbres in speech signals corresponds to different energy distribution over frequencies. Therefore, FFT is executed to obtain the magnitude frequency response of each frame and to prepare the signal for the next stage i.e. Mel Frequency Warping.

E. Mel-frequency warping

Human ear perception of frequency contents of sounds for speech signal does not follow a linear scale. Therefore, for each tone with an actual frequency f, measured in Hz, a subjective pitch is measured on a scale called the „mel‟ scale. The mel frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000Hz. To compute the mels for a given frequency f in Hz, a the following approximate formula is used.

Mel (f) = Sk = 2595*log10 (1 + f/700)

The subjective spectrum is simulated with the use of a filter bank, one filter for each desired mel-frequency component. The filter bank has a triangular band pass frequency response, and the spacing as well as the bandwidth is determined by a constant mel-frequency interval.

F. Cepstrum

In this final step, we convert the log mel spectrum back to time. The result is called the Mel Frequency Cepstrum Coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the discrete cosine transform (DCT). By doing DCT, the contribution of the pitch is removed. In this final step Log Mel spectrum is converted back to time. The result is called the Mel Frequency Cepstrum Coefficients (MFCC). The discrete cosine transform is done for transforming the mel coefficients back to time domain. [6]

Whereas Sk, K = 1, 2, … K are the outputs of last step.

Feature Matching

There are many feature-matching techniques used in speaker recognition such as Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization. DTW technique is used for feature matching.

Dynamic Time Warping (DTW)

The time alignment of different utterances is the core problem for distance measurement in speech recognition. A small shift leads to incorrect identification. Dynamic Time Warping is an efficient method to solve the time alignment problem. DTW algorithm aims at aligning two sequences of feature vectors by warping the time axis repetitively until an optimal match between the two sequences is found. This algorithm performs a piece wise linear mapping of the time axis to align both the signals.

Consider two sequences of feature vector in an n-dimensional space.

The two sequences are aligned on the sides of a grid, with one on the top and other on the left hand side. Both sequences start on the bottom left of the grid.

In each cell, a distance measure is placed, comparing the corresponding elements of the two sequences. The distance between the two points is calculated via the Euclidean distance.

The best match or alignment between these two sequences is the path through the grid, which minimizes the total distance between them, which is termed as Global distance. The overall distance (Global distance) is calculated by finding and going through all the possible routes through the grid, each one compute the overall distance

The global distance is the minimum of the sum of the distances (Euclidean distance) between the individual elements on the path divided by the sum of the weighting function. For any considerably long sequences the number of possible paths through the grid will be very large. Global distance measure is obtained using a recursive formula.

Here, GD = Global Distance (overall distance) LD = Local Distance (Euclidean distance)

IV. CONCLUSION

The main aim of this project was to recognize isolated speech using MFCC and DTW techniques. The feature extraction was done using Mel Frequency Cepstral Coefficients {MFCC} and the feature matching was done with the help of Dynamic Time Warping (DTW) technique. The extracted features were stored in a .mat file using MFCC algorithm. A distortion measure based on minimizing the Euclidean distance was used when matching the unknown speech signal with the speech signal database. The experimental results were analysed with the help of MATLAB and it is proved that the results are efficient. This process can be extended for n number of speakers. The project shows that the DTW is the best nonlinear feature matching technique in speech identification, with minimal error rates and fast computing speed. DTW will receive the utmost importance for speech recognition in voice based Automatic Teller Machine.

Figures at a glance


Figure 1	Figure 2	Figure 3	Figure 4	Figure 5


Figure 6	Figure 7	Figure 8	Figure 9

References

ChadawanIttichaichareon, SiwatSuksri and ThaweesakYingthawornsuk “Speech Recognition using MFCC” International Conference onComputer Graphics, Simulation and Modeling (ICGSM'2012) July 28-29, 2012 Pattaya (Thailand)

http://www.springerlink.com/content/n1fxnn5gpkuelu9k.

B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley and Sons, New York, NY, 2000.

C. Becchetti and LucioPrinaRicotti, Speech Recognition, John Wiley and Sons, England, 1999.

E. Karpov, “Real Time Speaker Identification,” Master`s thesis, Department of Computer Science, University of Joensuu, 2003.

“MFCC and its applications in speaker recognition” VibhaTiwari, Deptt. of Electronics Engg., Gyan Ganga Institute of Technology andManagement, Bhopal, (MP) INDIA (Received 5 Nov., 2009, Accepted 10 Feb., 2010).