Improvement of the Speaker Verification System
with Feature Level and Score Level
Normalization Techniques

Kshirod Sarmah; Utpal Bhattacharjee

Improvement of the Speaker Verification System with Feature Level and Score Level Normalization Techniques

Kshirod Sarmah¹, Utpal Bhattacharjee²

Research Scholar, Department of Computer Science and Engineering, Rajiv Gandhi University, Rono Hills, Doimukh, Arunachal Pradesh, India
Associate Professor, Department of Computer Science and Engineering, Rajiv Gandhi University, Rono Hills, Doimukh, Arunachal Pradesh, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

The performance of a text independent Speaker verification (SV) system has degraded when speaker model training is done in one environment while the testing is done in another, due to mismatching of phonetic contents of speech utterances, recording environment, session variability and sensor variability of training and testing criteria, which are major problems in speaker verification system. The robustness of SV system has been improved by applying different Voice Activity Detection (VAD) techniques like Cepstral Mean Normalization (CMN), Cepstral Variance Normalization (CVN) techniques in features level and score normalization techniques in score level. In this paper we report the experiment carried out on the recently collected speaker recognition database Arunachali Language Speech Database (ALS-DB). The collected database is evaluated with Gaussian mixture model and Universal Background Model (GMM-UBM) and Mel- Frequency Cepstral Coefficients (MFCC) with its first and second order derivatives as well as Prosodic features as a front end feature vectors. The performance of the speaker verification system has been improved by applying CVN at the feature level as well as score normalization technique Test-normalization (T-Norm) in the score level. And also we observe that the performance of SV system vastly improved while applying CVN in feature level and T-Norm in score level at the same time. We observe that combining MFCC with Prosodic features improved the performance of the SV system with 7.08%, while T-Norm improved the SV system with 3.22% and CVN has improved with 3.90%.

Keywords

Speaker Verification; GMM-UBM; Mel-Frequency Cepstral Coefficients; CVN; T-Norm

INTRODUCTION

Automatic Speaker Verification (ASV) is one of the most natural and economical methods for solving the problems of unauthorised use of computer and communication systems and multilevel access control [1]. Speaker Verification is one of the biometric tasks to be verified the validity of user’s acceptance or rejection for a computer system. Speaker verification (SV) is a technology which is used to verify a person’s identity from their speech utterances. In general, the ASV system consists of five phases: Speech data acquisition in digital format, speaker related feature extraction, enrolment to generate speaker models ,pattern matching, and finally making an accept or reject decision. Figure 1 describes the basic concept of speaker verification system.

For speaker modeling technique, the state of the art speaker verification systems uses Gaussian mixture models (GMM) approached with adapted mean from universal background models (UBM) [2] or support vector machines (SVM) over GMM super-vectors [3]. Mel-frequency cepstral coefficients (MFCC) are the most popular features, although, the traditional MFCC is very sensitive to noise interference for which the performance of SV system degrades also because of the mismatches between training and testing, but MFCC combined with supra-segmental information such as prosody and speaking style for improved performance [4]. The main applications of the SV technology are employed in person authentication and in forensic science [5]. With the growth in the wireless telecommunication, many of these applications are now accessed through mobile phones.

RELATED WORKS

Speaker Recognition systems have typically used generative models like Vector Quantization (VQ) [12] or Gaussian Mixture Models (GMMs) [2].A single-state hidden Markov model (HMM) that presently known as Gaussian Mixture Model (GMM) was proposed by Rose, as one of the robust parametric modelling technique for the text-independent speaker recognition system [6].The generative model is generally trained using maximum likelihood (ML) principle. The main disadvantage of the ML approach is that it doesn’t generalize well to unseen speech data with finite amount of training material. To solve this problem Maximum a Posteriori (MAP) approach of training is sufficient which is also known as universal background model (UBM) [2]. In MAP approach, prior knowledge of the distribution of model parameters is incorporated into modeling process [15].

Besides the GMM-UBM, other speaker modeling techniques are developed recently. The most successful ones include the support vector using GMM super vector (GSV-SVM) [9]. That concatenates the GMM mean vectors as the input for SVM training test. Another important development namely joint factor analysis (JFA) [11], which jointly models the channel subspace and the speaker subspace. Although these innovative methods achieve rapid progress, GMM-UBM is still the basis for their recent developments.

The primary aim of the Voice Activity Detection (VAD) is to identify the speech segments from a given audio signal which is an important sub-component of SV system [17]. In this case VAD computes the energy values of all frames and cancels the too low absolute energy frames based on the threshold. In our case the threshold value is (-55dB).

The principle of feature normalization is to use generic noise suppression techniques to enhance the quality of feature extraction strategy [17]. In this case we used Cepstral Mean Normalization (CMN) and Cepstral Variance Normalization (CVN) which reduces the channel affect.

Score normalization is the transformation of speaker verification output scores to enhance the effectiveness of the detection threshold by aligning the score distribution speaker models. T-Norm is one of the popular score normalization methods. T-Norm speaker models are scored in parallel with the target speaker model [13].As the adapted universal background model (UBM) provides fast scoring so T-norm is efficient in an adapted UBM system [14].

The rest of the paper is organized as follows: Section–III describes the details of the speaker verification database. Section–IV details the speaker modeling techniques. Features and score normalization techniques have been explained in the Section-V. The experimental setup, data used in the experiments and result obtained are described in Section VI and VII. Finally the paper is concluded in Section–VIII.

SPEAKER VERIFICATION CORPUS

In this section we have used the recently collected Arunachali Language Speech Database (ALS-DB) consisting 200 speakers of Arunachal Pradesh [19][20].Arunachal Pradesh of North East India is one of the linguistically richest and most diverse regions in all of Asia, being home to at least thirty and possibly as many as fifty distinct languages in addition to innumerable dialects and subdialects there of [7].

To study the impact of language variability as well as channel variability on speaker verification task, ALS-DB is collected in multilingual and multi-channel environment. Each speaker is recorded for three different languages – English, Hindi and a Local language, which belongs to any one of the four major Arunachali languages - Adi, Nyishi, Galo and Apatani. Each recording is of 4-5 minutes duration. Speech data are recorded in parallel across four recording devices, which are listed in table -1.

The speech data collection is done in laboratory environment with air conditioner, server and other equipments switched on. The speech data is contributed by 120 male and 80 female informants chosen from the age group 20-50 years. During the recording, the subject was asked to read a story from the school book of duration 4-5 minutes in each language for twice and the second reading was considered for recording. Each informant participates in four recording sessions and there is a gap of at least one week between two sessions.

GMM-UBM AS SPEAKER MODEL

The GMM-UBM approach for speaker verification system can be considered primarily as a four phase process. At the first phase, a gender independent UBM model is generated which is a GMM that built based on the Expectation- Maximization (EM) algorithm and using utterances from a very large population of speakers [2]. The target speaker specific models are then obtained through the adaptation of mean from the UBM using the speaker’s training speech and a modified realization of the maximum a posteriori (MAP) approach [2]. In the testing phase, a fast scoring procedure is used in order to reduce the number of computations [2]. This involves determining the top few scoring mixtures in the UBM for each feature vectors and then computing the likelihood of the target speaker model using the score for its corresponding mixtures. The scoring process is then repeated for all the feature vectors in the test utterance to obtain the average log likelihood score for each of the UBM and the target speaker model. Finally, UBM-based normalization is performed by subtracting the log likelihood score of the UBM from that of the target speaker model. This is firstly to minimize the effect of unseen data, and secondly to deal with the data quality mismatch [2].

Normally, SV systems use mel-frequency cepstral coefficients (MFCCs) as a feature vector and the speaker model is parameterized by the set where are the weights, are the mean vectors, and are the covariance matrices. In the testing stage, feature vectors X are extracted from a test signal. A log-likelihood ratio Λ(X) is computed by scoring the test feature vectors against the claimant model and the UBM.

The claimant speaker is accepted if Λ(X) ≥ θ or else rejected. The important problem in SV is to find a decision threshold θ for the decision making [21][22]. The uncertainty in θ is mainly due to score variability between the trials.

FEATURES AND SCORE NORMALIZATION

Normalizations at the stage of feature extraction are implemented to reduce the effect of the noise, speech signal distortion as well as the channel distortion. State-of-the –art speaker recognition system have used several approaches in order to enhance the performance in feature level scores. The cepstral mean substraction (CMS) [8] is a blind deconvolution that comprises the substraction of the utterance mean of the cepstral coefficients from each feature. In the similar way, the variance normalization (CVN) is also applied. Hence, the new features will fit a zero mean and variance one distribution. Another well- known feature normalization is RASTA (Relative Spectras) [23]. While CMS focus on the stationary convolution of the noise due to the channel, RASTA reduces the effect of the varying channel; which removes low and high modulation frequencies [16]. The three of them are the most commonly used feature normalization techniques in the SV system.

In score normalization, the final score of the SV system is normalized relative to a set of other speaker models termed as cohort. Score normalization techniques have been mainly derived from the study of Li and Porter [18]. The main purpose of score normalization is to transform scores from different speakers into a similar range so that a common speaker independent verification threshold can be used [17]. As we know that in SV system the score variability comes from various sources. First, the probable mismatch between enrollment data which is used for training speaker models and the data that is used for testing is one of the main problems in SV system. Secondly, the nature and properties of the enrollment data can vary between the speakers, the phonetic content, the duration, the environmental noises as well as the quality of the speaker model training. Other two main factors intra-speaker and inter-speaker variability also affects in the performance in SV system. On the other hand some environment condition changes in transmission channel, recording devices or acoustical environment may also considered as a potential factor affecting the reliability of decision boundaries. To overcome above problems score normalization techniques have been introduced to cope with score variability and to make speaker-independent decision threshold tuning easier [12]. The basic of the normalization techniques is to center the imposter score distribution by applying on each score generated by the SV system. The All paragraphs must be indented. All paragraphs must be justified, i.e. both left-justified and right-justified.

The general formula to compute score normalization for speech signal X and speaker model λ is given as follows.

Where š_λ(X) is the normalized scores, S_λ (X) is final score and μ_λ and σ_λ are normalized parameters known as estimated mean and standard deviation of the imposter score distribution. Imposter distribution represents the largest part of the score distribution variance.

There are different types of normalization techniques which can be seen in speaker recognition system. These are ZNorm, H-Norm, T-Norm, HT-Norm, C-Norm etc.

The zero normalization (Z-Norm) had been more used in SV in the middle of nineties. The advantage of Z-norm is that the estimation of the normalized parameters can be performed offline during speaker model training.

The test-normalization (T-Norm) can be performed online during testing. In T-norm, during testing, the incoming speech signal is classically compared with claimed speaker model as well as with a set of imposter models to estimate imposter score distribution and normalized parameters consecutively [12].

The handset T-Norm (HT-Norm) is introduced to deal with handset type information. Here, handset-dependent normalization parameters are estimated by testing each incoming speech signal against handset dependent imposter models [12].

EXPERIMENT SETUP OF THE BASELINE SYSTEM

In these works, the baseline system, a speaker verification system was developed using Gaussian Mixture Model with Universal Background Model (GMM-UBM) based on modeling approach. Following specification is used:

An energy based silence detector (VAD) is used to identify and discard the silence frames prior to feature extraction with threshold (-55dB).

Front end features = Total 42 dimensions (36 MFCCs + 6 Prosodic Features)

MFCC Features = Total 36 dimensions with first and second order derivatives of total 12 no. of MFCC coefficients excluding the 0thcepstral coefficient.

Prosodic Features = Total 6 dimensional prosodic features vector consist of pitch , short time energy and its first and second order derivatives (Δpitch, Δenergy, ΔΔpitch and ΔΔenergy)

The coefficients were extracted from a speech sampled at 16 KHz with 16 bits/sample resolution.c

Frame size: 20 ms

Frame rate: 10 ms

Windowing: Hamming

Pre-emphasis filter H(z) = 1- a.z-1 with Pre-emphasis factor a = 0.97

No. Of Filterbanks = 24

Cepstral Mean Subtraction (CMS) has been applied on all features to reduce the effect of channel mismatch. In this approach we can also apply Cepstral Variance Normalization (CVN) which forces the feature vectors to follow a zero mean and a unit variance distribution in feature level solution to get more robustness results.

The Gaussian mixture model with 512 Gaussian components has been used for both the UBM and speaker model. The UBM was created by training the speaker model with 50 male and 50 female speaker’s data with 256 Gaussian components each male and female model with Expectation Maximization (EM) algorithm. Finally UBM model is created by pooling the both male and female models of total 512 Gaussian components. The speaker models were created by adapting only the mean parameters of the UBM using Maximum a Posteriori (MAP) approach with the speaker specific data.

Finally, we have applied T-Norm technique to improve the performance of SV system. The normalization parameters are estimated using score derived from a set of imposter speaker models from the 2003 NIST SRE database that trained with same baseline system.

The detection error trade-off (DET) curve has been plotted using log likelihood ratio between the claimed model and the UBM and the equal error rate (EER) obtained from the DET curve has been used as a measure for the performance of the speaker verification system. Another measurement MinDCF values has also been evaluated.

EXPERIMENT AND RESULT

All the experiments reported in this paper are carried out using the database ALS-DB described in section II. An energy based silence detector VAD is used to identify and discard the silence frames prior to feature extraction. Only data from the headset microphone (Device 2) has been considered in the present study. All the four available sessions were considered for the experiments. Each speaker model was trained using first two complete sessions. The test sequences were extracted from the next two sessions. The training set consists of speech data of length 120 seconds per speaker. The test set consists of speech data of length 15 seconds, 30 seconds and 45 seconds. The test set contains more than 3500 test segments of varying length and each test segment will be evaluated against 11 hypothesized speakers of the same sex as segment speaker [10]. The overall performance of the SV system has been evaluated in the Fig.2 with DET curves and MinDCF values.

CONCLUSION

In the present study, experiments have been carried out on a recently collected speech database (ALS-DB) for evaluation of the effectiveness of GMM–UBM for speaker verification with feature and score normalization techniques. From the experimental point of view we can conclude that the performance of the speaker verification system is very poor while using only prosodic features, but it has been improved while combing the both acoustic (MFCCs) and prosodic features. And also we observe that the performance of SV system can be vastly improved while applying CVN in feature level and T-Norm in score level at the same time. Here we found EER of 6.40% with Minimum DCF value 0.1085. We observe that combining MFCC with prosodic features improves the performance of the SV system with 7.08%, while T-Norm improves the SV system with 3.22% and CVN has improved with 3.90%.

ACKNOWLEDGEMENT

This work has been supported by the ongoing project grant No. 12(12)/2009-ESD sponsored by the Department of Information Technology, Government of India.

Tables at a glance


Table 1	Table 2

Figures at a glance


Figure 1	Figure 2

References

J.P. Campbell, (1997), ‘Speaker recognition: a tutorial’, In Proceedings of the IEEE 85, 9, September 1997, pp.1437–1462.

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn (2000), ‘Speaker Verification Using Adapted Gaussian Mixture Models’, Digital SignalProcessing, vol. 10(1–3), pp. 19-41.

W. M. Campbell, D. E. Sturim, and D. A. Reynolds (2006), ‘Support vector machines using GMM supervectors for speaker verification’, IEEESignal Processing Letters, vol. 13, pp. 308–311.

A. Adami, R. Mihaescu, D. Reynolds, and J. Godfrey (2003), ‘Modeling prosodic dynamics for speaker recognition’, in Proc. ICASSP ’03, vol.4, pp. IV–788–91.

B.C.Haris, G.Pradhan, A.Misra, S.Shukla, R.Sinha and S.R.M.Prasanna (2011), ‘Multi-variability Speech Database for Robust SpeakerRecognition’, In Proc. NCC, pp. 1-5.

R. Rose and R.A. Reynolds (1990). ‘Text independent speaker identification using automatic acoustic segmentation’, Proc. ICASSP, pp. 293-296.

Arunachal Pradesh, http://en.wikipedia.org/wiki/Arunachal_Pradesh.

S. Furui (1981), ‘Cepstral analysis technique for automatic speaker verification’, IEEE Transactions on Acoustic,Speech and Signal Processing29,2. pp 254-272.

W. M. Campbell , D. E. Sturim , D. A. Reynolds (2006), ‘Support vector machines using GMM supervectors for speaker verification’, IEEESignal Processing Letters 13,5, pp. 308-311.

NIST 2003 Evaluation plan, http:// www.itl.nist.gov/iad/mig/tests/sre/2003/2003-spkrec-evalplan-v2.2.

S.C.Yin, R.Rose, P.Kenny and P. Dumouchel (2007), ‘A Joint factor analysis approach to progressive model adaptation in text independentspeaker verification’. IEEE Transactions on Audio, Speech and Language Processing, 15 (7), pp. 1999-2010.

F.J. Bimbot (2004), ‘A Tutorial on Text-Independent Speaker Verification’. EURASIP Journal on Applied Signal Processing 2004:4,pp. 430-451.

D. A. Reynolds and D.E. Sturim., “ Speaker Adaptive Cohort Selection for Tnorm in text-independent speaker verification.” MIT LincolnLaboratory, Lexington, MA USA.

D.A Reynolds (1997), ‘ Comparison of Background Normalization Methods for Text-Independent Speaker Verification’, In Proceeding ofEUROSPEECH ’97, Rhodes, Greece, pp. 963-966.

Ville Hautamaki, TomiKinnunen , IsmoKarkkainen, JuhaniSaastamoinen, Marko Tuononen and PasiFranti (2008), ‘Maximum a PosterioriAdaptation of the Centroid Model for Speaker Verification’, IEEE signal Processing letters, vol.15.

Leibny P.G. Perera, Roberto A.Lopez and J.N. Flores (2011), ‘ Speaker Verification in Different Database Scenarios’, Computation y SistemasVol.15 No.1, pp 17-26.

TomiKinnunen ,Haizhou Li (2010), ‘An Overview of Text-Independent Speaker Recognition: from Features to Supervectors’, SpeechCommunication 52(1):12-40.

K. P. Li and J. E. Porter (1988), ‘Normalizations and selection of speech segments for speaker recognition scoring’, in Proc. IEEE Int. Conf.Acoustics, Speech, Signal Processing (ICASSP ’88), vol. 1, pp. 595–598.

UtpalBhattacharjee and KshirodSarmah (2012), ‘ A Multilingual Speech Database for Speaker Recognition’, Proc. IEEE, ISPCC.

UtpalBhattacharjee and KshirodSarmah (2013), ‘Development of a Speech Corpus for Speaker Verification Research in MultilingualEnvironment’, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-6.

R. Auckenthaler, M. Carey, and H. Lloyd-Thomas (2000), ‘Score normalization for test-independent speaker verification system’, DigitalSignal Processing, vol. 10, no. 1, pp. 42–54.

J. Fierrez-Aguilar, D. Garcia-Romero, J. Ortega-Garcia, and J. Gonzalez-Rodriguez (2004), ‘Exploiting general knowledge in userdependentfusion strategies for multimodal biometric verification’, Proc. Int. Conf. Acoust. Speech, Signal Process., vol. 5, pp. 617– 620..

H. Hermansky and N. Morgan (1994), ‘ RASTA processing of speech’, IEEE Trans. On Speech and Audio Processing 2, pp. 578-589.