ISSN ONLINE(2320-9801) PRINT (2320-9798)

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Duration Modeling For Telugu Language with Recurrent Neural Network

V.S.Ramesh Bonda, P.N.Girija
Professor, School of Computer & Information Sciences, University of Hyderabad, Hyderabad, Andhra Pradesh, India
Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

In this paper, a novel syllable duration modeling approach for Telugu speech is proposed. Duration of a syllable is influenced by positional and contextual variations of syllables. Multiple linguistic features of syllables at different levels like positional and contextual features are used from text. Duration values of syllables are extracted from speech analysis software PRAAT. Duration of a syllable is predicted by a Recurrent Neural Network (RNN) algorithm. A small speech database is considered as a preliminary work to predict syllable duration with proposed RNN algorithm. Experiments are conducted with different sets of features.

Keywords

duration, speech synthesis, recurrent neural networks, syllables, Parts of Speech, Positional and contextual features

INTRODUCTION

In recent years, most of the speech researchers are using unit selection procedures for speech synthesis. First the input text is normalized by expanding abbreviations, acronyms, numbers and all non-standard words. Recent findings suggest that the use of data-driven methods neural networks or statistical methods to to generate prosodic information and tom achieve naturalness and fluency in automatic speech synthesizers [1]. Since duration is one of the important prosodic features, it is proposed to predict the duration.
Human brain [2] consist of three types of memories as long-term, short-term and mid-term which was studied in [3, 4] and [5]. It has been shown that RNNs use short term memory. In RNN the connections between units form a directed cycle which allows it to exhibit dynamic temporal behavior. One or more feedback connections are used to pass output of a neuron in a certain layer to the previous layer(s). Due to the presence of cycles, it can not be divided into layers. RNN is more superior in learning many behaviors / sequence processing tasks / algorithms / programs compared to traditional machine learning methods.
A feed forward neural network is used to predict duration for Telugu [6]. A Recurrent Neural Network (RNN) is used to predict prosodic information for Persian, Chinese and Mandarin [7]. Recurrent data input also helps to smooth the output parameter tracks [8]. RNNs inherently implement short-term memory by allowing the output of a neuron to influence its input either directly or indirectly via its effect on other neurons [9]. It is obvious that cognitive processes and/or more practical applications will require higher-level architectures. This is a solid reason to investigate recurrent neural networks even if feed forward networks showed good results in many practical applications in different areas, from classification to time-series prediction. In the present work hence it is proposed to predict duration of syllable for Telugu with RNN approach since RNN is better in learning sequence processing tasks than simple feed forward neural network. Linguistic features are used as input nodes of RNN to learn duration rules of the syllable automatically and can be predicted duration of syllable at the output node.

RELATED WORK

Neural networks are very useful for applications like pattern recognition, data classification etc. through learning process. The RNN has been applied in a variety of areas including pattern recognition, classification, image processing, and combinatorial optimization and communication systems [10]. A suitable algorithm should be considered for modeling the duration of basic units. RNNs have the ability of incorporating contextual or temporal dependencies in a natural way and also can include cyclic connections of the neurons. RNNs preserve some history of previous states through their recurrent links and accordingly they have been used widely in the processing of temporal patterns [11]. Recurrent networks are built in such a way that the outputs of some neurons are fed back to the same neurons or to neurons in the preceeding layers [12]. This helps in handling forward and backward coarticulation effect.
RNNs have an intrinsic dynamic memory and their outputs at a given instant reflect the current input as well as previous inputs and outputs which are gradually quenched. This has shown that how the synergistic combination of different local plasticity mechanisms can shape the global structure and dynamics of RNNs in meaningful and adaptive ways [13]. Multilayer perceptron (MLP) and RNN are employed as local experts to discriminate time-invariant and time-variant phonemes, respectively [14]. RNN can learn the temporal relationships of speech data and is capable of modeling time-dependent phonemes.
RNN can be trained to associate unknown input data to learned words [15]. The neural network recognizer based on the static networks, such as MLP, and the dynamic networks like RNN [16] or Time Delay Neural Network (TDNN) [17], use parametric representation of the activation function. The exact label of the phoneme is determined at low level classification using RNN [18]. MLP and RNN are employed as local experts to discriminate time-invariant and timevariant phonemes, respectively. RNN exhibits better performance in nonlinear channel equalization problem [19]. To circumvent this difficulty, an adhoc solution has been suggested to back propagate the output error through this heterogeneous configuration.
RNN is a powerful connectionist model that can be applied to many challenging sequential problems, including problems that naturally arise in language and speech [20]. However, RNNs are extremely hard to train on problems that have long-term dependencies, where it is necessary to remember events for many time steps before using them. However Temporal-Kernel Recurrent Neural Network (TKRNN) is very efficient for the long term dependencies. This is out of the scope of this work hence not discussed here.
It is observed that several types of features are used for duration modeling and more relevant work is briefly explained here. It employs a simple three layer RNN to learn the relationship between input prosodic features, with input syllable boundaries and output word-boundary information [21]. Their experimental results show that the proposed Recurrent Fuzzy Neural Network (RFNN) can generate proper prosodic features including pitch means, pitch shapes, maximum energy levels, syllable duration and pause duration. The linguistic representation is usually a complex structure that includes information about the word sequence, Parts of Speech (POS) tags, prosodic phase information, fundamental frequency, energy and pause. The mixture of RNN expert’s type model provides robustness against changing the features in learning, but it lacks the ability to extract common patterns included in the sequences because of the independency of the local representation [22]. The local representation is constructed into orthogonal units, while the global representation is also constructed into internal units using the connection weights between I/O units and internal units. Methods for processing speech data are described herein [23].
The input to the neural network consists of a set of features correspond to phonological, positional and contextual information which are extracted from the text [24]. The relative importance of the positional and contextual features is examined separately. A two-stage duration model is proposed for improving the accuracy of duration. A multi-level prosodic model based on the estimation of prosodic features is considered [25]. Different linguistic units to represent different scales of prosodic variations (local and global) at each level are used for syllable based duration modeling. Local and global variations are associated with phonological properties of these levels (coarticulation, syllabic structure, accentuation) and intermediate variations on a set of units larger than the syllable and + / - linguistically well defined (accentual group, interpausal group, prosodic group, intonational phrase, period, verbal construction, discourse sequence, ...) and associated with + / - linguistic factors : physiological (f0 declination), modalities (questions, ...), syntactical (prosodic contrasts related to some specific syntactical sequence), semantic (informational structure) and discourse. For duration, a phone-based [26] or a syllable-based [27,28] representation is considered.
A syllable based duration model based on multi-level context-dependent analysis is proposed [30]. In contrast to models based on modeling durational features on a single linguistic unit (phoneme, syllable), the proposed approach shows several advantages like distinguishing several linguistic units in the representation of durational features variation enables to explicit the superposition of prosodic forms jointly observed on a given unit, 2) each prosodic level (speech rate, duration syllabic residual, ...) can be modeled and controlled independently from each other and 3) estimate the set of linguistic features affecting each linguistic unit independently. In this experiment low-level linguistic features such as location features (position of a given unit within higher level units), weight features (number of observations of a given linguistic unit within higher level units) and phonological features (syllabic structure and prominence) are used.
A combination of the constraints and statistical analysis in the acquisition of the multiword acquisitions is outlined [29]. Word forms in inflectional languages encode rich morpho-syntactic information constraining the possible syntactic structures. This type of information is useful in extraction of linguistic knowledge by means of M/C learning and statistical learning methods without resorting to parsing. Some approaches also make use of basic linguistic knowledge in the form of a heuristic method using language specific character frequencies plus language specific lists of function words and word endings [30]. Common to all of these approaches is that the granularity of language identification is either a sentence or at most a word.

SPEECH DATABASE

In the current work Tv9 male speaker’s speech is recorded. Since Telugu is syllabic in nature duration is predicted for syllable. Speech production as well as perception of Telugu can be considered as syllable like units. Also syllable like units capture some coarticulation effects. Syllable-like units considered are V, CV, CCV, CCVC and CVCC, where C is a consonant and V is a vowel. The database consists of Tv9 news data in Telugu language. Syllables of the form CV, CVC, CCVC, CVCC are extracted from text. The speech signal is sampled at 16 KHz sampling rate and encoded as 16- bit data. The speech utterances are manually transcribed into text using WX notation. Telugu has a character set of 56. These can be represented as V, CV and CCV forms. The TV9 speech is organized as syllables, words and phrases.

DURATION MODELING

In continuous speech, different factors affect the duration of the basic units. They are classified into phonological, positional and contextual factors. Duration of syllable may be influenced by the category of the vowel present in the syllable, the category of the consonant(s) associated with the vowel and position of the vowel etc. Duration variations occur based on the positions of the basic units like word initial position, word final position, phrase boundary, sentence ending position etc. Similarly contextual variations occur due to the influence of the preceding and following units on the present unit.
RNN architecture [24] consists of three layers like input layer, hidden layer and output layer. At the input layer 25 input nodes are given. Output of input nodes is passed to hidden layer which consists of 40 hidden nodes and at the output layer 1 output node is connected. The activation function tan h is used at hidden layer. The most widely used training algorithm for RNN is the so called error back propagation. The aim of the algorithm is to adjust the weights from the output units to the hidden layer units and in turn from the units in the hidden layer to the input units to minimize the discrepancy between the network's output and its target, desired output. In back propagation this is done by propagating the error (i.e., the network's output for a given training vector (t-o) where t is a target vector and o is an output vector which is subtracted from the target, or vice versa back to the network in such a way that the weights are gradually adjusted to optimal values. In this work the objective is to adjust the weights of the network to minimize the mean squared error of each syllable’s duration. This process is not deterministic and the networks do not always converge to the same solution.
In this work linguistic features like lexical (syllable identity, syllable nucleus), positional and contextual features are used. Syllable position in a phrase, word, syllable identity, context of a syllable and syllable nucleus are considered as input features for RNN [12]. Details of features at different levels considered for RNN is shown in Table 1.
The experiment is done in two phases as training and testing with two sets of data as train set and test set respectively. In the training phase initially the duration of the syllables are found manually. For each syllable the features extracted from text is given as input vectors. The corresponding syllable durations which are measured manually are given as output to the RNN models and these models are trained for 100 epochs. The training error is estimated for different combinations of input features and is shown in Figure 2, Figure 3 and Figure 4. In the test phase the predicted syllable duration is compared with the corresponding syllable from the test data. The difference between the actual duration and predicted duration is estimated as duration deviation. The deviation of duration for different syllable classes is shown in Table 2.

RESULTS

The output of RNN with different input features are shown below in Figure 1, Figure 2 and Figure 3. In Figure 1, is shown that the training error is decreasing to zero when RNN is trained with lexical, positional and contextual input features. In Figure 2, the training error is not reaching zero where RNN is trained with lexical and positional input features. It shows that these input features are not sufficient to predict duration correctly. In Figure 3, the training error is reaching approximately nearer to zero with lexical and contextual input parameters. Also at the beginning the training error is decreasing slowly compared training error with lexical, positional and contextual features. From these experiments it is clear that the combination of lexical, positional and contextual input features are useful for better duration prediction.

CONCLUSIONS AND FUTURE WORK

A Recurrent Neural Network is used for predicting the syllable duration. Duration values are predicted based on Phonological, positional and contextual information of syllables at phrase and word levels. It is observed that the duration values predicted are similar to values predicted by [2,3]. The performance of neural net is evaluated with error values by using different combinations of input features. These error values are predicted by using the difference between actual duration and predicted duration. The values are shown in Table 2 and it is observed that the deviation of duration is in the decreasing order of voiced sounds, nasals < dentals < palatals, labials, stops < fricatives, alveolar < velars, liquids < affricates and unvoiced sounds. In future, performance can be improved by considering accent and prominence of syllable as additional feature vectors. Also it is proposed to study by considering some more additional features as input features. In RNN instead of taking all input features at single layer as input, a hierarchical approach with increasing size of the basic unit is to be tried.
 

Tables at a glance

Table icon Table icon
Table 1 Table 2

Figures at a glance

Figure 1 Figure 2 Figure 3
Figure 1 Figure 2 Figure 3
 

References