Shrawan@blogger: Speech Recognition using Neural Networks

Speech Recognition using Neural Networks

Shrawan Gaur, Uttam Singh

Department of Computer Science & Engineering,

Faculty of Engineering & Technology,Raja Balwant Singh College,

Bichpuri Campus,Agra (Uttar Pradesh) India.

ABSTRACT: Speech is a natural mode of communication for people. Today, most of the speech recognition process is based on the Hidden Markov Model that supports both acoustic^[1] and temporal modeling but it makes a number of suboptimal modeling assumptions that limit their potential effectiveness and Neural Network avoids these assumptions but it does not work effectively on temporal modeling so, we suggest new system NN-HMM hybrids, in which neural networks perform acoustic modeling, and HMMs perform temporal modeling. Experiments and previous researches have confirmed that NN-HMM has proved advantageous over HMM and NN approach separately.

KEYWORDS: Speech recognition, neural networks, hidden Markov models, hybrid systems, backpropogation, global optimization.

1. INTRODUCTION : We learn all the relevant during early childhood, without instruction and we continue to rely on speech communication throughout our lives. It comes so naturally to us that we do not realize how complex phenomenon speech is. The human vocal tract and articulators are biological organs nonlinear properties, whose operation is not just under conscious control but also by factors ranging from gender to up bringing to emotional state^[2]. As a result vocalizations can vary widely in terms of their accent, pronunciation, articulation, roughness, pitch, volume and speed, moreover, during transmission, our irregular speech patterns be further distorted by background noise and echoes as well as electrical characteristics telephones or other electronic equipment .All these sources of variability in speech generation makes speech recognition even more complicated and hence making this problem, a very complex problem.

2. FUNDAMENTALS OF SPEECH RECOGNITION: Speech recognition is a multileveled pattern recognition task, in which acoustical signals are examined and structured into a hierarchy of subword, units words and sentences. Each level may provide additional temporal constraints which can compensate for errors at lower levels. This hierarchy of constraints can best be exploited by combining decisions probabilistically at all lower levels, and making discrete decisions only at the highest level. The elements of a standard speech recognition system are as follows:

1. Raw speech.

2. Signal analysis.

3. Speech frames.

4. Acoustic models.

5. Acoustic analysis and frame scores.

6. Time alignment.

7. Word sequence^[2].

2.2. HIDDEN MARKOV MODELS: The most flexible and successful approach to speech recognition so far has been Hidden Markov Models (HMMs).

2.2.1. BASIC CONCEPTS: A Hidden Markov Model is a collection of states connected by transitions, It begins in a designated initial state. In each discrete time step, a transition is taken into a new state, and then one output symbol is generated in that state. The choice of transition and output symbol are both random, governed by probability distributions. The HMM can be thought of as a black box where the sequence of output symbols generated over time is observable but the sequence of states visited over time is hidden from view. This is why it’s called a Hidden Markov Model^[6,7]. HMMs have a variety of applications. When an HMM is applied to speech recognition, the states are interpreted as acoustic models, indicating what sounds are likely to be heard during their corresponding segments of speech while the transitions provide temporal constraints indicating how the states may follow each other in sequence. Because speech always goes forward in time, transitions in a speech application always go forward (or make a self-loop, allowing a state to have arbitrary duration).

2.2.3. LIMITATIONS OF HMMS: Despite their state-of-the-art performance, HMMs are handicapped by several well-known weaknesses, namely:

1. THE FIRST-ORDER ASSUMPTION — which says that all probabilities depend solely on

the current state — is false for speech applications. One consequence is that HMMs have difficulty modeling coarticulation, because acoustic distributions are in fact strongly affected by recent state history.

2.THE INDEPENDENCE ASSUMPTION — which says that there is no correlation between adjacent input frames — is also false for speech applications. In accordance with this assumption, HMMs examine only one frame of speech at a time. In order to benefit from the context of neighboring frames, HMMs must absorb those frames into the current frame

3.THE HMM PROBABILITY DENSTY MODEL: have suboptimal modeling accuracy. Specifically, discrete density HMMs suffer from quantization errors, while continuous or semi-continuous density HMMs suffer from model mismatch, i.e., a poor match between their a priori choice of statistical model and the true density of acoustic space.^[2,4]

2.2.4.THE MAXIMUM LIKELIHOOD TRAINING: criterion leads to poor discrimination between the acoustic models (given limited training data and correspondingly limited models). Discrimination can be improved using the Maximum Mutual Information training criterion, but this is more complex and difficult to implement properly.

We will argue that neural networks mitigate each of the above weaknesses (except the First Order Assumption), while they require relatively few parameters, so that a neural network based speech recognition system can get equivalent or better performance with less complexity.

3.REVIEW OF NEURAL NETWORKS: There are many different types of neural networks, but they all have four basic attributes:

• A set of processing units;

• A set of connections;

• A computing procedure;

• A training procedure.^[4]

3.5. BACKPROPAGATION- Backpropagation, also known as Error Backpropagation or the Generalized Delta Rule, is the most widely used supervised training algorithm for neural networks.

Backpropagation is a faster learning procedure but it can still take a long time for it to converge to an optimal set of weights. backpropagation is a simple gradient descent procedure, it is unfortunately susceptible to the problem of local minima, i.e., it may converge upon a set of weights that are locally optimal but globally suboptimal. Experience has shown that local minima tend to cause more problems for artificial domains (as in boolean logic) than for real domains (as in perceptual processing), reflecting a difference in terrain in weight space. In any case, it is possible to deal with the problem of local minima by adding noise to the weight modifications.^[5]

3.5.1.Algorithm for implementing back propagation learning:

1.Present a training sample to the neural network

2.Compare the network's output to the desired output from that sample. Calculate the error in each output neuron

3.For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error.

4.Adjust the weights of each neuron to lower the local error.

5.Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights

6.Repeat the steps above on the neurons at the previous level, using each one’s “blame” as its error.^[3]

4. NN-HMM HYBRIDS: We know that neural networks are excellent at acoustic modeling and parallel implementations, but weak at temporal and compositional modeling. We also know that Hidden Markov Models are good models overall, but they have some weaknesses too. In this section we will review ways in which researchers have tried to combine these two approaches into various hybrid systems, capitalizing on the strengths of each approach.

Lippmann and Gold (1987) introduced the Viterbi Net which is a neural network that implements the Viterbi algorithm. The input is a temporal sequence of speech frames, presented one at a time, and the final output (after T time frames) is thecumulative score along the Viterbi alignment path, permitting isolated word recognition via subsequent comparison of the outputs of several Viterbi Nets running in parallel.

4.1. FRAME LEVEL TRAINING-Rather than simply reimplementing an HMM using neural networks, most researchers have been exploring ways to enhance HMMs by designing hybrid systems that capitalize on the respective strengths of each technology: temporal modeling in the HMM and acoustic modeling in neural networks. In particular, neural networks are often trained to compute emission probabilities for HMMs. Neural networks are well suited to this mapping task, and they also have a theoretical advantage over HMMs, because unlike discrete density HMMs, they can accept continuous-valued inputs and hence don’t suffer from quantization errors; and unlike continuous density HMMs, they don’t make any dubious assumptions about the parametric shape of the density function. The simplest is to map frame inputs directly to emission symbol outputs, and to train such a network on a frame-by- frame basis. This approach is called Frame Level Training

4.2. SEGMENT LEVEL TRAINING-An alternative to frame-level training is segment-level training, in which a neural network receives input from an entire segment of speech (e.g., the whole duration of a phoneme), rather than from a single frame or a fixed window of frames. This allows the network to take better advantage of the correlation that exists among all the frames of the segment, and also makes it easier to incorporate segmental information, such as duration. The drawback of this approach is that the speech must first be segmented before the neural network can evaluate the segments.

4.3. WORD LEVEL TRAINING- A natural extension to segment-level training is word-level training, in which a neural network receives input from an entire word, and is directly trained to optimize word classification accuracy. Word level training is appealing because it brings the training criterion still closer to the ultimate testing criterion of sentence recognition accuracy. Unfortunately the extension is nontrivial, because in contrast to a simple phoneme, a word cannot be adequately modeled by a single state, but requires a sequence of states; and the activations of these states cannot be simply summed over time but must first be segmented by a dynamic time warping procedure (DTW), identifying which states apply to which frames. Thus, word-level training requires that DTW be embedded into a neural network.

4.4. GLOBAL OPTIMIZATION-The trend in NN-HMM hybrids has been towards global optimization of system parameters, i.e., relaxing the rigidities in a system so its performance is less handicapped by false assumptions. Segment-level training and word-level training are two important steps towards global optimization, as they bypass the rigid assumption that frame accuracy is correlated with word accuracy, making the training criterion more consistent with the testing criterion. Another step towards global optimization, pursued by Bengio et al (1992), is the joint optimization of the input representation with the rest of the system. Bengio proposed a NNHMM hybrid in which the speech frames are produced by a combination of signal analysis and neural networks; the speech frames then serve as inputs for an ordinary HMM. The neural networks are trained to produce increasingly useful speech frames, by backpropagating an error gradient that derives from the HMM’s own optimization criterion, so that the neural networks and the HMM are optimized simultaneously. This technique was evaluated on the task of speaker independent plosive recognition, i.e., distinguishing between the phonemes /b,d,g,p,t,k,dx,other/. When the HMM was trained separately from the neural networks, recognition accuracy was only 75%; but when it was trained with global optimization, recognition accuracy jumped to 86%.^[2,5]

5.EXPERIMENTAL RESULTS: We performed our experiments on NN-HMM hybrids using two different databases: The CMU Conference Registration database, and the DARPA Resource Management database. CMU Conference Registration database (Wood 1992) consists of 204 English sentences using a vocabulary of 402 words, comprising 12 hypothetical dialogs in the domain of conference registration. Training and testing versions of this database were recorded with a close-speaking microphone in a quiet office by multiple speakers for speaker-dependent experiments. Recordings were digitized at a sampling rate of 16 kHz; a Hamming window and an FFT were computed, to produce 16 melscale spectral coefficients every 10 msec.

In order to fairly compare our results against those of researchers outside of CMU, we also ran experiments on the DARPA speaker-independent Resource Management database (Price et al 1988). This is a standard database consisting of 3990 training sentences in the domain of naval resource management, recorded by 109 speakers contributing roughly 36 sentences each; this training set has been supplemented by periodic releases of speaker-independent testing data over the years, for comparative evaluations.^[2,5]

System	Type	Parameters	Word Accuracy
MLP	NN-HMM	41,000	89.2%
MS-TDNN	NN-HMM	67,000	90.5%
MLP(ICSI)	NN-HMM	156,000	87.2%
CI-Sphinx	HMM	111,000	84.4%
CI-Decipher	HMM	126,000	86.0%
Decipher	HMM	5,500,000	95.1%
Sphinx-II	HMM	9,217,000	96.2%

6. CONCLUSION: The field of speech recognition has seen tremendous activity in recent years. Hidden Markov Models still dominate the field, but many researchers have begun to explore ways in which neural networks can enhance the accuracy of HMM-based systems. Researchers into NN-HMM hybrids have explored many techniques (e.g., frame level training, segment level training, word level training, global optimization), many issues (e.g., temporal modeling, parameter sharing, context dependence, speaker independence), and many tasks (e.g., isolated word recognition, continuous speech recognition, word spotting). These explorations have especially proliferated since 1990. Finally, NN-HMM hybrids offer several theoretical advantages over standard HMM

speech recognizers. Specifically:

1. Modeling accuracy.

2. Context sensitivity.

3. Discrimination.

4. Economy.

7.ACKNOWLEDGMENTS

The authors wish to express their appreciation of Reader Er.B.K.Singh of F.E.T R.B.S.College for several fruitful discussions on neural networks and Artificial Intelligence. The authors would like to thank Department Of Computer Science & Engineering of F.E.T R.B.S.College for providing the basic infrastructure and their constant support..

8. REFERENCES: 1.Waibel, A., Jain, A., McNair, A., Saito, H., Hauptmann, A., and Tebelskis, J. (1991). Janus: A Speech-to-Speech Translation System using Connectionist and Symbolic Processing Strategies. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1991.

2. Joe Tebelskis,May 1995,CMU-CS-95-142

School of Computer Science,Carnegie Mellon University.Pittsburgh, Pennsylvania 15213-3890 Ph.D Thesis.

3. Lang, K. (1989). A Time-Delay Neural Network Architecture for Speech Recognition.PhD Thesis, Carnegie Mellon University.

4. Lippmann, R. (1989). Review of Neural Networks for Speech Recognition. Neural Computation 1(1):1-38, Spring 1989. Reprinted in Waibel and Lee (1990).

5. Sondhi, M. M., and Roe, D. (1983), unpublished report, AT&T Bell Labs.

6. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of lhe IEEE. 77, 257-286.

7. Fcrguson. J . D. (1980). "Hidden Markov Analysis: An Introduction." in Hidden Markov Models for Speech, ed. J . D. Ferguson. Princeton. NJ: Institute for Defense Analyses. pp. 8-15.

Shrawan@blogger

Sunday, November 29, 2009

Speech Recognition using Neural Networks

No comments:

Post a Comment

Followers

Blog Archive

About Me