Speech enhancement is the term used to describe algorithms or devices whose purpose is to improve some perceptual aspects (quality) and intelligibility of speech for the human listener or to improve the speech signal so that it may be better exploited by other speech processing algorithms. Speech enhancement algorithms have been applied to problems as diverse as background noise removal, cancellation of reverberation and multi-speech separation (speaker separation) in modern speech communication systems. In a quiet environment, when speaker and listener are near to each other, communication is usually easy and accurate.
However, from a high distance or in a noisy environment, the listener’s ability to understand suffers. Background noise is severe problem in speech communication systems as it degrades the perceptual aspects of the speech signal. In this paper, several short time spectral amplitude (STSA) based speech enhancement methods are discussed. This paper evaluates the performance of STSA based algorithms for different types of noises. The experimental and simulation results are presented in this paper.
These methods are very effective in suppressing additive white noise but spectral subtraction based STSA methods generates artefact known as musical noise. The modifications required are suggested to overcome the reported problems.
In the larynx and voluntary articulators’ movements (i.e. mouth, tongue, jaw, etc.), the primary means of communication among humans and results from complex interaction among vocal folds vibration is called the Speech. Speech is used by the human being to communicate messages. In a quiet environment, when speaker and listener are near to each other, communication is usually easy and accurate. However, from a high distance or in a noisy environment, the listener’s ability to understand suffers.
Historically, just after the invention of telephone by Alexander Graham Bell in the year 1850, the task of advanced speech signal enhancement in the field of communication engineering was required to begin. However, at first the speech signal transmission, processing and reception was analog in nature and it is used only wired communication with restriction of limited number of users only.
After establishment of Bell Telephone Laboratories at New Jersey, USA, the meaningful work was started in this field. Since then the rapid growth of purely digital speech signal processing applications like speech coding, speech synthesis, speech recognition, speaker verification and identification and speech enhancement had been helped by the evolving discrete time signal processing techniques along with the development in digital hardware and software technologies .
Today the wireless communications industry is largely dependent upon advanced speech coding techniques, while the integration of computers and voice technology (speech recognition, synthesis etc.) is poised for growth. To be embedded into them, both the speech coding and recognition requires some speech enhancement strategy. In the field of speech enhancement, the requirements and scope and its real time implementation is explained here.
Speech signal is sent electrically; the conversion media (earphone, loudspeaker, headphones, microphones), as well as the transmission media (it may be wired and wireless), typically introduce distortions, including a noisy and distorted speech signal in electronic communication systems. The intelligibility can be decreased by such distortion (the likelihood of being correctly understood) and/or quality (naturalness and freedom from distortion, as well as ease for listening) of speech. To improve the speech signal quality and intelligibility in many applications either it is for the human listener or it is for to improve the speech signal the speech enhancement techniques are required so that it may be better exploited by other speech processing algorithms. Speech enhancement technique should aim for high quality as well as intelligibility for human listeners, whereas quality is largely irrelevant if the enhanced speech serves as input to a recognizer [2-4].
Speech signals could really be “enhanced” in such a way as to sound worse for coders or recognizers, the analysis process is very long since often includes a high-quality output i.e., it serves its overall purpose, if the “enhanced” input allows more efficient parameter estimation in a coder or higher accuracy in a recognizer. For example, pre-emphasizes are used to balance relative amplitudes across frequency of the speech signal in anticipation of broadband channel noise (which may distort many frequencies) does not enhance the speech as such, but allows easier noise removal later (via de-emphasize). Nowadays growing speech enhancement techniques improve speech quality but do not increase intelligibility; indeed some reduce intelligibility as well as quality. Aspects of quality are worthwhile general objective. However, when speech is subject to distortions, it is ordinarily more important to furnish or hand over it intelligible than merely more satisfying.
Lately, the use of wireless communication in cellular and mobile phones with or without ‘hands free’ system is increased in voice messaging service (voice mail), call service centers, voice over internet protocol (VOIP) phones, cord less hearing aids etc. require efficient real time speech enhancement strategies to combat with additive noise and convolutive distortion (e.g., reverberation and echo) that occur in any communication system . Including aircraft and military communication, aids for hearing impaired persons, communication inside vehicles and telephone booths, enhancing emergency calls and black box recordings etc are the other application areas in newly communication system. In other speech processing systems like speech recognition, speaker recognition, speaker identification and speech coding, speech enhancement is required as pre-processing block . In a post processing stage, speech enhancement is also required as codecs used in 3G cellular mobile phone .
In most speech enhancement algorithms, we need to detect intervals where speech is absent or present in an interval so it is called noisy or noise free interval in a noisy signal in order to estimate aspects of the noise alone. This task is done by voice activity detector (VAD) and hence the voice activity detection is integral part of most speech enhancement techniques. The performance of most speech enhancement algorithms is extremely dependent on VAD . So speech detection and enhancement must be worked together. We can also use the VAD to find an application in mobile phones to detect speech/silence through which the power consumption can be reduced during non speech periods.
In modern speech communication, background noise removal, cancellation of reverberation and multi-speech separation (speaker separation) systems are such big problems. To handle them becomes now so difficult. Nowadays, Speech enhancement algorithms are applied on them as diverse as possible. This is outlined in figure 1.1. This figure shows that additive background noise degrades an original speech signal. According to this, to restore the quality of the speech, a speech enhancement algorithm is used, before finally being presented to the listener. Either a human or machine is taken as the listener [7-8]. The other sources of degradation exist for speech signals are such as distortion from the microphone or reverberation from the surrounding environment.
Depending upon the type of degradation, the approach to speech enhancement varies considerably. The speech enhancement techniques can be divided into two basic categories: (i) Single channel and (ii) Multiple channels (array processing) based on speech received from single microphone or multiple microphone sources respectively . However, single channel (one microphone) signal is available for measurement or pick up in real environments and hence focus is here on single channel speech enhancement methods. Also the methods must have other characteristics like real-time implementation, reasonable computational complexity while processing, low level of speech distortion, operation with low level SNR, separation as cleaned speech signal, adaptation to background noise, controlled level of noise suppression in speech, possibility of using a graphic equalizer for removing the stationary hindrances and easy integration with target applications.
The pioneer work in the field was done by Lim and Oppenheim  in 1979. Since then several methods have been proposed in literature for single channel speech enhancement in last thirty years. The major contributors in this area are Boll and Berouti, (1979), Ephraim and Malah (1984), Sclarat (1986), Virag (1999), Kamath (2002) and so on [10-18]. The approach to speech enhancement varies considerably depending upon type of degradation. Various domains of speech enhancement are discussed in chapter 2. Most methods assume the noise to be stationary and VAD estimates the noise characteristics during speech pauses or silent period [19-20]. However, some researchers proposed the method to handle non-stationary noise .
The limitations of these methods still pose a considerable challenge to researchers in this area. The objectives of speech enhancement vary widely: reduction noise level, increased intelligibility, reduction of auditory fatigue, etc. For communication systems, two general objectives depend on the nature of the noise, and often on the signal-to-noise ratio (SNR) of the distorted speech. With medium-to-high SNR (e.g., > 5dB), reducing the noise level can produce a subjectively natural speech signal at a receiver (e.g., over a telephone line) or can obtain reliable transmission (e.g., in a tandem vocoder application). For low SNR (e.g., ≤5dB), the objective could be to decrease the noise level, while retaining or increasing the intelligibility and reducing the fatigue caused by heavy noise (e.g., train or street noise). In the present work, the goal is to design a single channel speech enhancement algorithm having good noise suppression characteristic in the low SNR range (0-5dB) for various noise characteristics.
The research topic is motivated by the fact that the speech is the most important signal transmitted using communication system and it is always subjected to background and surrounding noise and distortion. So if speech enhancement is embedded into this system and if it works in real time, the performance of speech communication system is greatly improved. Several strategies are suggested in past for that and still some challenges are unsolved. Hence it is required to develop new strategies for speech enhancement and detection considering the communication application [37-39]. The use of technical computing development support tools such as MATLAB, SIMULINK and related Tool Boxes [42-48] makes simulation study as well design of graphical user interface simpler.
The work described in the thesis includes:
I have designed the proposed methods of Speech enhancement by using MATLAB software. The used methodology started with the parameters adoption. Number of parameters is varied to minimize speech distortion, background noise, musical noise and transient distortion. In each methods, some parameter are introduced for noise cancellation and minimization and the parameters are made adaptive according to the residual noise and musical noise. In proposed method we made adaptive smoothing constant adaptive in the range. Finally the noise tradeoff is achieved.
There are several reasons for speech degradation; viz.
Considerable ways of research has been examined to enhance speech, mostly related to speech which is distorted by background noise (occurring at the source or in transmission)-both wideband (and usually stationary) noise and (less often) narrowband noise, clicks and other non-stationary violations [1-7]. In most case, special noise function changes slowly(i.e., locally stationary over analysis frames of interest), so that Mean and Variance can characterize in terms of noise (i.e., second-order statistics), whether it is non-speech intervals (pauses) of the input signals or it is via a second microphone (called reference microphone) receiving little speech input .
In ideal backgammons, the quality and/or intelligibility of original speech and/or human subjects have normal speech production and perception systems should not be degraded. In practical scenario, the quality and/or intelligibility is degraded and/or human subjects have talking confusion in speech production and perception systems. So the focus of speech enhancement is that the quality and intelligibility is to be enhanced. Exclusively when there is availability of input from multiple microphone (in some specially arranged cases), it becomes very difficult for speech enhancement systems to increse intelligibility. Thus after minimizing any loss in intelligibility, the quality is increased by most speech enhancement methods. As it is observed, certain aspects of speech are more perceptually important than others. When it is presence of energy, the Hearing system is more sensitive to the absence of energy and tends to ignore many aspects of phase. Thus accurate modeling of peaks is often focused in the speech amplitude spectrum by speech enhancement algorithms, phase relationships or on energy at weaker frequencies is less considered. Unvoiced speech which is used for preserving quality is not perceptually so important as Voiced speech, with its high amplitude and concentration of energy at low frequency. Hence, the periodic portions of speech is usually emphasized and improved by speech enhancement. If there is good representation of spectral amplitudes at harmonic frequencies and especially in the first three formant regions is main for high speech quality. All enhancement algorithms can introduce their own distortion and artifacts and care to be taken to minimize distortion
For intelligibility weaker, unvoiced energy is so important, but obstruent they often becomes the first which are to be lost in noise and it will become most difficult to recover them. According to Some perceptual studies we can claim that strong voiced sounds are more important than such sounds (e.g., causes little decrease in intelligibility by replacing the former by noise of corresponding levels). Generally speaking, the part (both voiced and unvoiced) that performs spectrum conversion(correspond to free movement of channel) is very important. To take Advantage of knowledge beyond simple estimates of SNRs is attempted by Speech enhancement different frequency bands. Some systems combine advanced speech and speech recognition and adapt speech improvement techniques with speech segment estimates derived from speech recognition components. Since simpler ASR of broad phonetic classes is more robust and reliable than ASR of noisy speech, yet improved speech enhancement is allowed .
Different suppression techniques can also be required to avoid different kind of interferences. Noise signal can also be perform of continuous, impulsive or periodic signal, and amplitude might vary across frequency( occupying broad or slim spectral ranges); e.g., background or transmission noise is usually continuous and broadband( sometimes modeled as “white noise”- uncorrelated time samples, with a flat spectrum). A number of distortion can also abrupt and powerful, however it should be long period (e.g., radio, static, fading) will be handled by applying. From machinery or from AC power lines is also continuous, however present only at a few frequencies. These noises are usually additive in nature. Most Speech enhancement techniques are to handle the additive background signal. Noise that isn’t additive (increasing or convolutional) may be handled by applying logarithmic transformation to the corrupted signal, either within the time domain (for increasing noise) or within the frequency domain (for convolution noise), that converts the distortion to associate degree additive one (allowing basic speech sweetening ways to be applied). Types of techniques are devised to handle convolutive distortion and reverberation.
Interfering speakers introduce a unique downside for speech improvement. When listeners hear many sound sources, they will usually direct their attention to at least one specific source and perceptually ignores others. The stereo reception facilitates this “cocktail party effect” via a listener’s two ears. In twoeared sound reception, the waves reaching at each ear are slightly completely different (e.g., in time delays and amplitudes); one will usually localize the position of the supply and attend there to source suppressing perception of different sounds. However, such interference is suppressed and poorly understood by brain. One heard listening (e.g., via a telephone handset) has no directional cues, and also the beholder should depend on the specified sound source being stronger (or having major energy at completely different frequencies) than competitor sources. Once a desired source is often monitored by many microphones, techniques will exploit the gap between microphones . However, most sensible speech improvement applications involve monaural listening, with input from one electric acoustic transducer.
The artifacts of echo and ground noise will be usually minimized by directional and head-mounted noise cancelling microphones. The identical overall frequency range is occupied by the speech of many interfering speakers as that of desired speaker; however such voiced speech typically has elementary (pitch) frequency F0 and harmonics at completely different frequencies. Therefore the robust frequencies are identified and established by speech enhanced strategies either of the specified speaker or the unwanted supply, or to separate their spectral parts to the extent that the parts do not overlap. Just like speech, the interfering, music has some properties, permitting the chance of its suppression via similar strategies (except that some musical chords have over one F0, therefore spreading energy to additional frequencies than speech does). Multiple microphone solution is need to the multi speech speaker separation (speaker separation). For this type of interference, the single microphone techniques do not seem to be sufficient. Little or no literature is accessible and still this downside is not precisely resolved for any general case.
The approach to speech enhancement varies significantly relying upon sort of degradation. into two basic categories ,The speech improvement methods will be divided: (i) Single channel and (ii) Multiple channels (array processing) supported speech received from single microphone or multiple microphone sources severally . However, single channel (one microphone) signal is available there for activity and obtain in real environments and thus focus is here on single channel speech enhancement strategies. Figure 2.1 shows the chart of the newest single channel speech enhancement methods for three completely different sorts of issues.
Additive Noise Removal:
In most cases required speech signal adds the some background random noise and makes an additive mixture that is picked up by microphone. It will be stationary or non-stationary, white or colored and having no correlation with desired speech signal. Various, methods suggested in literature to this point to beat this drawback. One of them belongs to following class.
Transform domain methods:
The most commonly used methods are transform domain methods. They are most conventional methods. They transfer the time domain signal into other domain using different transforms and involve some kind of filtering to suppress noise and then inverse transform filtered signal into time domain. They follow the analysis-modify-synthesis approach. The transformation used is DFT.
DFT based (STSA methods): They are most well liked as they need less process complexness and simple implementation. They are based on short time DFT (STDFT) and are highly intensively investigated; conjointly called spectral process ways. they are supported the actual fact that that human auditory system is not sensitive to spectral phase but the clean spectral amplitude must be properly extracted from the buzzing (noisy) speech to posses acceptable quality speech at output and thence they are known as short time spectral amplitude (STSA) primarily based techniques [5,7]. In practice power density of signal is employed rather than amplitude. Methods of this category remove an estimate of noise from noisy signal using spectral subtraction (SS). The noise power spectrum estimation is obtained by averaging over multiple frames of a better known noise segment; which may be detected using voice-activity detector (VAD) . But the fundamental SS methodology suppresses noise however its limitation in terms of an side effect( artefact) called musical noise. . This offers rise to distortion in increased speech. Boll and Berouti et al introduced several modifications in basic technique to scale back the musical noise. But this needs very careful parameter choices. Using McAuly’s maximum likelihood (ML) estimation  of output speech, opposite modification is done in basic SS; that assumes noise with complicated normal Gaussian distribution. In general a posteriori SNR is estimated by all SS methods. Also SS methods are appropriate for stationary white noise only. The solutions to current are suggested using smoothing time varying filter called Wiener filter . The combination of SS and Wiener filter is employed in most real applications.
Using the calculable estimated ratio of the power spectrum of fresh speech, the optimal Wiener filter for the noisy speech will be designed in frequency domain; referred to as object power spectrum to it of noisy speech (a priori SNR). The coloured noise is accommodated by this spectrally variable attenuation, and may be updated at any desired frame rate to handle non-stationary noise. Estimating background noise spectrum is a significant drawback with this approach at each frame that is restricted by the performance of VAD. A noise adaptation is needed in VAD for each frame. But estimation of object power spectrum considering current frame solely is non-realistic additionally as time variable and non-stationary process. A answer to current is recommended by Ephrahim and Malah  called decision direct (DD) rule that estimates a priori SNR of current frame using a posteriori SNR of current frame, calculable noise for current frame and calculable clean speech in previous frame. Therefore in follow DD approach is combined with a Wiener filter to get realistic system. So it is seen that the Wiener filter shows substantial reduction in musical noise artefacts compared to SS strategies.
The realistic and best optimal object power spectrum estimation while not artefacts needs model based statistical methods. Ephrahim and Malah suggested the stochastic estimation methods such as minimum mean square error (MMSE) and its variant MMSE log spectral amplitude (LSA)  are commonly used estimation methods. They are supported modelling spectral elements of speech and noise processes as independent Gaussian variables. most literature mentions that the performance of Wiener filter and MMSE LSA is outstanding in terms of each subjective and objective evaluations. The random estimation methodology known as MAP (maximum a posteriori) is incredibly advance in performance with MMSE LSA with easier computations. All of those ways assume speech presence in the frequency bin under consideration; but it is not perceptually true. These ways way will be extended by incorporating a two state speech presence/absence model that ends up in a soft decision based mostly spectral estimation and more improves performance at the value of computational complexity. Further enhancement was discovered by using Laplacian model for speech spectral coefficients instead of Gaussian model. The varied types of noise adaptation techniques used like hard/soft/mixed decision also have an effect on the performance. The soft call based mostly noise adaptation found satisfactory in removing musical artefact however at the value of multiplied process needs.
Motorola developed a background noise suppression system is enclosed as a feature in IS-127, the TIA/EIA standard for the Enhanced Variable Rate Codec (EVRC) to be employed in CDMA primarily based telephone systems . EVRC was modified to EVRC-B and soon replaced by Selectable Mode Vocoder (SMV) that maintained the speech quality at the identical time improved network capability. Recently, however, the new CDMA2000 4GV codecs has replaced SMV itself. 4GV is that the next generation 3GPP2 standards-based EVRC-B codec . The combination of STSA based approaches is used by the EVRC based codec: multiband spectral subtraction (MBSS) and minimum mean square error (MMSE) LSA gain function estimator for background noise suppression as a pre-processor. The voice activity detector (VAD) used to detect speech/silence frame is embedded within the algorithm. Its quality has been evidenced good through commercial products. Yet, the standard might not be sufficiently sensible for a large range of SNRs that were not given a lot of attention once it absolutely was standardized. Another algorithm rule was suggested by A.Sugiyama, M.Kato and M. Serizawa  uses modified MMSE-STSA approach supported weighted noise estimation. The subjective tests on this algorithm claim to give maximum difference in mean opinion score (MOS) of 0.35 to 0.40 compared to EVRC and hence its later version is provided inside 4G handsets. The modified STSA-MMSE algorithm based on weighted noise estimation is used in millions of 4G handsets as the one and only commercially available 3GPP-endorsed noise suppressor . But still there are open questions like how the parameters of statistical models can be estimated in a robust fashion and what can be meaningful optimization criteria for speech enhancement; which will require further research.
Speech Enhancement and Detection Techniques
This chapter describes the for most ancient and general techniques for additive noise removal which are transform domain methods and based mostly on short time Fourier transform (STFT). The discrete short time Fourier transform is employed as transformation tool in most techniques used nowdays [1-2, 4]. These methods are supported the analysis-modify-synthesis approach. They use fixed analysis window length (usually 20-25ms) and frame based processing. They are supported to the actual fact on that human auditory system is not sensitive to spectral phase however the clean spectral amplitude should be properly extracted from the noisy speech to have acceptable quality speech at output and thence they are known as short time spectral amplitude or attenuation (STSA) based methods . The phase of noisy speech is preserved in the enhanced speech. The synthesis is usually done using overlap-add method. They have been one of the well-known and well investigated techniques for additive noise reduction. Also they need less computation complexity and easy implementations. The elaborated mathematical expression for the transfer gain function operates for every methodology is represented together with the terms used in the function. The relative pros and cons of all accessible methods as well as applications are mentioned. The chapter starts with the brief of analysis and synthesis procedures employed in the methods. The other transformation used is discrete wavelet transform (DWT) and the techniques based on DWT are also represented briefly here.
The performance evaluation of any algorithm is very important for comparisons. There are several objective and subjective measures are available to evaluate the speech enhancement algorithms. They are described in brief in this chapter.
The signal processing tools are used by the STSA algorithms explained here in brief.
Where is the input signalthe analysis window, which is time – reversed and shifted by n samples as shown in figure 3.1. The STFT is a function of two variables: the discrete – time index, n, and the (continuous) frequency variables. To obtain , slide the window by one sample, multiply it with , and compute the Fourier transform of the window signal. Continuing this will generate a set of STFTs for various values of n until the end of the signal is reached.
A discrete version of the STFT is obtained by sampling the frequency variable at N uniformly spaced frequencies, i.e., at. The resulting discrete STFT is defined as:
The STFT X(n,) will taken in two distinct ways that, depending on how one treat the time (n) and frequency variables. If n is mounted, but varies, may be viewed as the discrete time Fourier transform of the windowed sequence. As such, have the same properties as the DTFT. If is fixed and the time index n varies, a filtering interpretation emerges.
The STFT is a two dimensional function of time n and frequency. In principle, X(n,) can be evaluated for each value of n; however, in practice X(n,) is decimated in time domain partly to the heavy computational load involved and partly to the redundancy of information contained in consecutive values of (e.g., between and ). Hence, in most practical applications, cannot be evaluated for every sample but can be evaluated for every R sample, where R corresponds to the decimation factor, often express as a fraction of the window length. The sampling, in both time and frequency, has to be done in such a way that can be recovered from without aliasing effect.
Considering the sampling of in time domain, from equation 3.2 it can be shown that bandwidth of the analysis window is greater than or equal to the bandwidth of the sequence (along n, for a fixed frequency). This suggests that has to be sampled at twice the bandwidth of the window to satisfy the Nyquist sampling criterion.
For this window, has to be sampled in time at a minimum rate of 2B sample/sec which is equal to sample /sec to avoid time aliasing. The corresponding sampling period is sec or it is L/4 samples. This means that for an L –point Hamming window requires to evaluate at most every L/4 samples, corresponding to a minimum overlap of 75% between adjacent windows. This strict requirement on the minimum amount of overlap between adjacent windows can be stopped if zeros are allowed in the window transform . In speech enhancement application, it is quite general to use a 50% rather than 75% overlap between adjacent windows. This implies that is evaluated every L/2 samples; that is, it is decimated by a factor of L/2, where L is the window length. As STFT (for fixed) is the DTFT of the window sequence. Hence, to recover the windowed sequence with no aliasing, it is required that the frequency variable be sampled at uniformly spaced frequencies.
The speech signal’s relative energy concentration is described by the spectrogram in frequency as a function of time and, as such, it reflects the time variable properties of the speech waveform. Frequency is plotted vertically on the spectrogram with time plotted horizontally. Amplitude, or loudness, is pictured by gray scale or color intensity. Color spectrograms represent the maximum intensity as red bit by bit decreasing through orange, yellow, green and blue (illustrated in figure 3.5). We can produce two forms of spectrograms which are named as narrow-band and wide-band, , depending on the window length utilized in the computation of.
A long duration window (at least two pitch periods long) is usually utilized in the computation of the narrow-band spectrogram and a short window in the computation of the wide band spectrogram. The narrow-band spectrogram provides sensible frequency resolution but poor time resolution. The fine frequency resolution permits the individual harmonics of speech to be resolved. These harmonics seem as horizontal striations in the spectrograph (figure 3.5, top panel). The most downside of using long windows is that the risk of temporally smearing short-duration segments of speech, like the stop consonants. The wideband spectrogram uses short-duration windows (less than a pitch period) and provides sensible temporal resolution however poor frequency resolution. The most consequence of the poor frequency resolution is that the smearing (in frequency) of individual harmonics within the sound spectrum, yielding only the spectral envelope of the spectrum (figure 3.5, bottom panel). The elemental frequency (reciprocal of pitch period) range is regarding 60-150Hz for male speakers and 200-400Hz for females and kids . Therefore pitch period varies just about 2-20ms. Therefore, in apply a compromise is created by setting an appropriate sensible practical value for window duration of 20-30ms. This way it is possible to accommodate a broad range of general speakers. These values are used throughout the research work. This also represents the harmonic structure of speech fairly properly.