A nonuniform sampling technique based on inflection point detection and its application to speech coding Byeong-Gwan Iema) Gangneung-Wonju National University, Gangneung, Gangwondo, 210-702, South Korea

(Received 9 May 2013; revised 15 May 2014; accepted 11 June 2014) In order to reduce the data amount, the nonuniform sampling (NUS) method detects samples of a signal, such as local maxima and minima. To overcome the sparseness problem of the NUS method, an inflection point detection (IPD) method is proposed to sample a signal nonuniformly. The IPD samples a signal not only at the local maxima and minima, but also at the inflection points where the slope of the signal changes. To show its usefulness, the IPD is applied to speech coding. The encoder transmits the time instants and sample amplitude values of the inflection points. At the receiver, the decoder estimates the sample amplitude values at the noninflection points by interpolating the received information. Simulation results show that the IPD method produces 7% mean square error improvement over the NUS method. With a small threshold to detect inflection points, the proposed coding method shows 0.388.72 dB signal-to-noise ratio (SNR) and 0.51.3 mean opinion score improvement, compared to the continuously variable slope delta modulation algorithm (CVSDM). The IPD method produces up to 8.5 dB improvement in SNR over the CVSDM at bit error rates (BER) below 5  105, while the IPD method becomes worse than the CVSDM at BER above C 2014 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4884882] 5  105. V PACS number(s): 43.72.Gy, 43.60.Dh, 43.60.Ek [CYE]

I. INTRODUCTION

The nonuniform sampling (NUS) technique is based on the observation that a signal can be approximated by a combination of nearly linear segments in a very short time period (Davisson, 1968). Therefore, in the technique, the signal is sampled only at local maxima and minima (Budaes and Goras, 2005), at random points (Fj€allbrant, 1977) or at level crossing points (Mark and Todd, 1981). The random sampling can reduce the sample rate to under the Nyquist rate. However, the method still has the restriction of skipping no more than two consecutive samples. The resulting nonuniform samples may still have some redundancy. The level crossing detection (LCD) method transmits sample time instants where the signal sample crosses predetermined levels. The LCD method may produce roughly sampled results, if the signal changes rapidly between predetermined levels. In the maxima and minima detection (MMD), the samples at the local maxima and minima are sampled, and the data rate is reduced accordingly. The MMD-based method can result in an imprecisely sampled signal. Extra samples can be inserted to overcome possible sparseness of samples, if the distance between two sampled locations is greater than a predetermined value (Budaes and Goras, 2005). The MMD-based NUS technique has been applied to speech coding problems (Bae, 1996; Budaes and Goras, 2005; Ghosh and Sreenivas, 2006). Instead of adding extra samples to avoid sparseness of samples, this study exploits the geometrical structure of speech signals by sampling the local maxima and minima, and the points where the slope of the signal abruptly changes. It is expected that the reconstructed speech via interpolation a)

Author to whom correspondence should be addressed. Electronic mail: [email protected]

J. Acoust. Soc. Am. 136 (2), August 2014

Pages: 903–909

will more closely resemble the original speech. Thus, the proposed nonuniform sampling can be used as an alternative to the pulse code modulation (PCM) coding of speech. II. INFLECTION POINT DETECTION FOR NONUNIFORM SAMPLING A. Types of inflection points

A speech signal can be considered as the output of a vocal tract filter, which is modeled as a linear time-varying filter. Thus, it inherently shows nonstationary characteristics. However, when speech is examined in small time windows, such as 5 ms segments, the signal may appear linear over this time segment. It can be approximated as a connected sequence of piecewise linear signals. Figure 1 shows a speech signal in a full sentence of 1 s, a segment of 500 ms, and voiced and unvoiced parts of 5 ms, respectively. Figure 1(a) can be considered as the nonstationary output of a vocal tract filter. But, in short time intervals of 5 ms shown in Figs. 1(c) and 1(d), the speech signal is illustrated as a connection of linear segments. Therefore, to reduce the amount of speech data, the NUS technique samples the speech at the local maxima and minima. One disadvantage of the NUS technique is that the resulting speech may show roughly sampled data. In this study, sample points where the slope of the signal changes abruptly are also sampled to overcome the sparseness problem of the NUS method. These samples are called inflection points. Figure 2 shows various types of inflection points, including local maxima and minima. B. Algorithm for inflection point detection

To locate an inflection point, the encoder checks the slope of the signal. Since the slope can be obtained by differentiation, the central difference equation is applied to

0001-4966/2014/136(2)/903/7/$30.00

C 2014 Acoustical Society of America V

903

FIG. 1. Speech signal example: (a) full sentence of 1 s, (b) a segment of 500 ms, (c) the voiced component in 5 ms, and (d) the unvoiced component in 5 ms.

approximate the differentiation. For a normalized speech signal x[n], the central difference is defined as follows (Boashash, 1992): y½n ¼

 1 x½n þ 1  x½n  1 : 4

(1)

The encoder classifies a speech sample as an inflection point, if the sample satisfies one of the following conditions: (i) (ii)

the product of consecutive differences is negative, i.e., y[n]  y[n 1] < 0, the difference of consecutive differences is larger than a predetermined threshold, i.e., jy½n  y½n  1j > thr:

Figure 3 shows the flow chart for the inflection point detection (IPD). III. APPLICATION OF NONUNIFORM SAMPLING TO SPEECH CODING A. Structure of the coder

Figure 4 shows the structure of a simple speech coder that is based on the IPD algorithm. At the transmitter, the

(2)

The first condition corresponds to samples at the local maximum or minimum, while the second condition corresponds to samples at points where the slope abruptly changes.

FIG. 2. Types of detected inflection points in dots: (a) local maxima, (b) local minima, and (c) inflection points. 904

J. Acoust. Soc. Am., Vol. 136, No. 2, August 2014

FIG. 3. Flow chart for the detection of inflection points. Byeong-Gwan Iem: A nonuniform sampling technique

FIG. 4. Structure of the speech codec.

encoder detects inflection samples, and transmits their sampling time information and sample values. For noninflection samples, only time information is transmitted. The time information of the inflection and noninflection samples is specified by a one-bit flag. A speech sample amplitude value follows the flag bit, if the sample is flagged as an inflection point. At the receiver, the decoder estimates values for the noninflection points using the received inflection point information by interpolation. B. Application example of a speech signal

In this section, the IPD-based NUS scheme is applied to a speech signal to show its usefulness. Figures 5(a) and

5(c) show a full sentence of an original speech signal and the reconstructed speech signal at the decoder, respectively. The sentence, spoken by a female adult, is “Should we chase those cowboys?” (Childers, 2000). For decoding, the linear interpolation technique is used (Press, 1986). The original speech signal is 2.1248 s long, and is sampled at 10 kHz. Therefore, the number of samples is 21 248. The number of detected inflection points is 8902. Consequently, there are 12 346 noninflection points. If 8 bits are used for quantization, then the number of bits for transmission is 8902  9 bits/sample þ 12 346  1 bits/sample ¼ 92 464 bits. The IPD and interpolation are applied to both the voiced and the unvoiced speech components, whose characteristics are different. Figures 6 and 7 show the original

FIG. 5. Application example of the proposed method: (a) original speech in a full sentence of 2 s, (b) original speech in a segment of 0.3 s from 0.45 to 0.75 s, (c) reconstructed speech by inflection point detection and interpolation in a full sentence of 2 s, and (d) reconstructed speech in a segment of 0.3 s from 0.45 to 0.75 s.

J. Acoust. Soc. Am., Vol. 136, No. 2, August 2014

Byeong-Gwan Iem: A nonuniform sampling technique

905

FIG. 6. Application example of the proposed method for the voiced component: (a) original speech in full sentence, (b) voiced part of original speech in detailed plot, (c) detected inflection points, and (d) reconstructed speech in solid line and original speech in dashed line.

speech signal, the detected inflection points, and the interpolated signal in 5 ms segments for the voiced and unvoiced speech signal components, respectively. For comparison, the original signal shown by the dashed line is superimposed with the reconstructed signal from inflection points in Figs. 6(d) and 7(d). The voiced part in Fig. 6(d) is better reconstructed than the unvoiced part in Fig. 7(d) because the voiced speech has low frequency components.

IV. PERFORMANCE COMPARISON AND CONSIDERATION A. Comparison of maxima/minima detection and inflection point detection

Objective performance measures, such as mean square error (MSE) and data rate, are closely related to the number of detected inflection points, which depends on the threshold level from Eq. (2) in the inflection point detection algorithm

FIG. 7. Application example of the proposed method for the unvoiced component: (a) original speech in full sentence, (b) unvoiced part of original speech in detailed plot, (c) detected inflection points, and (d) reconstructed speech in solid line and original speech in dashed line.

906

J. Acoust. Soc. Am., Vol. 136, No. 2, August 2014

Byeong-Gwan Iem: A nonuniform sampling technique

FIG. 8. Comparison of the MMDbased NUS and the IPD-based NUS for various threshold values: (a) mean square error vs threshold, and (b) data rate vs threshold.

shown in Fig. 4. For small threshold values, the IPD-based method detects more inflection points, and shows lower MSE and higher data rate than the MMD-based NUS method. As the threshold increases, the number of detected inflection points decreases, and the resulting data rate and MSE of the IPD get close to those of the MMD. Figure 8 shows the MSE and data rate of the MMD and the IPDbased methods for changing threshold values. For threshold values above 0.06, the data rate of the IPD-based method converges to that of the MMD-based method, but the MSE of the IPD is always lower than that of the MMD. B. Objective performance analysis

The objective performance measure used is the signalto-noise ratio (SNR) in a quiet environment. Noise is defined as the difference between the original speech signal and the reconstructed signal. The result is provided in the fourth TABLE I. Performance analysis based on SNR and MOS under noiseless condition. Proposed IPD-based speech coder Sample speech #1 #2 #3 #4

Threshold in Eq. (2)

Data rate (kbps)

SNR (dB)

MOS

0.016 0.16 0.016 0.16 0.016 0.16 0.016 0.16

44 32.3 41.4 34.4 44.2 34.8 38.6 26.6

17 11.1 21.1 11.4 16.8 9.6 16.68 9.62

4.03 2.98 4.06 3.03 3.99 2.78 3.85 2.72

J. Acoust. Soc. Am., Vol. 136, No. 2, August 2014

32 kbps CVSDM SNR (dB)

MOS

12.22

2.71

12.38

3.03

11.88

2.63

16.3

3.38

column of Table I. Speech samples 1–3 are selected to contain both voiced and unvoiced components. Sample 4 has only the voiced component. The four speech signals are shown in Fig. 9 (Childers, 2000). For comparison, the performance results of the 32 kbps continuously variable slope delta modulation (CVSDM) are also provided (Kondoz, 1994). For each speech signal, the inflection point detection algorithm uses two different thresholds, thr ¼ 0.16 and 0.016. With the lower threshold value, the coding method detects and transmits as many inflection points as possible. Thus, in this simulation, the data rate is high, above 35 kbps, and the SNR is relatively high, since the decoder recovers the signal in detail. In contrast, with the larger threshold value, the coder yields a smaller number of inflection points, data rates of below 35 kbps, and poorer SNR results. For sample 4, the IPD method shows a relatively lower data rate, than for samples 1–3. This is because sample 4 has voiced components only, which produce fewer inflection points than those for the unvoiced components only. The SNR performance of the IPD-based coder and the CVSDM under random bit error condition is also shown in Fig. 10. With lower bit error rates (BER), the IPD-based method produces better SNR than the CVSDM. But, as the BER increases, the performance of the IPD-based method becomes worse. The CVSDM shows consistent SNR performance. This is because bits in the CVSDM bit stream have no relative weights whereas bits in the bit stream of the IPD-based method have different weights from PCM coding and flag bits. That is, a bit error in the CVSDM only affects the increment or decrement in an integrator of the CVSDM whereas a bit error in the IPD-based PCM coding may happen in the least significant bit, the most significant bit, or the flag. Thus, with higher bit error rates, the probability of Byeong-Gwan Iem: A nonuniform sampling technique

907

FIG. 9. Four speech signals used for performance analysis: (a) “The rain in Spain falls mainly on the plain.”, (b) “A bird in the hand is worth two in the bush.”, (c) “The supplies were stored for everyone.”, and (d) “We were away a year ago.”

MSB or flag bit errors becomes high, and the SNR performance worsens in the IPD-based PCM coding. C. Subjective performance analysis

Because the end user of a speech signal is a human individual, a subjective opinion is as important as the objective performance. The mean opinion score (MOS) is used as the subjective performance measure (Rabiner and Schafer, 1978). The audience consisted of 134 trained normal hearing adults, with ages between 20 and 25 years. Each speech sample was played twice before rating. The rating scale was from 1 to 5 for the worst to the best. The results are shown in the fifth column of Table I. The lower threshold value yields

higher MOS than the larger threshold value, and the IPDbased coding technique shows better subjective results when the speech signal contains unvoiced components (samples 1–3). This is because the unvoiced components produce more inflection points. D. Computational cost

The IPD method requires relatively less computation than the CVSDM. For the IPD algorithm in Fig. 4, for N speech samples, the encoder requires N multiplications, 2N additions, and fewer than or equal to 3N comparisons for inflection point detection. Table II summarizes the computational requirements for one speech sample. For multiplication, the variable bit-rate coding method needs half or less of the computational requirements of the CVSDM. V. CONCLUSIONS

In this study, an inflection point detection-based nonuniform sampling technique has been proposed. The method detects speech samples at the inflection points, including TABLE II. Computational cost in number of operations for a speech sample.

Coding method

FIG. 10. SNR performance of the IPD-based method and the CVSDM under noisy situation. 908

J. Acoust. Soc. Am., Vol. 136, No. 2, August 2014

The CVSDM The proposed method

No. of multiplications

No. of additions

No. of comparisons

2–3 1

2–3 2

2 2

Byeong-Gwan Iem: A nonuniform sampling technique

local maxima and minima. Compared to the conventional nonuniform sampling method that detects only maximum and minimum points, the inflection point detection method shows reductions in the MSE. Application to speech coding has been considered to show the usefulness of the IPD-based NUS. With a small threshold value, the speech coder based on the IPD technique produces higher data rates of above 35 kbps, and better objective and subjective performance than the CVSDM. At low bit error rates (BER) below 5  105, the IPD-based speech coder exhibits up to 8.5 dB improvement in SNR over the CVSDM. Bae, M., Lee, W., and Im, S. (1996). “On a new vocoder technique by the nonuniform sampling,” Proc. of IEEE MILCOM, McLean, VA, Vol. 2, pp. 649–652. Boashash, B. (1992). “Estimating and interpreting the instantaneous frequency of a signal—Part 2: Algorithms and applications,” Proc. IEEE 80, 540–568.

J. Acoust. Soc. Am., Vol. 136, No. 2, August 2014

Budaes, M., and Goras, L. (2005). “On speech signal reconstruction from local extreme values,” Proc. of ISSCS, Iasi, Romania, Vol. I, pp. 315–318. Childers, D. G. (2000). Speech Processing and Synthesis Toolboxes (Wiley, New York), accompanied CD-ROM. Davisson, L. D. (1968). “Data compression using straight line interpolation,” IEEE Trans. Inf. Theory IT-14(3), 390–394. Fj€allbrant, T. (1977). “Method of data reduction of sampled speech signals by using nonuniform sampling and a time-variable digital filter,” Electron. Lett. 13(11), 334–335. Ghosh, P. K., and Sreenivas, T. V. (2006). “Dynamic programming based optimum non-uniform samples for speech reconstruction and coding,” Proc. ICASSP, Toulouse, France, Vol. I, pp. 1221–1224. Kondoz, A. M. (1994). Digital Speech (Wiley, Chichester, UK), pp. 117–236. Mark, J. W., and Todd, T. D. (1981). “A nonuniform sampling approach to data compression,” IEEE Trans. Commun. COM-29(1), 24–32. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1986). Numerical Recipes: The Art of Scientific Computing (Cambridge University Press, London), pp. 77–101. Rabiner, L., and Schafer, R. (1978). Digital Processing of Speech Signals (Prentice-Hall, Englewood Cliffs, NJ), pp. 172–240.

Byeong-Gwan Iem: A nonuniform sampling technique

909

Copyright of Journal of the Acoustical Society of America is the property of American Institute of Physics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

A nonuniform sampling technique based on inflection point detection and its application to speech coding.

In order to reduce the data amount, the nonuniform sampling (NUS) method detects samples of a signal, such as local maxima and minima. To overcome the...
1MB Sizes 0 Downloads 6 Views