Do you really understand phonetic features?

Abstract: This article refers to the detailed description of the process of voice conversion acoustic features, and a detailed introduction of the application of different acoustic features in different models.

This article is shared from the Huawei Cloud Community " Do you really understand the principles behind voice features? ", author: White Horse Crossing Pingchuan.

Voice data is often used in artificial intelligence tasks, but voice data often cannot be directly input into the model for training like image tasks. There is no obvious feature change in the long-term domain, and it is difficult to learn the characteristics of voice data, plus voice Time domain data is usually composed of a 16K sampling rate, that is, 16,000 sampling points per second. The amount of training data directly input to the time domain sampling points is large and it is difficult to train to achieve practical results. Therefore, the speech task is usually to convert speech data into acoustic features as the input or output of the model. Therefore, this article refers to a detailed introduction of the process of voice conversion acoustic features, and a detailed introduction of the application of different acoustic features in different models.

First of all, figuring out how the speech is produced is of great help to our understanding of speech. People produce sound through the vocal tract, and the shape of the vocal tract determines what kind of sound is made. The shape of the vocal tract includes tongue, teeth, etc. If we can accurately know this shape, then we can accurately describe the generated phonemes. The shape of the sound channel is usually shown in the envelope of the short-term power spectrum of speech. How to get the power spectrum, or get the spectrum envelope on the basis of the power spectrum, can get the characteristics of the voice.

1. Time domain diagram

Figure 1: Time domain diagram of audio

In the time domain diagram, the voice signal is directly represented by its time waveform. Figure 1 above is the time domain diagram of the audio opened with Adobe Audition. It shows that the quantization accuracy of this voice waveform is 16bit. From the figure, you can get the information of each tone. Starting position, but it is difficult to see more useful information. But if we zoom in to a 100ms scene, we can get the image shown in Figure 2 below.

Figure 2: Short-time time domain diagram of audio

From the above figure, we can see that in the short-term time dimension, the speech time-domain waveform has a certain period. Different pronunciations often correspond to changes in different periods. Therefore, in the short-time domain, we can pass the waveform through The Fourier transform is transformed into a frequency domain map to observe the periodic characteristics of the audio to obtain useful audio features.

The short-time Fourier transform (STFT) is the most classic time-frequency domain analysis method. The so-called short-time Fourier transform, as the name implies, is a Fourier transform of short-term signals. Since the voice waveform only exhibits a certain periodicity in the short-time domain, the short-time Fourier transform used can more accurately observe the changes in the voice in the frequency domain. The schematic diagram of the realization of Fourier transformation is as follows:

Figure 3: Schematic diagram of Fourier transform from time domain to frequency domain

The above figure shows how to transform the time domain waveform into the frequency domain spectrogram through Fourier change, but the algorithm complexity of the Fourier change itself is O(N^2), which is difficult to apply in a computer. More applications are the use of Fast Fourier Transform (FFT). The reasoning proof of conversion refers to knowing

Second, get audio features

From the above one, we can know the specific method and principle of obtaining the frequency domain features of the audio, how to convert the original audio into the audio features of the model training, which still requires a lot of auxiliary operations. The specific process is shown in Figure 4 below:

Figure 4: Flow chart of audio conversion into audio features

(1) Pre-emphasis

The pre-emphasis process actually passes the speech signal through a high-pass filter:

Among them, \muμ, we usually take 0.97. The purpose of pre-emphasis is to enhance the high frequency part, flatten the frequency spectrum of the signal, and keep it in the whole frequency band from low frequency to high frequency, and can use the same signal-to-noise ratio to find the frequency spectrum. At the same time, it is also to eliminate the effects of the vocal cords and lips in the process of occurrence, to compensate for the high-frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high-frequency formant.

pre_emphasis = 0.97
emphasized_signal = np.append(original_signal[0], original_signal[1:] - pre_emphasis * original_signal[:-1])

(2) Framing

Since the Fourier transform requires the input signal to be stable, it is meaningless to perform the Fourier transform on an unstable signal. From the above, we can know that the voice is unstable for a long time, and it has a certain periodicity when it is stable in the short time. That is, the voice is not stable from a macro point of view. When your mouth moves, the characteristics of the signal will change. . But from a micro point of view, in a relatively short period of time, the mouth does not move so fast, and the speech signal can be regarded as stable, and it can be intercepted for Fourier transform. Therefore, the framing operation is required. That is, short-term speech fragments are intercepted.

So how long is a frame? The frame length must meet two conditions:

From a macro point of view, it must be short enough to ensure that the signal within the frame is stable. As mentioned earlier, the change of the lip shape is the cause of the unstable signal, so the lip shape cannot change significantly during one frame, that is, the length of a frame should be less than the length of a phoneme. At a normal speaking rate, the duration of a phoneme is about 50 to 200 milliseconds, so the frame length is generally less than 50 milliseconds.
From a microscopic point of view, it must include enough vibration periods, because the Fourier transform is to analyze the frequency, and the frequency can only be analyzed if it is repeated enough times. The fundamental frequency of the voice is about 100 Hz for male voices and 200 Hz for female voices, and the period is 10 ms and 5 ms. Since a frame contains multiple cycles, it is generally at least 20 milliseconds.

Note: sub-frames is not a strict sub-fragment, but a frame shift concept, that is, determine the size of the frame window, each time it moves according to the frame shift size, intercept short-term audio clips, usually frame shift The size is 5-10ms, and the size of the window is usually 2-3 times the frame shift, that is, 20-30ms. The reason for setting the frame shift is mainly for the subsequent windowing operation.
The specific framing process is as follows:

Figure 5: Schematic diagram of audio framing

(3) Add window

Before the Fourier transform is performed, the extracted frame of signal must be "windowed", that is, multiplied by a "window function", as shown in the figure below:

Figure 5: Schematic diagram of audio windowing

The purpose of windowing is to make the amplitude of a frame of signal fade to 0 at both ends. Gradation is good for Fourier transform. It can make the peaks on the spectrum finer and not easy to blur together (the term is called reducing spectrum leakage). The price of windowing is that the two ends of a frame of signal are weakened, not like the center. That’s how important it is. The way to make up is that the frames should not be captured back to back, but overlap each other partly. The time difference between the starting positions of two adjacent frames is called frame shift.

Usually we use the Hamming window for windowing, and multiply each frame after frame division by the Hamming window to increase the continuity between the left and right ends of the frame. Assuming that the framed signal is S(n), n=0,1,…,N-1, NS(n),n=0,1,…,N−1, N is the frame size, then multiply Behind the Hamming window:

Implemented code:

# 简单的汉明窗构建
N = 200
x = np.arange(N)
y = 0.54 * np.ones(N) - 0.46 * np.cos(2*np.pi*x/(N-1))

# 加汉明窗
frames *= np.hamming(frame_length)

(4) Fast Fourier transform FFT

Since the transformation of the signal in the time domain is usually difficult to see the characteristics of the signal, it is usually converted to the energy distribution in the frequency domain for observation. Different energy distributions can represent the characteristics of different voices. After multiplying the Hamming window, each frame must be subjected to fast Fourier transform to obtain the energy distribution on the frequency spectrum. Fast Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame. The realization principle of the fast Fourier transform has been introduced before, but I will not introduce it here. Detailed reasoning and implementation recommendations refer to Know FFT ( https://zhuanlan.zhihu.com/p/31584464)

Note: where the audio is returned by fast Fourier transform is a complex number, where the real part represents the amplitude of the frequency, and the imaginary part represents the phase of the frequency.

There are many libraries containing FFT functions, just to list a few:

import librosa
import torch
import scipy

x_stft = librosa.stft(wav, n_fft=fft_size, hop_length=hop_size,win_length=win_length)
x_stft = torch.stft(wav, n_fft=fft_size, hop_length=hop_size, win_length=win_size)
x_stft = scipy.fftpack.fft(wav)

To obtain the amplitude and phase spectrum of the audio, it is necessary to take the modulus and square the spectrum to obtain the power spectrum of the speech signal, which is the linear spectrum often referred to in speech synthesis.

(5) Mel spectrum

The frequency range that the human ear can hear is 20-20000 Hz, but the human ear does not perceive linearly with the scale unit of Hz. For example, if we adapt to a 1000Hz tone, if we increase the tone frequency to 2000Hz, our ears can only perceive the frequency increase a little bit, but not at all that the frequency has doubled. Therefore, the ordinary frequency scale can be converted into the Mel frequency scale to make it more in line with people's auditory perception. This mapping relationship is shown in the following formula:

In the computer, the transformation from linear coordinates to Mel coordinates is usually realized by using a band-pass filter. Generally, a triangular band-pass filter is commonly used. The use of a triangular band-pass filter has two main functions: smoothing the frequency spectrum. And eliminate the effect of harmonics, highlight the formant of the original voice. The schematic diagram of the triangular bandpass filter structure is as follows:

Figure 6: Schematic diagram of triangular bandpass filter structure

This is a schematic diagram of the structure of a non-equal triangular band-pass filter, because humans have a weaker perception of high-frequency energy, so the stored energy of low-frequency is significantly larger than that of high-frequency. The construction code of the triangular bandpass filter is as follows:

low_freq_mel = 0
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))  # Convert Hz to Mel
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nfilt + 2)  # Equally spaced in Mel scale
hz_points = (700 * (10**(mel_points / 2595) - 1))  # Convert Mel to Hz
bin = numpy.floor((NFFT + 1) * hz_points / sample_rate)
fbank = numpy.zeros((nfilt, int(numpy.floor(NFFT / 2 + 1))))

for m in range(1, nfilt + 1):
    f_m_minus = int(bin[m - 1])   # left
    f_m = int(bin[m])             # center
    f_m_plus = int(bin[m + 1])    # right

    for k in range(f_m_minus, f_m):
        fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
    for k in range(f_m, f_m_plus):
        fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
filter_banks = numpy.dot(pow_frames, fbank.T)
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)  # Numerical Stability
filter_banks = 20 * numpy.log10(filter_banks)  # dB

Then just multiply the linear spectrum by the triangular bandpass filter and take the logarithm to get the mel spectrum. Usually the audio feature extraction of speech synthesis tasks generally ends here, and the mel spectrum as an audio feature basically meets the needs of some speech synthesis tasks. But in speech recognition, we need to do a discrete cosine transform (DCT transform) again, because different Mel filters have intersections, so they are related. We can use DCT transform to remove these correlations to improve the accuracy of recognition. , But this correlation needs to be preserved in speech synthesis, so only DCT transformation is needed in recognition. For a detailed explanation of the principle of DCT transformation, please refer to Knowing DCT ( https://zhuanlan.zhihu.com/p/85299446)

If you want to learn more about the dry goods of AI technology, welcome to the AI area of HUAWEI CLOUD. There are currently six practical camps such as AI programming and Python for everyone to learn for free. (Six actual combat camp link: http://su.modelarts.club/qQB9)

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Do you really understand phonetic features?

1. Time domain diagram

Second, get audio features

(1) Pre-emphasis

(2) Framing

(3) Add window

(4) Fast Fourier transform FFT

(5) Mel spectrum

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

三分钟掌握音视频处理 | 在 Rust 中优雅地集成 FFmpeg

从0到1：Rust 如何用 FFmpeg 和 OpenGL 打造硬核视频特效

Rust 开发者必备：三分钟搞定视频缩略图生成

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

从FFmpeg命令行到Rust：多场景实战指南

Rust 开发者必备：三分钟掌握视频帧率调整，告别 FFmpeg 命令行与 FFI 烦恼