(In-depth article) Roaming voice recognition technology-take you into the world of voice recognition technology

（深入篇）漫游语音识别技术—带你走进语音识别技术的世界

of Xiao Wang who loves to think. Today we will continue to roam the voice recognition technology. Today, the content is a little more professional. You can combine it with the previous roaming voice recognition technology. Learn.

In the last article, we briefly understood the concept of speech recognition technology, past and present, and basic recognition principles. For a while, seniors will take you to a deeper (more professional) world of speech recognition technology.

Article list: (please preview first)

一、语音识别基础
二、信号处理过程
    1、降噪处理 
        ①小波变换降噪法
        ②谱减法
        ③自适应噪声抵消法
        ④声音滤波器
    2、预加重
    3、分帧加窗
    4、端点检测
三、特征提取
四、语音识别方法
    1、声学模型
    2、语言模型
    3、解码器
    4、基于端到端的学习方法
五、深度学习-CNN实战举例
六、声网 Agora 一站式智能语音识别方案
七、语音识别开发平台
    深度学习平台
    语音识别开发平台
八、语音识别相关开源学习资料
    开源数据集
    开源语音识别项目
作者介绍

1. Basics of Speech Recognition

Speaking of speech recognition, we should first think about what sound is?

Usually we can think of sound as a wave propagating in the air, but it does not propagate the height changes of the wave like water waves, it propagates changes in the density of the air. For example, when we clap our hands, the vibration of the palm squeezes out the air. Compared with the surrounding atmospheric pressure, the pressure in the place where the air is squeezed is increased, while the pressure in the place where the air is squeezed out is lower; the part with high pressure moves around the palm, and the pressure The low part is close behind. This type of wave in which the air density changes periodically due to the vibration of the palm is called a compression wave. Once the compression wave in the air hits a thin film like the eardrum, it will vibrate. The function of the microphone is to extract this vibration in the form of electrical signals. You can refer to the following picture

Several waveforms or superimposed forms (click on the picture to see the source)

With the amplitude of vibration as the vertical axis and time as the horizontal axis, the sound can be visualized.

In other words, sound travels in the form of waves, that is, sound waves. When we understand sound from the perspective of waves, the amplitude (Magnitude), frequency (Frequency), and phase (Phase) constitute the sound wave and all the superimposed sound waves. The different pitch (Pitch), volume (Loudness), Timbre is also combined from these basic units.

All kinds of sound waves in the world can be "degraded" to basic waves, and the basic idea of Fourier Transform is the same. Different sound waves have different frequencies and amplitudes (determining the volume), and the human ear also has its own acceptance range. The human ear’s frequency range is roughly 20 Hz to 20 kHz, so higher frequency sound waves are defined as ultrasonic waves (Ultrasound Wave) and lower frequency sound waves are defined as infrasound waves (Infrasound Wave), although other animals can hear To different ranges of sounds.

Everyone should have a preliminary understanding of ASR. Speech recognition is a statistical optimization problem. Given the input sequence O={O1,...,On}, look for the most likely word sequence W= {W1,...,Wm} is actually looking for the word sequence that maximizes the probability P(W|O). Expressed by Bayesian formula as:

在这里插入图片描述

Among them, P(O|W) is called the acoustic model, which describes the probability that the acoustic observation is O when the word W is given; P(W) is called the language model, which is responsible for calculating the probability of a certain word sequence; P(O) is the observation sequence The probability of is fixed, so just look at the denominator.

The basic unit of voice selection is a frame. A frame of data is generated by a small segment of speech through the acoustic feature extraction module of the ASR front end, and the entire segment of speech can be organized into a vector group in units of frames. The dimension of each frame is fixed, but the span is adjustable to adapt to different text units, such as phonemes, characters, words, and sentences.

Most of the research on speech recognition is to obtain acoustic and language models separately, and put a lot of energy on the improvement of acoustic models. But later, an end-to-end method based on deep learning and big data was developed, which can integrate acoustic and language models to directly calculate P(W|O).

Second, the signal processing process

1. Noise reduction processing

Before noise reduction, let me tell you why noise reduction is necessary.

When we record audio data, a lot of noise will be mixed in. The noise generated in different environments and situations is not the same. The irregular ripple information in the noise signal affects the inherent acoustic characteristics of the acoustic signal, making the analysis The quality of the sound signal is degraded, and noise will have an important influence on the recognition result of the sound recognition system. Therefore, we must perform noise reduction before analyzing and processing the sound signal. (Specific noise classification voice: Look seniors this article )

Let's look at a few common methods of noise reduction:

①Wavelet transform denoising method

The wavelet transform denoising method is referred to as wavelet denoising. Generally, the wavelet threshold denoising method is the most used in sound denoising. It mainly means that in the noise signal, the effective sound signal and the noise have different wavelets at different frequencies. The coefficient, the effective signal energy spectrum performance will be more concentrated, the absolute value of the wavelet coefficient will be relatively large in the area where the energy spectrum is concentrated; and the energy spectrum of the noise is relatively scattered, so the absolute value of its coefficient is relatively small. Next, according to this feature, the wavelet transform method is used to decompose the noisy sound signal into different frequencies, and then the threshold is set for differential adjustment, and the wavelet coefficients of the effective sound signal are retained. Finally, the effective sound signal in the noisy signal is restored according to the wavelet reconstruction algorithm. Signal, which can achieve the effect of noise reduction.

This is its basic principle, in which the threshold setting can also be divided into hard threshold and soft threshold methods. The specific formulas and calculation methods involved can be Baidu or leave a message with me if you are interested. The following is the before and after comparison chart obtained by using wavelet denoising method (obtained in MATLAB environment):

Noisy signal waveform

Wavelet after denoising

②Spectral subtraction

Spectral subtraction, also known as spectral subtraction noise reduction, is a noise reduction method for sound signals based on the additivity of noise, local stationarity, and the irrelevance of noise and effective sound signals. This method does not involve noise reference signal, which main idea is noisy sound signal is superimposed on the effective signal-to-noise, then the power of the noisy signal and noise power is equivalent to the superposition of the effective power of the voice signal, Use the calculated noise spectrum estimate of the "silent" segment (the signal contains no valid signal, only system noise or environmental noise) to equivalently replace the noise spectrum contained in the effective sound signal during the existence of the sound signal, and finally the noise signal By subtracting the estimated value of the spectrum from the noise spectrum, the estimated value of the effective sound signal spectrum can be obtained.

③Adaptive noise cancellation method

The core components of the adaptive noise cancellation method are adaptive algorithms and adaptive filters. The adaptive algorithm can automatically adjust the weighting coefficient of the input filter to make the filter achieve the optimal filtering effect, so the key to the adaptive noise cancellation method is to find a certain algorithm that can automatically adjust the weighting coefficient.

The main idea of the adaptive noise cancellation method is: In addition to the noise signal x(t)=s(t)+n(t), it is assumed that another reference signal r(t) can be obtained, and this reference signal and noise n (t) is relevant, but not relevant to the effective sound signal s(t), then the Widrow algorithm (a neural network algorithm that approximates the fastest descent) can be used to cancel the noise in the noisy signal to achieve the effect of noise reduction.

④Sound filter

As an important part of digital signal processing, digital filters can achieve filtering effects and remove noise components through calculations between values. There are many types of digital filters. According to the time-domain characteristics of the impulse response function, digital filters can be divided into two types: Infinite Impulse Response (IIR) filters and Finite Impulse Response (FIR) filters. )filter. These two kinds of filters can realize 4 kinds of functions of low-pass, high-pass, band-pass and band-stop respectively.

2. Pre-emphasis

Pre-emphasis is a signal processing method that compensates for the high-frequency components of the input signal at the transmitting end. As the signal rate increases, the signal is greatly damaged during transmission. In order to obtain a better signal waveform at the receiving terminal, it is necessary to compensate for the damaged signal. The idea of pre-emphasis technology is to enhance at the beginning of the transmission line. The high-frequency component of the signal to compensate for the excessive attenuation of the high-frequency component in the transmission process. The pre-emphasis has no effect on noise, so it effectively improves the output signal-to-noise ratio. (Encyclopedia official explanation)

pre-emphasis principle : The high frequency band energy of the voice signal is large, and the low frequency band energy is small. The power spectral density of the output noise of the frequency discriminator increases with the square of the frequency (low frequency noise is large, high frequency noise is small), resulting in a large low-frequency signal-to-noise ratio of the signal, while a high-frequency signal-to-noise ratio is obviously insufficient, which leads to weak high-frequency transmission , Making high-frequency transmission difficult. Therefore, the high-frequency part of the signal is emphasized before transmission, and then the receiving end is de-emphasized to improve the quality of signal transmission.

3. Framing and windowing

"Framing" is to divide a sound signal into some audio signals of equal time length. It can be obtained by smoothly moving a window function of a prescribed length on the pre-emphasized sound signal. The window size of the window function is determined by the sampling frequency of the sound signal. Using a window function that can move with time to "overlap and framing" pig sound signals can prevent the omission of effective sound signals during framing, and also ensure that each sound signal maintains stability and continuity when sliding .

Several commonly used window functions: power window, rectangular window, triangular window, Hanning window, Hamming window, Gauss window

Hamming window example

4. Endpoint detection

Endpoint detection refers to determining the starting point and ending point of a valid signal in a sound signal. The collected sound signal contains invalid sound segments. Endpoint detection is performed to determine the starting point and ending point of the pig sound signal, which can eliminate a large number of interference signals, cut out silent segments, and reduce the amount of calculation for subsequent feature parameter extraction. Shorten the extraction time.

Matlab endpoint detection comparison

Common methods:

Short-time zero-crossing rate refers to the number of times each frame of sound signal passes through the zero point. The algorithm is to calculate the total number of times the amplitude sign of each frame of sound signal changes. If the amplitude sign of adjacent sampling points is the same, nothing happens. In the case of zero crossing, on the contrary, if the sign of the amplitude of the adjacent sampling point changes, it means that the sound signal has zero crossing.

short-term energy reflects the amplitude change of the sound signal to a certain extent. It is used to distinguish the unvoiced and voiced in the sound signal, because the energy of the unvoiced sound in the sound signal is much smaller than the energy of the voiced sound; distinguish between silent and voiced segments, because there is no sound The short-term energy of the segment is basically equal to zero, while the sound segment has energy.

dual-threshold endpoint detection method is one of the commonly used endpoint detection methods. It determines the endpoint position of the sound signal through the short-term energy of the sound signal and the short-term average zero-crossing rate. The short-term zero-crossing rate detects the starting point of the sound signal And the end point may be too wide, which reduces the speed of the sound signal processing system; and the short-term energy detection of the start point and end point of the sound signal may contain noise signals, which will cause the extracted sound signal to be inaccurate. Therefore, the two combined with to detect the start and end points of the pig sound signal, that is, the double-threshold detection method extracts the end point of the sound signal.

Three, feature extraction

Next, let you learn in detail the knowledge of MFCC feature extraction:

Let me talk about MFCC first. When the human ear receives signals, different frequencies will cause vibrations in different parts of the cochlea. The cochlea is like a spectrum analyzer, which automatically extracts features and processes speech signals. In the field of speech recognition, MFCC (Mel Frequency Cepstral Coefficents) feature extraction is the most commonly used method. Specifically, the steps of MFCC feature extraction are as follows:

Framing the speech signal
Use periodogram method to estimate power spectrum
Filter the power spectrum with the Mel filter bank, and calculate the energy in each filter
Log the energy of each filter
Perform discrete cosine transform (DCT) transform
Keep the 2-13 coefficients of DCT, and remove the others

Among them, the first two steps are short-time Fourier transform, and the latter steps mainly involve the Mel spectrum.

Basic flow chart

Important feature extraction knowledge points that everyone needs to master:

zero crossing rate (zero crossing rate) is a signal sign change rate, that is, the number of times the speech signal changes from positive to negative or from negative to positive in each frame. This feature has been widely used in the fields of speech recognition and music information retrieval, and is usually of higher value for high-impact sounds like metal and rock. In general, the greater the zero-crossing rate, the higher the frequency approximation.

Spectral Centroid (Spectral Centroid) is one of the important physical parameters describing the properties of timbre. It is the center of gravity of the frequency components. It is the energy-weighted average frequency within a certain frequency range, and its unit is Hz. It is important information about the frequency distribution and energy distribution of the sound signal. In the field of subjective perception, the spectral centroid describes the brightness of the sound. Sounds with dark and low quality tend to have more low-frequency content, and the spectral centroid is relatively low. Most of the bright and cheerful qualities are concentrated in the high frequency, and the spectral centroid is relatively high. high. This parameter is often used in the analysis and research of the sound and color of musical instruments.

Spectral Roll-off (Spectral Roll-off) is a measure of the shape (waveform) of the sound signal, representing the frequency lower than the specified percentage of the total spectral energy.

Mel-frequency cepstral coefficients (MFCC) is a cepstral parameter extracted in the frequency domain of the Mel scale. The Mel scale describes the non-linear characteristics of the frequency of the human ear. Among them, the Mel Scale is to establish the frequency perceived by human hearing; for example, if the tone frequency is increased from 1000 Hz to 2000 Hz, our ears can only perceive that the frequency seems to have increased a little rather than doubled. But by converting the frequency to the Mel scale, our characteristics can better match the human auditory perception.

Chroma Frequencies (Chroma Frequencies) Chroma Frequencies is an interesting and powerful representation of music audio, in which the entire spectrum is projected into 12 intervals, representing 12 different semitones of the musical octave.

Four, voice recognition method

In today's mainstream speech recognition systems, the acoustic model is a hybrid model, which includes a Hidden Markov Model (HMM) for sequence jump and a deep neural network that predicts the state based on the current frame.

1. Acoustic model

Hidden Markov Model (HMM) is a common model used to model discrete time series. It has been used in speech recognition for decades and is considered a typical acoustic model.

The main contents involved in HMM are: two sets of sequences (hidden state and observation value), three kinds of probabilities (initial state probability, state transition probability, emission probability), and three basic questions (calculation of the probability of generating the observation sequence, the best The decoding of the hidden state sequence, the training of the model itself), and the common algorithms for these three problems (forward or backward algorithm, Viterbi algorithm, EM algorithm). The final application of speech recognition corresponds to the decoding problem, and the evaluation and use of the speech recognition system is also called decoding (Decoding).

Before studying HMM, let us briefly review the Markov chain. The Markov chain is a method of modeling random processes. A simple example of using weather is that whether it rains today is related to whether it rains the day before, and has an associated characteristic. On it is that we know the spectrum of speech, but do not know what the meaning of the previous spectrum in speech recognition, you can through the spectrum of history, to derive corresponding to the result of the new spectrum.

Gaussian Mixed Model (GMM, Gaussian Mixed Model) mainly to obtain the probability of a certain phoneme through GMM.

In speech recognition, HMM is used to model subword level (such as phoneme) acoustic modeling. Usually we use 3 states of HMM to model a phoneme, which respectively represent the beginning, middle and end of the phoneme.

Nowadays, popular speech systems no longer use GMM but use a neural network model. Its input is the feature vector of the current frame (possibly adding features of some frames before and after), and the output is the probability of each phoneme. For example, if we have 50 phonemes and each phoneme has 3 states, then the output of the neural network is 50x3=150. This acoustic model is called a "hybrid" system or HMM-DNN system, which is different from the previous HMM-GMM model, but the HMM model is still being used.

2. Language model

The problem to be solved by the language model is how to calculate P(W). The commonly used method is based on n-gram Grammar or RNN. At present, there are mainly n-gram language models and RNN language models.

The n-gram language model is a typical autoregressive model, and the RNN language model because the current results depend on the previous information, so you can use a one-way recurrent neural network for modeling. If you are interested here, you can learn about it. , The content is too much, seniors choose important ones to tell you.

3. Decoder

According to the aforementioned P(W|O), our ultimate goal is to choose the W that makes P(W|O) = P(O|W)P(W) the largest, so decoding is essentially a search problem, and Use Weighted Finite State Transducer (WFST) to uniformly search for optimal paths (understand first)

4. Based on an end-to-end learning method

CTC (Connectionist temporal classification), CTC method has been proposed and applied to speech recognition as early as 2006, but it really shined after 2012, with various CTC researches spreading out. CTC is just a loss function. In short, the input is a sequence and the output is also a sequence. The loss function wants to make the sequence output by the model fit the target sequence as much as possible. Before you need to align the speech to the frame, you don’t need to align it with this. It only cares about whether the predicted output sequence is close (the same) as the real sequence.

Attention model After reading many concepts, I still think it’s easiest to use the previous example:

When we humans are looking at something, what we are paying attention to at the moment must be a certain place of the thing we are currently looking at. In other words, when our gaze is moved elsewhere, the attention will also follow the movement of the gaze. Transfer. The realization of the Attention mechanism is to retain the intermediate output results of the LSTM encoder on the input sequence, and then train a model to selectively learn these inputs and associate the output sequence with it when the model outputs.

Five, deep learning-CNN practical examples

I have said so much theoretical knowledge before, now I use Python code to briefly explain the CNN network model (let you use examples to better understand the voice classification process), and I recommend that you take a look at this PPT (click to view and download) is really learning Dry goods!!!

#搭建CNN模型
model = Sequential()

# 输入的大小
input_dim = (16, 8, 1)

model.add(Conv2D(64, (3, 3), padding = "same", activation = "tanh", input_shape = input_dim))# 卷积层
model.add(MaxPool2D(pool_size=(2, 2)))# 最大池化
model.add(Conv2D(128, (3, 3), padding = "same", activation = "tanh")) #卷积层
model.add(MaxPool2D(pool_size=(2, 2))) # 最大池化层
model.add(Dropout(0.1))
model.add(Flatten()) # 展开
model.add(Dense(1024, activation = "tanh"))
model.add(Dense(20, activation = "softmax")) # 输出层：20个units输出20个类的概率

# 编译模型，设置损失函数，优化方法以及评价标准
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

model.summary()

# 训练模型
model.fit(X_train, Y_train, epochs = 20, batch_size = 15, validation_data = (X_test, Y_test))


# 预测测试集
def extract_features(test_dir, file_ext="*.wav"):
    feature = []
    for fn in tqdm(glob.glob(os.path.join(test_dir, file_ext))[:]): # 遍历数据集的所有文件
        X, sample_rate = librosa.load(fn,res_type='kaiser_fast')
        mels = np.mean(librosa.feature.melspectrogram(y=X,sr=sample_rate).T,axis=0) # 计算梅尔频谱(mel spectrogram),并把它作为特征
        feature.extend([mels])
    return feature
  X_test = extract_features('./test_a/')
  X_test = np.vstack(X_test)
predictions = model.predict(X_test.reshape(-1, 16, 8, 1))
preds = np.argmax(predictions, axis = 1)
preds = [label_dict_inv[x] for x in preds]

path = glob.glob('./test_a/*.wav')
result = pd.DataFrame({'name':path, 'label': preds})

result['name'] = result['name'].apply(lambda x: x.split('/')[-1])
result.to_csv('submit.csv',index=None)
!ls ./test_a/*.wav | wc -l
!wc -l submit.csv

6. Agora One-Stop Intelligent Voice Recognition Solution

After talking about the prerequisite knowledge of speech recognition, let’s think about what will happen when speech recognition is more and more widely used in voice chat, music social, video live broadcast, these social scenes related to "sound". The problem is that the most prominent problem is the existing voice content audit + real-time audio and video services, the deployment, debugging, and operation and maintenance costs are high, and many solutions have poor audio recognition effects for background music and noise.

Senior Xiao Wang also checked many application solutions, and felt that Agora's one-stop intelligent voice recognition solution is quite good, and I recommend it to everyone. Someone will definitely ask why you think it’s good, and what’s the good thing about it?

Let me talk about the existing traditional solution , which is simply divided into three steps:

The content is transcoded or directly streamed to the CDN;
Content review vendors pull the stream from CDN, and then conduct AI and manual content review;
After the review is completed, it will be sent back to the server.

<center size=1>Traditional real-time audio and video content review process (click on the picture to see the source)</center>

: First of all, developers need to connect with three vendors and have to deploy and debug many times. Many of them will cause costs and risks. Moreover, when the CDN fails, it will take a long time to troubleshoot The problem also needs to pay additional cost of pulling the stream.

On the other hand, current solutions still need to solve the problem of noise. For example, scenes such as voice social interaction and voice FM are often accompanied by background music and environmental noise, which will affect the recognition rate of existing content review solutions.

Acoustic Network now provides the industry's unique one-stop intelligent speech recognition solution:

Developers only need to integrate the Agora SDK into the application, and the audio content can be recognized and reviewed during the real-time transmission of the audio on the Agora SD-RTN™ network. It also integrates the industry's top 3 speech recognition services, eliminates background sounds through the AI audio noise reduction engine exclusively developed by Shengwang, optimizes audio quality, and makes speech clearer.

The advantages of voice recognition program of 161372a1db9586:

1. Call RESTful API, one-stop access: After integrating the Agora SDK in the application, developers can add voice content review services to their applications by calling the RESTful API. Compared with the traditional content review solution, the sound network solution can save development time and server access costs.

2. AI noise reduction, higher recognition rate: uses the voice network AI audio noise reduction engine to optimize the audio to improve the speech recognition rate.

3. Voice interaction with low latency: SDK realizes the global end-to-end 76ms real-time audio and video low-latency transmission. Acoustic network Agora SD-RTN™ real-time communication network uses a private UDP protocol for transmission, and optimizes routing based on software definition to select the optimal transmission path, automatically avoiding the impact of network congestion and backbone network failures.

So, after reading the comparison of the advantages and disadvantages of the sound network and the traditional solution, do you feel that the one-stop solution of the sound network is very fragrant! ! !

In addition, I would like to recommend a useful tool- Agora's tool crystal ball

To put it simply, the crystal ball is the first quality monitoring and data analysis tool in the RTC industry launched by Agora, which mainly solves the problem of too long feedback chain for end users. If you want to know more, you can click here

Features: 1. Self-built monitoring
2. RTC integration of multiple monitoring tools
3. Use the same quality investigative tool provided by RTC service providers

Seven, speech recognition development tools

Deep learning platform

Senior Xiao Wang concludes with heart (collect it now)

Speech recognition development tools

Senior Xiao Wang concludes with heart (Hurry up and collect it)

8. Open source learning materials related to speech recognition

Open source data set

Open source speech recognition project

(My friends, remember to like and collect them after reading it. Senior Xiao Wang hopes to help everyone~)