General model, new framework, full solution of WavLM voice pre-training model

(Reproduced from Microsoft Research AI headlines)

Editor's note: Although the deep neural network model has made progress in various speech tasks in recent years, it still relies on a large amount of annotation data. The emergence and development of self-supervised training methods alleviated this problem to a certain extent. Recently, on the basis of discriminative self-supervised pre-training method, researchers from Microsoft Research Asia and Microsoft Azure Speech Group have followed the Transformer model architecture pre-trained by the Natural Language Computing Group of the Research Institute to propose a new Denoising Masked Speech Modeling framework. Through 94,000 hours of English speech pre-training, the universal speech pre-training model WavLM surpassed all previous models in all 13 speech task evaluations of SUPERB, ranking first, and achieved on the other 4 different speech classic evaluation data sets. Very effective.

In the past two years, pre-training models have attracted widespread attention in the academic and industrial circles in the fields of natural language processing and computer vision. The pre-training model trained with large-scale unsupervised data has very good generalization, and only needs to be fine-tuned on the small-scale labeled data to improve the corresponding tasks. Although the pre-training model has made some progress in the field of speech processing, it has only been verified on speech recognition tasks.

To this end, researchers from Microsoft Research Asia and Microsoft Azure Speech Group proposed the universal speech pre-training model WavLM . Through the Denoising Masked Speech Modeling framework, the researchers adapted WavLM to 17 tasks and achieved very good results. This makes speech pre-training model extend from speech recognition tasks to non-content recognition speech tasks . Based on 94,000 hours of unsupervised English data training, WavLM also achieved SOTA results on multiple speech-related data sets. At present, the model has been open sourced and integrated into Hugging Face's Transformer framework to facilitate users to call.

Link to the paper:
https://arxiv.org/pdf/2110.13900.pdf
open source link:
https://aka.ms/wavlm
Hugging Face integration link:
https://huggingface.co/microsoft/wavlm-large

Look at the speech task from the self-supervised pre-training method

Generative & discriminant self-supervised pre-training method

In the past few years, although deep neural network models have made breakthroughs in various tasks of speech, they are still subject to the large amount of annotation data required for model training. The emergence of self-supervised pre-training methods alleviated this problem to a certain extent. This method first uses large-scale unsupervised data for pre-training, and then fine-tunes the trained model on small-scale labeled data. Studies have shown that the use of self-supervised pre-training can improve the performance of a variety of speech tasks.

According to different pre-training goals, self-supervised pre-training methods can be divided into generative and discriminant . The generative formula includes restoring the original voice features through continuous or discrete hidden variables. For example, the autoencoder can predict future moments or voice features that are covered by a mask. The discriminant pre-trains the model by contrast learning or predicting discretization index (id), such as wav2vec2.0 and HuBERT. After pre-training the two methods of wav2vec2.0 and HuBERT on 60,000 hours of data, it can be found that it has achieved SOTA performance on the speech recognition data set Librispeech. Both methods use sound waves as the model input, and down-sampling is performed through the CNN module. The down-sampled features are randomly masked and input into the Transformer encoder. wav2vec2 uses contrast learning for model training, discretizes the uncovered CNN output by introducing a vector quantizer, and calculates the InfoNCE loss on the output representation of the Transformer at the covered position, where the positive sample comes from that position The discretized vector, the negative sample comes from the discretized vector at other positions in the speech sequence. HuBERT draws on the loss function of the mask language model in BERT, and uses Transformer to predict the discrete id of the masked position to train the model. HuBERT uses an iterative method to generate training targets, that is, the discrete id of each frame. Researchers from Microsoft Research Asia first performed k-means clustering on the MFCC features of speech to generate discrete IDs for learning the first-generation HuBERT model, and then clustered the output representations of the trained previous-generation models and generated new ones. The id for the next round of learning.

Even though wav2vec2.0 and HuBERT have made very good progress, their performance has only been verified on speech recognition tasks, and can only handle single-speaker tasks, while performing on multi-speaker tasks such as speaker separation Suboptimal. In addition, because these two models use the audio e-book LibriLight data set as the pre-training set, the performance of the models on downstream tasks outside the domain is not ideal.

The new Denoising Masked Speech Modeling framework

Following the Transformer model architecture pre-trained in natural language by the Natural Language Computing Group of Microsoft Research Asia, the researchers of the Institute proposed the pre-training scheme Denoising Masked Speech Modeling. As shown in the figure below, the WavLM model includes a convolutional encoder (CNN Encoder) and a Transformer encoder. Among them, the convolutional encoder has 7 layers, and each layer includes a time domain convolution layer, a layer normalization layer, and a GELU activation function layer. In the Transformer encoder, the researchers used gated relative position bias (gated relative position bias) to introduce the relative position into the calculation of the attention network in order to better model the local information. During training, WavLM will randomly transform the input wav, for example: mixing two wavs, or adding background noise. After that, about 50% of the audio signal is randomly covered, and the label corresponding to the covered position is predicted at the output. WavLM follows the idea proposed by HuBERT, converts continuous signals into discrete labels through the Kmeans method, and uses the discrete labels as targets for modeling. Formally speaking, given the input speech X, its label Y is first extracted, and then X is noised and masked to generate X ̂, and the Transformer model needs to input X ̂ to predict the label Y of the masked position.

Figure 1: WavLM model network structure

Large-scale training data

WavLM used 94,000 hours of English speech pre-training, which is currently the largest English training model of open source data used . Large-scale unsupervised speech data from different fields helps WavLM improve the robustness of the model. Most previous studies only use LibriSpeech or LibriLight data sets for pre-training. Since the input data is extracted from audiobooks, the generalization ability of the pre-training model is limited. Moreover, the voice environment in the e-book is different from the real scene, and the real scene is often accompanied by more noise.

Therefore, the researchers used two additional data sets to expand the training data:

(1) 10,000 hours of GigaSpeech data, collected from e-books, podcasts, and YouTube, and its content covers a variety of topics such as art, science, and sports.

(2) VoxPopuli data. This is a large-scale multilingual unlabeled audio data set consisting of more than 400,000 hours of audio in 23 languages, collected from 2009-2020 European Parliament (EP) recordings. The researchers only used 24,000 hours of English data in VoxPopuli for pre-training.

Coupled with LibriLight's e-book data, the researchers collected a total of 94,000 hours of data (including LibriLight, VoxPopuli and GigaSpeech). Researchers at Microsoft Research Asia believe that a rich data set can improve the robustness of the model because it contains different audio backgrounds, more speakers, and different content. The researchers called the data set Mix 94k hr to simplify the description.

Task evaluation and experimental results

SUPERB (13 voice task assessment)

Speech processing Universal PERformance Benchmark (SUPERB) is an evaluation data set jointly proposed by National Taiwan University, Massachusetts Institute of Technology, Carnegie Mellon University and Meta Company. It contains 13 speech understanding tasks and is used to evaluate the performance of the pre-trained model. Bad. 13 tasks include: Speaker Identification, Automatic Speaker Verification, Speaker Diarization, Phoneme Recognition, Automatic Speech Recognition, Keyword Spotting Word detection), Query by Example Spoken Term Detection (QbE), Intent Classification, Slot Filling, Emotion Recognition, Speech Separation, Speech Enhancement And Speech Translation.

In the process of fine-tuning the model, it is not allowed to update the parameters of the pre-training model, so as to measure whether the pre-training model can learn the corresponding information in the pre-training. The evaluation results show that WavLM surpasses the previous pre-trained model, and the base model with fewer parameters surpasses the previous best HuBERT large model.

Figure 2: WavLM's performance on SUPERB Leaderboard

Speaker Verification

The task of Speaker Verification is mainly to verify whether two voices are spoken by the same person, which has important applications in the field of speech. Researchers use VoxCeleb 2 as the training set and test on VoxCeleb 1. The test set is divided into three: Vox1-O, Vox1-E and Vox1-H. In this task, the researchers chose the classic ECAPA-TDNN as the downstream model, proving that the pre-training model can greatly reduce the error rate of the speaker verification task.

Table 1: The performance of WavLM on the speaker verification tasks Vox1-O, Vox1-E and Vox1-H

It can be seen that after using the pre-trained model, the Equal Error Rate of the ECAPA-TDNN model has relatively dropped by more than 50%, which greatly improves the accuracy of the model; and WavLM is still better than HuBERT in this task. The model has a better effect.

Due to its excellent performance on the Speaker Verification task, Hugging Face uses WavLM as a seed for finetuning and created an online Demo to detect whether two speeches are from the same speaker.

Demo link:
https://huggingface.co/spaces/microsoft/wavlm-speaker-verification

Figure 3: Demo screenshot

Speaker Diarization

The speaker diarization task is also called voiceprint segmentation clustering and speaker segmentation clustering. The main problem to be solved is "when and who is saying what". That is, given a speech that contains multiple people talking alternately, the task needs to determine who is speaking at each time point. For example, for a record of a call between a customer and a customer service, the task needs to know which time periods are spoken by the customer and which time periods are spoken by the customer service.

Table 2: The performance of WavLM on the speaker log task CALLHOME dataset

Researchers used the CALLHOME data set to evaluate the model, and selected EEND-vector clustering as the overall pipeline of Diarization, which is divided into speaker vector extraction and clustering modules. It can be seen from the experimental results that the WavLM model can greatly reduce the Diarization error rate of the speaker log.

Speech Separation

The goal of the Speech Separation task is to separate a piece of speech containing multiple people, ensuring that each source output only contains a piece of speech from one person. The researchers used the LibriCSS data set to evaluate the speech separation task. The data set will use the ASR model to test the word error rate (WER) of the audio separated by the speech separation model. The researchers selected the Conformer model as the Downstream model, and the experimental results are shown in the following figure:

Table 3: The performance of WavLM on the speech separation task LibriCSS dataset

It can be seen that WavLM can greatly improve the audio quality output by the separation model, and it exceeds the baseline performance at 40% overlap and 0 overlap.

Speech Recognition

The goal of the Speech Recognition task is to convert a piece of speech into text. In this task, the researchers used the Librispeech data set to verify the pre-training effect of WavLM, which contains a total of 960 hours of audio e-book recordings. The researchers considered four supervised subsets of different sizes for fine-tuning: train-1h, train-10h, train-clean-100h and all 960h Librispeech, and compared them on the standard test sets test-clean and test-other. The following table shows the results at 1h, 10h, and 100h. It can be seen that without a language model, WavLM significantly surpasses the results of wav2vec2.0. In the case of combined decoding with different language models, the results of WavLM are comparable to, or even better than, the results of wav2vec2.0 and HuBERT.

Table 4: The results of WavLM on the speech recognition task Librispeech dataset 1h, 10h and 100h

The following table shows the results of fine-tuning on the entire 960h Librispeech data set. The results showed that WavLM surpassed all supervised training models and achieved results comparable to wav2vec2.0 and HuBERT. The experimental results show that although WavLM introduces artificial noise and multi-speaker input during pre-training to improve its performance on multi-speaker tasks, it does not harm the model's performance on single-speaker speech recognition tasks. On the contrary, the performance of WavLM exceeds the baseline in multiple fine-tuned subset scenarios, demonstrating the effectiveness of WavLM pre-training.

Table 5: The results of WavLM on the speech recognition task Librispeech dataset 960h

In the future, researchers at Microsoft Research Asia will continue to explore how to train larger models to achieve better performance, and explore model compression methods, so that models can be quickly inferred on low-resource devices. In addition, researchers will conduct more discussion and research on how to conduct joint training on large-scale unsupervised speech data and text data.

Welcome to follow the Microsoft China MSDN subscription account for more latest releases!