Recently, the collaborative paper "Time-Frequency Attention for Monaural Speech Enhancement Based on Time-Frequency Perceptual Domain Model" by the Alibaba Cloud Video Cloud Audio Technology Team and the team of Professor Li Haizhou from the National University of Singapore was accepted and invited by ICASSP 2022 Presented research to academia and industry at a conference in May this year. ICASSP (International Conference on Acoustics, Speech and Signal Processing) is the world's largest and most comprehensive top-level conference in the field of speech that integrates signal processing, statistical learning, and wireless communications.
Qi Qi|Author
This collaborative paper proposes a TF attention (TFA) module incorporating speech distribution characteristics, which can significantly improve the objective indicators of speech enhancement with almost no additional parameters.
arxiv link: https://arxiv.org/abs/2111.07518
Review of previous research results:
INTERSPEECH 2021:《Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement》
Link:
https://www.isca-speech.org/archive/pdfs/interspeech_2021/zhang21b_interspeech.pdf
1. Background
Speech enhancement algorithms are designed to remove unwanted signal components such as background noise in speech signals. It is a basic component of many speech processing applications, such as online video conferences and calls, smart short video clips, live video broadcasts, social entertainment and online education, etc.
2. Summary
In most current research on supervised learning algorithms for speech enhancement, the energy distribution of speech in the time-frequency domain (TF) representation, which is critical for accurate mask or spectrum prediction, is usually not explicitly considered in the modeling process. In this paper, we propose a simple yet effective TF Attention (TFA) module, which makes it possible to explicitly introduce prior thinking on the distributional properties of speech in the modeling process. To verify the effectiveness of our proposed TFA module, we use Residual Temporal Convolutional Neural Network (ResTCN) as the base model, and use two commonly used training targets in the field of speech enhancement, IRM [1] (The ideal ratio mask) and PSM [2] (The phase-sensitive mask) separately conducts exploration experiments. Our experimental results show that the application of the proposed TFA module can significantly improve the five commonly used objective evaluation indicators with almost no additional parameters, and the ResTCN+TFA model consistently outperforms other baseline models by a large margin.
3. Method analysis
Figure 1 shows the network structure of the proposed TFA module, where the TA and FA modules are marked with black and blue dashed boxes, respectively. AvgPool and Conv1D are the abbreviations for average pooling and 1-D convolution operation, respectively. ⊗ and ⊙ denote matrix multiplication and element-wise multiplication, respectively.
figure 1
The TFA module takes the transformed time-frequency representation \( Y∈ \mathbb{R} ^{L×d_{model} } \) as input, and uses two independent branches to carry out 1-D time-frame attention map \ (T_{A} \in \mathbb{R} ^{L\times 1} \) and 1-D frequency-dimension attention map \( F_{A} \in \mathbb{R} ^{1\times d_{ model} } \) and then fuse it into the final desired 2-D TF attention map \( TF_{A} \in \mathbb{R} ^{L\times d_{model} } \), the final The result can be rewritten as: \( \widetilde{Y} =Y\odot TF_{A} \).
4. Experimental results
training error curve
Figure 2-3 shows the resulting training and validation error curves for each model over 150 epochs of training. It can be seen that ResTCN using the proposed TFA (ResTCN+TFA) produces significantly lower training and validation set errors compared to ResTCN, which confirms the effectiveness of the TFA module. Meanwhile, compared with ResTCN+SA and MHANet, ResTCN+TFA achieves the lowest training and validation set errors and shows obvious advantages. Among the three baseline models, MHANet performs the best, and ResTCN+SA outperforms ResTCN. Furthermore, the comparison between ResTCN, ResTCN+FA and ResTCN+TA demonstrated the efficacy of the TA and FA modules.
Figure 2. Training error curve under the IRM training target
Fig. 3 Training error curve under PSM training target
Evaluation of Objective Metrics for Speech Enhancement
We use five metrics for the evaluation of enhanced performance, including wideband perceptual evaluation of speech quality (PESQ) [3], extended short-time objective intelligibility (ESTOI) [4], and three comprehensive metrics [5], mean opinion score (MOS) predictors of the signal distortion (CSIG), background-noise intrusiveness (CBAK), overall signal quality (COVL).
Tables 1 and 2 show the average PESQ and ESTOI scores for each SNR level (with four noise sources), respectively. The evaluation results show that our proposed ResTCN+TFA consistently achieves significant improvements over ResTCN in terms of PESQ and ESTOI on IRM and PSM with negligible parameter increment, which demonstrates the effectiveness of the TFA module. Specifically, under the condition of 5 dB, compared with the baseline ResTCN, the ResTCN+TFA under the IRM training objective is improved by 0.18 in the PESQ index and 4.94% in the ESTOI index. Compared with MHANet and ResTCN+SA, ResTCN+TFA performs the best in all cases and shows a clear performance advantage. In the three baseline models, the overall effect ranking is MHANet > ResTCN+SA > ResTCN. At the same time, ResTCN+FA and ResTCN+TA also have considerable improvement over ResTCN, which further confirms the effectiveness of FA and TA modules.
Table 3 lists the average CSIG, CBAK, and COVL scores for all test conditions. Consistent with the trends observed in Tables 1 and 2, the proposed ResTCN+TFA significantly outperforms ResTCN on three metrics and outperforms all models. Specifically, compared with ResTCN, ResTCN+TFA improves CSIG by 0.21, CBAK by 0.12, and COVL by 0.18 under the PSM training objective.
About Alibaba Cloud Video Cloud Audio Technical Team
Alibaba Cloud's video cloud audio technology team focuses on comprehensive audio technologies such as acquisition and playback-analysis-processing-transmission, serving real-time communication, live broadcast, on-demand, media production, media processing, and long and short videos. Through the combination of neural network and traditional signal processing, we will continue to polish the industry-leading 3A technology, deeply cultivate equipment management and adaptation, and qos technology, and continue to improve the experience of live broadcast and real-time audio communication in various scenarios.
References
[1] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans. Audio, speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.
[2] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, 2015, pp. 708–712.
[3] R. I.-T. P. ITU, “862.2: Wideband extension to recommendation P. 862 for the assessment of wideband telephone networks and speech codecs. ITU-Telecommunicatio.
[4] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016.
[5] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, Lang. process., vol. 16, no. 1, pp. 229–238, 2007.
"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。