Hello everyone, I’m Mr. CV. I have been studying voice for a while. Today I will briefly describe how the quality of the voice passes before and after the transmission. That is to say, how to evaluate the quality of our voice, such as microphones and other sound equipment.
In terms of speech quality, we have three overall evaluation methods: there are objective evaluation methods with reference, objective evaluation methods with reference, and subjective evaluation methods.
Then we subdivide into his sub-categories, there will be many algorithms and evaluation ideas used.
Voice quality is extremely important. It can prevent you and me from chatting from some noise, make the communication of the army and the military more reliable, and make you think about your relatives every festive season, and relive the long-lost, true, and cordial call with family members. Words and timbre.
How did we evaluate it in the past?
Subjective evaluation research can mainly refer to the national security standard "YT Audio Subjective Test and Analysis Method". The main content of the national development standard is also a subjective evaluation in reference to the international standard: the international standard generally uses itu-t p800 (voice quality in the telephone transmission system). (Subjective evaluation of telephone broadband and broadband digital voice codec), (subjective evaluation of telephone broadband and broadband digital voice codec) and itu-t p805 (subjective evaluation of dialogue quality).
cv Jun found the previous evaluation method on their official website, but it is very comprehensive.
Figure 1: Test method in YDT2309-2011 standard
scoring standard
The scoring standard can be 5 points or 7 points. If the scoring value is defined in advance, normalization is not required. Otherwise, it needs to be normalized
Figure 2: YDT2309-2011 scoring standards
Evaluation dimension
The "National Standards for Auditing Subjective Judgment and Evaluation" lists many dimensions that need to be deleted or added based on actual products.
Mr. cv believes that objective test standards are usually divided into word quality and word meaning. These pages first discuss the quality of words. Many common experience standards and related experience. These pages are part of a common quality standard.
Good value-based indicators
Applicable EU Standard 1-65899
The global volume test can be divided into one or more dynamic levels. Among the most widely used audio standards, ordinary audio programs are trained from different perspectives of activities.
Objective evaluation-model-based
(1) Background and standards
The earliest voice quality evaluation standard is based only on the wireless index (rxqual), and the actual voice is transmitted through horizontal transmission nodes such as wireless, transmission, switching, routing, etc. Any link problem will lead to insufficient speech perception of the user, and it is impossible to find out only by considering the wireless index And to locate voice quality problems, therefore, the voice quality evaluation method based on user perception has become the most important standard for user voice quality evaluation.
Commonly used methods for evaluating speech quality can be divided into subjective evaluation and objective evaluation. The early education evaluation of pronunciation teaching quality is subjective. People can feel the quality of speech through their ears after making a call. In 1996, the International Telecommunication Union began its work. It is a subjective test method used to investigate and quantify the user's listening behavior and perceived voice quality.
points : GSM network, one point is better than three points~
However, in real life, it seems that it is difficult for people to hear and appreciate the sound quality. This is why the International Telecommunication Union has done sound quality testing and standardized technology. Standard noise evaluation algorithms such as PESQ have been released one after another. The evaluation is based on actual evaluation. Starting from the object of the method, the drawbacks of using a quantization algorithm to calculate the audio quality level are eliminated. Among them, the algorithm is the latest generation voice quality evaluation algorithm released by the International Telecommunication Union in February 2001. Due to its strong activity and good connectivity, the fastest voice quality evaluation algorithm is adopted. In various end-to-end networks, in order to objectively evaluate the quality of words, the quality and quantity of words determine the quality of words. By building an algorithm model (see template 6), we can see the flow of all algorithms, and then use the input filter to simulate the level of the input filter, extract and extract these two algorithms. Signal. Generally speaking, there is a big difference between the output signal and the reference signal, and the S point is low. Therefore, they may be confused. We can take a look at these pictures from Tongue University.
The latest speech evaluation algorithm based on MNB can only be used for co-frequency coding and specific coding types, and can only be used for algorithm models for applications such as Asyaq color and gradient, and for editing image templates, etc. The latest speech evaluation algorithm based on p8 22 can only be used for co-frequency coding and specific coding types, and can only be used for Asyaq color, gradient and other application algorithm models, used to edit image templates (2) Test method MOS mask has Image models and algorithms. The model and algorithm can be used to detect the number of MOS system tests or count the number of MOS words. Icon loading The system exits the application to save this image like a window, the main functions of the system are divided into two groups. The number of processes must be written as wireless network. On the other hand, the PESQ algorithm module creates the main audio file and MOS line to play the dark key. The automatic quality of the audio analyzer is not easy to name; the unit format ® does not need to study the interpretation of the voice, it can be translated into a MOS phrase model based on Upv.
Figure 9 Rohde & Schwarz audio analyzer with mos test
summary
Mr. cv has finished writing the voice evaluation methods used by people in the past, and summarized as follows:
based on subjective judgment:
The theme has been resolved: In terms of sound quality, this value is based on natural repetition. The main content attributes on the object index at the beginning of the test area of other organizers are easy to verify whether the certificate is correct, and it will not automatically respond to the parameter list. The index is too good, but the quality of the words is not good. Model-based objects are specified as: There is no automatic modeling of word attributes, valid entries and the types of personal details they are used to distinguish from MOS descriptions include sensory factors of various quantum algorithms (such as encryption and decryption, bit errors, packing ( Filtering, etc.) and subject index testing are invalid. What method are we using now? Cheng Yang is the only one used by the company. The personal and object method for determining language attributes. The subject is MOS, and the calculation of CMOS and ABX tests determines that this The description of the language attribute M CD (Mel cep turm) value i shown in the document. The signal contains a signal indicating whether it needs to be trusted, if a good word or syllable lacks a link, but after the specified account expires, the language is automatically detected Attributes, such as Macnet. Methods based on deep learning: automatic, multiplication, escape, and mosquitoes find it difficult. Accept cancellation of papers. Some CNN classification and language selection methods are required. Create and read data selection and select attributes to configure rating lists, such as The loss creation and learning template provides the default KDE file module with the standard file dialog name to view and mark notifications. This is what we are interested in. When defining deep learning: compare multiple definitions in language size.
compares several indicators
1 Size setting, any height. 0=too big, 5=too small. The author of Moses proposed that the MOS estimate is too large to change the target values through hard-working language learning. This value is provided by the scorer. For example, in the language code, in order to test messages of different sizes, the normal MOS and MOS are maximized. In the real-time window, attributes and attribute values are allowed. However, this value is affected by several reasons. In various papers, MOS is incompatible. Only one protocol MOS can be integrated with different systems and converted into different systems. Values published in ssw10. Long format text: replace sensor and underscore, when in attribute text When assigning a value to a character string in, the audio sample will affect E. In the original window, the value of property and the value of change are listened to, but the value provided by people is multi-interface, which is about results. In general, Google’s evaluation of long-form text to speech: comparing the sensory and paragraph ratios published in s10 compares several evaluation methods for multi-line text synthesis speech. When evaluating a sentence in a long text, the presentation of audio samples will significantly affect the subject's result of v. Only a sentence without context is specified and compared with the same content.
It is allowed to use the I TU language attributes of the original window for authentication, when the A CR method is used to convert the language attributes and codes of the entire class rating (ACR) to ettp. 80 0.1. Through this option, participants can obtain additional language attributes, as co height and language quality. In general, MOS must be 4 or higher, which is a good language attribute. If the MOS is less than 3.6, more topics are incomplete, with cancellation attributes. MOSv test requirements are general: The number of samples and variable strings control the use of each audio input and device; each audio sequence has the same value. Full rating, the rating of the class opposite to the other topics of the language attribute (DCR is the opposite of these two methods) language does not need to provide hints, but the actual language is required. The MOS counting script is attached to the language of this article in the background language. It is not just a MOS value, it is a 95% confidence interval.
Here, Mr. cv found a copy of the code, you can take a look, it's relatively simple, so I won't repeat it.
# -*- coding: utf-8 -*-
import math
import numpy as np
import pandas as pd
from scipy.linalg import solve
from scipy.stats import t
def calc_mos(data_path: str):
data = pd.read_csv(data_path)
mu = np.mean(data.values)
var_uw = (data.std(axis=1) ** 2).mean()
var_su = (data.std(axis=0) ** 2).mean()
mos_data = np.asarray([x for x in data.values.flatten() if not math.isnan(x)])
var_swu = mos_data.std() ** 2
x = np.asarray([[0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = np.asarray([var_uw, var_su, var_swu])
[var_s, var_w, var_u] = solve(x, y)
M = min(data.count(axis=0))
N = min(data.count(axis=1))
var_mu = var_s / M + var_w / N + var_u / (M * N)
df = min(M, N) - 1
t_interval = t.ppf(0.975, df, loc=0, scale=1)
interval = t_interval * np.sqrt(var_mu)
print('{} :{} +—{} '.format(data_path, round(float(mu), 3), round(interval, 3)))
if __name__ == '__main__':
data_path = ''
calc_mos(data_path)
Voice quality perception assessment
The following are the tag codes: First, the verification system converts the original signal and signal level to standard audio levels, and then converts them to filters. After changing layers and filtering, the audio format is adjusted to two codes. This change includes linear filtering and modifying the interval between two audio codes as an interface write (for example. Extracting the intersection of pages from two angles, extracting time and MOS display.)
cv Jun here also introduce a comparison with PESQ: P.563 algorithm is very easy to use
Objective quality single-ended method P.563
The maximum output codes of P.5 and PE are only applicable to different audio engines of P.5, so P.5 is more usable. However, the accuracy of PE is low, one of the three options will be determined; attribute parameter estimation; the second part is the mapping model, after language processing, 563 first counts some attributes, and after applying these attributes, the mapping model will be displayed The type is calculated using a mapping model that finds the final value (in fact, it is the same as a straight line). The language code is calibrated and filtered. You can choose the third time. 563 means that all languages will be input codes. The signal is calibrated to S. It will be decided below. The 563 algorithm uses two types of filters. The size of the first filter. The second filter is used to include an active filter. The second filter uses the above five types of filters. The speech symbol sequence synthesized by the filter can be used to detect its cut-off fixer. It is known that the last component of the channel model is to process the symbolic function, which is used to split the height of the message threshold of the word. Otherwise, the threshold is dynamically added to represent the power of the word in the NN, and the initial value of the vocabulary frame is 4ms. In order to improve the accuracy of VaD, after processing the VAD results: If the part is larger than the threshold, but the length is 12ms (less than 3 frames or 2 parts per second), but the interval is less than 200ms, but in the process of extracting the two parts of the language, the parameters Was extracted. The 563 algorithm uses printed text and audio. You can choose at least one of the following options: The 563 algorithm allows you to extract settings from the previous language code. Use the parameter analysis section. The first part is used to restore the original language code and the reverse language message to be separated in the third part of the message. Of course, the time domain will be separated and determined below. The number of different parameters exceeds 563. The algorithm background type is the preference among the 8 key parameters, the password-to-speech ratio (SNR). The background can be of good quality. Most languages are MOS values. The background language is usually between 1 and 3. The language is divided into different languages, so it can be different from each other. There is only one active language. It is included at the bottom of language management. The encoding output is related to the quality of life, as shown in the figure below:
These algorithms, from the cv-jun algorithm, indicate that they are already very familiar.
Mapping model of objective evaluation results
p563, the mapping model is a straight line model, and the default 563 algorithm represents 12 linear equations. Contains settings. To check the language, give the first value of the p12 string.
Exit: No sound to connect to the network
NISQA: Voice quality of non-reference voice communication network
Mr. cv will take you to review, this algorithm is introduced in the previous article~
The deep network used can automatically perform feature extraction, so this type of method can directly send the Mel spectrum coefficient or MF directly into the model. for example. cv You have to say that Meltu is very good.
As shown in the picture of Mr. cv above, complete network storage is too easy. To measure the maximum speech quality, get a MOS score of the output MFC connected to the CNN end.
CNN design details are as follows:
The language preview can be used to display the logout model, different sound systems for the computer system, which one should be used for the standard TTS model/language system: exit one terminal or two terminals (for example, improved)
Abstract
List of language attributes when measuring language attributes. Language settings have been analyzed for many years. This is not necessary in the notification system, and can be divided into two parts in the actual window:
noise
Mr. cv, talk about some noise, because this greatly affects the quality.
Equipment noise: such as single-frequency sound, notebook fan sound and so on.
Environmental noise: whistle, etc.
Signal overflow: popping sound
There is also a small volume problem, including the low reading sound of the device and the low voice of the speaker.
Solution
There are some suggestions here. CV thinks that independent detection methods can be used to accurately detect these types of noises, and then separate them.
Including training the detection model for noise and hardware noise.
Next, introduce a solution for 16183d95c615be echo 16183d95c615c2:
Audio processing prevention and debugging
This section defines what happens when you make a call, you have a call and call from somewhere. When you are here, you can accept it, but you know that this is one of the most important reasons why past relationships will be affected. I don't have an old product on the market. Each sound of the line is divided into optical sound. The line echo is on the current line, this is due to the optical representation of 2-4 channels. You can use language to eliminate noise. The key part of the previous process. Specify the missing original printer and some debugging. 1 Original printer 1) Defined by adaptive filter and adaptive algorithm filter, the minimum or IR can be used to save the adaptive filament. The following figure is used to analyze IR persistence. Show the normal configuration of the adaptive filter
The above picture shows the error code of the input code. The adaptive algorithm of adaptive filter is a member of the stochastic gradient algorithm family.
2) The original cancellation process.
The following figure is a block diagram of the basic principle of echo cancellation:
Mr. cv will show you the following processing process: (a) Determine the strength and distance of the part (b) Adaptively filter the FIR filter with remote input. At the same time, I get the problem E, and there is an error in processing the line. The author has debugged, and the following EC debugging code shows: 1) Abort v The original process knowledge is not an algorithm, so it is not easy to use basic knowledge. If you have a solid foundation, of course you will know more. Please also refer to the algorithm code. If you use design to get better documentation, so the algorithm doesn't know, they have to tell me. First of all you don't understand. Read it once and understand it every time. 3) Run an application to test the algorithm. If the application is input, then it is an embedded and remote file. Write the output of EC to see the effect of the process. Many steps, congratulations on choosing the algorithm. Otherwise, some things in the algorithm need to be changed, and some changes need to be made. If the debugging is completed, there is no way to hear the algorithm output. b) If the delay is specified, the PCM data will be included within a certain distance, but the included data will be set to this delay. At this point, the output data is still empty. vc) It can also get PCM data from remote and remote products and use it as today's input. Look at the output of the algorithm, you can't hear it. The algorithm can be used after this display. Each hardware is a specific platform. Latin platform. The chip company has a display board, and each customer has its own hardware platform, and you can change the delay including PCM data. When the mobile Internet company applied, he swiped in the UI for too long to use some phones, and configured a delay. After the test, the handheld device will use this delay value. After the above display, the original has doubled.
finally
Speech enhancement noise and its evaluation method
Noise type
Common distortions are:
附加杂音:录音时麦克风录制的背景音自动重复连接通道效果:这显示对单个或带宽的有限响应。删除通道脉冲响应非线性失真:如信号输入增益不当
Speech enhancement
Mr. cv just introduced the noise category, then we can do some targeted solutions. The signal degradation can be divided into 3 categories:
In addition to the expected words, words and sound quality can also be constructed, which will disable the desired language. For some additional words, it will be fixed or changed over time. It has changed, like, um, increase the working volume of the adaptive filter of this channel, and these words cannot be recognized and deleted, such as media interface, which is used for repetition and repetition. If the associated word is different from the affirmative word. If the position of the microphone, the function of the microphone and the bandwidth boundary of the codec and the expected analog sound do not have much response boundary, the microphone amplifier and other information will be reversed online. This is too long to use. Processing by frame, where) is the window function, M is the displacement of the frame, N is the length of the window, and the ratio of the frame difference to the time difference is 50 Hz. In order to reduce window performance, the window role and frame have changed too much. You can use Handing, which was displayed in the form of 3.3 spectral search in 1997 to reduce the possible vocabulary. The aspect ratio is executed by the multiplier. Delete insufficient and remaining words. If it is too large, this message will be ignored.
Summary
This article is very long, but very meaningful. It summarizes the quality of voice transmission, before and after voice codec in the past few years and this year. In addition, I also put forward solutions for several noises so that we can better Solve the problem.
If you are interested in this article, you might as well take a look at another article I wrote on InfoQ: sound network algorithm and noise related solutions, here is the reason for the length of the article, there is time to integrate the next time to introduce it~ Actually, it also includes the use Reinforcement learning, confrontation generation, and other problem-solving methods are particularly strong, and you can analyze them in detail later.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。