Today I mainly want to introduce what the quality of real-time voice is like. I will probably introduce some of the existing methods in this field, and then I will introduce the existing methods, and introduce some things that I want to do in the future.

Voice quality assessment method

First of all, I will briefly introduce the speech quality evaluation. Generally speaking from that method, it is divided into a subjective evaluation method and an objective evaluation method. The subjective evaluation method is actually based on a human emotion. There are actually two subjective evaluation methods. One is that I don’t give you an original reference signal at all, that is, I only give you a piece of speech, and then you listen to it. After you are finished, tell me that you think it depends on how much its score should be. Then there is another way. It will give you an anchor point and tell you that this is the worst, and then let you base it on the worst. To make an evaluation, this method is also the most used one in the current papers, which is a subjective evaluation method.

Objective evaluation method

The objective evaluation method is divided into objective conditions with reference and objective evaluation methods with reference according to whether an original non-destructive reference signal is needed. The earliest was about 1996. There was a standard called P.861. The first is to propose a method, which is to give a lossless, and then a damaged speech signal, and then compare some of their similarity, or some hearing impairment, and then give a score. In 2000, a p.862 came out, and then about 2004, there was a method called PESQ-WB, which expanded the previous test range of pesq from 8khz to 16khz, and then, what we commonly use now is generally this PESQ-WB. Now many papers, including for example: noise reduction, lossless, etc., still use this method to make an evaluation. About 12 years ago, ITPO released a new standard, p.863. This POLQA method is actually an upgraded version of pesq, that is, it has made some improvements in noise suppression. In addition, its accuracy is actually quite high. Yes, the accuracy mentioned here is actually the same intonation. The result measured by POLQA is close to the score heard by human beings. The closer it is, the higher the test.

Reference objective evaluation method

  • P.861 PSQM earliest standard
  • P.862 PESQ, PESQ-WB , the most widely used evaluation method with reference
  • P.863 POLQA , the latest reference evaluation method

img

No reference objective evaluation method

  1. P.563 , the most famous narrow-band non-reference evaluation method
  2. ANIQUE , according to the author, is more accurate than the referenced PESQ
  3. E-Model/P.1201 , parameter domain evaluation method
  4. xxNet , deep learning domain evaluation method

img

In fact, there are quite a lot of them. For example, the most commonly used Itot method p.563 is actually that as long as you give him a piece of speech, you don’t need to give it an original lossless speech, and then it will learn from the completeness of its speech. Sex, and then get a level of noise, and then see if it is smooth enough to judge whether the speech is OK. If it thinks that all of these features are okay, it will give a high score. If there are some features, very big reasons may appear, such as a break between speech, or because of excessive noise. It will also give a relatively low score. After p.563, an ANIQUE is released, which is a standard in the United States. According to the literature, its accuracy will exceed the pesq method mentioned just now. Then there is the parameter domain method. In the parameter domain, the speech signal will not be processed, but some state information will be used to make an estimate. For example: this E-Model method, from the collection to the echo, and then to the entire encoding, if any module has some damage, they will cut off the impact factor of the damage from the whole. There is also a relatively new p.1201 standard, which includes audio and video evaluation methods. Among them, the audio part mainly includes network parameters, codec, volume parameters, etc.

Pain points of objective evaluation methods

  • has a reference method , which can only be used before going online
  • No reference method-traditional signal domain , narrow application scenarios and poor robustness
  • No reference method-traditional parameter domain , accuracy can be maintained only under limited weak network conditions
  • No reference method-deep learning , application scenarios and corpus are limited, and the complexity is slightly higher
  • scene narrow
  • Poor
  • poor robustness
  • complexity

Online offline testing

online quality perception capability is characterized by high accuracy, wide coverage, low complexity, and strong robustness. The quality assessment is accurate enough to cover most business scenarios without introducing too much algorithmic complexity, which is weakly related to voice content.

Downlink quality evaluation method

A standard process: encoding-transmission-decoding-playback, so the factors involved: codec performance, network quality, weak network countermeasure algorithm quality, equipment playback capabilities, etc. We do a set of data tests: In the test case of multiple weak networks, multiple devices, and multiple modes, the score of this method and the reference score of POLQA have a MAE less than 0.1 points, MSE less than 0.01 points, and the maximum error is less than 0.15 points. The following figure shows the test results of multiple weak networks of a certain device and mode:

img

Uplink quality evaluation method

There are many modules, and each module is independent, so, first of all, each module has its own independent detection capability. For example: the echo module, the echo may be missed at present, you need to know this. Then, after the self-testing of all modules, before coding, there will be a unified testing module, which is equivalent to a guard who will be the gatekeeper of the entire process. Extracting the commonality of all scenes, we can summarize it into four points:

  • Equipment collection stability
  • Echo cancellation capability
  • Noise suppression
  • Volume adjustment ability

Causes of Echo Leakage

In fact, we very much want to know whether there will be echo leakage at present. The reasons for echo leakage are generally divided into four categories:

  • Delay jitter . There may be many reasons for delay jitter. For example, the thread is stuck and the signal is not sent in time. It may also be that the current external amplifier has serious nonlinearity, dual devices, non-causal, and non-causal generally because of buffer reason
  • Large reverberation environment , the reverberation length exceeds the filter length
  • collected signal overflows , causing the filter to not converge
  • dual-talk 16125c049852a7, relying on NLP strongly, and

Causes of noise and noise

  • Equipment noise , single frequency tone, power frequency noise, notebook fan sound, disordered noise
  • Environmental noise , Babble, whistle, etc.
  • Signal overflow , popping sound
  • algorithm introduced , residual echo, etc.

Low volume

  • Equipment collection ability is weak/speaking voice is small , most

    playback capability of the device is weak. , the opposite end

  • analog gain, analog boost gain is small , PC terminal

    digital gain is small , two-way gain

Independent detection module

  • Howling detection , detection and suppression
  • Noise detection , early warning
  • Noise detection , to quantify the impact of noise
  • hardware detection , estimate the external performance of the device

future

Integration of perception, feedback and monitoring

  • The internal state is more detailed
  • Experience coverage is wider
  • Feedback is faster
  • More comprehensive coverage

RTE开发者社区
647 声望966 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。