1
头图

The rise of online conferences, online education, e-commerce live broadcasts and other scenarios has also enabled real-time interactive technology to move from behind the scenes to the front stage, which has attracted more people's attention. A series of RTE-related technologies, such as codec, network transmission, and computer vision, are also showing greater vitality. In 2021, with the blessing of deep learning, 5G and other technologies, what possibilities will RTE further give birth to?

The Agora developer community of the sound network and InfoQ are jointly planned and invited a number of technical experts from the developer community of the sound network Agora to jointly write from the perspectives of video transmission, computer vision, codec standard development, WebRTC, machine learning, audio technology, etc. "2021 Real-time Interactive Technology Outlook Series" gives a glimpse of new technology trends. This article originated from an interview with Chen Ruofei, director of audio experience and engineering at Agora. This series of content is jointly planned by the Agora developer community of first published on InfoQ .

There are many details in audio technology that will affect the experience of real-time interaction. With the changes in technology and application scenarios, audio is also being combined with more disciplines and technologies. In the real-time interactive scene, what factors will affect the audio experience? Compared with video technology, is the development of audio technology slower? Facing the RTC scene, what changes need to be made in audio technology? ...In order to answer these questions, we interviewed Chen Ruofi, Director of Audio Experience and Engineering at Agora, and asked him to talk about the changes and opportunities of audio technology in the real-time interactive scene.

Q: Compared with engineers who study network architecture and large front-ends, there are fewer engineers who study audio. Specifically, what related technologies do audio engineers study?

Chen Ruofei: Sound is the carrier of information and emotion transmission, so audio-related research will basically focus on how to make information and emotion better transmitted and perceptually understood. The audio field is relatively professionally subdivided, but we will find that there are many audio-related research directions, and the interdisciplinary subjects involved are also very wide. From the perspective of interactive objects, we can be divided into two categories: human-computer interaction audio and human-human interaction audio. From the real-time nature of the interaction, it can be divided into real-time interaction and non-real-time interaction. Human-computer interaction mainly studies how to make machines better understand and generate sounds, and realize the tasks that humans want machines to complete through ASR, MIR, TTS and other technologies. The human-to-human interaction part is more related to the human perception system, and its optimization goal will be around how to make people perceive audio better. Real-time human-to-human audio interaction puts forward more constraints on this basis, and optimization needs to be carried out under a lower delay, a smaller amount of calculation, and a causal system. The sound network where I work mainly focuses on research in the field of real-time interactive audio. Therefore, we will study how to provide better results with the lowest possible delay and calculation from the entire link of acquisition and playback, encoding and decoding, pre- and post-processing, and transmission. Good audio interactive experience.

Q: Before talking about technological changes, first sort out the concepts. In the real-time interactive scene, what factors will affect the audio experience?

Chen Ruofei: real-time interactive audio is an end-to-end, mouth-to-ear experience, so all components on the entire link may affect the audio experience. We can decompose the impact of technology on audio experience from five aspects: acquisition, broadcasting, filtering, compression, and transmission. First of all, let's talk about collection. The difference in acoustic properties of different microphones has a decisive influence on the audio experience, from the distance of the pickup, the directionality to the accuracy. The picked up sound signal undergoes analog-to-digital conversion, and signal sampling will also cause sound loss. The higher the sampling rate, the better the sound details will be preserved. So a high-quality microphone will provide a better audio source from the source. Similarly, a high-quality playback device can better retain more sound details. Then the pre- and post-processing is a very important part of the audio link. The 3A technology that everyone often hears belongs to this category. The pre- and post-processing performs secondary processing on the original collected signal or the signal to be played to filter out the interference. Signals, such as echo, noise, noise, howling, etc., while enhancing the volume and sense of hearing of the target audio. In addition, in some sound effects gameplay, we will also achieve specific sound effects such as voice change and beautifying through the processing of the signal. Let's talk about codec and transmission, the two are strongly coupled. In principle, the higher the sampling rate and bit rate of the encoding, the better the fidelity of the sound and the better listening experience. However, in reality, the bandwidth of the network is limited, and unfavorable conditions such as packet loss and jitter often occur. A good codec algorithm can achieve high-quality sound retention at a relatively low bit rate through an in-depth understanding of acoustic models and information redundancy, thereby ensuring stable performance under various weak network conditions. At the same time, we also need to develop weak network countermeasures for source channels to reduce the hearing impact caused by packet loss and jitter while ensuring low latency.

Q: There is a view in the industry that compared to video technology, the development of audio technology seems to be slower? What do you think of the current development of audio technology?

Chen Ruofei: technological progress is driven by demand. The audio technology of the telephone era has experienced fierce development. Some classic theories such as linear prediction and adaptive filtering have solved some basic and usable problems. Many technologies are still in use today. In recent decades, the technology of VoIP has also undergone considerable development. Today, we see that VoIP minutes can occupy an increasing share of the communication field, and it is also inseparable from the long-term solid work and continuous progress of audio researchers. Audio requires a high technical threshold, the barrel effect of the entire link is obvious, the device coupling is heavily fragmented, and the subjective improvement is not easy to be perceived. These factors determine that the audio needs to sit on a cold bench and require long-termism. Persistence.

The rise of AI technology in recent years has injected new vitality into audio and provided new ideas for many problems that have not been solved for a long time. Human-computer voice interaction has become a new hot spot in the audio field, and related technologies are also booming. At present, considerable progress has been made in the fields of recognition and synthesis. In recent years, we have also seen a lot of practical results in the combination of AI technology and RTC field, and people have seen a huge space for further improving the audio experience. From the perspective of the external environment, after watching the live broadcast of one thousand people, more and more people are beginning to like audio social interaction with less psychological burden and greater imagination. In the recent industry, a new wave has begun to appear. I believe that with this combination of internal and external factors, more people will begin to study the experience of real-time interactive audio, and I am very much looking forward to this industry will bring you a different new experience.

Q: From a practical point of view, what technical challenges still exist in the real-time field of audio?

Chen Ruofei: There are still many technical challenges in the real-time interactive audio field that we need to overcome. I mention two big points here. First, fragmentation. The traditional mobile phone manufacturers debug the algorithms one by one and pass the acoustic test one by one. If we want to provide a consistent high-quality audio experience under different equipment, environments, and network conditions, we need to find new breakthroughs. In the next era of the Internet of Everything, this demand will become stronger, and technological breakthroughs in this area will bring tremendous value. Second, subjectivity. Audio experience is a very subjective existence, and everyone's perception and preferences are also very different. We need to find better ways to match this personalized preference and provide a better quantitative evaluation system.

Q: Based on your observations in the industry and academia, you think audio technology is geared towards the RTC scene. What changes need to be made next? (Such as the combination of algorithms and technologies, etc.)

Chen Ruofei: I think the future of real-time interactive audio should have the following three parts. First, the deep integration of AI and signal processing. Classic signal processing and acoustic models have been able to help us solve many problems, and of course many of them are not good. With the effective integration of AI, it can effectively supplement the deficiencies of traditional algorithms and better solve our problems at a reasonable cost, rather than simply treating AI as a panacea. Second, it conforms to the evaluation standards of the times. At present, many audio standards are designed for communication. A truly gathering interactive experience requires corresponding evaluation standards. How to better evaluate interactivity and immersion is what we need to explore. Third, the real sense of immersion and accompaniment. People are beginning to be dissatisfied with pure information interaction, and further pursue face-to-face interactive experience and emotional accompaniment. With the further maturity of network and equipment conditions, this kind of future is also possible. The entire audio link needs to be upgraded, from the collection of the sound field to the restoration, and even augmented reality, to create a truly immersive experience. This will also be a long road of exploration. We at Shengwang have been committed to exploring these long-standing industry problems, and we welcome friends from all walks of life who have ideas and pursuits to contact me, exchange and explore together, and knock on the door of future audio together.

Related reading in this series

2021 Technology Outlook | Real-time generation technology towards the future

2021 Technology Outlook | Extreme real-time video communication under weak network

2021 Technology Outlook | 5G will force transmission protocols and algorithms to make more improvements

2021 Technology Outlook | AV1 Status and Outlook in RTC Application Practice


RTE开发者社区
658 声望973 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。