[Rongyun perspective] The future trend of immersive audio and communication technology

Looking back on the development of the Internet, from PC LAN to mobile Internet, the immersion of Internet use has gradually increased, and the distance between virtual and reality has gradually decreased. The use of immersive audio and communication technology will greatly enhance the user experience in the future. In the meta-universe of virtual and reality, there are high requirements for immersion, participation, and sustainability. Therefore, It will be supported by many independent tools, platforms, infrastructures, protocols, etc. With the increasing maturity of AR, VR, 5G, cloud computing and other technologies, the communication technology based on immersive audio is expected to gradually move from concept to reality in Metaverse.

This article will work with industry partners to explore the impact of Metaverse technology development on the communications industry, the future development trend of immersive audio, and the application of communications technology in the VR, AR, and AI industries.

A brief introduction to the concept of meta universe
Metaverse refers to the creation of a virtual world parallel to real life with almost no difference in experience. Humans can use virtual identities to work in the virtual world, social interaction, entertaining games, and even buying and selling transactions. To sum up, in the meta-universe, you can do whatever you can think. The boundless imagination gives you unlimited freedom.

Metaverse creates a virtual digital second world independent of the real world, enabling users to live freely with digital identities. VR, AR, and AI, as the technical foundation of Metaverse, will usher in a period of rapid growth. The global market size of the virtual reality industry in 2020 is approximately RMB 90 billion, and the average annual growth rate in 2020-2024 is expected to be approximately 54%. According to the China Academy of Information and Communications Technology, global shipments of virtual devices will accelerate from 2021 and are expected to reach 75 million units in 2024. (Data source: Tianfeng Securities "Metaverse Research Report") With the gradual improvement of the VR industry chain, the empowerment of VR to the industry will show a strong flywheel effect.

So how can we gradually enter the meta-universe world from the real world?

The dimension of reality
If we divide the realism experienced by users in the meta-universe scene into two dimensions: "immersion" and "degree of freedom". The starting point of the two axes is the original perception of reality, such as you who are reading this article. The depth of immersion and freedom together determines whether the user experience in the meta-universe is real enough.

Level of realism

Lv1: The initial stage from the original perception to the virtual world
Lv2: Let the brain feel part of the real virtual world
Lv3: A true virtual world that completely deceives the brain
Max: A virtual world with the same depth as the original world

The current development trend of the meta universe
At this stage, the meta-universe concept of the industry chain, such as interactive experience, human-computer interaction, etc., most of the capabilities range between Lv1-Lv2, and only a small number of cutting-edge companies are moving towards Lv3. How to achieve Max's goals in the future stage, whether it can be achieved, is still unknown.

The Lv1-Lv2 industry chain has matured day by day, and applications such as 3D motion-sensing movies, open sandbox games, VR, AR, and MR games have been implemented.

If the user experience at the Lv2 stage is a "semi-real" experience accumulated by a few immersive or free factors, then upgrading to the "full-real" experience at the Lv3 stage can be said to be a qualitative leap. "Immersion" and "freedom" must be deep enough to complement each other. Can digital visual and auditory perception experience completely fool our brains? Can the 3D engine provide enough free experience? Can AI be sustainable and self-growth? Can network transmission be achieved without delay? As long as any one factor is flawed, it is impossible to truly achieve a "full real" user experience. It can be seen that from "semi-real" to "full-real", the difficulty of realization will increase sharply.

After Lv3, the next stage of the meta-universe is to achieve the ultimate goal, allowing people's consciousness to live in the virtual world forever. Factors affecting the realization of this goal, in addition to hardware, software, communications and other scientific and technological factors, also involve the fields of biology and medicine. Whether it can be realized is still unknown at present.

The progress of the head manufacturers

1.Facebook
In September 2020, at the Facebook Connect 2020 conference, Facebook released the 15 important strategic plans for AR/VR. A series of AR/VR information announced at the conference covers the latest hardware products, software products, solutions, developer services, and cutting-edge technology research.

Among them, the VR headset Oculus Quest 2 relies on the game and software support provided by the platform, and has become the mainstream VR headset on the market.

It is worth noting that the Project Aria released at the conference is a research device built by Facebook to help researchers understand the software and hardware required for AR glasses. It uses sensors to capture video and audio from the wearer's perspective, calculates position via GPS, and captures multi-channel audio.

2.Apple
The well-known American technology blog Scobleizer predicts that Apple's product plan announced within the next year will include a new AR/VR headset. Specifically, Apple plans to launch a variety of products in the next ten years, including AR/VR glasses and AR/VR contact lenses (launched between 2022 and 2025, respectively). This means that Apple has to upgrade from a 2D screen, interface and experience to a 3D form.

Scobleizer said: Apple's AR/VR headset will cover both eyes and ears of the user at the same time. After wearing it, you will not only not see the surrounding environment, but also cannot hear the surrounding sound. In other words, a major feature of Apple’s AR/VR headset is visual and auditory immersion. What’s interesting is that it does not completely isolate the user from the outside world. Perhaps you can see and hear through the AR perspective function. To the surroundings. After the Apple AR/VR headset is turned on, you can see the virtual image of the surrounding environment and hear the surrounding sounds.

Also worthy of attention is Apple's car surround audio technology. Scobleizer said that the technology can create surround sound effects from various places such as the interior of the car and the home. Using the LiDAR module of Apple's AR/VR headset, 3D audio can be positioned in space. Through personal experience, he said that the technology can simulate the audio effects of visiting the scene.

Current status of RTC communication technology
RTC's audio transmission technology realizes the transmission of analog signals to digital signals through sampling, quantization, coding, and compression. At present, the two-channel sampling is commonly used, that is, the left and right channels of stereo sound. After compression processing, the transmission takes up less bandwidth, which meets the needs of most current business scenarios for transmission efficiency. With the advent of 5G, network bandwidth is no longer a problem. On the basis of ensuring transmission efficiency, people will pursue a 3D immersive audio experience. Binaural sampling will no longer meet future needs. Multi-channel acquisition (for example, Ambisonics microphone collects 4 channels in a tetrahedral array) transmission may become the mainstream of future communication technology.

In addition to the above methods to enable users to achieve an immersive audio experience, are there any other methods? Let's take a look at the current mature immersive audio technology.

Immersive audio technology
Currently, immersive audio types are mainly divided into three categories: Channel based audio (CBA), Object based audio (OBA), and Scene based Audio (SBA). Scene-Based Audio is mainly used to describe the sound field of the scene, and its core underlying algorithm is HigherOrder Ambisonic (HOA).

According to the conclusions analyzed by industry experts, the future VR audio professional field will mainly be the two major trends of Object based audio and Ambisonics (HOA).

So, what VR social scenes can VR audio technology be applied to?

Correspondence with social scene
At the current stage of the development of Metaverse, social scenes mainly exist in VR games, VR live broadcasts, and VR social software.

Because Object based audio has a lot of data and calculations, in addition to the audio of the channel, there is also metadata about the sound source: the sound source (position, size, speed, shape, etc.), the environment where the sound source is located ( reverb (reverberation) and reflection (echo), attenuate (attenuation), geometry), so it is more suitable for games on the VR host.

The feature of Ambisonics is that the sound source is attached to the pre-rendered panoramic sphere, so players may not be able to place the sound source where they want to be placed in the scene, even if there is a sound source, it is compressed on the sphere. It is suitable for mobile and streaming video.

How to use immersive audio and communication technology to enhance future experience
Through the above analysis, how do we use RTC's audio transmission technology to achieve user immersive audio experience?

1. Directly transmit audio in immersive format
Using Ambisonics technology, the collection and processing of sound are handed over to the App or the VR sound engine, and the RTC channel is only responsible for transmission.

2. After preprocessing, hand it over to the receiving end to restore
Corresponding to Object based audio technology, Ambisonics is used for sound collection, but before transmission, the dimensionality is reduced to two-channel for encoding and transmission, so that the Web terminal or mobile device can be compatible. Then the receiving end uses the two-channel data, and then restores it back to Ambisonics, renders in real time according to the changes of the virtual scene, and finally plays it on the user end.

3. Realized by text and voice conversion technology
If the virtual scene is a two-dimensional world, we must not only avoid the direct restoration of the human voice, but also make the character’s voice conform to the setting in the two-dimensional world.
In this case, you can use Rongyun IM technology and the mutual conversion of voice and text (asr and tts). After the human voice is collected, it is first converted into text, then input into the sound modeling, and finally converted into the voice of the two-dimensional character.
This method allows each player's voice to conform to the settings in the game world, thereby enhancing the sense of immersion.

Concluding remarks
The continuous upgrading and advancement of related technologies will enable the concept of meta universe to continue to move forward. The development of industry chains such as VR, AR, 5G, AI, professional engines and platforms will continue to drive users' pursuit of immersive experiences. Immersive audio communication may become the mainstream of future communication. We keep an eye on the market and hope to conduct in-depth exploration and research with our partners in the industry. Immersive audio and communication technologies may become a breakthrough for the communication business in the future.

[Rongyun perspective] The future trend of immersive audio and communication technology

融云RongCloud

引用和评论

融云 uni-app IMKit 上线，1 天集成，多端畅行

三分钟掌握音视频处理 | 在 Rust 中优雅地集成 FFmpeg

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

CVPR 2025 | 火山引擎获得NTIRE 视频质量评价挑战赛全球第一

三分钟掌握音视频信息查询 | 在 Rust 中优雅地集成 FFmpeg

【harmonyOS NEXT 下的前端开发者】WAV音频编码实现

什么是抖动以及如何使用抖动缓冲区来减少抖动？