Sound Network recently released "MetaChat primitive chat" , "MetaKTV" solutions, is committed to providing a new interactive social universe yuan, K song mode, which are referred to the "3D spatial audio "This core technology plays a key role in enhancing the player's immersion and listening experience. We have also received a lot of inquiries from developers. Today, we will bring you the technical disclosure of SoundNet 3D spatial audio, revealing the core function points of this technology and the technical principles behind it.
In real life, due to the principle of the binaural effect, we hear different sounds from different directions such as up, down, left and right, and we can quickly identify the location of the other party. In the virtual space of the metaverse, how to increase the sense of immersion through hearing is a key element that many manufacturers urgently solve. Imagine, in a 3D virtual chat room, you control the avatar to chat with netizens, and there is a positive message in your ears. The voice of the host in front and the chatter in every corner of the room, the elegant BGM surrounds you, as if you are in a real offline party. The fidelity of some of these key details is the key element to build a sense of presence and immersion. The lack of key details will cause the user's perception to be very low. A detail is perfectly processed to bring users a more realistic listening experience to the greatest extent.
3D spatial sound effects, air attenuation simulation, and human voice blur three black technologies perfectly simulate real hearing
SoundNet 3D spatial audio simulates the stereo field of the spherical area of the head through a pure software algorithm scheme, so that users have a sense of space in the audio sense. When the user operates the avatar to move in the virtual scene, different sound effects can be presented according to the avatar's face orientation, sound source orientation, distance and height, which perfectly simulates the real auditory experience. Among them, "3D spatial sound effects", air attenuation The three major black technologies of simulation and vocal blur have played a key role.
01 3D spatial sound effects: simulate the position and orientation of the sound source to achieve timbre differences
As mentioned at the beginning of the article, we can perceive sound coming from different directions in real life. Let's first briefly introduce how the "sense of direction" here is generated.
■Figure 1: Schematic diagram of auricle radio
We can see from Figure 1 that when the auricle of the human ear receives sound sources from different directions, the sound waves will be conducted to the inner ear in different paths. In this way, when sound waves in different directions are transmitted to the inner ear, the timbre will be anisotropic due to the shape of the pinna. In addition, since we have two ears, the time for the sound wave to reach the ear will be different when the sound source is in different directions. We can understand this in conjunction with Figure 2.
■Figure 2: Schematic diagram of binaural effect
Combined with Figure 2, we can find that if the sound source is on your right side, the right ear will receive the sound wave first. On the contrary, if the sound source is on the left side, the left ear will receive the sound first. At the same time, the human head will also transmit the sound wave. If the sound source is on the right side, then the sound wave needs to cross the "barrier" of the head to be transmitted to the left ear, so the timbre and frequency heard by the left ear will be attenuated relative to the right ear. In the end, we rely on the volume difference, time difference and timbre difference between the ears to determine the direction of the sound.
After introducing the principle of the binaural effect, let's look at how the sound network simulates the real "binaural effect" in the virtual space. In the virtual world, we need to base the audio on the relative position of the sound source + the user and the sound source + user To render a two-channel audio, you can bring headphones to experience the sound of different spatial positions to achieve "listening and positioning" , some traditional solutions are to adjust the volume of the left and right ears to achieve the left and right direction rendering, but this method can only render left and right directions. For more complex rendering in front, back, up and down directions, it is also necessary to fine-tune the details of left and right ear timbre, delay and other detailed differences in order to accurately simulate the position of the sound source.
In the research and implementation of spatial hearing, the head-related impulse response ( ) a very important position. and SoundNet has developed a complete set of 3D sound field rendering engine based on HRTF head-related transfer function, psychoperceptual acoustics, sound source pointing simulation and other algorithms. It can dynamically simulate the sound changes of any angle and orientation in the space when it is transmitted to the left and right ears, so as to achieve high-precision sound orientation rendering. And in order to pursue the ultimate sense of hearing and ultimate usability, the rendering engine supports 48kHz full-band, multi-channel audio rendering with minimal computing power requirements, allowing you to enjoy more on the mobile side without spending much traffic or worrying about computing power. People interact with high-definition sound quality.
In the acquisition of HRIR, Acoustic Network finally formed a spherical data set by having a pair of HRIRs collected for each angle in the total elimination laboratory to achieve accurate simulation of the angle . Figure 3 below is a schematic diagram of the coordinates of a spherical HRIR. We can see that the center of Figure 3 is the position of the human head, and the surrounding red dots are the orientation of the sound source during HRIR acquisition.
■ Figure 3: Spatial distribution of HRIR acquisition points
The orientation of the sound source will also have a direct impact on our hearing. For example, when a person speaks with his back to you, his voice will appear more “stuffy” than when he speaks directly to you, because the voice needs to bypass the body when speaking with his back to you. This obstacle, the energy attenuation of different frequencies of sound waves passing the obstacle is different. The sound network 3D spatial sound effect also provides the sound source orientation function. Through the method of acoustic modeling, the difference caused by the sound source direction of any angle can be simulated. Simulate the feeling of real hearing.
In addition, people's perception of up and down and front and rear directions in the real environment is relatively vague. This is because the human ear is basically symmetrical. Compared with the horizontal direction, the volume and delay of the left and right ears are basically the same in the vertical direction, which is not enough to distinguish the direction. Therefore, in the virtual space, the sound network has also enhanced the sense of hearing in these directions, so that users can have the ability to identify sounds beyond reality in the "virtual space".
02 Air attenuation simulation: simulate the realization of acoustic phenomena to make the sound more realistic
Simulating the positions and orientations of different sound sources for "listening and positioning" is only the first step in simulating the real auditory experience with the 3D spatial audio of the sound network. We have also realized the simulation of air attenuation. In reality, due to the existence of air, sound waves will attenuate in the air. The high-frequency sound attenuates quickly and the low-frequency attenuate slowly. If the sound of the same volume has more high frequency, we will feel that it is far away from you. closer.
Due to the attenuation of sound waves in the air. Among them, high-frequency sounds attenuate quickly, such as the humming of mosquitoes and bird calls, while low-frequency sounds attenuate slowly, such as the rough voice of boys, the sound of wind, and the sound of water pumps. Then the sound of the same volume, if there are more high frequencies, we will feel that it is closer to you. air attenuation function is to simulate the acoustic phenomenon in this real environment to make the sound sound more realistic. We use 2 pictures to show the more intuitively, as shown in the spectrogram in Figure 4 below, the low-frequency sound can travel farther, and the sound above 8Khz is difficult to hear if it exceeds 1 km.
■Figure 4: Acoustic air attenuation curves of different frequencies
As shown in the time-frequency diagram of Figure 5, the upper part is the simulation of air attenuation, and the lower part is the attenuation of volume only. The comparison shows that in the case of air attenuation, after the distance gradually increases, the audio sound above 8KHz attenuates more rapidly.
■Figure 5: Spectrum comparison of air attenuation effect
In life, many people do not have obvious perception of the existence of "air attenuation", but in virtual space, through the strong combination of 3D spatial sound effects and air attenuation simulation, it is possible to further restore the real auditory experience and create a more realistic feeling in the metaverse. The "sound immersive" feeling.
03 Blurred voices: lively atmosphere and "quiet" chat can also have both
We often have this situation in noisy bars, LiveHouses, where you just want to hear your friends, but you don't want to eliminate the noise of other people completely, because then you don't have the vibe of the bar. Offline, you may not be able to do it, but you can do it in a virtual space. The Voice Blur feature of SoundNet 3D Spatial Audio blurs the sounds you don’t want to hear in the virtual space, so that you can hear the voices around you in the space, but you can’t hear what they are saying, so you can The ambience is maintained without interfering with your interactions with friends.
In addition to the above three core functions, SoundNet 3D spatial audio also supports the playback of local sound source files, and can customize the background sound, accompaniment, sound effects, etc. in the scene. For developers, flexible and multi-mode access is also supported:
- API mode: directly integrate the Agora SDK, and call the API to customize spatial audio (customers (requires center server) can input parameters according to their own virtual world audio source, listener position, orientation information, etc.).
- Server mode: Agora Server implements the parameter calculation required for coordinate synchronization and spatial audio, and performs audio rendering on the client side.
- Local rendering mode: The client can render the uploaded audio for spatial audio, so as to realize functions such as background music and ambient sound rendering.
After introducing the core function points and technical principles of 3D spatial audio of Shengwang, let's take a look at its application scenarios. The metaverse, which is regarded as the next new form of the Internet, has been integrated into multiple scenarios in various industries, such as games and chat rooms. , online karaoke songs, virtual concerts, VR, AR, etc., but for scenarios such as chat rooms, online meetings, virtual events, and online education, 3D spatial audio can effectively enhance users’ online interaction and listening experience. For Metaverse, VR, AR, virtual concerts, and online games, 3D spatial audio is expected to reconstruct the user's immersion and auditory experience in the virtual world.
■Figure 6: Some application scenarios of 3D spatial audio
We will more intuitively demonstrate the effects of 3D spatial audio through several application scenarios.
1. Voice chat room : In the voice chat room, through 3D spatial audio, you can hear 360° audio from the front, back, left and right of the room, and when you feel the laughter from around you, it feels like you are participating A real offline gathering will not feel dull for both the audience and the speakers. Combined with the voice blur function, the "cocktail party effect" can also be stimulated in the voice chat room. In a mixed environment of multiple voices, you can pay attention to a certain voice and still be able to hear what he is saying. Cocktail party effect”, and users can reduce fatigue, be more immersed in the chat environment, and greatly increase the length of chat.
2. Online game : Game voice is a standard feature of many online games. By adding real-time voice function in the game, it is convenient for players to communicate and collaborate, and better win the game. However, the simple game voice solution is more of a bridge for communication between players, and has no substantial effect on improving the game experience. By combining real-time interaction with 3D spatial audio, it will reconstruct a new experience of fighting side by side between players in the game. . For example, in FPS games, when teammates communicate with you in real time through the game voice with 3D spatial audio, you can feel that he is standing on your right and giving you offensive instructions, as if they are really participating in real CS offline together. , such a game experience almost subverts the traditional game voice communication experience, turning the original simple game voice interaction function into a core function that enhances the sense of immersion and collaboration in the game.
3. Virtual concert : Virtual concert/concert is an emerging online concert form. The singer can incarnate a virtual image and project it onto the virtual stage through motion capture technology, and the user will also become a virtual person Applauding under the virtual stage, after adding 3D spatial audio, it is expected to subvert the auditory experience of virtual concerts. When the audience incarnates as a virtual person and sits under the stage to watch the performance of the singer, he can hear the sound from all corners, from the left and right sides of the stage to the singer in the center and then to various positions of the surrounding audience. Just like being in a real concert environment, the same is true for the singer.
Compared with voice chat rooms and online conferences, the key to the success of virtual concerts lies in whether the singing voice of the singer and the singing voice heard by the audience are good. 3D spatial audio essentially changes the auditory experience of the singing voice transmitted to the audience's ears, and improves the performance of the audience. It makes up for the lack of "live atmosphere" in virtual concerts. In the future, the audience's immersion and auditory experience in virtual concerts is expected to be as good as offline concerts.
Whether it is various current online activities or the virtual world of the metaverse in the future, the communication and interaction between players are inseparable from the real-time interaction of RTE. Therefore, RTE is also regarded as one of the underlying infrastructures of the metaverse. While inserting 3D spatial audio, new spatial information is introduced into the virtual experience, enabling viewers to perceive what is happening behind them or in the virtual environment completely independent of their eyes, while establishing real-time interaction between users, RTE will also become the infrastructure to help metaverse scenes increase the sense of presence and immersion, and build a more realistic and cutting-edge audio metaverse.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。