Object-based real-time spatial audio rendering丨Dev for Dev column

This article is the content of the " Dev for Dev Column " series. The author is Li Song, an audio algorithm engineer of Shengwang.

With the introduction of the concept of the metaverse, the technology of spatial audio has gradually come to everyone's eyes. Regarding the basic principles of spatial audio, we have made a popular science video - "The Principles Behind Spatial Audio". Friends who want to know more can copy the link at the end of the article to view.

In this article, we will mainly discuss object-based real-time spatial audio rendering, that is, the rendering ideas and solutions when the rendering object is a sound source in application scenarios such as headphones.

01 Rendering of virtual sound

Virtual sound refers to a virtual sound source synthesized by using spatial audio technology.

In real life, people can use both ears to perceive the position of the real sound source. The so-called virtual sound rendering is to imitate the process of the real sound source reaching our ears , so that the listener can feel the position of the virtual sound source in space, etc. information.

During the rendering process, we need several basic information for signal processing, such as: the sound source, the model of the room, the position of the sound source and the listener, the orientation of the sound source, and so on. When there is no obstruction between the sound source and the listener, the sound emitted by the sound source will directly reach the listener's ear, and we call this heard sound direct sound.

After the direct sound arrives, the reflected sound from the sound source reflected from the wall, floor, ceiling or other obstacles will also reach the listener's ears one after another. These reflections start out sparse and become denser over time, decreasing exponentially in energy. Usually, we call the sparse reflections at the beginning as early reflections, generally within 50ms to 80ms; the dense reflections after this period are called late reverberations (the specific time is related to environmental factors such as room size).

02 Process realization of real-time spatial audio rendering

The above describes the process of sound sources reaching the human ear through the air in a room. Next, we discuss how this process is achieved. Considering that the transmission process of the sound source through the air to the human ear is generally linear, as shown in Figure 1, the easiest method for rendering is to convolve a known monophonic sound source and a pair of binaural rooms. Impulse response (Binaural room impulse response, BRIR).

■Figure 1: Convolution BRIR to obtain binaural sound

BRIR refers to the binaural room impulse response from the sound source to the human ear in the room, including the direct sound and reflections (the reflections here include early reflections and late reverberations). We can think of it as the sound you hear when the source of the sound is an impulse signal (such as clapping hands, the sound of a starting gun, etc.). As we said earlier, BRIR is related to the room, the sound source and the position of the listener.

Figure 2 shows a pair of BRIRs at a 30° horizontal angle actually measured in a room, where the blue and red lines represent the BRIRs for the left and right ears, respectively. We can clearly see the direct sound, early reflections and late reverberation parts in the time domain.

Figure 2: A pair of real measured binaural room impulse responses (Source: University of Surrey, IoSR-Surrey database)

So how to get BRIR? Measurement is a real and accurate practice, especially for augmented reality (AR) scenarios. But even without considering personalization, it is impractical to actually measure BRIR at all possible locations with an artificial head in every room.

In addition, the length of the BRIR has a lot to do with the reverberation size of the room, and the real-time convolution of a long BRIR requires a lot of computing power. Therefore, synthetic methods are usually used to simulate the real BRIR.

1) Synthesis of direct sound

The first part of BRIR is the impulse response of the direct sound. The synthesis of the direct sound can be obtained by convolving the sound source and the head-related impulse response (HRIR). HRIR is the time domain representation of the head correlation function HRTF.

I believe everyone already knows about HRTF. It represents the transfer equation from the sound source to the human ear, but does not include the reflection (early reflection and late reverberation) part. That is to say, if it is just a sound source rendered by HRTF, the virtual listening scene we are in is a non-reflection scene, similar to an anechoic room, a snowy mountain top and other special venues.

■Figure 3: Rendering of the direct sound part

Figure 3 shows the rendering chain of the direct sound part, including the rendering of the sound source orientation, the sound pressure changing with distance, the simulation of air attenuation, and a pair of HRTF filtering in a specified direction. Each of the modules here has been introduced in the previous article (see the extended reading at the end of the article for the historical article).

In real-time audio processing, considering the length of HRIR, convolution is generally not performed directly in the time domain, but via a Fast Fourier Transform (FFT) in the frequency domain. When the position of the sound source changes, it is necessary to switch to use the corresponding HRIR. In order to prevent noise when switching HRIRs, the previous and next HRIRs will perform similar fade-in and fade-out operations.

If the sampling points corresponding to HRIR are not very dense, real-time interpolation is required during rendering to obtain the HRIR of the corresponding orientation. Of course, offline interpolation can also be performed until it is dense enough (less than Just noticeable difference), and the HRTF closest to the target position can be directly selected for use during real-time rendering.

In this way, the computing power caused by real-time interpolation can be reduced, but the package volume of HRIR that needs to be stored will increase. Therefore, there is a trade-off between HRIR storage volume and computing power for real-time interpolation.

2) Synthesis of reflections

The second part of BRIR is the impulse response of the early reflections. Early reflections have a great impact on timbre, sound source localization, etc. Since early reflections are sparse, each early reflection is treated as a virtual sound source. The positions of these virtual sound sources can be obtained by methods such as specular method, ray tracing method or numerical solution.

The specular method is a relatively computationally efficient and relatively intuitive method, but it is generally only suitable for the case where the wall is smooth, and it cannot simulate the scattering scene caused by the uneven wall. Ray tracing can simulate reflection and scattering scenes, but the choice of direction and number of sound rays requires more careful consideration. When simulating low-order early reflections, the computational power will be larger than that of specular reflections. The accuracy of numerical calculation is the highest, but its computational complexity is very high, which is suitable for offline simulation and simulation.

When calculating early reflections in real time, considering the computing power problem, generally we will accurately calculate the 1-2 order early reflections. The order here refers to several collisions with the wall. For example, the first-order reflection means that the sound source and the wall reach the human ear after one collision. When the positions of these early reflections are calculated, each reflection needs to be treated the same as the direct sound, including the orientation of the reflected sound source, the distance-dependent sound pressure and air attenuation simulation, and the corresponding HRTF filtering.

It should be noted that since each reflection reaches the human ear at a different time, it is necessary to calculate the delay time of the reflection reaching the human ear relative to the direct sound. In the strict physical sense, the direct sound also needs to include a delay from the sound source to the human ear.

If in a virtual scene, the sound source is 100 meters away from the listener, then an additional delay of about 291 milliseconds needs to be introduced. In this way, although it conforms to the physical meaning, it adds extra delay, and the delay of this direct sound does not help the sense of space. Therefore the delay of the reflection only takes into account the delay relative to the direct sound reaching the human ear.

In addition, since the reflected sound source is caused by the reflection/scattering through the wall, it is necessary to simulate the energy attenuation caused by the wall reflection/scattering. To simulate this sound effect realistically, it is necessary to know room information such as wall materials.

Figure 4: Rendering of early reflections

The general flow of generating early reflections is shown in Figure 4. The figure shows that each early reflection is processed separately, and when the binaural sound is finally generated, each reflection is filtered by the corresponding HRTF. Another way is to first synthesize the impulse response of the early reflections. Of course, the length will increase with the size of the room, but in the end the sound source only needs to convolve the impulse response of the early reflections once.

3) Synthesis of post-reverb

The final part of BRIR is the post reverb. It should be noted that all reverberation parts can be calculated as early reflections, but calculating higher-order early reflections requires a lot of computing power, and our human ear is not sensitive to the location of higher-order reflections, so we do not need Accurately calculate higher-order reflections, that is, the exact location of multiple levels of sound waves reflected/scattered by walls.

Typically, we can use an artificial reverberator to render post-reverb. The more common artificial reverberators are: all-pass filter, feedback comb filter, combination filter of all-pass filter and comb filter, etc. In reverberators, feedback delay networks are a relatively common and cost-effective implementation.

4) Real-time spatial audio rendering process

■Figure 5: Binaural sound rendering process

The overall real-time spatial audio rendering flowchart is shown in Figure 5. It should be noted that each of these modules can be designed to be complex or simple. For example, each room has a different shape, but sometimes to simplify the computational complexity, a simple cuboid room model is used instead. The simulation of air attenuation can be achieved with FIR filters, but in practical applications, low-order IIR filters are used to approximate the results to save computing power.

The core modules such as HRTF are also a hot research topic. HRTF is highly personalized. If non-personalized HRTF is used, inaccurate positioning and confusion may occur. Although real measurement of personalized HRTF is the most accurate, in practical applications, measurement of HRTF is impractical. In practical applications, personalized HRTF prediction or personalized HRTF selection based on ear photos may be a more appropriate approach.

In addition to the personalization issue of HRTF, HRTF has nothing to do with distance in the far field (generally, the distance between the sound source and the listener is greater than 1 meter). That is to say, in the same direction, the spectrum of HRTF of 2 meters is the same as that of HRTF of 3 meters, but the amplitude is different.

However, when HRTF is in the near field (generally, the distance between the sound source and the listener is less than 1 meter), the frequency spectrum changes greatly with distance. In order to render the effect of near-field sound sources more realistically, it is necessary to synthesize the near-field HRTF by extrapolating the far-field HRTF.

03 Conclusion

When the sound source is not an object, but a sound field or a large number of objects need to be rendered at the same time, the above method is not applicable, and the ambisonic method can be used for real-time rendering.

Another question, assuming we render the room-based spatial audio correctly, will we have a good immersive experience ? We will talk about how to use the Ambisonic method for real-time rendering of spatial audio and the subjective sense of hearing about spatial audio in a later article. Please continue to pay attention to the Dev for Dev column.

Further reading:

1. 3D spatial sound effects, air attenuation simulation, blurred human voice, three black technologies perfectly simulate real hearing
2. RTC Science 丨 The principle behind spatial audio
https://www.bilibili.com/video/BV13Y4y187z3

About Dev for Dev

The full name of the Dev for Dev column is Developer for Developer. This column is a developer interactive innovation practice activity jointly initiated by Shengwang and the RTC developer community.

Through various forms of technology sharing, communication and collision, and project co-construction from the perspective of engineers, the power of developers is gathered, the most valuable technical content and projects are mined and delivered, and the creativity of technology is fully released.

Object-based real-time spatial audio rendering丨Dev for Dev column

01 Rendering of virtual sound

02 Process realization of real-time spatial audio rendering

03 Conclusion

RTE开发者社区

引用和评论

今年夏天，最不该错过的一场 AI 聚会

一文掌握 MCP 上下文协议：从理论到实践

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略

常见的 AI 模型格式