Realization of high-quality music accompaniment in the language chat room

When watching a horror movie, turn off the sound, and the effect of the scary picture will be greatly reduced; a rush of transition music suddenly appears in a harmonious and warm picture, and the audience will immediately understand that this is a "big thing is not good".

If everyone has their own BGM, then different scenes and situations also have music that can mobilize emotions and resonate with people.

In the recent general language chat room business, popular scenes also lack the "accompaniment" of music. In voice radio, it can set off the theme atmosphere and make users more immersed; in multi-person social, it can eliminate awkward chats and make communication more comfortable; in karaoke rooms, accompaniment is even more indispensable.

Moreover, music is a common language of mankind. Even if users face different cultural backgrounds in overseas business, music can evoke a variety of key emotions and is an essential "skill" for voice social applications.

Rongyun Language Chat Room SDK covers all popular language chat room scenes with high scalability. Naturally, it has also made great efforts in music management, providing online music and playlist management capabilities; it supports loading and playing local music, and supports most file types. ; Provides ambient sound effects such as welcome, applause, laughter, etc.; mix control, support human voice volume, local music volume, remote music volume, and ear-back switch.

With such a wealth of music management capabilities, how to ensure a high-quality music experience on the technical side? This article will share the high-quality music playback implementation scheme of Rongyun Talking Room SDK.

The main forms of music playing in the language chat room and their handling difficulties

In the language chat room business scenario, there are mainly the following forms of music playback:

Play music through a third-party device, mix it with an external sound card and the host’s voice, and then send it from the host.

Play accompaniment through the chat room app, and mix with the host’s voice inside the app.

Play music accompaniment through the mobile device, and collect the anchor's voice and the device's sound block into the wheat when the device is collecting.

The live broadcast is realized through the Android emulator on the PC side, the accompaniment is played on the PC side, and then the music is collected through the internal recording method of the PC and mixed with the host's voice.

Different implementation methods have certain differences in the processing flow of music. Mishandling may cause low music sounds at the audience, missing parts of the high and low sounds of the music, and sound swallowing, which may seriously affect the user experience.

Generally speaking, after the audio signal is collected by the mobile terminal collection module, it will undergo pre-processing such as echo cancellation, noise reduction, and automatic gain control, and then be encoded by the audio encoder.

The preprocessing process includes Acoustic Echo Cancelling (AEC, echo cancellation), Automatic Gain Control (AGC, automatic gain), Active Noise Control (ANC, noise reduction), commonly known as 3A.

The 3A preprocessing algorithm will also cause varying degrees of damage to the sound collected by the device. Moreover, compared with voice call scenarios, audio preprocessing will have a more obvious impact on music. For example, noise reduction and echo cancellation algorithms may treat the accompaniment music as background noise when the host sings, resulting in the lack of accompaniment music and severely degraded sound quality.

(Principle of echo cancellation)

Therefore, in scenes with background music that are relatively complex and require relatively high sound requirements, more sophisticated preprocessing strategies are needed.

Realization of high-quality music in Rongyunyu chat room

After a lot of technical and market research, we have summarized the following ways of collecting music from the Talking Room:

The sound source collected by the device Mic is already a mixed sound of singing and accompaniment. The mixing of this kind of scene mainly includes:

The equipment plays the accompaniment and the host's singing voice is mixed in the air

The equipment plays the accompaniment through the external sound card and the host's singing voice for mixing

Mix the accompaniment and the host's vocals through the PC internal recording environment

The device collects the host's singing voice through Mic, and then mixes it with the voice played by the application in the application. According to the host’s habits, the application scenario can be divided into the host’s music use environment with and without earphones.

After a lot of testing and theoretical verification, the solution we adopted is: to restore the sound source as much as possible for the external music mixing, and to put the in-app music mixing strategy after 3A preprocessing to improve the user experience while ensuring the sound quality.

This is because if the in-app music and sound mixing are performed before the sound 3A, the 3A preprocessing will damage the music, causing the processed music to have problems such as unstable volume and partial sound quality loss, and it will also affect the echo The amount of reference signal in the cancellation has an impact, which causes echo problems in the chat process. In this case, putting the sound mixing after 3A processing not only solves the noise reduction problem in the chat process, but also guarantees the sound quality of the music, and further ensures that the user has a high-quality sound experience.

(In-app music and sound mixing process)

In the application of external music and sound mixing scenarios, there are two scenarios in which the anchor uses the headset and does not use the headset.

When the host uses a headset, the background noise can be shielded by collecting the sound source through the headset, and there will be no echo problem. In response to this situation, the mixing of the accompaniment and sound collected by Mic does not perform noise reduction and echo processing, thereby ensuring the integrity of the mixed sound source.

In response to the situation where the anchor does not use a headset, after a large number of multi-model equipment tests, we tuned the 3A processing parameters to maximize the quality of the music accompaniment mixing and call process, achieve high-quality music playback, and improve user experience.

(Out-of-app music and sound mixing process)

In addition to the high-quality playback of Yuchao Room music, Rongyun Yuchao Room SDK also supports the bel canto extension of sound preprocessing to beautify the magnetism and richness of the host’s voice, and to meet the application switching of various scenes such as karaoke and concerts. .

Realization of high-quality music accompaniment in the language chat room

The main forms of music playing in the language chat room and their handling difficulties

Realization of high-quality music in Rongyunyu chat room

融云RongCloud

引用和评论

融云 uni-app IMKit 上线，1 天集成，多端畅行

三分钟掌握音视频处理 | 在 Rust 中优雅地集成 FFmpeg

三分钟掌握视频分辨率修改 | 在 Rust 中优雅地使用 FFmpeg

CVPR 2025 | 火山引擎获得NTIRE 视频质量评价挑战赛全球第一

三分钟掌握音视频信息查询 | 在 Rust 中优雅地集成 FFmpeg

【harmonyOS NEXT 下的前端开发者】WAV音频编码实现

什么是抖动以及如何使用抖动缓冲区来减少抖动？