Rongyun Practice: Exploration and Practice of Real-time Audio Mixing on the Web Side

The Internet changes the world and is based on social interaction. Follow [Rongyun Global Internet Communication Cloud] to learn more

In social interaction, the desire to share is an important criterion for maintaining relationships and relationships, whether it is acquaintance socializing, marriage and friendship, or interest groups.

The desire to share is manifested in specific social scenarios such as chat rooms and Lianmai live broadcasts. It is not only a way to maintain relationships, but also has more emotional value of companionship. For example, listening to a favorite song together in the chat room, and watching a movie that we are looking forward to together with friends.

In order to provide users with a high-quality interactive experience based on sharing, it is necessary to apply real-time audio mixing technology. This article will share the exploration and practice of Rongyun real-time audio and video web-side SDK in mixing technology from the aspects of technical principles, technical solutions, interface design, etc.

Web-side audio sharing implementation scheme

For a web-side SDK that supports publishing local or online audio and video resources, the easiest way to share audio in an application is to publish microphone resources and audio resources that need to be shared at the same time. Two tones can be heard.

The internal path of this way is as follows:

(Dual channels get two audio channels)

However, the implementation scheme of "publishing two audio tracks and subscribing to two audio tracks" is obviously not elegant enough.

The solution adopted by Rongyun is to use real-time audio mixing technology to mix the microphone and the shared audio into one audio track for release. In this way, the subscriber only needs to subscribe to one track. The process is as follows:

(real-time audio mixing)

Publishing after mixing can not only save bandwidth, but also reduce operations on the subscriber side, resulting in fewer operations and better experience.

Technical principle

All operations in the Web Audio API are based on the audio node AudioNode. AudioNode is a general interface for processing audio modules. Audio node types include: audio source node, sound effect node, audio output node, audio analysis node , etc.

Each audio node has input and output. The previous audio node outputs the processed audio, and connect is connected to the input of the next audio node to complete the flow of audio data between audio nodes.

Different audio nodes are connected together to form an audio graph (Audio Graph). The node operations in the audio graph need to depend on the audio context AudioContext, and each audio node has only one audio context.

In practical applications, AudioContext needs to be used to create various types of audio nodes, starting from one or more audio source nodes, passing through one or more audio processing nodes (such as a filter node BiquadFilterNode, a volume control node GainNode), and finally connected to a destination. destination (AudioContext.destination).

AudioContext.destination returns an Audio

DestinationNode, the last node in the audio graph, is responsible for transmitting sound data to speakers or headphones.

The following is a simplified audio graph (Audio Graph):

(Single Track Audio Graph)

Since multiple audio source nodes can be output to the same AudioDestinationNode, which means that the microphone audio and background music can be connected to the same output port for playback, the Audio Graph becomes the following:

(Dual Track Audio Graph)

Connect two different audio sources to the same output to achieve mixing purposes.

Technical solutions

Create a sound source

The audio source can come from an ArrayBuffer of audio data read locally or online, an audio tag, or a MediaStream generated by the browser's microphone. Taking the use of MediaStream to generate an audio source as an example, the method of creating an audio source node is as follows:
Get the microphone resource to generate the audio source

 // 创建一个音频上下文实例
const audioContext = new AudioContext();

// 浏览器生成一个包含音轨的 mediaStream 对象
let mediaStream = null;
navigator.mediaDevices.getUserMedia({audio: true}).then((ms) => {
    mediaStream = ms;
});

// 生成一个 mediaStream 音频源
const micSourceNode = audioContext.createMediaStreamSource(mediaStream);

Play background music locally to generate sound source

 // 创建一个 audio 标签播放本地音频
const audioEl = document.createElement('audio');
audioEl.src = '/media/bgmusic.mp3';
audioEl.loop = true;
audioEl.play();

// 从 audio 标签中抓取一个 mediaStream 对象，以在 chrome 浏览器中为例
let mediaStream = null
audioEl.onloadedmetadata = () => {
  mediaStream =  audioEl.captureStream();
}

// 生成本地音频对应的 mediaStream 音频源
const bgMusicSourceNode = audioContext.createMediaStreamSource(mediaStream);

Create playback exit

 // 创建音源播放出口
const destination = audioContext.createMediaStreamDestination()

connection node

Connect two audio source nodes to the same output.

 // 麦克风音频源连接输出出口
micSourceNode.connect(destination)

// 背景音乐音频源连接输出出口
bgMusicSourceNode.connect(destination)

Get the mix track

Get the mixed audio track MediaStreamTrack from the output node.

 // 获取混音 audioTrack                             
const mixAudioTrack = destination.stream.getAudioTracks()[0]

After fetching the mixAudioTrack, the problem we need to solve is how to make the other end hear the sound of the microphone and shared music without re-establishing the peer connection in the actual audio and video scene.

Mixed audio tracks live in WebRTC

At this point, the mixAudioTrack can be used. Find the RTCRtpSender object that sends the microphone audio from the RTCPeerConnection, and use mixAudioTrack to replace the released microphone audio track. The specific implementation is as follows:

 // 发布麦克风资源前创建的 RTCPeerConnection 实例对象
const rtcPeerConnection = new RTCPeerConnection([configuration])

// 获取发送音频的 RTCRtpSender 对象
const micSender = rtcPeerConnection.getSenders().find((sender) => {
  return sender.track.kind === 'audio'
})

// 使用 audioContetx 中拿到的混音 mixAudioTrack 替换麦克风 audioTrack
micSender.replaceTrack(mixAudioTrack)

Interface Design Planning

Rongyun real-time audio and video SDK provides createMicrophone

The AudioTrack and createLocalFileTracks methods are used to create microphone tracks, local or online audio tracks, respectively.

Mixing is still based on the instances created by these two methods, adding start to RCRTCRoom and RCLivingRoom

The MixAudio and stopMixAudio methods are used for interface support for mixing and stopping.

The following shows the execution process of the business layer mixing. rtcClient and rtcRoom are the RTC client instance and room instance obtained by the business end.

Create and publish a microphone resource

 // 创建一个麦克风音轨
const micAudioTrack = rtcClient.createMicrophoneAudioTrack()

// 发布麦克风音轨
rtcRoom.pushlish([micAudioTrack])

Create a local audio track

 // 创建一个本地音频音轨，file 为 <input type='file'> 获取到的 File 实例
const localAudioTrack = rtcClient.createLocalFileTracks('bgmusic', file)

start mixing

Mix the local audio into the published microphone audio, the first parameter of startMixAudio is the published audio track, and the second is the local audio track to be mixed.

 // 开始混音
rtcRoom.startMixAudio(micAudioTrack, localAudioTrack)

stop mixing

Pass in the track that needs to be unmixed. The SDK will strip the sound of the local audio from the microphone audio.

 // 停止混音
rtcRoom.stopMixAudio(micAudioTrack)

Based on the above more elegant and concise solutions, developers can access mixing more conveniently and at low cost, reducing the learning threshold in the integration stage. In the future, Rongyun will continue to improve technology and optimize experience around making developers' business implementation more efficient and simpler.