Introduction to real-time audio and video learning: technical principles and use of open source engineering WebRTC

This article is shared by the ELab technical team. The original title is "On the Principles and Applications of WebRTC Technology", with revisions and changes.

1. Basic introduction

WebRTC (full name Web Real-Time Communication), that is, web instant communication. It is a technical solution that supports web browsers for real-time voice dialogue or video dialogue. From the perspective of front-end technology development, it is a set of callable API standards.

Before the release of WebRTC, the cost of developing real-time audio and video interactive applications was very expensive, and there were many technical issues to consider, such as audio and video encoding and decoding issues, data transmission issues, delay, packet loss, jitter, echo processing and elimination, etc. , if you want to be compatible with real-time audio and video communication on the browser side, you need to install additional plug-ins.

May 2010: Google acquired the GIPS engine of VoIP software developer Global IP Solutions for $68.2 million and changed its name to "WebRTC" (see "The Great WebRTC: The Ecology Is Getting Better, Or Turning Real-Time Audio and Video Technology into Cabbage" ). It aims to establish a platform for real-time communication between Internet browsers, making WebRTC technology one of the H5 standards.

January 2012: Google has integrated the software into the Chrome browser, and Opera initially integrates WebRTC.

June 2013: Mozilla Firefox[5] released version 22.0 to officially integrate and support WebRTC.

November 2017: W3C WebRTC 1.0 draft officially finalized.

January 2021: WebRTC is published as an official standard by the W3C and IETF (see WebRTC 1.0: Real-Time Communication Between Browsers).

(This article has been simultaneously published on: http://www.52im.net/thread-3804-1-1.html )

2. Significance

The emergence, development and general recognition of WebRTC by industry standard organizations (such as W3C) are of great significance to the current and future development of big front-end technologies.

Lower the threshold for audio and video interactive development on the web side:

1) The previous audio and video interactive development has a certain technical threshold for web developers;
2) Now with the help of WebRTC, web developers can quickly implement audio and video interactive applications by calling the JS interface.

Avoid secondary problems caused by dependencies and plugins:

1) In the past, the construction of audio and video interactive applications relied on various plug-ins, software and servers;
2) Now end-to-end audio and video interaction can be formed with the help of mainstream browsers.

Unification and standardization to avoid the differences of traditional audio and video interaction environment:

1) In the past, audio and video interaction needed to face different NATs and firewalls, which brought great challenges to the establishment of media P2P;
2) Now there is an open source project libjingle for P2P hole punching in WebRTC, which supports STUN, TURN and other protocols.

More efficient and optimized algorithms and technologies improve audio and video interaction performance:

1) WebRTC uses NACK and FEC technology to avoid routing through the server, reducing delay and bandwidth consumption;
2) There are also technologies such as TCC + SVC + PACER + JitterBuffer that are optimized for audio and video fluency.

3. Technical characteristics

WebRTC is rich in content, and the main technical features include the following points.

1) Real-time communication:

WebRTC is a real-time communication technology that allows network applications or sites to establish a peer-to-peer connection between browsers without an intermediary, enabling video streaming and/or audio streaming or other The transfer of arbitrary data.

2) No dependencies/plugins:

The standards included in WebRTC enable users to create peer-to-peer data sharing and conference calls without installing any plug-ins or third-party software.

3) There are many protocol stacks:

WebRTC is not a single protocol. It includes multiple protocol standards including media, encryption, and transport layer, as well as a set of JavaScript-based APIs. It includes functions such as audio and video collection, encoding and decoding, network transmission, and display. Through the simple and easy-to-use JavaScript API, the browser has the ability to share P2P audio, video and data without installing any plug-ins.

WebRTC relies on numerous protocol stack diagrams:

At the same time, WebRTC is not an isolated protocol. It has flexible signaling and can easily interface with existing SIP and telephone network systems.

4. Compatible coverage

At present, most mainstream browsers are normally compatible with WebRTC:

▲ The above picture is quoted from "Introduction to the Overall Architecture of WebRTC Real-Time Audio and Video Technology"

For more detailed browser and version compatibility, you can take a look at the following figure:

▲ The above picture is quoted from " https://caniuse.com/rtcpeerconnection "

All mainstream browsers support the WebRTC standard API, which also makes it possible to communicate audio and video without plug-in between browsers, which greatly reduces the threshold for audio and video development. Developers only need to call the WebRTC API to quickly build audio and video applications. .

5. Technical framework

As shown in the figure below: The technical framework describes the core content of WebRTC and the API design for different developers.

WebRTC technology framework diagram:

▲ The above picture is quoted from "Zero Basic Introduction: Based on Open Source WebRTC, Real-time Audio and Video Chat Function from 0 to 1"

As can be seen from the figure, WebRTC is mainly designed for APIs of three types of developers:

1) API for web developers: The framework includes a set of API standards based on JavaScript and certified by W3C, so that web developers can develop instant messaging applications based on WebRTC based on this API;
2) For the API of browser manufacturers: the framework also includes the underlying WebRTC interface based on C++, which is very friendly to the bottom-level access of browser manufacturers;
3) Customizable parts of browser manufacturers: The framework also includes extensions such as audio and video clipping that can be customized by browser manufacturers.

6. Technical core

As can be seen from the framework in the previous section, WebRTC mainly consists of three parts: audio, video engine and transmission, which also contains many protocols and methods.

1) Voice Engine:

a. Voice Engine includes iSAC/iLBC Codec (audio codec, the former is for wideband and ultra-wideband, the latter is for narrowband);
b. NetEQ for voice (handling network jitter and voice packet loss);
c, Echo Canceler (echo canceller) / Noise Reduction (noise suppression).

2) Video Engine:

a, VP8 Codec (video image codec);
b. Video jitter buffer (video jitter buffer, handles video jitter and video packet loss);
c, Image enhancements (image quality enhancement).

3）Transport。

7. Technical principle

7.1 Basic situation
Main technical features of WebRTC:

1) SRTP: secure real-time transmission protocol for audio and video streaming;
2) Multiplexing: multiplexing;
3) P2P: STUN+TURN+ICE, used for NAT network and firewall traversal;
4) DTLS: DTLS (Datagram Secure Transport) may also be used for secure transmission for encrypted transmission and key negotiation;
5) UDP: The entire WebRTC communication is based on UDP.

Due to space limitations, the following chapters of this article will not introduce audio and video acquisition, encoding, and processing in detail, but only introduce the core content of the principle of the establishment process of real-time communication.

7.2 Public network IP mapping: clear network location information
WebRTC is implemented based on a browser peer-to-peer connection (P2P).

Since there is no need for server transfer, the way to obtain the network address of the connection object is to obtain the network location information such as the public network address and port of the corresponding host with the help of ICE, STUN, TURN and other auxiliary intranet penetration technologies (NAT).

A clear network location is the basis for establishing end-to-end direct communication.

NAT traversal schematic:

The STUN server is used to assist intranet penetration to obtain the public network address and port information of the corresponding host:

▲ The above picture is quoted from "Introduction to the Overall Architecture of WebRTC Real-Time Audio and Video Technology"

7.3 Signaling Server: Network Negotiation and Information Exchange
The role of the signaling server is to relay information based on duplex communication.

The transit information includes network location information after public network IP mapping, such as public network IP, port, and media data stream.

Concept map:

Signaling server information exchange process diagram:

7.4 Session Description Protocol SDP: Unified Media Negotiation
The role of SDP:

1) Different terminals/browser have different encoding formats for media stream data, such as VP8, VP9, etc., the capabilities of each member participating in the session are not equal, and the user environment and configuration are inconsistent;
2) WebRTC communication also needs to determine and exchange local and remote audio and video media information, such as resolution and codec capabilities. The signaling of exchanging media configuration information is performed by exchanging Offer and Anwser using Session Description Protocol (SDP);
3) The exchange of SDP must precede the exchange of audio and video streams. Its content includes basic session information, media information description, etc.
//The structure of SDP

Session description

     v=  (protocol version)

     o=  (originator and session identifier)

     s=  (session name)

     c=* (connection information -- not required ifincluded inall media)

     One or moreTime descriptions ("t="and "r="lines; see below)

     a=* (zero or moresession attribute lines)

     Zero or moreMedia descriptions

Time description

     t=  (timethe session is active)

Media description, ifpresent

     m=  (media name and transport address)

     c=* (connection information -- optional ifincluded at session level)

     a=* (zero or moremedia attribute lines)

An example of SDP is as follows:

v=0 //represents the version, currently it is generally v=0 .

o=- 3883943731 1 IN IP4 127.0.0.1

t=0 0 //The time the session is active

a=group:BUNDLE audio video //: Describe service quality, transport layer multiplexing related information

m=audio 1 RTP/SAVPF103 104 0 8 106 105 13 126 //...

a=ssrc:2223794119 label:H4fjnMzxy3dPIgQ7HxuCTLb4wLLLeRHnFxh81

7.5 One-to-one connection establishment process
Take the process of establishing a one-to-one Web RTC connection as an example to briefly explain.

One-to-one process diagram:

Brief process diagram:

As shown in the image above, explain:

1) Exchange SDP to obtain respective media configuration information;
2) The STUN server exchanges network information such as network addresses and ports;
3) Turn to transfer audio and video media stream data.

work flow chart:

As shown in the image above, explain:

1) Both A and B first call getUserMedia to open the local camera as the local media stream to be output;
2) Send a request to join the room to the signaling server;
3) Peer B receives the offer SDP object sent by Peer A, and saves the Answer SDP object through the SetLocalDescription method of PeerConnection and sends it to Peer A through the signaling server;
4) In the offer/answer process of SDP information, Peer A and Peer B have created the corresponding audio channel and video channel according to the SDP information, and started the collection of candidate data, candidate data (local IP address, public network IP address, address assigned by the Relay server);
5) When Peer A collects the Candidate information, it sends it to Peer B through the signaling server. In the same process, Peer B will send Peer A again.

7.6 Many-to-many establishment
The concept diagram of establishing a point-to-point connection for many-to-many, taking the point-to-point connection of three users as an example:

7.7 The main JavaScript interface of WebRTC
getUserMedia(): access data streams, such as from the user's camera and microphone

// request media type

const constraints = {

video: true

audio:true

};

const video = document.querySelector('video');

//Mount the stream to the corresponding dom to display the local media stream

function handleSuccess(stream) {

video.srcObject = stream;

}

function handleError(error) {

console.error('getUserMedia error: ', error);

}

//Use the camera to capture the multimedia stream

navigator.mediaDevices.getUserMedia(constraints).

then(handleSuccess).catch(handleError);

RTCPeerConnection: Enables audio or video calls with encryption and bandwidth management tools

// Allow RTC server configuration.

const server = {

"iceServers":

        [{ "urls": "stun:stun.stunprotocol.org"}]

};

// create a local connection

const localPeerConnection = newRTCPeerConnection(servers);

// Collect Candidate data

localPeerConnection.onicecandidate=function(event){

...

}

// Monitor the operation when the media stream is accessed

localPeerConnection.ontack=function(event){

...

}

RTCDataChannel: supports point-to-point communication of general data, often used for data point-to-point transmission

const pc = newRTCPeerConnection();

const dc = pc.createDataChannel("my channel");

//accept data

dc.onmessage = function(event) {

console.log("received: "+ event.data);

};

//open transmission

dc.onopen = function() {

console.log("datachannel open");

};

//close transmission

dc.onclose = function() {

console.log("datachannel close");

};

8. Application cases

Here is a general demonstration of the multi-person video case of WebRTC as a practice.

8.1 Design Framework
Basic frame diagram of multi-person video:

8.2 Key code
8.2.1) Media Capture:

Get the browser video permission, capture the local video media stream, attach the media stream in the Video element, and display the local video result. code show as below.

//Camera compatibility processing

navigator.getUserMedia = ( navigator.getUserMedia ||

           navigator.webkitGetUserMedia ||

           navigator.mozGetUserMedia ||

           navigator.msGetUserMedia);

// Get local audio and video streams

navigator.mediaDevices.getUserMedia({

            "audio": false,

            "video": true

}).then( (stream)=> {

//Display your own output stream and hang it on the Video element of the page

document.getElementById("myVido").srcObject=stream

})

Capture a screenshot of the display result of the local video media stream:

Create an RTCPeerConnection object for each new client connection:

// stun and turn server const iceServer = {

"iceServers": [{

    urls:"stun:stun.l.google.com:19302"

}]

};

//Create RTCPeerConnection for point-to-point connection

const peerRTCConn=newRTCPeerConnection(iceServer);

8.2.2) Network Negotiation:

The main tasks are: create peer-to-peer connections, collect ICE candidates, and mount to dom when waiting for media stream access.

Interactive Connectivity Establishment (ICE) is a framework that allows real-time peers to discover each other and connect to each other. This technique allows peers to discover enough information about each other's topology to potentially find one or more communication paths between each other. The ICE agent is responsible for: collecting local IPs, port tuple candidates, performing connection checks between peers, and sending connection keep-alives. (For the introduction of ICE, see "Detailed Explanation of STUN, TURN, and ICE of P2P Technology")

// Send ICE candidates to other clients peerRTCConn.onicecandidate = function(event){

if(event.candidate) {

    //向信令服务器转发收集到的ICE候选          socket.send(JSON.stringify({

        "event": "relayICECandidate",

        "data": {

            'iceCandidate': {

                'sdpMLineIndex': event.candidate.sdpMLineIndex,

                'candidate': event.candidate.candidate

            }

        },

        "fromID":signalMsg['data']['peerId']

    }));

}

}

//Mount dom when media stream is involved peerRTCConn.ontrack=function(event){

let v=document.createElement("video")

v.autoplay=true

v.style="width:200px"



document.getElementById("peer").appendChild(v)

v.srcObject=event.streams[0]

}

8.1.3) Media negotiation:

Offer is created on launch. The peer uses the setLocalDescription method to add session information to RTCPeerConnection(), which is relayed by the signaling server. Other peers will return corresponding Answers. SDP process:

//The newly added node initiates an offer if(canOffer){

peerRTCConn.createOffer(

    function(localDescription) {

         peerRTCConn.setLocalDescription(localDescription,

            function() {

                //发送描述信息给信令服务器                         socket.send(JSON.stringify({

                    "event":"relaySessionDescription",

                    "data":localDescription,

                    "fromID":peerId

                }))

             },

            function() { alert("offer failed"); }

        );

    },

    function(error) {

        console.log("error sending offer: ", error);

    }

)

}

Create Answer when responding. The session description includes audio and video information. When the initiator sends an offer type description to the responder, the responder will return the answer type description:

//Create an Answer session

peer.createAnswer(

function(_remoteDescription) {

 peer.setLocalDescription(_remoteDescription,

    function() {

           //发送描述信息给信令服务器                  socket.send(JSON.stringify({

                "event":"relaySessionDescription",

                "data":_remoteDescription,

                "callerID":signalMsg['fromId'],

                "fromID":signalMsg['fromId']

            }))        },

    function() { alert("answer failed"); }

);

function(error) {

console.log("error creating answer: ", error);

});

When an ICE candidate share is received, the ICE candidate is added to the remote peer description:

//Corresponding RTCPeerConnection

const peer = peers[signalMsg["fromID"]];

// ICE candidates are added to the remote peer description

peer.addIceCandidate(newRTCIceCandidate(signalMsg["data"].iceCandidate));

Screenshot of multi-person video results <local simulation effect>:

8.2.4) Signaling relay:

The key code of the signaling service part:

wss.on('connection', function(ws) {

ws.on('message', function(message) {

    let meeageObj=JSON.parse(message)

    //交换ICE候选         if (meeageObj['event'] =='relayICECandidate') {             wss.clients.forEach(function (client) {

            console.log("send iceCandidate")

                client.send(JSON.stringify({

                    "event": "iceCandidate",

                    "data": meeageObj['data'],

                    "fromID": meeageObj['fromID']

                }));         

        });

    }

    //交换SDP

     if (meeageObj['event'] =='relaySessionDescription') {

        console.log(meeageObj["fromID"],meeageObj["data"].type)

        wss.clients.forEach(function(client) {

            if(client!=ws) {

                client.send(JSON.stringify({

                    "event": "sessionDescription",

                    "fromId":meeageObj["fromID"],

                    "data": meeageObj["data"],

                }));

            }

        });

    }

})

})

9. Summary

The advantages of WebRTC are mainly:

1) Convenience: For users, plug-ins and clients need to be installed for real-time communication before the emergence of WebRTC, but for many users, the operations of plug-in download, software installation and update are complicated and prone to problems Yes, now the WebRTC technology is built into the browser, and users do not need to use any plug-ins or software to achieve real-time communication through the browser. For developers, before Google open-sourced WebRTC, the technology for communication between browsers was in the hands of large enterprises. The development of this technology was a difficult task. Now developers use simple HTML tags and JavaScript. API can realize the function of Web audio/video communication.

2) Free: Although WebRTC technology is relatively mature, it integrates the best audio/video engine and very advanced codec, but Google does not charge any fees for these technologies.

3) Powerful hole punching capability: WebRTC technology includes key NAT and firewall penetration technologies using STUN, ICE, TURN, RTP-over-TCP, and supports proxies.

The disadvantages of WebRTC are mainly:

1) Lack of design and deployment of server solutions.

2) The transmission quality is difficult to guarantee. The transmission design of WebRTC is based on P2P, which is difficult to guarantee the transmission quality, and the optimization methods are limited. For example, the transmission quality in scenarios such as cross-region, cross-operator, low bandwidth, and high packet loss is basically dependent on the sky, and this is exactly the typical scenario of domestic Internet applications.

3) WebRTC is more suitable for one-on-one single chat. Although the function can be extended to realize group chat, it has not been optimized for group chat, especially super large group chat.

4) Device-side adaptation, such as echo, recording failure and other problems emerge one after another. This is especially true on Android devices. Due to the large number of Android device manufacturers, each manufacturer will customize on the standard Android framework, resulting in a lot of usability problems (failure to access the microphone) and quality problems (such as echo, howling).

5) Not enough support for Native development. As the name suggests, WebRTC is mainly for web applications. Although it can also be used for Native development, it involves a lot of domain knowledge (audio and video acquisition, processing, encoding and decoding, real-time transmission, etc.) It is not easy to even compile the project.

10. References

[1] Status of open source real-time audio and video technology WebRTC
[2] Briefly describe the advantages and disadvantages of the open source real-time audio and video technology WebRTC
[3] Interview with the father of the WebRTC standard: the past, present and future of WebRTC
[4] Sharing of Conscience: WebRTC Zero Basic Developer Tutorial (Chinese) [Attachment Download]
[5] Introduction to the overall architecture of WebRTC real-time audio and video technology
[6] Getting Started: What exactly is a WebRTC server and how does it connect to calls?
[7] WebRTC real-time audio and video technology foundation: basic architecture and protocol stack
[8] Talking about the technical points of developing a real-time video live broadcast platform
[9] Is it reliable to develop real-time audio and video based on open source WebRTC? What are the 3rd party SDKs?
[10] A concise compilation tutorial of open source real-time audio and video technology WebRTC under Windows
[11] WebRTC, real-time audio and video technology on the web page: It looks beautiful, but how many pits are there to fill in the production application?
[12] The great WebRTC: the ecology is becoming more and more perfect, or the real-time audio and video technology will be cabbage
[13] Zero-based entry: Based on open source WebRTC, real-time audio and video chat function from 0 to 1
[14] Detailed explanation of P2P technology (1): Detailed explanation of NAT - detailed principle, P2P introduction
[15] Detailed explanation of P2P technology (2): Detailed explanation of NAT traversal (hole punching) scheme in P2P (basic principle)

(This article has been simultaneously published on: http://www.52im.net/thread-3804-1-1.html )

Introduction to real-time audio and video learning: technical principles and use of open source engineering WebRTC

1. Basic introduction

2. Significance

3. Technical characteristics

4. Compatible coverage

5. Technical framework

6. Technical core

7. Technical principle

8. Application cases

9. Summary

10. References

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

即时通讯安全篇（一）：正确地理解和使用Android端加密算法

全民AI时代，大模型客户端和服务端的实时通信到底用什么协议？

融云数据监控平台「北极星」教程，聊天室洪峰、连接异常、消息未达正确解法

极致出海友好，融云 IM 支持消息免打扰设置时区

三分钟掌握视频剪辑 | 在 Rust 中优雅地集成 FFmpeg

视频直播技术干货(十三)：B站实时视频直播技术实践和音视频知识入门