算法 - Why WebRTC｜"Shallow in, deep out" working principle explained in detail - RTC 实时音视频

Preface

In recent years, real-time audio and video communication applications have shown a trend of explosion. Behind these real-time communication technologies, there is a technology that must be mentioned-WebRTC.

In January of this year, WebRTC was published as an official standard by W3C and IETF. According to a report by the research organization GrandViewReseach, the global WebRTC market is expected to reach 21.023 billion U.S. dollars in 2025. Compared with the market size of 2.3 billion U.S. dollars in 2019, the 5-year compound annual growth rate is 43.6%.

This series of content will discuss with you, why is WebRTC favored by developers and enterprises? How will WebRTC develop in the future? And how does Agora perform secondary development based on WebRTC, and how will it support the WebRTC NV version?

WebRTC can be regarded as a browser native real-time communication method that can run without installing any plug-ins or downloading any additional programs. Different clients can realize real-time communication and see each other by jumping to the same URL through the browser (same or different). But this is just a "God's perspective" statement, and the technical framework and implementation details contained therein are far from simple.

basic concepts

Before we start discussing how WebRTC works, let's clarify a few key technical concepts.

P2P

The ability to realize real-time point-to-point audio and video (ie, multimedia) communication is the most significant feature of WebRTC. In order to communicate through a web browser, everyone's web browser needs to agree to "start to connect", know the other party's network location, and also need to bypass network security and firewall protection and transmit all multimedia communications in real time. .

In browser-based peer-to-peer communication, how to locate and establish a network connection with another computer's Web browser and perform efficient data transmission is one of its biggest challenges.

When you want to visit a website, you usually enter the URL directly or click the link to jump to view the page. In this process, you actually make a request to the server that responds by providing web pages (HTML, CSS, and JavaScript). The key to making this access request is that you make an HTTP request to a known and easy-to-locate server (via DNS) and get a response (ie, web page).

At first glance, it seems that this problem is not that difficult, but let's take an example: Now suppose I want to communicate with my colleagues via video. So how can we make a request and actually receive the other party's audio and video data directly?

The problems in the above scenarios can be solved by P2P (point-to-point transmission) technology, and WebRTC itself is based on a peer-to-peer connection, among which RTCPeerConnection is the API responsible for establishing P2P connections and transmitting multimedia data.

Firewall and NAT penetration

In daily life, most of us access the Internet through work or home networks. At this time, our equipment is usually behind a firewall and a network access translation device (NAT), so there is no static public IP address assigned. Taking a closer look, the NAT device will convert the private IP address inside the firewall to a public-facing IP address to ensure the security and IPv4 restrictions on the available public IP addresses.

Let us take a look at the previous example. Considering the participation of the NAT device, how can I know the IP address of my colleague and send audio and video data to this address? Similarly, how can he know that my IP address can connect audio and video data to this address? Video data sent back? This is the problem to be solved by STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) servers.

In order for the WebRTC technology to work properly, it first requests a public-facing IP address from the STUN server. If this request is answered, and we receive a public-facing IP address and port, we can tell others how to connect directly with us. Others can also use STUN or TURN servers to perform the same operation.

Signaling & Conversation

Due to the presence of NAT, WebRTC cannot directly establish a connection with the peer. Therefore, devices need to be discovered and negotiated through signaling services for real-time audio and video exchange. The above-mentioned network information discovery process is one of the signaling topics on a larger level. In the case of WebRTC, it is based on the JavaScript Session Establishment Protocol (JSEP) standard. Signaling involves network discovery and NAT penetration, session creation and management, communication security and coordination, and error handling.

WebRTC does not specify which implementation must be used for signaling. This is to allow developers to use more flexible technologies and protocols.

At present, the WebSocket + JSON/SDP solution is more widely used in the industry. Among them, WebSocket is used to provide signaling transmission channel, and JSON/SDP is used to encapsulate the specific content of signaling:

WebSocket is built on top of TCP, provides long connection capability, and solves the inefficiency problems of HTTP only supporting half-duplex and Header information redundancy. WebSocket allows the server and client to push messages at any time, and has nothing to do with previous requests. A significant advantage of using WebSockets is that almost every browser supports WebSockets.

JSON is a common serialization format in the Web field, used to encapsulate some user-defined signaling content. (The essence is a serialization tool, so solutions like Protobuf/Thrift are also completely feasible).

SDP (Session Description Protocol) is a session description protocol used to encapsulate the signaling content of streaming media capability negotiation. Two WebRTC proxies will share all the states required to establish a connection through this protocol.

If the conceptual content is not easy to understand, then we can imagine it as a daily communication process:

When we are ready to communicate with a stranger or a stranger wants to join your chat, then when you or the other party sends this message, whether you accept or reject it, you need to exchange this message with the other party. Only after you communicate can you get more information to judge whether you can chat happily together. And to help you quickly summarize this information is SDP (Session Description Protocol), which contains information such as what proxy is used, what hardware it supports, and what type of media it wants to exchange.

Then when two people want to start chatting, one person always needs to speak first👇👇👇

Me: I speak Chinese, 17 years old, high school, like playing basketball, now I want to learn English, so I want to chat with you to see if I can help me improve my English (ie Offer SDP).
Peer: I speak Chinese, 23 years old, work, like to play basketball, English is average, may not help you but we can play together (ie Answer SDP).

The purpose of this process of exchanging information and mutual understanding is to confirm whether we can communicate in the next step, or if we can't communicate at all. It doesn't matter who sends the message first. The important thing is that no matter who sends the message, even if it is polite, we need to give the other party a response, so that this dialogue may be effective.

Related agreements

The protocol is a standard/convention, and the protocol stack is the implementation of the protocol, which can be understood as code, function library, and call for upper-level applications. The protocol stack in WebRTC has written the underlying code, conforms to the protocol standard, and provides developers with a functional module to call. Developers only need to care about the application logic, where the data goes from, how to store it, and the communication sequence between devices in the system.

WebRTC utilizes multiple standards and protocols, including data streaming, STUN/TURN server, signaling, JSEP, ICE, SIP, SDP, etc.

                WebRTC 协议栈

Signaling

Application layer: WebSocket/HTTP
Transport layer: TCP
Media stream
Application layer: RTP/RTCP/SRTP
Transport layer: SCTP/QUIC/UDP
Safety
DTLS: used to negotiate the key for the media stream
TLS: used to negotiate the key for signaling
ICE (interactive connection establishment)
STUN
TURN
Among them, ICE (Interactive Connectivity Establishment, interactive connection establishment), STUN and TURN are necessary to establish and maintain end-to-end connections. DTLS is used to protect the data transmission of the opposite end. SCTP and SRTP are used for multiplexing, congestion and flow control, and partly on top of UDP to provide partly reliable delivery and other additional services.

Basic structure

Through the above introduction, I believe everyone has an understanding of some key concepts in WebRTC. Next, let us take a look at the most critical basic component architecture of WebRTC, which is also very important for our subsequent understanding of the working principle of WebRTC.

Basic component architecture

The component architecture of WebRTC is divided into two layers: application layer and core layer. The green part in the above figure shows the core functions provided by WebRTC, and the dark purple part is the JS API provided by the browser (that is, the browser encapsulates the WebRTC core layer C++ API and encapsulates it into a JS interface).

The light purple pointing arrow at the top of the picture is the upper-level application, you can directly access the API provided by the browser in the browser, and finally call to the core layer.

Regarding the core functional layer, there are mainly 4 parts:

C++ API layer

The number of APIs is small, mainly PeerConnection. The API of PeerConnection includes transmission quality, transmission quality reports, various statistics, various streams, etc. (Design skills: For the upper layer, the API provided is simple, which is convenient for application layer development; the internal is more complicated.)

Session layer (context management layer)

If the application creates audio, video, and non-audio/video data transmissions, they can all be processed in the Session layer for management-related logic.

Engine layer/transport layer (most important, core part)
This part is divided into 3 different modules: Voice Engine (audio engine), Video Engine (video engine) and Transport (transmission module), which can be used as audio and video transmission decoupling.
Voice Engine (audio engine) contains a series of audio functions such as audio capture, audio codec, audio optimization (including noise reduction, echo cancellation, etc.).
- ISAC/ILBC codec;
- NetEQ (Buffer) network adaptation to prevent network jitter;
- Echo canceler: The focus of audio and video determines the quality of the product. WebRTC provides related very mature algorithms. You only need to adjust the parameters during development; noise reduction and automatic gain.
*Video Engine (Video Engine) * Contains such as video capture, video codec, dynamic modification of video transmission quality according to network jitter, image processing, etc.
- VP8, openH264 codec;
- Video jitter buffer: Prevent video jitter;
  - Image enhancements: image enhancement.
  Transport (transmission module) In WebRTC, all audio and video are received and sent. The transport layer includes leak detection and network link quality detection, estimates the network bandwidth according to the situation, and performs audio, video, and files according to the network bandwidth. Non-audio and video transmission.
  - UDP used in the bottom layer, SRTP used in the upper layer (secure, encrypted RTP);
  - Multiplexing: Multiple streams multiplex the same channel;
  - P2P layer (including STUN+TURN+ICE).
Hardware layer
- Video capture and rendering;
- Audio capture
- Network IO, etc.

There is no video rendering in the core layer of WebRTC, and all rendering needs to be done by the browser layer.

working principle

In fact, WebRTC involves many complex technical issues, such as audio capture, video capture, codec processors, etc. Since our content in this chapter hopes to show you a simple and easy-to-understand WebRTC workflow, we will not discuss more details about the implementation of WebRTC technology in this chapter. If you are interested, please click to enter #WebRTC# Check the column by yourself.

We mentioned in the first part of the content Why WebRTC｜In the past and present, "WebRTC is a set of W3C Javascript APIs that support web browsers for real-time audio and video conversations for developers." These JavaScript APIs are actually generated and transmitted for real-time Communication of multimedia data.

The main APIs of WebRTC include Navigator.getUserMedia (open recording and camera), RTCPeerConnection (create and negotiate a peer-to-peer connection) and RTCDataChannel (represent a two-way data channel between peers).

Regarding the workflow of WebRTC, we may be more intuitive from the "how to implement a 1:1 call" scenario:

Both parties first call getUserMedia to open the local camera;
Send a request to join the room to the signaling server;
Peer B receives the offer SDP object sent by Peer A, and saves the Answer SDP object through the SetLocalDescription method of PeerConnection and sends it to Peer A through the signaling server.
In the offer/answer process of SDP information, Peer A and Peer B have created corresponding audio channels and video channels based on the SDP information, and enabled Candidate data collection, Candidate data (local IP address, public IP address, Relay service End assigned address).
When Peer A collects the Candidate information, it is sent to Peer B through the signaling server. In the same process, Peer B will send to Peer A again.

In this way, Peer A and Peer B have exchanged media information and network information with each other. If they can reach agreement (find the intersection), they can start communication.

In order to help everyone better understand WebRTC technology, our latest issue of "Agora talk" invited engineers from the Agora WebRTC team of Shengwang.

They will share and explore more useful and interesting technical details around the two themes of "RTC Hybrid Development Framework Based on Web Engine Extension Technology" and "Next Generation WebRTC-Prospects for Real-time Communication".

In the next chapter, we will bring you information about the current development difficulties of WebRTC, commonly used development tools, and what optimizations we have made in the Agora Web SDK.

Stay tuned~

Why WebRTC｜"Shallow in, deep out" working principle explained in detail

Preface

basic concepts

P2P

Firewall and NAT penetration

Signaling & Conversation

Related agreements

Basic structure

working principle

RTE开发者社区

引用和评论

ElevenLabs 新 TTS 模型支持音频标签；NotebookLM 前产品经理新项目曝光：将邮件日历新闻转为互动音频丨日报

入选ICLR 2025，MIT/UC伯克利/哈佛/斯坦福等提出DRAKES算法，突破生物序列设计瓶颈

30分钟内输出结果，新加坡国立大学/MIT等基于SVM构建微生物污染检测模型

怎么判断自己下载的 trae 是国际版还是国内版？

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

FlowGram 简介：开源前端流程搭建引擎

【vLLM 学习】基础教程