头图

Preface

In recent years, real-time audio and video communication applications have shown a trend of explosion. Behind these real-time communication technologies, there is a technology that must be mentioned-WebRTC.

In January of this year, WebRTC was published as an official standard by W3C and IETF. According to a report by the research organization GrandViewReseach, the global WebRTC market is expected to reach 21.023 billion U.S. dollars in 2025. Compared with the market size of 2.3 billion U.S. dollars in 2019, the 5-year compound annual growth rate is 43.6%.

This series of content will discuss with you, why is WebRTC favored by developers and enterprises? How will WebRTC develop in the future? And how does Agora perform secondary development based on WebRTC, and how will it support the WebRTC NV version?

WebRTC can be regarded as a browser native real-time communication method that can run without installing any plug-ins or downloading any additional programs. Different clients can realize real-time communication and see each other by jumping to the same URL through the browser (same or different). But this is just a "God's perspective" statement, and the technical framework and implementation details contained therein are far from simple.

basic concepts

Before we start discussing how WebRTC works, let's clarify a few key technical concepts.

P2P

The ability to realize real-time point-to-point audio and video (ie, multimedia) communication is the most significant feature of WebRTC. In order to communicate through a web browser, everyone's web browser needs to agree to "start to connect", know the other party's network location, and also need to bypass network security and firewall protection and transmit all multimedia communications in real time. .

In browser-based peer-to-peer communication, how to locate and establish a network connection with another computer's Web browser and perform efficient data transmission is one of its biggest challenges.

When you want to visit a website, you usually enter the URL directly or click the link to jump to view the page. In this process, you actually make a request to the server that responds by providing web pages (HTML, CSS, and JavaScript). The key to making this access request is that you make an HTTP request to a known and easy-to-locate server (via DNS) and get a response (ie, web page).

At first glance, it seems that this problem is not that difficult, but let's take an example: Now suppose I want to communicate with my colleagues via video. So how can we make a request and actually receive the other party's audio and video data directly?

The problems in the above scenarios can be solved by P2P (point-to-point transmission) technology, and WebRTC itself is based on a peer-to-peer connection, among which RTCPeerConnection is the API responsible for establishing P2P connections and transmitting multimedia data.

Firewall and NAT penetration

In daily life, most of us access the Internet through work or home networks. At this time, our equipment is usually behind a firewall and a network access translation device (NAT), so there is no static public IP address assigned. Taking a closer look, the NAT device will convert the private IP address inside the firewall to a public-facing IP address to ensure the security and IPv4 restrictions on the available public IP addresses.

Let us take a look at the previous example. Considering the participation of the NAT device, how can I know the IP address of my colleague and send audio and video data to this address? Similarly, how can he know that my IP address can connect audio and video data to this address? Video data sent back? This is the problem to be solved by STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) servers.

In order for the WebRTC technology to work properly, it first requests a public-facing IP address from the STUN server. If this request is answered, and we receive a public-facing IP address and port, we can tell others how to connect directly with us. Others can also use STUN or TURN servers to perform the same operation.

Signaling & Conversation

Due to the presence of NAT, WebRTC cannot directly establish a connection with the peer. Therefore, devices need to be discovered and negotiated through signaling services for real-time audio and video exchange. The above-mentioned network information discovery process is one of the signaling topics on a larger level. In the case of WebRTC, it is based on the JavaScript Session Establishment Protocol (JSEP) standard. Signaling involves network discovery and NAT penetration, session creation and management, communication security and coordination, and error handling.

WebRTC does not specify which implementation must be used for signaling. This is to allow developers to use more flexible technologies and protocols.

At present, the WebSocket + JSON/SDP solution is more widely used in the industry. Among them, WebSocket is used to provide signaling transmission channel, and JSON/SDP is used to encapsulate the specific content of signaling:

WebSocket is built on top of TCP, provides long connection capability, and solves the inefficiency problems of HTTP only supporting half-duplex and Header information redundancy. WebSocket allows the server and client to push messages at any time, and has nothing to do with previous requests. A significant advantage of using WebSockets is that almost every browser supports WebSockets.

JSON is a common serialization format in the Web field, used to encapsulate some user-defined signaling content. (The essence is a serialization tool, so solutions like Protobuf/Thrift are also completely feasible).

SDP (Session Description Protocol) is a session description protocol used to encapsulate the signaling content of streaming media capability negotiation. Two WebRTC proxies will share all the states required to establish a connection through this protocol.

If the conceptual content is not easy to understand, then we can imagine it as a daily communication process:

When we are ready to communicate with a stranger or a stranger wants to join your chat, then when you or the other party sends this message, whether you accept or reject it, you need to exchange this message with the other party. Only after you communicate can you get more information to judge whether you can chat happily together. And to help you quickly summarize this information is SDP (Session Description Protocol), which contains information such as what proxy is used, what hardware it supports, and what type of media it wants to exchange.

Then when two people want to start chatting, one person always needs to speak first👇👇👇

Me: I speak Chinese, 17 years old, high school, like playing basketball, now I want to learn English, so I want to chat with you to see if I can help me improve my English (ie Offer SDP).

Peer: I speak Chinese, 23 years old, work, like to play basketball, English is average, may not help you but we can play together (ie Answer SDP).

image.png

The purpose of this process of exchanging information and mutual understanding is to confirm whether we can communicate in the next step, or if we can't communicate at all. It doesn't matter who sends the message first. The important thing is that no matter who sends the message, even if it is polite, we need to give the other party a response, so that this dialogue may be effective.

Related agreements

The protocol is a standard/convention, and the protocol stack is the implementation of the protocol, which can be understood as code, function library, and call for upper-level applications. The protocol stack in WebRTC has written the underlying code, conforms to the protocol standard, and provides developers with a functional module to call. Developers only need to care about the application logic, where the data goes from, how to store it, and the communication sequence between devices in the system.

WebRTC utilizes multiple standards and protocols, including data streaming, STUN/TURN server, signaling, JSEP, ICE, SIP, SDP, etc.

image.png

                WebRTC 协议栈

Signaling

  • Application layer: WebSocket/HTTP
  • Transport layer: TCP
    Media stream
  • Application layer: RTP/RTCP/SRTP
  • Transport layer: SCTP/QUIC/UDP
    Safety
  • DTLS: used to negotiate the key for the media stream
  • TLS: used to negotiate the key for signaling
    ICE (interactive connection establishment)
  • STUN
  • TURN
    Among them, ICE (Interactive Connectivity Establishment, interactive connection establishment), STUN and TURN are necessary to establish and maintain end-to-end connections. DTLS is used to protect the data transmission of the opposite end. SCTP and SRTP are used for multiplexing, congestion and flow control, and partly on top of UDP to provide partly reliable delivery and other additional services.

Basic structure

Through the above introduction, I believe everyone has an understanding of some key concepts in WebRTC. Next, let us take a look at the most critical basic component architecture of WebRTC, which is also very important for our subsequent understanding of the working principle of WebRTC.

image.png
Basic component architecture

The component architecture of WebRTC is divided into two layers: application layer and core layer. The green part in the above figure shows the core functions provided by WebRTC, and the dark purple part is the JS API provided by the browser (that is, the browser encapsulates the WebRTC core layer C++ API and encapsulates it into a JS interface).

The light purple pointing arrow at the top of the picture is the upper-level application, you can directly access the API provided by the browser in the browser, and finally call to the core layer.

Regarding the core functional layer, there are mainly 4 parts:

  • C++ API layer

The number of APIs is small, mainly PeerConnection. The API of PeerConnection includes transmission quality, transmission quality reports, various statistics, various streams, etc. (Design skills: For the upper layer, the API provided is simple, which is convenient for application layer development; the internal is more complicated.)

  • Session layer (context management layer)

If the application creates audio, video, and non-audio/video data transmissions, they can all be processed in the Session layer for management-related logic.

  • Engine layer/transport layer (most important, core part)

    This part is divided into 3 different modules: Voice Engine (audio engine), Video Engine (video engine) and Transport (transmission module), which can be used as audio and video transmission decoupling.

    Voice Engine (audio engine) contains a series of audio functions such as audio capture, audio codec, audio optimization (including noise reduction, echo cancellation, etc.).

    • ISAC/ILBC codec;
    • NetEQ (Buffer) network adaptation to prevent network jitter;
    • Echo canceler: The focus of audio and video determines the quality of the product. WebRTC provides related very mature algorithms. You only need to adjust the parameters during development; noise reduction and automatic gain.

    *Video Engine (Video Engine) * Contains such as video capture, video codec, dynamic modification of video transmission quality according to network jitter, image processing, etc.

    • VP8, openH264 codec;
    • Video jitter buffer: Prevent video jitter;

      • Image enhancements: image enhancement.

      Transport (transmission module) In WebRTC, all audio and video are received and sent. The transport layer includes leak detection and network link quality detection, estimates the network bandwidth according to the situation, and performs audio, video, and files according to the network bandwidth. Non-audio and video transmission.

      • UDP used in the bottom layer, SRTP used in the upper layer (secure, encrypted RTP);
      • Multiplexing: Multiple streams multiplex the same channel;
      • P2P layer (including STUN+TURN+ICE).
  • Hardware layer

    • Video capture and rendering;
    • Audio capture
    • Network IO, etc.

There is no video rendering in the core layer of WebRTC, and all rendering needs to be done by the browser layer.

working principle

In fact, WebRTC involves many complex technical issues, such as audio capture, video capture, codec processors, etc. Since our content in this chapter hopes to show you a simple and easy-to-understand WebRTC workflow, we will not discuss more details about the implementation of WebRTC technology in this chapter. If you are interested, please click to enter #WebRTC# Check the column by yourself.

We mentioned in the first part of the content Why WebRTC|In the past and present, "WebRTC is a set of W3C Javascript APIs that support web browsers for real-time audio and video conversations for developers." These JavaScript APIs are actually generated and transmitted for real-time Communication of multimedia data.

The main APIs of WebRTC include Navigator.getUserMedia (open recording and camera), RTCPeerConnection (create and negotiate a peer-to-peer connection) and RTCDataChannel (represent a two-way data channel between peers).

Regarding the workflow of WebRTC, we may be more intuitive from the "how to implement a 1:1 call" scenario:

image.png

  1. Both parties first call getUserMedia to open the local camera;
  2. Send a request to join the room to the signaling server;
  3. Peer B receives the offer SDP object sent by Peer A, and saves the Answer SDP object through the SetLocalDescription method of PeerConnection and sends it to Peer A through the signaling server.
  4. In the offer/answer process of SDP information, Peer A and Peer B have created corresponding audio channels and video channels based on the SDP information, and enabled Candidate data collection, Candidate data (local IP address, public IP address, Relay service End assigned address).
  5. When Peer A collects the Candidate information, it is sent to Peer B through the signaling server. In the same process, Peer B will send to Peer A again.

In this way, Peer A and Peer B have exchanged media information and network information with each other. If they can reach agreement (find the intersection), they can start communication.

In order to help everyone better understand WebRTC technology, our latest issue of "Agora talk" invited engineers from the Agora WebRTC team of Shengwang.

They will share and explore more useful and interesting technical details around the two themes of "RTC Hybrid Development Framework Based on Web Engine Extension Technology" and "Next Generation WebRTC-Prospects for Real-time Communication".

image.png

In the next chapter, we will bring you information about the current development difficulties of WebRTC, commonly used development tools, and what optimizations we have made in the Agora Web SDK.

Stay tuned~


RTE开发者社区
647 声望966 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。