头图

Audio and video communication meal - WebRTC one liver to the end

杨成功
中文
The source of this article is the public number: Programmer success

Recently, it is necessary to build a live broadcast platform for online classrooms. Considering the clarity and delay, we agreed that using WebRTC is the most suitable.

There are two reasons: the first is that "point-to-point communication" is very attractive to us, no intermediate server is required, and the client is directly connected, so the communication is very convenient; secondly, the WebRTC browser natively supports it, and other clients support it very well, unlike traditional live broadcasts. Compatibility with flv.js can achieve standard unification.

However, I am very embarrassed that the community has read several articles and wrote a bunch of theoretical frameworks, but none of them can run. The concepts in WebRTC are very new and many, and understanding it 通信流程 is the most important point, which is rarely described.

So I'll fiddle with it myself. After fiddling with it for a few days, I figured it out. In the following, based on my own practical experience, according to the key steps I understand, I will show you this amazing friend from the perspective of application scenarios WebRTC .

Online Preview: Local Communication Demo

Outline preview

The content presented in this article includes the following aspects:

  • What is WebRTC?
  • get media stream
  • Peer-to-peer connection process
  • Local simulation communication source code
  • Communication between both ends of the local area network
  • one-to-many communication
  • i want to learn more

What is WebRTC?

WebRTC (Web Real-Time Communications) is a real-time communication technology that allows network applications or sites to establish a peer-to-peer (Peer-to-Peer) connection between browsers without an intermediary to realize video streaming and the transmission of audio streams or other arbitrary data.

Simply put, WebRTC can realize audio and video transmission through direct connection (point-to-point) between browsers and browsers without the help of a media server.

If you've been exposed to live streaming technology, you know how surprising "no media servers" can be. Most of the previous live broadcast technologies were implemented based on push/pull logic. If you want to do live audio and video, you must have a streaming media server as an intermediate station for data forwarding. But this push-pull flow scheme has two problems:

  1. higher latency
  2. Clarity is difficult to guarantee

Because the communication at both ends must go through the server first, it is like a straight road originally, but you have "circled half a circle", which will definitely take more time, so there will inevitably be a delay in the live broadcast, even if the delay is low, it will take 1s above.

The essence of clarity is the amount of data. Imagine taking the subway to work every day. The more people there are in the morning rush hour, the easier it is for the road entering the station to be blocked. If you are blocked, you will stop and go. In addition, if you take a detour, it will be later when you arrive at the company. .

Connect this example to high-definition live broadcasts: network congestion is prone to occur because of the large amount of data, congestion will cause playback to be stuck, and the latency will be higher.

But WebRTC is different. It does not need a media server, and it is directly connected between two points and one line. First of all, the latency must be greatly shortened. In addition, because the transmission route is shorter, high-definition data streams are easier to reach, relatively less likely to be congested, so the playback end is not easy to freeze, which takes into account the clarity and delay.

Of course, WebRTC also supports intermediate media servers, and server forwarding is indeed indispensable in some scenarios. We only discuss the peer-to-peer mode in this article, aiming to help everyone understand and get started with WebRTC more easily.

get media stream

The first step in point-to-point communication must be that the initiator obtains the media stream.

There are three common media devices: cameras , microphones and screens . The camera and screen can be converted into video stream, and the microphone can be converted into audio stream. The audio and video streams are combined to form a common media stream.

Taking the Chrome browser as an example, the video streams of the camera and the screen are obtained in different ways. For camera and microphone, use the following API to get:

 var stream = await navigator.mediaDevices.getUserMedia()

For screen recording, another API is used. The limitation is that this API can only get video, not audio:

 var stream = await navigator.mediaDevices.getDisplayMedia()
Note: I have encountered a problem here, the editor prompts navigator.mediaDevices == undefined, the reason is that my typescript version is less than 4.4, and I can upgrade the version.

These two APIs for obtaining media streams have usage conditions, which must meet one of the following two conditions:

  • The domain name is localhost
  • protocol is https

If not satisfied, the value of navigator.mediaDevices is undefined .

The above methods have a parameter constraints , this parameter is a configuration object called media constraint . The most useful thing here is that it can be configured to obtain only audio or video, or to obtain both audio and video.

For example, if I only want video, not audio, I can do this:

 let stream = await navigator.mediaDevices.getDisplayMedia({
  audio: false,
  video: true
})

In addition to simple configuration to obtain video, you can also configure parameters related to video quality, such as video definition and bit rate. For example, if I need to get 1080p ultra-clear video, I can match it like this:

 var stream = await navigator.mediaDevices.getDisplayMedia({
  audio: false,
  video: {
    width: 1920,
    height: 1080
  }
})

Of course, the resolution of the video configured here is 1080p, which does not mean that the actual video obtained must be 1080p. For example, if my camera is 720p, then even if I configure a 2k resolution, the actual acquisition is at most 720p, which is related to the hardware and the network.

As mentioned above, the media stream is composed of an audio stream and a video stream. To be more precise, a media stream ( MediaStream ) will contain multiple media tracks ( MediaStreamTrack ), so we can get audio and video tracks separately from the media stream:

 // 视频轨道
let videoTracks = stream.getVideoTracks()
// 音频轨道
let audioTracks = stream.getAudioTracks()
// 全部轨道
stream.getTracks()

What's the point of acquiring orbitals individually? For example, the above API for getting the screen getDisplayMedia can't get the audio, but we need both the screen and the sound when we live broadcast. At this time, we can get the audio and video separately, and then form a new media stream. The implementation is as follows:

 const getNewStream = async () => {
  var stream = new MediaStream()
  let audio_stm = await navigator.mediaDevices.getUserMedia({
    audio: true
  })
  let video_stm = await navigator.mediaDevices.getDisplayMedia({
    video: true
  })
  audio_stm.getAudioTracks().map(row => stream.addTrack(row))
  video_stm.getVideoTracks().map(row => stream.addTrack(row))
  return stream
}

Peer-to-peer connection process

To say what is inelegant about WebRTC, the first thing to mention is that the connection steps are complicated. Many students were successfully persuaded to quit because the connection was always unsuccessful.

Peer-to-peer connection, that is, the point-to-point connection mentioned above, is implemented by the RTCPeerConnection function at its core. The point-to-point connection and communication between two browsers is essentially the connection and communication of two RTCPeerConnection instances.

The two instances created by the RTCPeerConnection constructor can transmit video, audio or arbitrary binary data after the connection is successfully established (requires support for the RTCDataChannel API). At the same time, it also provides connection status monitoring and methods to close the connection. However, one-way data transmission between two points can only be transmitted from the initiator to the receiver.

Now, according to the core API, we will sort out the specific connection steps.

Step 1: Create a connection instance

First create two connection instances, these two instances are the two parties that communicate with each other.

 var peerA = new RTCPeerConnection()
var peerB = new RTCPeerConnection()
In the following, the end that initiates the live broadcast is called 发起端 , and the end that receives the live broadcast is called 接收端

Both of these connection instances now have no data. Assuming that peerA is the initiator and peerB is the receiver, then the peerA side will obtain the media stream data as in the previous step, and then add it to the peerA instance. The implementation is as follows:

 var stream = await navigator.mediaDevices.getUserMedia()
stream.getTracks().forEach(track => {
  peerA.addTrack(track, stream)
})

When peerA adds media data, then peerB must receive the media data in a certain link of the subsequent connection. Therefore, set a listener function for peerB to obtain media data:

 peerB.ontrack = async event => {
  let [ remoteStream ] = event.streams
  console.log(remoteStream)
})

Note here: PeerA must add media data before proceeding to the next step! Otherwise, the ontrack event of peerB in the subsequent link will not be triggered, and the media stream data will not be obtained.

Step 2: Establish a peering connection

Once the data is added, both ends can begin establishing peer-to-peer connections.

The most important role in establishing a connection is SDP (RTCSessionDescription), which translates to 会话描述 . Both sides of the connection need to establish an SDP each, but their SDPs are different. The SDP on the originating side is called offer and the SDP on the receiving side is called answer .

In fact, the essence of establishing a peer-to-peer connection between the two ends is to exchange SDPs. During the exchange process, they verify each other. Only after the verification is successful, the connection between the two ends can be successful.

Now we create SDP for both ends. peerA creates the offer, and peerB creates the answer:

 var offer = await peerA.createOffer()
var answer = await peerB.createAnswer()

After creation, first the receiver peerB should set the offset to the remote description, and then set the answer to the local description:

 await peerB.setRemoteDescription(offer)
await peerB.setLocalDescription(answer)
Note: After peerB.setRemoteDescription is executed, the peerB.ontrack event will fire. Of course, the premise is that the first step is to add media data to peerA.

This is easy to understand. The offer is created by peerA, which is equivalent to the other end of the connection, so it should be set to "remote description". The answer is created by myself, and naturally it should be set to "local description".

With the same logic, after peerB is set up, peerA also sets answer as a remote description and offer as a local description.

 await peerA.setRemoteDescription(answer)
await peerA.setLocalDescription(offer)

At this point, the mutual exchange of SDP has been completed. But the communication is not over yet, there is still one last step.

When peerA executes the setLocalDescription function, it will trigger the onicecandidate event, we need to define this event, and then add a candidate for peerB inside:

 peerA.onicecandidate = event => {
  if (event.candidate) {
    peerB.addIceCandidate(event.candidate)
  }
}

At this point, end-to-end communication is truly established! If the process goes well, the media stream data should have been received in the ontrack event of peerB at this time, and you only need to render the media data to a video tag to realize playback.

Again: these steps seem simple, but the actual order is very important, and you can't go wrong in one step, otherwise the connection will fail! If you encounter problems in practice, be sure to go back and check the steps for errors.

Finally, we add a state listener event for peerA to detect whether the connection is successful:

 peerA.onconnectionstatechange = event => {
  if (peerA.connectionState === 'connected') {
    console.log('对等连接成功!')
  }
  if (peerA.connectionState === 'disconnected') {
    console.log('连接已断开!')
  }
}

Local simulation communication source code

In the previous step, we sorted out the process of point-to-point communication. In fact, there are only so many main codes. In this step, we will string together these knowledge points to simply implement a Demo of local simulated communication, and run it for everyone to see the effect.

The first is the page layout, which is very simple. Two video tags, one play button:

 <div class="local-stream-page">
  <video autoplay controls muted id="elA"></video>
  <video autoplay controls muted id="elB"></video>
  <button onclick="onStart()">播放</button>
</div>

Then set the global variable:

 var peerA = null
var peerB = null
var videoElA = document.getElementById('elA')
var videoElB = document.getElementById('elB')

The button is bound to a onStart method, in which the media data is obtained:

 const onStart = async () => {
  try {
    var stream = await navigator.mediaDevices.getUserMedia({
      audio: true,
      video: true
    })
    if (videoElA.current) {
      videoElA.current.srcObject = stream // 在 video 标签上播放媒体流
    }
    peerInit(stream) // 初始化连接
  } catch (error) {
    console.log('error:', error)
  }
}

The peerInit method is called in the onStart function, and the connection is initialized in this method:

 const peerInit = stream => {
  // 1. 创建连接实例
  var peerA = new RTCPeerConnection()
  var peerB = new RTCPeerConnection()
  // 2. 添加视频流轨道
  stream.getTracks().forEach(track => {
    peerA.addTrack(track, stream)
  })
  // 添加 candidate
  peerA.onicecandidate = event => {
    if (event.candidate) {
      peerB.addIceCandidate(event.candidate)
    }
  }
  // 检测连接状态
  peerA.onconnectionstatechange = event => {
    if (peerA.connectionState === 'connected') {
      console.log('对等连接成功!')
    }
  }
  // 监听数据传来
  peerB.ontrack = async event => {
    const [remoteStream] = event.streams
    videoElB.current.srcObject = remoteStream
  }
  // 互换sdp认证
  transSDP()
}

After initializing the connection, exchange SDP in the transSDP method to establish the connection:

 const transSDP = async () => {
  // 1. 创建 offer
  let offer = await peerA.createOffer()
  await peerB.setRemoteDescription(offer)
  // 2. 创建 answer
  let answer = await peerB.createAnswer()
  await peerB.setLocalDescription(answer)
  // 3. 发送端设置 SDP
  await peerA.setLocalDescription(offer)
  await peerA.setRemoteDescription(answer)
}
Note: The order of code in this method is very important. If the order is changed, the connection will probably fail!

If it goes well, the connection has been successful at this time. Screenshot below:

image.png

We use two video tags and three methods to implement a demo of local analog communication. In fact, "local simulation communication" is to simulate the communication between peerA and peerB, and put the two clients on one page. Of course, this is impossible in the actual situation. This demo just helps us to clarify the communication process.

I have uploaded the complete code of the Demo to GitHub. If you need to check it, please see here . Pull the code and open it directly index.html to see the effect.

Next we explore the real scenario - how the local area network communicates.

Communication between both ends of the local area network

The previous section implemented local simulation communication, simulating two end connections on one page. Now think about it: if peerA and peerB are two clients under a local area network, how should the code for local analog communication need to be changed?

We use two tags and three methods to implement local analog communication. If they are separated, first of all, the two instances of peerA and peerB, as well as the events bound to them, must be defined separately, and the same is true for the two video tags. Then the onStart method to obtain the media stream must be on the initiator peerA, which is no problem, but the transSDP method of swapping SDP is invalid at this time.

why? For example, on peerA side:

 // peerA 端
let offer = await peerA.createOffer()
await peerA.setLocalDescription(offer)
await peerA.setRemoteDescription(answer)

The answer is used to set the remote description here, so where does answer come from?

For local simulation communication, we define variables in the same file and can access each other. But now peerB is on another client side, and the answer is also on the peerB side. In this case, the answer needs to be created on the peerB side and then transmitted to the peerA side.

In the same way, after peerA creates an offer, it should also be passed to peerB. This requires two clients to remotely exchange SDP, a process called 信令 .

Yes, signaling is the process of exchanging SDP remotely, not some kind of credential.

Two clients need to actively exchange data with each other, so a server is required to provide connection and transmission. The most suitable implementation of "active exchange" is WebSocket , so we need to build a 信令服务器 based on WebSocket to realize SDP exchange.

However, this article will not explain the signaling server in detail. I will write a separate article on building a signaling server. Now we use two variables socketA and socketB to represent the WebSocket connection between peerA and peerB, and then modify the logic of the peer-to-peer connection.

First, modify the transmission and reception code of peerA side SDP:

 // peerA 端
const transSDP = async () => {
  let offer = await peerA.createOffer()
  // 向 peerB 传输 offer
  socketA.send({ type: 'offer', data: offer })
  // 接收 peerB 传来的 answer
  socketA.onmessage = async evt => {
    let { type, data } = evt.data
    if (type == 'answer') {
      await peerA.setLocalDescription(offer)
      await peerA.setRemoteDescription(data)
    }
  }
}

This logic is that after the initiator peerA creates the offer, it immediately passes it to peerB. When peerB finishes executing its own code and creates an answer, it sends it back to peerA. At this time, peerA sets its own description.

In addition, the part of candidate also needs to be passed remotely:

 // peerA 端
peerA.onicecandidate = event => {
  if (event.candidate) {
    socketA.send({ type: 'candid', data: event.candidate })
  }
}

The peerB side is slightly different. The answer must be created after receiving the offer and setting it as a remote description, and then sending it to the peerA side after creation, and also receiving the candidate data:

 // peerB 端,接收 peerA 传来的 offer
socketB.onmessage = async evt => {
  let { type, data } = evt.data
  if (type == 'offer') {
    await peerB.setRemoteDescription(data)
    let answer = await peerB.createAnswer()
    await peerB.setLocalDescription(answer)
    // 向 peerA 传输 answer
    socketB.send({ type: 'answer', data: answer })
  }
  if (type == 'candid') {
    peerB.addIceCandidate(data)
  }
}

In this way, the two ends transmit data to each other remotely to realize the connection and communication of the two clients in the local area network.

To sum up, two clients listen to each other's WebSocket to send messages, then receive each other's SDP, and set each other as a remote description. The receiving end also needs to obtain candidate data, so that the process of "signaling" runs through.

one-to-many communication

As we mentioned earlier, whether it is local analog communication or communication at both ends of the local area network, it belongs to "one-to -one " communication.

However, in many scenarios, such as live online education classes, a teacher may face 20 students, which is a typical one-to-many scenario. But WebRTC only supports point-to-point communication, that is, a client can only establish a connection with one client, so what should I do in this situation?

Do you remember what I said earlier: the point-to-point connection and communication between two clients is essentially the connection and communication of two RTCPeerConnection instances.

Then let's make a change. For example, now the receiving end may be several clients such as peerB, peerC, peerD, etc. The logic of establishing the connection does not need to be changed as before. So can the initiator extend from " one connection instance " to " multiple connection instances "?

That is to say, although the initiator is a client, it is not possible to create multiple RTCPeerConnection instances at the same time. In this case, the essence of one-to-one connection has not changed, but multiple connection instances are placed on one client, and each instance is connected to other receiving ends, realizing one-to-many communication in disguise.

The specific idea is: the initiator maintains an array of connection instances. When a receiver requests to establish a connection, the initiator creates a new connection instance to communicate with the receiver. After the connection is successful, the instance is pushed into the array. When the connection is disconnected, the instance is removed from the array.

I personally tested this method and it is effective. Next, we will modify the code of the initiator. The message whose type is join indicates that the connection end requests a connection.

 // 发起端
var offer = null
var Peers = [] // 连接实例数组

// 接收端请求连接,传来标识id
const newPeer = async id => {
  // 1. 创建连接
  let peer = new RTCPeerConnection()
  // 2. 添加视频流轨道
  stream.getTracks().forEach(track => {
    peer.addTrack(track, stream)
  })
  // 3. 创建并传递 SDP
  offer = await peerA.createOffer()
  socketA.send({ type: 'offer', data: { id, offer } })
  // 5. 保存连接
  Peers.push({ id, peer })
}

// 监听接收端的信息
socketA.onmessage = async evt => {
  let { type, data } = evt.data
  // 接收端请求连接
  if (type == 'join') {
    newPeer(data)
  }
  if (type == 'answer') {
    let index = Peers.findIndex(row => row.id == data.id)
    if (index >= 0) {
      await Peers[index].peer.setLocalDescription(offer)
      await Peers[index].peer.setRemoteDescription(data.answer)
    }
  }
}

This is the core logic. In fact, it is not difficult. It is very simple when the thinking is straightened out.

Because we have not introduced the signaling server in detail, the actual one-to-many communication requires the participation of the signaling server, so here I only introduce the implementation ideas and core codes. For a more detailed implementation, I will practice one-to-many communication again in the next article introducing the signaling server, and then the complete source code will be presented together.

i want to learn more

In order to better protect the originality, I will publish the following articles on the WeChat public account Programmer Success . This official account is only original, with at least one high-quality article per week, and the direction is front-end engineering and architecture, Node.js boundary exploration, integrated development and application delivery and other practices and thinking.

In addition, I also set up a WeChat group to provide exchanges and learning for students who are interested in this direction. If you are also interested, please add me on WeChat ruidoc I will pull you into the group, and we will make progress together~

阅读 3.4k

程序员成功
分享工程与架构,前端边界探索等实践
avatar
杨成功
前端架构师

分享小厂可落地的前端工程与架构

3.5k 声望
11.9k 粉丝
0 条评论
avatar
杨成功
前端架构师

分享小厂可落地的前端工程与架构

3.5k 声望
11.9k 粉丝
文章目录
宣传栏