1
头图

This article was shared by Mogujie front-end development engineer "Three Body", and the original title "Exploration of Mogujie Cloud Live Streaming - Sailing" has been revised.

1 Introduction

With the improvement of mobile network speed and the reduction of tariffs, live video has been gradually accepted by more and more users as a new entertainment method. Especially in recent years, live video has not only been used in traditional shows and games, but also has grown rapidly as a new model of e-commerce.

This article will give you a basic understanding of today's mainstream live video technology by introducing the live video live technology system, including commonly used push-pull streaming architectures, transmission protocols, etc.

study Exchange:

  • Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
  • Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK (click here for alternate address)

(This article has been published simultaneously at: http://www.52im.net/thread-3922-1-1.html )

2. Overview of the live broadcast architecture of Mogujie

At present, the main process of the push-pull stream of Mogujie live broadcast relies on the service of a cloud live broadcast.

There are two ways to push streams provided by Cloud Live:

1) First, push the stream by integrating the SDK (for mobile phone broadcast);
2) The other is to push the stream to the remote server through the RTMP protocol (for PC broadcast terminal or professional console equipment to start broadcast).

In addition to push-pull streaming, the cloud platform also provides cloud services such as cloud communication (IM instant messaging capability) and live recording, forming a set of basic services required for live broadcasting.

3. Push-pull flow architecture 1: Manufacturer SDK push-pull flow

As shown in the above question, this kind of push-pull streaming architecture needs to rely on the mobile interactive live broadcast SDK provided by manufacturers such as Tencent. By integrating the SDK in both the anchor-side APP and the client-side APP, both the anchor-side and the client-side have the function of push-pull streaming. .

The logical principle of this push-pull flow architecture is as follows:

1) The host and the client respectively establish long-term connections with the interactive live broadcast background of cloud live broadcast;
2) The host pushes audio and video streams to the interactive live background through the UDT private protocol;
3) After the interactive live broadcast background receives the audio and video stream, it forwards it, and sends it directly to the client that establishes a connection with it.

This push-pull flow method has several advantages:

1) Only need to integrate the SDK in the client: the broadcast can be started through the mobile phone, the requirements for the broadcast of the anchor are relatively low, and it is suitable for the rapid deployment of the live broadcast business;
2) The interactive live broadcast background only does forwarding: no additional operations such as transcoding, uploading to CDN, etc., the overall delay is relatively low;
3) Both the host and the client can be used as initiators of audio and video uploads: suitable for scenarios such as mic and video sessions.

4. Push-pull flow architecture 2: bypass push flow

The live broadcast method of pushing and pulling streaming through the mobile phone SDK was introduced before. It seems that the scene of watching live broadcast in the mobile phone client has been solved.

So here comes the question: If I want to watch the live broadcast in other scenarios such as H5, applet, etc., but there is no way to access the SDK, what should I do?

At this time, a new concept needs to be introduced - bypass push flow.

Bypass push streaming refers to connecting audio and video streams to a standard live CDN system through protocol conversion.

At present, after cloud live streaming is enabled, the audio and video streams will be pushed to the cloud live streaming background through the interactive live streaming background. The cloud live streaming background is responsible for transcoding the received audio and video streams into a common protocol format and pushing them to the CDN. The program and other terminals can pull the audio and video streams in a common format through the CDN for playback.

Currently, there are three types of protocols enabled by Moujie Live Bypass: HLS, FLV, and RTMP, which can cover all playback scenarios. These protocols will be introduced in detail in the following chapters.

5. Push-pull streaming architecture 3: RTMP streaming

With the development of live broadcast business, some anchors are gradually dissatisfied with the effect of mobile phone broadcast, and e-commerce live broadcast needs to display products on the screen with high fidelity, and needs to broadcast live through more high-definition professional equipment. RTMP streaming technology came into being.

By using streaming media recording programs such as OBS, we combine multiple streams recorded by professional equipment, and upload the audio and video streams to the specified streaming address. Since OBS push streaming uses the RTMP protocol, we call this push streaming type RTMP push streaming.

We first apply for the push stream address and secret key in the cloud live broadcast background, configure the push stream address and secret key into the OBS software, and adjust the push stream parameters. After clicking the push stream, OBS will send the corresponding push stream through RTMP protocol The stream address pushes audio and video streams.

The difference between this streaming method and SDK streaming is that the audio and video streams are directly pushed to the cloud live streaming background for transcoding and uploading to CDN, and there is no downstream method to directly push the live streaming to the client. The push delay will be longer.

To sum up, the advantages and disadvantages of RTMP streaming are more obvious.

The main advantages are:

1) Professional live camera and microphone can be connected, and the overall effect of live broadcast is obviously better than that of mobile phone broadcast;
2) OBS already has many mature plug-ins. For example, Mogujie anchors currently use YY assistant to do some beauty processing, and OBS itself already supports functions such as filters, green screen, and multi-channel video synthesis, which are more powerful than mobile phones.
The disadvantages are mainly:

1) The configuration of OBS itself is relatively complex, requiring professional equipment support, and significantly higher requirements for anchors, usually requiring a fixed venue for live broadcast;
2) RTMP requires cloud transcoding, and GOP and buffering are also configured in OBS when uploading locally, resulting in a relatively long delay.
6. High-availability architecture solution: After the cloud mutual backup business develops to a certain stage, we will have higher requirements for the stability of the business. For example, when there is a problem with the service of the cloud service provider, if we do not have a backup solution, the business will remain Waiting for the service provider to fix the progress problem.

Therefore, the cloud mutual backup solution appeared: cloud mutual backup refers to the simultaneous connection of live broadcast services to multiple cloud service providers. When a cloud service provider encounters a problem, it can quickly switch to the service nodes of other service providers to ensure that the business is not affected.

In the live broadcast business, it is often encountered that the downlink speed of the CDN node of the service provider is slow, or there is a problem with the live stream stored by the CDN node. Such problems are regional and difficult to troubleshoot. Therefore, the current mutual backup cloud solution is mainly backup. CDN node.

At present, the overall streaming process of Mogujie has already relied on the services of the original cloud platform. Therefore, we retweet all the way to the backup cloud platform in the cloud live broadcast background. After the backup cloud receives the live stream, it will transcode the stream and upload it to the backup cloud platform. Backup the cloud's own CDN system. Once there is a problem with the CDN node of the main platform, we can replace the delivered streaming address with the backup cloud streaming address, so that the service can be quickly repaired and the audience will not perceive it.

7. Principle of decapsulation of live video data stream

Before introducing the streaming protocol, let's first introduce that we get a piece of data from the cloud, and it takes several steps to parse out the final audio and video data we need.

As shown in the figure above, in general, there are four steps from acquiring data to finally playing audio and video.

Step 1: Undo the agreement.

When the protocol is encapsulated, it usually carries some header description information or signaling data. This part of the data has no effect on our audio and video playback, so we need to extract the specific audio and video encapsulation format data from it. The protocols we commonly use in live broadcasts are: Both HTTP and RTMP.

Step 2: Decapsulation.

After obtaining the encapsulated format data, it is necessary to perform the decapsulation operation, and extract the audio compressed stream data and video compressed stream data respectively. The encapsulated format data we usually see such as MP4 and AVI. In the live broadcast, we have more contact with the encapsulation format. TS, FLV.

Step 3: Decode audio and video.

So far we have obtained the compressed encoding data of audio and video.

The video compression encoding data we often hear every day includes H.26X series and MPEG series, etc., and the audio encoding formats include MP3, ACC, etc. that we are familiar with.

The reason why we can see so many encoding formats is that various organizations have proposed their own encoding standards, and will launch some new proposals one after another, but due to promotion and charging issues, there are not many mainstream encoding formats at present. .

After obtaining the compressed data, it is necessary to decode the compressed audio and video data to obtain uncompressed color data and uncompressed audio sampling data. The color data has RGB that we are familiar with, but the color data format commonly used in video is YUV, which refers to determining the color value of a pixel by brightness, hue, and saturation. PCM is usually used for audio sample data.

Step 4: Play audio and video synchronously.

Finally, we need to compare the time axis of the audio and video, and send the decoded data of the audio and video to the video card and sound card for synchronous playback.

PS: If you don’t quite understand the above process, it is recommended to read the following series of articles further:

"Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (1): the beginning"
"Detailed Explanation of Mobile Real-time Audio and Video Live Technology (2): Collection"
"Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (3): processing"
"Detailed Explanation of Mobile Real-time Audio and Video Live Technology (4): Coding and Encapsulation"
"Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (5): streaming and transmission"
"Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (6): Delay optimization"

In addition: For articles on audio and video codec technology, you can also learn the following articles in detail:

Video coding and decoding: "Theoretical Overview", "Introduction to Digital Video", "Basics of Coding", "Introduction to Prediction Technology"
"Understanding the Mainstream Video Coding Technology H.264"
"How to start learning audio codec technology"
"Introduction to Audio Fundamentals and Coding Principles"
"Common real-time voice communication coding standards"
"Features and advantages of real-time video encoding H.264", "The past and present of video encoding H.264 and VP8"
"Detailed explanation of the principle, evolution and application selection of audio codec", "Zero Foundation, Introduction to the Most Popular Video Coding Technology in History"

8. Live video transmission protocol 1: HLS

First, let's introduce the HLS protocol. HLS is the abbreviation of HTTP Live Streaming, which is a streaming media network transmission protocol proposed by Apple.

It is obvious from the name: this set of protocols is transmitted based on the HTTP protocol.

Speaking of the HLS protocol: First of all, you need to understand that this protocol is played in segments in the form of video slices. The slice video format used in the protocol is TS, which is the encapsulation format we mentioned earlier.

Before we get the TS file: the protocol first requires requesting a file in M3U8 format. M3U8 is a description index file, which describes the TS address in a certain format. We can get each segment according to the content described in the M3U8 file. The CDN address of the TS file. By loading the TS address and playing in segments, a complete video can be combined.

When using the HLS protocol to play a video: first, an M3U8 file will be requested. If it is on-demand, you only need to get it once during initialization to get all the TS slice points, but if it is live, you need to constantly poll the M3U8 file to get New TS slice.

After getting the M3U8: We can take a look at the contents inside. The beginning is some general description information, such as the first fragment sequence number, the maximum duration and total duration of the fragment, etc., and then the address list corresponding to the specific TS. If it is a live broadcast, every time the TS list in the M3U8 file is requested, it will be updated with the latest live broadcast slice, so as to achieve the effect of live broadcast.

The HLS format of slice playback is more suitable for on-demand playback, and some large video websites also use this protocol as a playback solution.

First of all: the feature of slice playback is especially suitable for hot switching of video resolution and multilingual in on-demand playback. For example, when we play a video, we initially choose to play the standard definition video. When we watch half of the video and feel that it is not clear enough, we need to change it to ultra-definition. At this time, we only need to replace the standard-definition M3U8 file with the ultra-definition M3U8 file. When we play When the next TS node is reached, the video will be automatically replaced with an ultra-clear TS file, and there is no need to re-initialize the video.

Secondly: the form of slice playback can also easily insert advertisements and other content into the video.

In the live broadcast scene, HLS is also a commonly used protocol. Its biggest advantage is the blessing of Apple bosses, and it is better to promote this set of protocols, especially on the mobile terminal. The video can be played directly by feeding the M3U8 file address to the video, and most browsers can also support it after decoding with MSE on the PC. However, due to its fragmented loading characteristics, the delay of live broadcast is relatively long. For example, our M3U8 has 5 TS files, and each TS file has a playback time of 2 seconds, then the playback time of an M3U8 file is 10 seconds, which means that the live broadcast progress of this M3U8 playback is at least 10 seconds ago. The scene is a big disadvantage.

The TS encapsulation format used in HLS, the video encoding format is usually H.264 or MPEG-4, and the audio encoding format is AAC or MP3.

A ts consists of multiple fixed-length packtets, usually 188 bytes. Each packtet consists of a head and a payload. The head contains some basic information such as identifiers, error information, and packet locations. The payload can be simply understood as audio and video information, but in fact there are two layers of encapsulation in the lower layer. After the encapsulation is decoded, the encoded data of the audio and video stream can be obtained.

9. Live video transmission protocol 2: HTTP-FLV

The HTTP-FLV protocol, as can be clearly seen from the name, is a protocol that transmits the FLV encapsulation format through the HTTP protocol.

FLV is the abbreviation of Flash Video, which is a packet method with small file size and suitable for transmission on the network. The video encoding format of FlV is usually H.264, and the audio encoding is ACC or MP3.

In the live broadcast, HTTP-FLV transmits FLV packet data to the requesting end by means of HTTP long connection.

In the live broadcast, we can pull a piece of chunked data through the pull stream address of the HTTP-FLV protocol.

After opening the file, you can read the hexadecimal file stream. By comparing with the FLV package structure, you can find that these data are the FLV data we need.

First of all, the header information: 464C56 converts ASCII code to three characters of FLV, 01 refers to the version number, 05 is converted to binary after the 6th and 8th digits represent whether there is audio and video respectively, and 09 represents the header The part length occupies several bytes.

The follow-up is the official audio and video data: it is encapsulated by FLV TAG one by one, and each TAG also has header information, indicating whether the TAG is audio information, video information or script information. We can extract the compression encoding information of audio and video by parsing TAG.

The FLV format is not natively supported in video. To play the packet format of this format, we need to decode the compressed encoding information of the video through MSE, so the browser needs to be able to support the MSE API. Since the transmission of HTTP-FLV is in the form of transmitting a file stream through a long connection, the browser needs to support Stream IO or fetch, and the compatibility requirements for the browser will be relatively high.

In terms of delay, FLV is much better than HLS for slice playback. At present, it seems that the delay of FLV is mainly affected by the GOP length set during encoding.

Here is a brief introduction to GOP: In the process of H.264 video encoding, three frame types are generated: I frame, B frame, and P frame. An I frame is what we usually call a key frame. The key frame includes complete intra-frame information and can be directly used as a reference frame for other frames. In order to compress the data smaller, B-frames and P-frames need to infer the information in the frames from other frames. Therefore, the duration between two I-frames can also be regarded as the minimum video playback segment duration. Considering the stability of video push, we also require the host to set the key frame interval to a fixed length, usually 1-3 seconds. Therefore, apart from other factors, our live broadcast will also have a delay of 1-3 seconds during playback.

10. Live video transmission protocol 3: RTMP

The RTMP protocol can actually be classified into the same type as the HTTP-FLV protocol.

Their packet formats are all FlV, but the transmission protocol used by HTTP-FLV is HTTP, and the RTMP pull stream uses RTMP as the transmission protocol.

RTMP is a set of real-time message transmission protocols made by Adobe based on TCP, which is often used in conjunction with Flash players.

The advantages and disadvantages of RTMP protocol are very obvious.

The advantages of RTMP protocol are mainly:

1) First, like HTTP-FLV, the latency is relatively low;
2) Secondly, its stability is very good, and it is suitable for long-term playback (because it borrows the powerful functions of Flash player during playback, even if multiple streams are played at the same time, it can ensure that the page does not freeze, which is very suitable for monitoring and other scenarios).

However, Flash player is currently in a situation where the web side is pushed by the wall, and the mainstream browsers gradually express that they no longer support the Flash player plug-in. Using it on the MAC can immediately turn the computer into an iron plate for barbecue, which consumes a lot of resources. On the mobile terminal, H5 is basically in a completely unsupported state, and compatibility is its biggest problem.

11. Live video transmission protocol 4: MPEG-DASH

The MPEG-DASH protocol belongs to an emerging force. Like HLS, it is played by slicing video.

The background of his generation is that in the early days, major companies made their own set of protocols. For example, Apple has engaged in HLS, Microsoft has engaged in MSS, and Adobe has also engaged in HDS, so that users need to suffer from the compatibility problem of multiple sets of protocol packages.

So the big guys got together, integrated the previous streaming media protocol solutions of various companies, and created a new protocol.

Since it is the same protocol for video playback of slices, DASH has similar advantages and disadvantages as HLS. It can support the switching of multiple video bit rates and multiple audio tracks between slices. It is more suitable for on-demand services, but there is still a long delay in live broadcast.

12. How to choose the optimal live video transmission protocol

Two very critical points for the selection of live video protocols have been mentioned in the previous article, namely low latency and better compatibility.

First of all, consider from the perspective of delay: regardless of the consumption of cloud transcoding and uplink and downlink, HLS and MPEG-DASH shorten the slice duration, and the delay is about 10 seconds; RTMP and FLV theoretically have the same delay, which is 2-3 second. So in terms of delay HLS ≈ DASH > RTMP ≈ FLV.

Considering from the compatibility point of view: HLS > FLV > RTMP, DASH, due to some historical reasons of the project, and the positioning is repeated with HLS, there is no detailed test for its compatibility for the time being, and it has been pushed out of the scope of consideration for selection.

To sum up: we can choose the lowest latency protocol available in the current environment by dynamically judging the environment. The general strategy is to use HTTP-FLV first, use HLS as the bottom line, and switch to RTMP by manual configuration in some special demand scenarios.

For HLS and HTTP-FLV: We can directly use hls.js and flv.js for decoding and playback, both of which are internally decoded by MSE. First, extract the corresponding audio and video chunk data according to the video encapsulation format, create SourceBuffer for audio and video respectively in MediaSource, and feed the encoded data of audio and video to SourceBuffer. MediaSource replaces the src in the Video tag with a MediaSource object for playback.

When judging the playback environment, we can refer to the judgment method inside flv.js, and judge whether MSE and StreamIO are available by calling the MSE judgment method and simulating the request:

// Determine whether MediaSource is supported by browsers, whether H.264 video encoding and Acc audio encoding can be supported for decoding
window.MediaSource && window.MediaSource.isTypeSupported('video/mp4; codecs="avc1.42E01E,mp4a.40.2"');

If FLV playback is not supported: you need to downgrade to HLS. At this time, you need to determine whether the browser environment is on the mobile terminal. The mobile terminal usually does not need hls.js to play through MSE decoding, and directly hand over the address of M3U8 to The src of the video is enough. If it is on the PC side, judge whether MSE is available, and if it is available, use hls.js to decode and play.

These interpretations can be judged in advance in their own logic to pull the CDN of the corresponding decoding library, instead of waiting for the third-party library to be loaded and using the internal method of the third-party library to judge, so that when selecting a decoding library, all libraries can be selected. Pull down to increase loading speed.

13. How to solve the same layer playback

E-commerce live broadcasts require more audience operations and interactions than traditional live broadcasts. Therefore, during product design, many functional modules will be suspended above the live video to reduce the space occupied. At this time, you will encounter a big problem with mobile players - playing on the same layer.

Same-layer playback problem: In the mobile H5 page, some browser kernels replace the video tag with the native player in order to improve the user experience, so that other elements cannot be overlaid on the player.

For example, we want to add a chat window above the player in the live broadcast room, and place the chat window above the player by increasing the z-index through absolute positioning. The test on the PC is completely normal. But in some browsers on the mobile side, the video is replaced by the native player, and the element level of the native is higher than our normal elements, so that the chat window is actually displayed below the player.

To solve this problem, we must first divide into multiple scenarios.

First of all, in the iOS system: under normal circumstances, the video tag will be automatically played in full screen, but iOS10 and above already provide the same layer of video attributes. We can add playsinline/webkit-playsinline to the video tag to solve most browsers in the iOS system. The same layer problem of the remaining browsers with low system versions and some webview containers in the APP (such as Weibo), the properties mentioned above do not work, and calling the third-party library iphone-inline-video can solve most of the remaining problems .

On the Android side: Most of the built-in webview containers of Tencent-based APPs use the X5 kernel. The X5 kernel will replace the video with a native customized player to enhance some functions. X5 also provides a solution of the same layer (the official document link of this solution cannot be opened), and writing the X5 same-layer attribute to the video tag can also realize inline playback in the X5 kernel. However, the properties of the same layer of X5 are not the same in each X5 version (for example, the X5 full-screen playback mode needs to be used in the lower version of X5 to ensure that the video played by MSE takes effect on the same layer), you need to pay attention to distinguish the versions.

In the Mogujie App, the currently integrated X5 kernel version is relatively old, and when MSE is used, the X5 same-layer parameters will not take effect. However, if a new version of the X5 kernel is integrated, regression testing needs to be performed on a large number of online pages, and the cost is relatively high, so a compromise solution is provided. By adding a switch parameter to the page URL, the container will downgrade the X5 kernel to the system's native browser kernel after reading the parameter, which can solve the problem of the same layer of browser video and also control the impact of kernel changes within in a single page.

14. Related Articles

[1] Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (4): Encoding and packaging
[2] Detailed explanation of real-time audio and video live broadcast technology on mobile terminals (5): Streaming and transmission
[3] Practical sharing of realizing 1080P real-time audio and video live broadcast with a delay of less than 500 milliseconds
[4] Talking about the technical points of developing a real-time video live broadcast platform
[5] Chat technology of live broadcast system (7): Practice of architectural design difficulties of massive chat messages in live broadcast room
[6] From 0 to 1: 10,000 people online live audio and video live technology practice sharing (video + PPT) [Attachment download]
[7] Features and advantages of real-time video coding H.264
[8] The past and present of video encoding H.264 and VP8
[9] Zero foundation, the most popular introduction to video coding technology in history
[10] Coding basis of video codec
[11] Getting Started with Zero Basics: A Comprehensive Inventory of the Basics of Real-time Audio and Video Technology
[12] Necessary for real-time audio and video viewing: quickly master 11 basic concepts related to video technology
[13] Introductory outline of real-time audio and video technology written to Xiaobai

(This article has been published simultaneously at: http://www.52im.net/thread-3922-1-1.html )


JackJiang
1.6k 声望810 粉丝

专注即时通讯(IM/推送)技术学习和研究。