1
头图

Author: Zhang Lingzhu (with rope)

This is the second article in the "Youku Play Black Technology" series. The first article can be clicking 161c98eaa835ce Youku Play Black Technology | Free View Technology Experience Optimization Practice . Focus on not getting lost~

Live content is different from on-demand. It attracts users because of its real-time and strong interactive features. While watching, you can also perform interactive behaviors such as likes, comments, and rewards simultaneously. This type of interaction is mainly focused on the user’s social behavior, and content-based interactions have also come into the eyes of the public with the emergence of "interactive episodes" in the on-demand. Users can choose roles to perform different behaviors at key points in the progress of the plot. Then enter the different branch plots. The content-based interaction in the same live broadcast direction is also being continuously explored and applied. One example is the simultaneous live broadcast of multiple perspectives, allowing users to switch to select a specific perspective to watch the live content. Especially in medium and large live events (such as hip-hop, the Champions League, etc.), the fans are more obvious (stars, football stars), if you can provide multiple viewing angles or provide a focus perspective for switching options, there will undoubtedly be a user experience Great improvement.

Based on this, after research and exploration, finally realized the low-latency and high-definition multi-view live broadcast capability based on WebRTC. The annual finals of "It's Hip-hop" is officially launched. The effect video is as follows:

The video can be clicked to view: Youku broadcast black technology | WebRTC-based live broadcast "cloud multi-view" technology analysis

Scheme selection

Ability to achieve low-latency switching , high-definition , full model coverage user experience is an important measure of the final decision which technology adoption program. Application layer protocols commonly used in the industry include:

  • HLS (including LHLS)
  • RTMP
  • RTP (strictly speaking, RTP is between the application layer and the transport layer)

Among them, RTMP is determined to be used for live streaming due to its advantages and disadvantages (low latency but easy to be blocked by firewalls). This article will not introduce much for the time being. Below we will mainly analyze the streaming process. According to the number of end-pull streams, the implementation schemes can be divided into:

  • Single stream pseudo-multi-view live broadcast: Based on the HLS protocol, only one stream is pulled at the same time on the terminal, and the cut-to-view needs to switch to the stream address to start the broadcast;
  • Multi-stream and multi-view live broadcast: Based on the HLS protocol, the terminal simultaneously pulls multiple streams for decoding and rendering, and the cut-to-view does not need to start the broadcast at the cut-to-stream address;
  • Single stream and multi-view live broadcast: Based on the RTP protocol, only one stream is pulled at the same time on the terminal, and there is no need to switch the view point to start the broadcast again.

Horizontal contrast

The form obtained after marking the inferior characteristics with red:

planprotocolSimultaneous previewSeamless switchingBit ratePerformance burdenIncremental cost
Single stream pseudo multi-view live broadcastHLSnonoordinaryordinarynone
Multi-stream and multi-view live broadcastHLSYesYeshighhighCDN
Single stream multi-view live broadcastRTPYesYesordinaryordinaryEdge service and traffic bandwidth

Through comparison, we finally decided to adopt a single-stream multi-view live broadcast based on RTP, that is, the edge solution.

WebRTC overview

WebRTC, whose name is derived from the abbreviation of Web Real-Time Communication (English: Web Real-Time Communication), is an API that supports web browsers for real-time voice or video conversations. It was open-sourced on June 1, 2011 and was included in the W3C recommendation standard of the World Wide Web Consortium with the support of Google, Mozilla, and Opera. WebRTC uses UDP protocol by default (actually uses RTP/RTCP protocol) to transmit audio and video data, but it can also be transmitted via TCP. At present, it is mainly used in video conferencing and Lianmai.

WebRTC internal structure

The Voice Engine audio engine is responsible for audio collection and transmission, with functions such as noise reduction and echo cancellation. Video Engine video engine is responsible for network jitter optimization, Internet transmission codec optimization. Above the audio and video engine is a set of C++ API.

WebRTC protocol stack

ICE, STUN, and TURN are used for intranet penetration, which solves the problem of obtaining and binding external network mapping addresses. DTLS (UDP version of TLS) is used to encrypt transmission content. RTP/SRTP is mainly used to transmit audio and video data with high real-time requirements, and RTCP/SRTCP is used to control RTP transmission quality. SCTP is used for the transmission of custom application data.

system design

The multi-view live broadcast overall link involves multiple links such as stream production, domain name scheduling services, edge services, confluence services, and live broadcast control services. The overall structure framework is as follows:

  • Stream production: The camera collects and uploads the stream to the live broadcast center, and the live broadcast center encodes and optimizes the stream;
  • Convergence service: After multiple input streams are aligned with audio and video through time stamps, the multiple mixed streams are output according to the template;
  • Edge service: pre-cache the multiple output streams of the combined service, encode the output streams, and transmit them to the end side through the RTP connection;
  • Domain name scheduling service: configure IP and port for communication between end-side and edge services;
  • Broadcast control service: provide broadcast control services such as convergent service request parameter configuration and multi-view business capability configuration.

detailed design

In order to reuse the broadcast control main link as much as possible on the end side and reduce the intrusion into the main playback link, a multi-view player is added. The multi-view player maintains the same calling interface as other players, and the broadcast control station decides whether to create a multi-view player according to the service type. Add a multi-view plug-in to the business layer, which is decoupled from other businesses and can be easily mounted and removed.

The end-side structure design is as follows:

Core process

The main start-up process of the user entering the multi-view mode is as follows:

Currently, 3 display modes are supported, namely mix mode, cover mode and source mode. The following figure shows the specific mode and the switching process.

[]()

Side instructions

The end side and the edge node mainly interact with commands, and carry out data transmission through the RTP protocol. The general header transmission format is as follows, PT=96 represents H264 media data.

  • Connection establishment instruction: blocking instruction, the end-side establishes an RTP connection with the edge node, and it needs to wait for the edge node to return a response. If there is no return within a certain period of time, the connection establishment is considered to have failed;
  • Disconnect instruction: non-blocking instruction, the end side is disconnected from the edge node without waiting for the edge node to return;
  • Play instruction: non-blocking instruction, the end side issues play instruction, it needs to pass the play stream ID and OSS stream configuration information, so that the edge node can find the correct stream;
  • Stream cut instruction: non-blocking instruction, the end-side issues a switch view instruction, in order to maintain synchronization with the original view, it is necessary to pass the original view frame timestamp to the edge service.

Project landing

Playback ability adjustment

  • Audio sampling rate adjustment

WebRTC currently does not support 48K audio sampling rate by default, and the current sampling rate of large-scale live broadcasts is relatively high. If it is sent to the end without adjustment, it will cause the audio decoding module to crash;

  • AAC audio decoding capability expansion

The audio encoding in WebRTC uses Opus by default, but most of the current live broadcasts are in AAC format, so the client needs to add AAC encoding information in the offer SDP and implement the AAC decoder, and the edge service cooperates with the delivery of RTP packaged data;

  • H.264 encoding support

In order to reduce the delay as much as possible, WebRTC video encoding uses VP8 and VP9, and needs to be migrated to the more general H.264 encoding;

  • In terms of the transmission method, WebRTC uses P2P for media transfer. It only solves the end-to-end problem. It is obviously not suitable for large-scale PGC and OGC live broadcast content. Therefore, the network transmission layer does not have hole-making logic and uses RTP connection for transmission. For streaming media data, RTCP performs transmission quality control.

Access to existing playback system

In order to reduce the impact on the main player as little as possible, the main player maintains the same data request process, the same playback capability interface, and consistent data embedding timing, etc.:

  • Multi-view player: encapsulate WebRTC to expand a multi-view player, dedicated to multi-view video playback;
  • Broadcast control logic adjustment: adjust the live bypass data acquisition process, and merge the return data of service, AMDC and other services into the data stream;
  • Playback capability & statistical adjustment: Keep the original playback capability and callback interface, and differentiate by extending interface parameters;
  • Extended error code: Based on the error rules of the main player, the error code of the multi-view player is expanded.

End-to-side problem solving

In the end, the main playback window shown in the figure below needs to be rendered at the same time with several sub-play windows:

The sliding list on the right is implemented by UITableView (RecyclerView), adding a rendering instance GLKView (managed by the RTCEAGLVideoView package) for each child window, and creating, removing, and updating the rendering frame of the child window at the right time to achieve multiple views all the way Synchronize the effect of live broadcast.

During the period, the experience problem of flickering angle of view and the stability of rendering black screen caused by memory leak were mainly solved. Here is a brief introduction.

Play flashes

[Problem description] If there are a total of N viewing angles, then our implementation scheme has a total of N*3 (see the 3 display modes in the system design). Each time you click on the viewing angle to switch, the actual operation flow is that the player sends out RTP. Streaming instruction, the edge node converts the stream ID and sends Sei return information. After receiving the SEI information, the player performs cutting and updating operations for each window. There is a problem. It takes a certain time for the window to update after the cut angle is successful, which will cause the actual frame data of the new stream to be received. The first frame or several frames used are still the template of the old stream, so the user will see A momentary visual remnant.

[Solution] After the playback kernel receives the stream cut instruction, set a short frame loss window period, and resume rendering after receiving the Sei message indicating that the stream cut is successful. During this period, because there is no new frame data, what the user sees is Static frame, this can shield the rendering data that causes the content of the sub-window to be disordered. During this period, the upper UI will not see the residual frame flicker, and the user feels basically smooth switching.

Memory leak

[Problem description] When the user keeps switching perspectives, the list is refreshed, and the cell is recreated, the memory usage keeps increasing, and when the upper limit is reached, a black screen window appears.

[Solution] The cause of the memory leak is that during the cell re-creation process, the OpenGL rendering instances in the kernel were newly created in batches, and the old instances were not destroyed in time. The first step is to make it clear that the upper-level business code has a release operation on the UIView object through removeFromSuperview, and the reference count is still not 0 after lldb prints, then the problem lies in the instance management of the kernel. __bridge_retained represents that after the OC is converted to the CF object, the memory management needs to be released manually, so when you switch the perspective, you need to call the relevant Filter to achieve the destruction and release of the C++ level. Memory performance after resolution:

Service concurrency optimization

The pre-optimization version used to support live broadcast of CUBA related events, but there were many problems: high cost, high switching delay, and lack of support for large-scale concurrency. The goal of this optimization is to be able to support the live broadcast of large-scale events, greatly reduce costs, and at the same time ensure the experience.

Encoding preamble

The figure below is the optimization front link. You can see that each client needs to re-layout and code. These two operations consume a lot of resources. Even with the T4 GPU, it can only support 20 channels of simultaneous viewing.

After analysis, it can be found that in the multi-view application scenario, the combination of user viewing is limited. In the street dance mode, the viewing angle is 3*N, and N is the original acquisition angle. Then, if we pre-produce these 3N streams, then By directly copying the encoded H.264 data in the edge service, the CPU consumption can be greatly reduced, while the number of users carried is increased by at least an order of magnitude.

The confluence service in the figure only needs to be produced once, and the produced content can be shared by all edge service nodes.

But this kind of simple cut flow has a problem, it cannot achieve frame-level alignment. In other words, when the user selects a different view angle, the user will find that the new cut view is not continuous in time with the previous view. In order to solve this problem, we asked Alibaba Cloud directors to align them based on absolute timestamps when they merged, and at the same time transparently transmit this timestamp information to the edge service.

We still can’t achieve true frame-level alignment to solve the problem of PTS alignment, because there is also the problem of GOP alignment. Students who are familiar with H.264 encoding know that video encoding and decoding are based on GOP. The length may be 1s to 10s or even longer, and the typical application length in live broadcast is 2s. If the user switches, it happens to be at the beginning of a new GOP, then the simple flow can be achieved, but we cannot ask the user to switch when and when he can’t. What the user wants is to switch. Our solution: If the user does not have a GOP that can just be switched when switching, the edge service will produce one.

From the above figure, we can see that when the user switches from perspective 1 to perspective 2, we generate a new GOP at the switching point, so that after the stream is pushed to the client, the client can seamlessly decode and render to the new perspective.

Through the above steps, we have greatly reduced the encoding consumption, but in order to be able to quickly respond to the user's perspective, we must prepare the original frame (YUV420) of all source perspectives, and use it directly when a new GOP needs to be generated. But the demand is always changing. For 4 views, we can 3=12 streams. When the business party wants 9 3=27 streams, simultaneously decode 27 channels of 1080p video data to make the entire 32 The nuclear machinery is overwhelmed, not to mention that more sources may be required in the future.

On-demand decoding

Users want more perspective, we need to satisfy. Then all the previous pre-decoding methods must be changed, so we have implemented on-demand decoding, that is, only when decoding is really needed, we will prepare the frame (YUV420) of that stream for it. The biggest problem here is the real-time problem, because when the user switches, the picture may be in any position of a GOP, and we can only decode from the start frame of the GOP. However, through multiple optimizations, the delay presented to the user is not perceived.

Client dynamic cache

Students who have done audio and video should be familiar with the stutter rate. The annual management is always not ideal. Multi-view live broadcast will face more troublesome problems, because if the stutter rate is reduced, the most basic way is to increase the playback cache. But we also hope that users can quickly see new perspectives, and we require the cache not to be too high. So we introduced dynamic caching. Simply put, when users switch, we use extremely low-water cache to ensure the switching experience. When users do not switch, the cache will slowly recover to a certain level to ensure a certain degree of resistance to network jitter.

Summary and outlook

The multi-view capability was officially launched in the hip-hop finals. As Youku’s unique and innovative gameplay, it has received many positive feedbacks from users on Weibo and Moments. Follow-up Youku plans to further enhance the user experience through interaction optimization and delay optimization.

, 3 mobile technology practices & dry goods for you to think about every week!


阿里巴巴终端技术
336 声望1.3k 粉丝

阿里巴巴移动&终端技术官方账号。