(Figure 1 Global Software Development Conference Rongyun "Real-time Communication Technology" special session)

Today I share "Real-time communication full-link quality tracking and index system construction". The content mainly includes five parts: the quality challenges faced by real-time audio and video platforms, the overall architecture of Rongyun real-time audio and video quality control, client-side real-time audio and video quality detection, and SDN The establishment and quality tracking of large networks, the tracking and query of link problems.

To get the complete PPT, please reply "Xu Jie" in the backstage of Rongyun official account.


(Picture 2 Xu Jie gave a keynote speech at the QCon site)

Quality challenges faced by real-time audio and video platforms

The challenges faced by real-time audio and video platforms can be summarized as three "diversifications":

First, the terminal is diversified. Including OS version features, hard-coded differences, multiple APP SDK versions, etc.

As we all know, iOS and Android are relatively new systems and are in a period of rapid iteration. In the iterative process, both new and old features, including vendor compatibility and other issues are more prominent, which also brings great challenges to our ToB platform. Unlike the PC-side upgrade with the help of the background engine, the mobile-side situation is very difficult, and many apps are still versions a few years ago.

The so-called "hard coding"-in the era of mobile Internet, hardware coding and decoding have been widely used. Many Internet celebrities live outdoors. If pure software is used to encode and decode, the battery will soon run out. This is why although other encoders have been released one after another, the most popular is still H.264, which has a lot to do with hardware.

Second, the global network is diversified. It involves differences in network types, cross-border exports, differences in foreign infrastructure, restrictions on the construction of dedicated lines, etc. Among them, the first two items will be deeply touched by everyone, and in the process of Rongyun empowering a large number of enterprises to go overseas, they also deeply realized the big difference between overseas infrastructure construction and China.

Third, the user scenarios are diversified, such as 1V1, education, finance, conferences, low-latency live broadcasts, and chat rooms. To give a simple example, if the average viewing time of an APP is 10 minutes during a live broadcast, but suddenly drops to 5 minutes in a certain period of time, we can investigate-is this numerical fluctuation caused by quality, such as a decline in picture quality , Or is the lag severely causing users to be unable to watch patiently and have to switch live broadcast rooms frequently? From this, through QoE monitoring, the quality is reversed from the side.

The overall architecture of Rongyun real-time audio and video quality control

The various demand-side "challenges" mentioned above must be transformed into technical thinking when switching to a problem-solving perspective. What problems need to be solved by the quality architecture?

The first is algorithm optimization, including codec optimization and weak network optimization. Starting from the laboratory environment, to see whether our R&D quality meets the requirements of the production environment and how the production environment can be generalized, we will first push it to the production environment network on a small scale. If a problem occurs, it will be recalled.

The second is SDN. SDN construction is how to evaluate the coverage capacity of the server after selecting a site, especially in many overseas countries, such as Southeast Asia, where the personnel are widely distributed, and the coverage capacity of base stations and computer rooms also needs to be strictly reviewed. Furthermore, how to evaluate the current quality of dedicated lines, global link quality, and disaster recovery and recovery needs to be monitored in a timely manner.

The third is customer experience-how can we provide customers with reliable real-time global status monitoring when customers trust us? Once a problem is found in the data report summary, how to query the link details and quickly locate the problem? These are all issues that need to be resolved.

With the above questions, let's take a look at the overall architecture diagram of Rongyun's real-time audio and video quality control system (Figure 3).


(Figure 3 Overall architecture of Rongyun RTC quality system)

The bottom layer is R&D management, including use case management, automatic verification, link tracking, large data screens, and early warning.

The middle layer is the construction process of Rongyun's global "big network" SDN, such as the log system-pulling logs from more than a dozen large nodes around the world for real-time analysis; routing management is actually the "brain" of the entire system. Including how to plan the path, how to do disaster recovery, and so on.

At the top is the customer platform. We have reports, large data screens, and link information self-checking systems.

Client real-time audio and video quality detection

Client RTC quality inspection is the focus of this sharing.


(Figure 4 RTC audio and video processing flow & quality factors)

As shown in Figure 4, data is collected, pre-processed, encoded, and sent from the sender, and decoded, post-processed, and rendered after receiving the data. This is a typical data processing process of RTC.

Obviously, this process is arranged linearly. The trouble caused by this is that once a certain link goes wrong, the quality of all subsequent links will be affected. Just like a "water pipe", any blockage in any place will cause the water flow to fail. Unblocked.

Look at it one by one:

There may be hidden dangers such as hardware problems, focus problems, and noise problems on the acquisition end. Hardware issues such as the resolution of the device itself; focus issues are often overlooked, and the actual software's autofocus may also fail. Anyone who knows SLR cameras knows that when using a telephoto lens, the focal plane is very short. The picture is not clear; the noise problem means that electronic components will have noise when the sensitivity is low, and this "random white dot" has a great impact on the overall quality.

Once there is noise, pre-processing is needed, that is, the processing before encoding, including beautification, down-sampling, denoising, etc. At present, "single frame denoising" is mainly used, and the effect is acceptable. The disadvantage is that the previous and next frames are not referenced in the continuous frames of the video, which causes the average value of the frame and the frame to be different, and the final residual error is large.

In the coding stage, conversion, transformation, and quantization are also the biggest factors for picture degradation, because it is restricted by network transmission. As we all know, the focus of RTC is to optimize the network, maximize the available bandwidth, enhance the adaptability of packet loss and jitter, and create better conditions for encoding.

Decoding is the reverse process of encoding, and problems of conversion and transformation are still encountered. The so-called "filtering" means that the encoder we are using is basically a hybrid encoder, using "blocks" for calculations, and the problem is that the "average value" error between blocks in an image It is also obvious, so filtering is used to filter out this "blocking effect".

In post-processing, in addition to the video sampling mentioned above, the main problem in audio is data packet loss. In order to compensate for the auditory effect after packet loss, it will increase the characteristics of variable speed and non-adjustment, and even comfort noise.

Rendering is mainly in terms of hardware, such as insufficient display speed, resulting in no problems with decoding, and the final frame rate is not uniform.

In response to the above problems, there are some commonly used evaluation indicators in the industry, mainly in two categories: subjective indicators and objective indicators.


(Figure 5 Commonly used evaluation indicators)

The most representative subjective index is MOS. The advantage is people-oriented; the disadvantage is that the cost is too high, and can not be accurately reproduced, and the judgment will change with the mood of the tester.

Therefore, R&D personnel hope that there are some operations that use machines instead of manual operations, and there are typically two types: full reference and no reference.

No reference, such as ambiguity, blockiness, etc., its advantage is that it only needs data from the receiver; its disadvantage is that the judgment will be weak and it cannot locate problems inside and outside the system. For example, the final result graph does not work well, and it cannot be judged as the source itself. No, it still introduced a problem in the process.

The full reference, such as PSNR, VMAF, etc., has the advantage of technically good operation, can be repeated frequently, and can be accurately reproduced, which is convenient for quickly locating the problem; the disadvantage is that the data of both parties is required, and the original image and the target image must be strictly compared. At this point, the advantage of our RTC model is that it can tolerate network packet loss.

For example, the sender sends ten pictures, but the receiver only receives eight, which cannot meet the one-to-one correspondence of the full reference. I don't know which two are missing. How to solve it?


(Figure 6 Video alignment scheme)

As shown in Figure 6, we refer to Intel’s solution. When we get an original image, we will add a video number to the upper left corner of the image. The processed image with this number will be sent to the RTC as the original image. The receiving end is decoded and passed the depth Learn the text recognition, extract the number, so that the two sides can correspond.

The above is the alignment of the video, what about the audio? Audio alignment mainly uses a "physical phenomenon": humans are mainly concentrated below the 4K frequency during voice chat, so 4K can be used as the standard line.


(Figure 7 Audio alignment scheme)

Figure 7 shows our audio alignment solution: after we get the audio signal, we first go through the time domain to the frequency domain to perform low pass in the frequency domain. The original information is retained for subsequent comparison with the target. After coding, it is converted into time domain to RTC. RTC performs time domain conversion to frequency domain, and then the audio number can be obtained after extraction, and finally it is converted into the original sound through low-pass filtering.

In addition to the video and audio alignment scheme, another important scheme is clock alignment.

During a 1V1 call, the 500 millisecond delay will affect the interactive experience. We have encountered this situation when using real-time communication products: a person wants to speak between the other's two sentences, but because the delay is too large (more than 500 milliseconds), the other party does not receive the signal, so there is no Stop, eventually two people will advance and retreat to each other, causing chaos in the entire interaction process. This is also the main reason why we need to measure latency performance.

In addition, when performing automated verification, multi-terminal devices are often used. The clocks of different devices are actually quite different, as high as a few seconds. This is far from the millisecond error we hope, so we must first synchronize the multi-terminals. , The error is controlled within 100 milliseconds. In fact, the environmental error of our laboratory can even be controlled within tens of milliseconds.


(Figure 8 Clock alignment scheme)

Figure 8 shows the clock alignment scheme. Before testing, each client needs to send its own timestamp to the unified time server. The time server returns the timestamp of client A with the timestamp of the server. When the client receives the return packet Will get a timestamp B of its own.

Correction time = server timestamp + (client timestamp B-client timestamp A)/2. In other words, all our clients are aligned with the time server, even if the time server has an error, it does not matter.

In summary, we have obtained the alignment of video, audio, and clock, and then we can complete the comparison of the full reference model.


(Figure 9 Analysis method)

As shown in Figure 9, the left side is the sending end, which needs to record the frame number, time stamp, and image binary information. The receiving end is similar. Here we simulate a packet loss situation (packet loss on No. 3 and 4). With these two tables, through quality analysis, we can find frame loss, frame rate & fluency changes, full reference image quality, delay, jitter, etc. An abnormal situation.

Make a summary of the full reference model mentioned earlier.


(Figure 10 Audio and video end-side feature processing)

As shown in Figure 10, the upper and lower dashed lines are the video and audio processing processes. Before sending it to WebRTC, it is necessary to record the original video and timestamp log. After the receiving end code recognition, the target video and timestamp log also need to be recorded. Audio is similar. In addition, some statuses of WebRTC itself will be recorded to facilitate quick positioning of the final result and preliminary judgment of the problem.

Figure 11 is the overall framework of our automated testing.


(Figure 11 Automation framework)

On the far left is the verification management server as the "master control". When you get a test, you first simulate the network environment, automatically configure the weak network instrument, server and terminal, and then perform the real test to generate all the files we mentioned above. Finally Complete summary analysis and enter the database.

As shown in Figure 12, the automated verification process is: R&D personnel change the coding, network and other characteristics and submit a new version. First, deploy the version in the verification environment, obtain use case information, configure the weak network instrument, execute the test process, and analyze the results. If other use cases are found, they will continue to execute in a loop, and finally generate a report.

(Figure 12 Automated verification process)

In terms of trigger conditions, whether it is optimizing weak network algorithms, audio and video algorithms, or process optimization and OS feature follow-up, no matter what is modified, in principle everything must be verified, including network environment, backward compatibility, Special model coverage, historical coverage, accurate measurement, etc.

SDN large network formation and quality tracking

There are two main goals for SDN large network construction:

One is real-time construction. The main functions are the collection of node status, path planning, and line optimization. Specific indicators include: multi-line QoS between nodes, node degree & betweenness, node load pressure, number of reachable paths, etc.

The other is rapid self-healing. After the network is established, node damage and temporary failure of dedicated lines may occur in daily operations, and disaster tolerance processing may be required. Disaster recovery processing must first have an error collection and feedback mechanism, as well as path re-planning and key line balance. If there is a problem, it needs to be continuously optimized.

There are four main link modes between nodes: public network, private line in the cloud, private line, and SD-WAN.


(Figure 13 Link mode and quality characteristics between nodes)

The cost of the public network is low, but the stability is relatively poor; the cost of the private line in the cloud is lower than other private lines, and the deployment is very convenient, but it cannot be opened in the cloud service provider. Compared with SD-WAN, the repair time will be longer; the private line is like Shanghai The quality of the cross-border link to Singapore is not good, and a dedicated line is directly pulled. The disadvantage is that some operators cannot cover it; SD-WAN is a software-defined network with strong link capability and a little more expensive cost, which can be automatically repaired.

In practical applications, several linking methods will be used at the same time. Once there is a problem with a certain path, it can be switched in time.

In the selection of cascading links, we mainly balance three factors: quality + scenario + cost. For example, 1v1 calls require lower delay, while live broadcast relaxes the delay, but requires higher total bandwidth. In the end, the cost will be considered comprehensively.


(Figure 14 Real-time quality information collection)

As shown in Figure 14, when collecting real-time quality information on a node, we will summarize all service data in a node to an Agent of the node, perform preliminary analysis, and then send it to the real-time status management server, which will synchronize multiple nodes. . In this way, the load situation and current bandwidth quality of all nodes in the global network can be known.

Tracking and querying of link problems

Rongyun has built a set of monitoring system for the global links. Through this platform, we can query information about the user's access point, access device, and the version number of the Rongyun SDK used, and can also obtain information about the user's subscription flow during the business process, and through various monitoring points, to compare the overall The status of audio and video services is monitored.


(Figure 15 Global Link Monitoring Kanban)

In this way, we can monitor service quality and performance in real time, analyze the reasons that affect customer experience, and provide developers with more detailed location information, accurate parameter information, actual scene conditions, etc., and ultimately facilitate R&D personnel to quickly locate the underlying problem , Formulate the optimization plan accurately.


融云RongCloud
82 声望1.2k 粉丝

因为专注,所以专业