Guide: On October 21, 2021, the "QCon Global Software Development Conference" will be held in Shanghai. As the producer, Chen Gong, VP of NetEase Intelligent Enterprise Technology, launched the "Converged Communication Technology in the AI Era" special session and invited NetEase Yunxin , NetEase Audio and Video Laboratory, NetEase Cloud Music technical experts to share with you the trend and evolution direction of integrated communication technology, the exploration and practice of key video communication technologies, the practice of audio AI algorithms in RTC, and the cross-platformization of NetEase Cloud Music Network Library Practice and other topics.

We will introduce and share the four lecture topics one by one. This issue is our fourth issue, the cross-platform practice of NetEase Cloud Music Network Library.

Guest introduction: Chen Songmao, joined NetEase Cloud Music at the end of 2020, one-stop network solution technology leader, currently engaged in cross-platform network solutions related research, aimed at reducing the research and development costs of multi-terminal network work, and access between multiple applications Cost, with a lower cost to obtain sustained and considerable performance and energy efficiency improvements. He once worked in Ali, has long been engaged in the research and application of Chromium related technologies, and has rich experience in browser development and kernel upgrade.

Foreword

In the process of network optimization, due to the difference of the system network library, you have to adapt the same type of optimization at each end, and you feel that the workload is increasing exponentially.

Because of the existence of multiple adapted versions, consistency cannot be fully guaranteed. In addition, due to the different data collection capabilities provided by the system network library, it is difficult for you to collect completely aligned data, which means that the server needs to be compatible.

With the rise of the mobile Internet, PCs are becoming niche, but this does not mean that PCs do not need corresponding network optimization or monitoring services, but you really have no time to take care of them. Large manufacturers can prepare a huge team to go deep into the protocol stack for redevelopment and optimization at any cost, but you hope that with limited resources, you can get a considerable improvement in network performance.

I don’t know if these scenes have hit your pain points, or have you been bothered before? If I tell you that the following sharing can help you solve all the above problems, are you interested in learning about it? If I tell you that you may not need to do anything and can easily shorten the response time of the network by 30% to 40%, are you more interested?

The topic of this speech is: Cross-platform Practice of NetEase Cloud Music Network Library. This article focuses on cross-platform and mainly involves engineering-related ideas. It is divided into four parts, namely background introduction, scheme design, landing practice and follow-up planning.

background introduction

The following figure shows the architecture of the current cloud music network library at each end. You can see that the upper-layer network strategy and monitoring services are implemented independently on each end. This will bring the following problems:

  1. Reinvent the wheel: all network strategies and quality monitoring are implemented at each end. When you need to add a new strategy, each end has to be implemented again, which is a serious waste of personnel.
  2. Lack of consistency: Differences in network libraries and people lead to inconsistent standards and completion of wheels.
  3. Unequal resources: There is a lot of research and development on the mobile side, but very few on the PC side, which leads to the lack of many network strategies and quality monitoring on the PC side.
  4. In-depth optimization is difficult: For network libraries, in-depth optimization requires a full understanding of network modules. But at this moment, we can enter in-depth mining on Android, but it is impossible for all four terminals to have network students to do special optimization. This leads to only shallow optimizations at each end, and no deep mining, because there is no one.

Through the previous analysis, we can clearly see that it is not reliable to implement one set on each end. Combining with some of the current pain points, a common demand naturally arises: let all terminals share a set of network solutions. Going a step further, can all apps share a set of network solutions?

The most important way to solve the above problems is to make the overall solution cross-platform, but it also faces many challenges.

  • The first is cross-platform. At the functional level, is only the strategy needed to sink? After the strategy sinks, can the three-party SDK sink? In addition, there are many cross-platform solutions, which can only be cross-platform on the mobile side, or both on the mobile side and the desktop side.
  • The second is capacity reuse. Can the network strategy rewritten in C++ be reused in different scenarios? In addition to restructuring the business, can some framework be refined? Can these frameworks be reused? In addition, follow-up new requirements come out, can they be easily expanded on the existing basis?
  • The last is post promotion. At the beginning of the establishment of this project, it was clear that it needed to be promoted later, not only for NetEase Cloud Music. So, what is the cost for other apps to access the network library? Is it easy to access? This determines whether it can be promoted smoothly in the later stage. In addition, each App has its own characteristics. For example, NetEase Cloud Music has a streaming-free function, but other apps may not need streaming-free. In other words, different apps can be customized according to the actual situation. This is also the challenge we face. .

The scheme design for the above-mentioned challenges is the core of the whole sharing.

scheme design

Cross-platform design ideas

The design idea is very simple at the beginning. The green part on the right side of the figure below is the SDK that contains the core logic rewritten in C++. On this basis, Android and iOS are connected to this SDK separately, and the network library below remains unchanged. The cost of this transformation is very low, but the access cost is high. When the mobile terminal accesses the SDK, the SDK provided by us needs to be repackaged. So we were thinking, can we not engage in five or six SDKs and merge them into one SDK when it is connected?

Next, there is another solution. We want to merge all the green parts of the upper layer into one SDK. This requires abstracting the network proxy layer on top of the existing network library, based on the network proxy, and encapsulating various strategies and business logic on it. In this way, all the logic can be merged into one SDK. After this is done, the overall access cost will be greatly reduced. However, the network proxy layer is very thick. With the migration of time, the underlying network library will change, and the upper network proxy layer needs to be continuously adapted, so that the adaptation work will be endless.

Thinking further, can the network library be unified? We removed the above network agent layer and replaced it with a common cross-platform network library. On this basis, we encapsulated our strategies and services so that the overall link would be cross-platform. This is an idea.

Finally, to summarize, there are many cross-platform methods. In NetEase Cloud Music, we chose the most thorough one, that is, the entire network solution including network libraries, upper-level strategies, and some services are all cross-platform.

Cross-platform network library design

As mentioned above, the most critical point of cross-platform is to choose a suitable cross-platform network library. We chose Cronet, which spans five terminals and can fully support cloud music scenarios; the protocol level supports traditional HTTP, HTTP/2, and QUIC; In addition, in terms of popularity, all Google departments use Cronet. Domestic Baidu, Weibo, and Netease Media have already accessed the official Cronet. Based on the official Cronet, each side has made their own strategies including some monitoring services. The headline department and Mogujie have been completely transformed, and they have carried out secondary custom development on the basis of Cronet and formed their own network library; since Chromium is open sourced, all domestic browsers are based on this project and are constantly evolving. Very active; the last point is the open source protocol. Chromium mainly uses the BSD protocol, which means that we make modifications based on Cronet and do not need to be forced to open source. This is critical for commercial companies and is also an important consideration when we choose Cronet.

The following figure shows the overall architecture of the entire cross-platform network library.

At the bottom are OS, Base and Net; we encapsulate a layer of Common API on the basis of Net, as a glue layer to isolate Net, and expand some of our basic capabilities, including BI, FunctionBridge, plug-in services, etc.; in general capabilities A layer of components was built on the foundation, including network policy, APM monitoring, HTTPDNS service, etc.; at the end of the interface layer, we exposed component APIs to facilitate App calls.

We can use the Cronet API for network sending and receiving, and on this basis, some extension APIs have been added (for example, the original Cronet API does not support timeout [including connection establishment and inter-packet timeout] settings; for successful and failed requests, return is also not supported Specific IP; in addition, it is not very convenient for HTTPDNS support).

For App, in addition to seeing Cronet, you can also see a component system. App uses unified configuration to enable internal components on demand (for example, HTTPDNS and APM are enabled by default here, but not necessarily for network policies); theoretically, components are completely isolated, but there are some special scenarios, such as network The strategy needs to communicate with the HTTPDNS component. We have encapsulated some pluggable services internally. The HTTPDNS component exposes the ability to the pluggable service. The network policy uses the pluggable service to call the HTTPDNS component; the App calls the bridge (similar to JSBridge) communicates directly with the component, so that the cost of subsequent component access and expansion is very low.

The component does not directly communicate with the network kernel. We encapsulate the basic capabilities on the basis of the network kernel, including some necessary request interception, request forwarding and network monitoring capabilities.

We hope that through this cross-platform practice of the network library, a reusable network framework can be precipitated: this part of the network framework can be reused within NetEase, does not contain business logic, is just a framework, and the Cronet kernel will be updated regularly. Including some security patches (we also took the official version of Cronet, and after running for a period of time, we found that there were some crashes online, which we need to fix by patching); we provide some basic component management and foundation on the basis of Cronet Capability encapsulation is designed to reduce the cost of C++ level customization.

On the other hand, based on the reusable network framework, we have expanded a part of the capability set: all services are placed in the capability set, turned into a tripartite library, and reused in the form of components; the access party can flexibly combine according to its own needs ; The capability set can provide general capabilities (such as APM monitoring, HTTPDNS services), and can also provide personalized capabilities (such as free-flow services), designed to meet the changing business scenarios and customization requirements of the access party.

Finally, a brief summary of the network library design, we directly based on the open source Cronet solution, we built Cloud Music's own unified network library solution, and adopted the "reusable network framework + scalable capability set" model for business sinking.

Cronet upgrade

We have done a lot of online research when choosing this plan, and we found that most companies chose to use Cronet directly when choosing the plan instead of secondary development and customization based on Cronet. They all mentioned a common point. If Cronet is customized, the subsequent kernel upgrade will be difficult to control or the cost will be high. This is the main reason why they did not choose this solution.

So what should we do in the face of upgrades? First, let me talk about why you want to upgrade. The reason for the upgrade is very simple. One is to fix the problem (for example, there is a security vulnerability, and I hope to fix this problem by upgrading or patching); the other is to obtain features (for example, the current QUIC is rapidly evolving In China, the mainstream domestic use is gQUIC, but the iQUIC standard is slowly being unified, and Google is gradually moving closer to iQUIC. We hope to directly support iQUIC through upgrades. Finally, Google is also continuing to optimize Cronet, and we hope to adopt Upgrade and enjoy the corresponding optimization results directly.

What are the specific pain points of the upgrade? To put it simply, the upgrade can be compared to a code submission: if you submit it in a timely manner, you will submit a part of the modification immediately, and you will almost never encounter conflicts, and it goes smoothly; some students are not used to it and wrote two or three days of code. Only one-time submission, at this time it is easier to encounter code conflicts, which may take some time to resolve; if you continue to slow down the pace of code submission, like our secondary development of Cronet, it may be half a year or a year or even longer. It will not be merged with the official latest Cronet. When the Cronet needs to be upgraded, you will find that the new version of the Cronet framework may have changed. At this time, it is very painful to merge the code.

In the past, when we were doing browsers, every time we upgraded the kernel, there were tens of thousands of file conflicts. It took half a month to merge the code. In addition, when the code is merged, you actually don't know how to merge it. There are too many changes and it is troublesome to merge. The last one, even if the compilation and linking pass by luck, you will find that many functions are degraded.

Based on this, we thought of some corresponding solutions. For example, for code conflicts, we try to minimize modifications based on the source code for secondary customization, and try to isolate them so that the modified places can be easily identified. In this way, the merging party can clearly know that this place is to be combined, and this place is not to be combined. The most effective method for functional degradation is to do single test. The single test coverage of the Chromium project itself is very wide. You only need to do some single test coverage on the newly added code.

The following focuses on how to reduce intrusions and isolate them. These two points are actually very simple techniques without any difficulty. You can easily refer to and practice them.

We think about reducing intrusion in this way. There are mainly three words: one is to mention the interface, the second is to build a framework based on the interface, and finally to expand the component based on the framework.

For the intrusion of Cronet, we use more interfaces instead of direct magic changes. The interface mentioning includes adding some agents, observers or interceptors (using some design pattern ideas), and converging all the modifications to a few small points. After the whole practice, the interface mentioning part, we modified the fast After a year, the interface part only accounts for 5% or less of the modification; based on the new interface, we have built a network framework, which mainly does some capability encapsulation, including component mechanism and channel encapsulation, and includes some plugging and unplugging. Services, etc.; in the end, we will stack various strategies, monitoring, and business on the components, and the intrusion of the source code can be controlled as a whole.

The code in the figure below is an example of our modification to the Cronet source code. On the left is the code for Socket reuse in the source file. On the right, some modifications have been made. We made a callback, and the business side decides whether or not Socket reuse is possible. At first glance, it is impossible to detect the difference between the two, which brings a very serious challenge to the kernel upgrade. Our code has not added any isolation measures at all. Once we merge, we don’t know what changes we have made, and we cannot merge at all.

How to do it? Some students might think of adding annotations. In the past, when we first started making browsers, our team also used the method of adding annotations. This method is simple and "efficient", but it always feels a little uncomfortable. Although isolation is carried out, sometimes there is a problem in positioning. This bug should not be Cronet, it should be modified by us. At this time, you may need to compile the original code for verification, but the way of adding comments prevents you from quickly compiling a source version.

For C++, it's actually very simple. It is to add a macro switch. Through the macro switch, you can clearly see which changes have been made, at a glance, and can quickly switch the source. This is the long-term practice experience of my previous team. .

Everyone sees that all the changes put the source code in front and our changes in the back. Why do you want to do this? A big advantage is that because the above code is not modified by you, there will be less conflicts during the merge process. If there are subsequent problems, you can quickly identify the differences through the merge tool.

For codes other than C++, we also made a switch similar to the macro switch to isolate the code, as shown in the figure below.

Finally, we also did file-level isolation. The red box is the newly added file directory. We put all the changes under wow, and isolate all the changes in our own files. For example, there is a file in the net directory, we modify it, we will put the newly created file in the net directory under wow, and add the wow prefix to the newly added file.

The Cronet upgrade is unavoidable, and there is no silver bullet. Here we use low-cost and implementable "techniques" to make the entire upgrade relatively simple.

Avoidance Guide

Having resolved the upgrade concerns, are you already eager to try? Don't worry, you still need the last step, let the Cronet code run first.

The figure below is the official Cronet document, which gives a detailed description of the check out, build and run of the source code. You only need to follow the document step by step. Since Cronet is not separately open sourced, the code of Cronet and Chromium are mixed together, sharing the same code repository and build system, so the code pull and compilation environment preparation is consistent with Chromium. For the compilation of Cronet, Google gives a separate document description , And provides separate compilation scripts for Android and iOS, which is simply a command line thing.

At the beginning of the year, we conducted research on Cronet. At that time, Chromium's kernel version was M88, and the entire pull code compilation and debugging went smoothly. Of course, we only tested the Windows side. Our original intention for accessing Cronet was only for cross-platform, but with the in-depth research, we saw the practice of major Internet companies on QUIC, and they gave performance improvement data of different versions. Because Cronet naturally supports QUIC, we Just thought, after the cross-platform transformation of the network library is completed, we will also turn on QUIC to see the effect. If there is an improvement, it will be even better.

But when we conducted some research on QUIC, we found that the main QUIC version supported by domestic cloud vendors is gQUIC 43, and the Chromium M88 version no longer supports gQUIC43 by default, which means we have to fall back to a default support for gQUIC 43. In order to better compare the data and reduce the risk of QUIC access, we chose the version M72 that is consistent with the media.

Since M72 and M88 have been separated by nearly 2 years, when we switched the Chromium code back to M72, we found that no matter it was under Windows, Ubuntu, or Mac, it could not be compiled, and there were various strange problems. After working on it for a long time, I suddenly realized that it may be a document problem. After all, 2 years have passed. How can I find the official document corresponding to the M72 2 years ago?

Chromium's official documents and codes are managed in a unified manner. You only need to find the corresponding documents in the src directory, or find the corresponding documents directly based on tags online. With matching documents, all kinds of compilation link problems will be solved. Much less.

The following figure shows the development tools, compilation environment and build system of Cronet at each end according to the official documents and our usage habits. The only thing that needs to be emphasized is that when compiling the iOS version, you need to point the Command line tools in XCode to the lower version, and remember to install the corresponding SDK under Win10. In addition, the entire SDK is constructed using the GN+Ninja method, which may be a little uncomfortable for students who have just met. In fact, you can get started after a few days of familiarization.

The other is debugging. Anyone who does development knows that learning one thing is very simple, which is to pull down the code to compile, and then debug. To our surprise, the Cronet SDK is not natively supported for debugging under Android. The official recommendation is VLOG, and the second is NetLog. Because we have corresponding development classmates on all three terminals. For example, I am used to Windows and build a Demo directly based on Cronet for debugging, and most of the scenes can be covered. If you can't cover it, use VLOG to debug it. This problem is a pain point, and a good solution has not been found yet.

The last is App access. When using Cronet to initiate a request, Google provides two implementations, one is asynchronous UrlRequest, and the other is HttpURLConnection and NSURLProtocol implementations that encapsulate stream operations based on UrlRequest, which conforms to the mobile terminal protocol standard.

Since we have made some interface extensions to network requests, the solution on the left only needs to modify the URLRequest, and the solution on the right needs to modify both the UrlRequest and the upper-layer protocol encapsulation. In order to reduce the intrusion of Cronet, we prefer all Both ends use the same set of network interface, UrlRequest, in fact, we do this in the process of accessing the API request of the master site. However, when accessing CDN requests, UrlRequest is not very convenient to use in the face of streaming operations. The business side needs more adaptations and modifications to access. Finally, for the sake of lower cost access to the App, we Still switched to the method of using HttpURLConnection and NSURLProtocol as a whole.

There are some trial-and-error costs, share with you, everyone can avoid detours. The pit avoidance guide is mainly related to secondary development based on Cronet. How to make the code run is just a small pit. After skipping it, it will be Kangzhuang Avenue.

Landing Practice

We didn’t know much about Cronet at the beginning. We first connected to the official Cronet SDK and tested it. Then we built some network frameworks based on Cronet, and then we sank the business components into the network library one by one (rewritten in C++); After sinking, I landed on the Android side (in heavy volume); then gradually turned on QUIC, and only one line of code in the code was turned on; finally, the iOS side was also in the process of accessing.

In terms of online data, when QUIC is not turned on, the response time of Cronet is improved by 16%~20% compared to OK. After QUIC is turned on, the response time is further improved to 37%~41%. This 40% performance improvement is very impressive, and you may not be able to keep up with this effect after one year of optimization. In order to cross-platform the upper-level business, we finally chose to also cross-platform the underlying network library. Because of the solution selection, we directly enjoyed the network performance optimization results brought by Cronet.

Follow-up planning

The entire cross-platform network library has been partially implemented on the main App side of NetEase Cloud Music, and then Windows and Mac will be overwritten; the network library, after the cloud music main App is fully implemented, we will list them one by one in the cloud music product matrix Landing; in the future, we will promote it within NetEase.

Cronet is connected, and the road to tuning has just begun, including pre-connection, parameter tuning, "racing" optimization, connection reuse rate, connection migration, independent deployment of QUIC clusters, etc.

The standardized version of QUIC has been released this year. In the future, when the time is right, we will directly support iQUIC by upgrading the Cronet kernel version.

In our practice, the network library is the most worthwhile and most should be cross-platform transformation; we recommend Cronet as the first choice for cross-platform network libraries. There is a little threshold, but this threshold is not high; our online data Cronet is in HTTP/2 There are already good performance advantages on the basis of about 20%. After QUIC is turned on, the advantages are further enlarged; for accessing Cronet, we are also waiting and watching at the beginning. You can try to use the official website like Baidu, Weibo, and NetEase Media. To access the Cronet, and then decide whether to conduct secondary development based on Cronet.


网易数智
619 声望140 粉丝

欢迎关注网易云信 GitHub: