Architecture design practice of system-level message push platform on vivo mobile phones

1 Introduction

The content of this article is compiled from Li Qingxin's speech at the "2021 vivo Developers Conference" (live speech can be downloaded from the attachment at the end of this article). This article will share the technical practice and summary of the architecture design of the mobile phone manufacturer vivo's system-level push platform. This is also the first time that a mobile phone manufacturer has shared the technical details of the self-built system-level push platform. We also took this opportunity to get a glimpse of the manufacturer's ROOM-level push channel's technical level.

study Exchange:

Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK (click here for alternate address)
(This article has been published simultaneously at: http://www.52im.net/thread-4008-1-1.html )
2. About the author

Li Qingxin, architect of vivo Internet server team.
3. Why do you need message push?
Message push is a very common business feature for mobile apps, such as the latest news in news apps, system notifications in social apps, offline chat messages in IM instant messaging apps, and so on.

It can be said that without the ability to push messages, the APP loses the ability to reach in real time. For an application, its "stickiness" to users will be greatly reduced. For users, the ability to obtain information in real time will also be greatly reduced, and the user experience will also be greatly reduced.
4. Technical barriers to message push
For the most common IM applications in our daily life, the push of offline messages is a necessary capability. But with the continuous upgrade of the Android system, offline push is no longer just a background service to extend the connection so it is a matter of course.

For the early Android system, it is not difficult to realize the offline message push of IM, even if it is a background service and a long socket connection. However, with the upgrade of the Android system, the restrictions on background processes and network services have been continuously increased. In order to continue to push offline messages, developers have to fight with the system and come up with various black technologies to keep alive, such as: Android4 The dual-process guardianship after .0, the anti-killing and resurrection technique after Android 6.0, and the Tencent TIM process immortality technology developed later, a group of demons danced wildly and very coquettishly (for interested students, you can read "Android Process Immortality Technology" The ultimate secret: the underlying principle of process killing, APP's skills to deal with being killed" is a summary article for all keep-alive black technologies).

With the arrival of Andriod 9.0, the system has basically blocked the way of keeping alive black technologies (see "The official version of Android P is coming: the real nightmare of background application keep alive and message push"), and various Android manufacturers' The ROOM system-level push channel has also emerged as the times require—Huawei Push, Xiaomi Push, Meizu Push, OPPO Push, vivo Push, for a while, the user’s nightmare (the black technology of keeping alive is very troublesome to users) has become the developer’s evil. The dream has continued to this day (to do a good job of IM offline push, today's IM developers have to connect with each mobile phone manufacturer's offline push, you say it's annoying).

Don't tell me why you don't use Android's official FCM service (you can open this link in China and I lose, as for why, you know...), and don't mention the unified push alliance (4 or 5 years past) Well, it looks like we will have to wait).

Therefore, in order to continue to get offline message push, IM developers currently only have two options:

1) Surrender to the system with a white flag, give up the black technology of keep alive, and directly guide the user to manually add the whitelist (see "Android Keep Alive from Getting Started to Giving Up: Obediently Guide Users to Whitelist");
2) One by one, the system-level push channels of various manufacturers are connected (Huawei, Xiaomi, Meizu, OPPO, vivo, tragically, some niche manufacturers do not have the ability to build their own push).

With the "blocking" of the Android system for developers to keep the black technology alive, it is natural for mobile phone manufacturers to develop their own system-level push channels to "sparse". Among these manufacturers, vivo's system-level push channel appeared relatively late. The remaining technical content of this article is the technical details of the self-built system-level push platform shared by mobile phone manufacturers for the first time so far. Let's learn together.

5. Understand the push platform from a technical point of view

What does a push platform do?

From a technical point of view, a push platform is a platform that sends messages to users through a long TCP connection. Therefore, the essence of the push platform is to send messages to user devices through network channels.

Everyone has received the express notification every day! When the courier puts the courier in the courier cabinet, the courier background will automatically push a message to notify you that there is a courier. I believe that if you are an operator, you will also like this efficient way of automatically sending messages. If you are interested, you can go through the portal of vivo open platform and select message push to learn more about more technical details, which will not be expanded here.

6. Short and long connections

The essence of the message push platform is to connect content, services, and users through long connections, distribute content to users, and provide real-time, two-way communication capabilities for terminal devices. There is a concept of long connection here, so what is a long connection? The so-called long connection is: a network connection maintained by the client and the server that can perform network communication for a relatively long period of time (for example, a long connection based on TCP).

Why do we use long connections instead of short connections as the underlying network communication of the platform? Let's first look at the scenario of message delivery under short connection: the method of using short connection is polling, that is, the client periodically asks the background whether there is a message from device A, and when there is a message from device A, the background returns the corresponding message. In many cases, it may be ineffective and waste traffic. When there is a message in the background that needs to be sent to device A, the message cannot be sent because device A has not come to fetch it.

And use long connection: when there is a message from device A, the background directly sends it to device A without waiting for device A to pull it by itself, so long connection makes data interaction more natural and efficient.

7. Business requirements drive architecture upgrade

For the technical architecture of the system, it is dynamic and may change at different stages. The driving force for the evolution of the architecture mainly comes from business requirements. Let's review the business development process of the platform.

Since the establishment of the project in 2015, with the growth of business volume, functions and features have been continuously added to the system to enrich the capabilities of the entire system so that it can meet the needs of different business scenarios. For example, it supports full content review, IM, IoT, and WebSocket communication. As can be seen from the figure, the business volume has increased by billions almost every year, and the continuous increase has brought challenges to the system. Problems in the original system architecture have gradually surfaced, such as delays and performance bottlenecks. The architecture serves the business. Before 2018, all the services of our platform were placed on the cloud, but other internal services that we depended on were deployed in the self-built computer room.

With the growth of business volume and data transmission in the self-built computer room, the problem of delay has occurred, and it is gradually worsening, which is not conducive to the expansion of our platform functions. Therefore, in the second half of 2018, we adjusted the deployment architecture: all core logic modules were migrated to the self-built computer room. After the architecture was optimized, the data delay problem was completely solved, and it also laid a foundation for the further evolution of the architecture. From the above figure, we can see that our access gateway is also optimized for deployment in three places.

Why do three-site deployments instead of more regional deployments? Mainly based on the following three considerations:
1) The first is based on the consideration of user distribution and cost;
2) The second is to provide users with nearby access;
3) The third is to enable the access gateway to have a certain disaster tolerance capability.

You can imagine that if there is no deployment in three places, when the access gateway computer room fails, the platform will be paralyzed. With the further expansion of the platform business scale, the daily throughput has reached the order of 1 billion, and users have higher and higher requirements for timeliness and concurrency. In 2018, the system architecture of logical services has been unable to meet the high concurrency requirements of business or requires higher server costs to meet high concurrency requirements. Therefore, starting from the optimization of platform functions and costs, the system was reconstructed in 2019 to provide users with richer product functions and a more stable and higher-performance platform.

8. Utilize long-term connection capabilities to empower more services

As the company's larger-scale long-term connection service platform, the team has accumulated a wealth of long-term connection experience. We have also been thinking about how to enable long-term connectivity to empower more businesses. Each module of our platform server is called through RPC, which is a very efficient development mode, and every developer does not need to care about the underlying network layer data packets.

We imagine that if the client can also call the background through RPC, this must be a great development experience. In the future, we will provide a VRPC communication framework to solve the communication and development efficiency problems between the client and the background, and provide a consistent development experience for the client and the background, so that more developers will no longer care about network communication issues and concentrate on developing business logic. . As a push platform with a throughput of more than 10 billion, its stability, high performance and security are very important. Next, I will share with you our practical experience in system stability, high performance and security.

9. Domain model of vivo push platform

As can be seen from the domain model in the above figure, the push platform takes communication services as its core capabilities. Based on the core capabilities, we also provide big data services and operating systems, which provide different functions and services through different interfaces. The stability and performance of a push platform centered on communication services will affect the timeliness of messages.

The timeliness of the message refers to the time it takes for the message to be received from the device that initiates the service. So how to measure the timeliness of the message? Let's continue reading.

10. How to realize the monitoring and quality measurement of message timeliness?

The traditional message timeliness measurement method is shown on the left side of the above figure: the sender and the receiver are on two devices, and the time t1 is taken when sending the message, and the time t2 is taken when the message is received, and the two times are subtracted to obtain the message. time consuming. But this method is not rigorous, why? Because the time bases of the two devices are likely to be inconsistent.

The solution we adopted is shown in the right picture above: put the sender and the receiver on the same device, so that the problem of time reference can be solved. Based on this solution, we built a dial-testing system to actively monitor the time-consuming distribution of message delivery.

11. How to achieve a high-performance and stable long-connection gateway?

In the past 10 years, when discussing the long connection performance of a single machine, we have faced the problem of 10,000 connections per machine (the C10K problem). As a platform with hundreds of millions of devices online at the same time, we have to face the problem of 1 million connections per machine.

As a long-connection gateway, the main responsibility is to maintain the TCP connection with the device and forward data packets. For long connection gateways: we should make it as lightweight as possible.

We have carried out top-down refactoring optimization from the following aspects:
1) Architecture design;
2) encoding;
3) Operating system configuration;
4) Hardware feature configuration.

Specific implementation methods, such as:
1) Adjust the maximum number of file handles in the system and the maximum number of file handles in a single process;
2) Adjust the system network card soft interrupt load balancing or enable network card multi-queue, RPS/RFS;
3) Adjust TCP related parameters such as keepalive (need to be adjusted according to the session time of the host), close timewait recycles;
4) Use AES-NI instructions on hardware to accelerate data encryption and decryption.

After our optimization, the online 8C32GB server can stably support 1.7 million long connections.

Another major difficulty is connection keep-alive: an end-to-end TCP connection passes through layers of routers and gateways, and the resources of each hardware are limited, so it is impossible to store all TCP connection states for a long time. So in order to prevent TCP resources from being reclaimed by intermediate routers and cause the connection to be disconnected, we need to send heartbeat requests regularly to keep the connection active (why does TCP have such a problem? If you are interested, you can read these two articles: "Why TCP-based Mobile IM still needs heartbeat to keep alive?", "Comprehend the KeepAlive mechanism of the TCP protocol layer thoroughly").

How often should heartbeats be sent? Sending too fast will cause power consumption and traffic problems, and too slow will have no effect. Therefore, in order to reduce unnecessary heartbeats and improve connection stability, we use intelligent heartbeats to use different frequencies for different network environments.

For more details on the heartbeat mechanism of long-term connections, you can refer to:
"Teach you to use Netty to implement the heartbeat mechanism and disconnection reconnection mechanism of network communication programs"
"Understanding the Network Heartbeat Packet Mechanism in Instant Messaging Applications: Function, Principle, Implementation Ideas, etc."
"Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat"
"Mobile IM Practice: Analysis of Heartbeat Strategy of WhatsApp, Line and WeChat"
"Discussion on the design and implementation of an Android-side IM intelligent heartbeat algorithm (including sample code)"
"Correctly understand the IM long connection, heartbeat and reconnection mechanism, and implement it by hand"
"The 4D Long Article: Teach you to implement a set of efficient IM long connection adaptive heartbeat keep-alive mechanism"
"Web-side instant messaging practice dry goods: how to make your WebSocket disconnect and reconnect faster? 》

12. How to achieve load balancing of hundreds of millions of devices?

Our platform has more than 100 million devices online at the same time. When each device is connected to the long-connection gateway, the load is balanced through the traffic scheduling system. When the client requests to obtain an IP, the traffic scheduling system will issue multiple IPs of the nearest access gateway:

So how does the scheduling system ensure that the issued IP is available? You can simply think about it.

For me, we employ four strategies:
1) Access nearby;
2) Public network detection;
3) Machine load;
4) Interface success rate.

What are these strategies? You can think about these two questions:

1) If the internal network is normal, will the public network be able to connect?
2) Does the server with a small number of connections must be available?

The answer is no, because the persistent connection gateway and the traffic scheduling system are kept alive by heartbeat through the intranet, so the persistent connection gateway seen on the traffic scheduling system is normal, but it is very likely that the public network connection of the persistent connection gateway is Abnormal, such as not opening the public network permissions and so on.

Therefore, we need to combine a variety of strategies to evaluate the availability of nodes, ensure the load balance of the system, and provide guarantees for system stability.

13. How to meet high concurrency requirements?

There is such a scenario: sending a piece of news to hundreds of millions of users at a push speed of 1,000 per second, some users may not receive the news until a few days later, which greatly affects the user experience, so high concurrency is not necessary for news. timeliness is very important.

Judging from the push process in the figure above: Do you think TiDB will become a performance bottleneck for push? Actually not: at first glance, you may think that they are used as central storage, but because we use distributed caching, the data stored in the central storage is cached to each business node according to a certain strategy, making full use of server resources and improving system performance and throughput. Our online distributed cache hit rate is 99.9%, and the central storage blocks most of the requests. Even if TiDB fails for a short time, the impact on us is relatively small.

14. How to ensure system stability?

14.1 Overview As a push platform, the traffic of the platform is mainly divided into external calls and internal calls between upstream and downstream. Their large fluctuations will affect the stability of the system, so it is necessary to limit the current and control the speed to ensure the stable operation of the system.

14.2 Push Gateway Current Limit

The stability of the push gateway as a traffic entry is very important. To make the push gateway run stably, we must first solve the problem of traffic balance, that is, to avoid the problem of traffic skew. Because the flow is inclined, it is likely to cause an avalanche. We use the polling mechanism to balance the load of traffic to avoid the problem of traffic skew. However, there are two prerequisites here: 1) All push gateway nodes, the server configuration must be consistent, otherwise it is very likely to cause overload problems due to a lack of processing capacity; 2) The amount of concurrency flowing into our system should be controlled to avoid traffic floods The backend service is overloaded by penetrating the push gateway. We use the token bucket algorithm to control the delivery speed of each push gateway, thereby protecting downstream nodes. So what is the appropriate number of tokens to set? If the setting is too low, the resources of downstream nodes cannot be fully utilized; if the setting is too high, the downstream nodes may not be able to carry it.

We can adopt the strategy of active + passive dynamic adjustment:
1) When the traffic exceeds the processing capacity of the downstream cluster, notify the upstream to limit the speed;
2) When the call to the downstream interface times out, a certain percentage is reached to limit the current.

14.3 Internal speed limit of the system: smooth delivery of label push Since the push gateway has already limited the current, why is the speed limited between internal nodes?

This is determined by the business characteristics of our platform. The platform supports full and label push. It is necessary to avoid modules with better performance that exhaust downstream node resources. The label push module (providing full, label push) is a high-performance service, in order to avoid its impact on the downstream. We implement the function of smooth push based on Redis and token bucket algorithm, and control the push speed of each label task to protect downstream nodes.

In addition: the platform supports applications to create multiple tag pushes, and their push speeds will be superimposed, so it is not enough to control the smooth push of a single tag task. It is necessary to limit the speed of the application granularity in the push delivery module to avoid the pressure on the business background caused by too fast push.
14.4 Internal speed limit of the system: speed-limited sending of messages

In order to achieve application-level rate limiting, we use Redis to implement a distributed leaky bucket current limiting solution, as shown in the figure above. Why do we use clientId (device unique identifier) here instead of using application ID for consistent hashing? Mainly for load balancing. Since the implementation of this function, the business side no longer has to worry about the problem of pushing too fast and causing great pressure on their own servers.

So will the speed-limited message be dropped? Of course not. We will store these messages in the local cache and store them in Redis. The reason why we need to store them is to avoid subsequent storage hotspots.

14.5 Fuse downgrade the push platform, some emergencies and hot news will bring large burst traffic to the system. How should we deal with burst traffic?

As shown on the left of the above figure: In the traditional architecture, in order to avoid the impact of sudden traffic on the system, a large number of machines are deployed redundantly, which is costly and wastes resources. In the face of burst traffic, the capacity cannot be expanded in time, which reduces the success rate of push. How do we do it? We adopt the solution of increasing buffer channels, using message queues and containers (this solution has little system changes). When there is no burst traffic, it is deployed with a smaller number of machines. When there is burst traffic, we do not need manual intervention. It will automatically expand and shrink according to the system load.

15. Automated testing based on Karate

In the daily development of the system, in order to quickly develop the demand, the boundary test of the interface is often ignored, which will cause a great quality risk to the online service. In addition, I don’t know if you have noticed that the different media used by different roles in the team to communicate, such as the use of word, excel, xmind, etc., will cause the communication information to be compromised to varying degrees. Therefore, in order to improve the above problems, we have developed an automated test platform to improve test efficiency and interface use case coverage. We use a unified language in the field to reduce the loss of communication information between different roles in the team. In addition, test cases can be managed in a unified and centralized manner, which is convenient for iterative maintenance.

16. Content Security

As a push platform, of course, we must check the content security. We provide the ability to audit content. The audit method adopts automatic audit as the main and manual audit as the auxiliary mechanism to improve the audit efficiency. At the same time, content auditing is carried out in combination with strategies based on impact and application classification. As can be seen from the figure below, the business request is forwarded to the content auditing system through the access gateway for content auditing of the first-layer local rules. If the local rules are not hit, we will call our listening system to conduct content anti-spam auditing.

17. Future planning

Previously, we mainly introduced the architecture evolution of the push platform in recent years and the technical practice of system stability, high performance, and security during the evolution process. Next, we will introduce the key work in the future.

In order to provide easier-to-use, more stable, and more secure push, we will continue to invest in the following aspects in the future:
1) On the basis of single-module data consistency, realize system-wide data consistency;
2) We will continue to improve the fuse and downgrade capabilities of each system;
3) The usability of the platform is continuously optimized to provide more convenient platform services;
4) Build the ability to identify abnormal traffic.

18. Download the attachment of the speech

Download the attachment of the original speech corresponding to the content of this article: vivo push platform architecture evolution (52im.net).pdf (1.93 MB ) Overview of the original speech:

19. References

[1] Dual-process daemon keep-alive practice below Android 6.0
[2] Keep-Alive Practices for Android 6.0 and Above (Process Anti-Killing)"
[3] Why does the TCP-based mobile IM still need the heartbeat keep-alive mechanism?
[4] Android version of WeChat background keep-alive actual combat sharing (process keep-alive)
[5] Realize the intelligent heartbeat mechanism of WeChat for Android
[6] The official version of Android P is coming: the real nightmare of background application keep-alive and message push
[7] Practice of network link keep-alive technology of Rongyun Android IM products
[8] Correctly understand the heartbeat and reconnection mechanism of IM long connection, and implement it
[9] The strongest Android keep-alive idea in history: In-depth analysis of Tencent TIM's process immortality technology
[10] The ultimate secret of Android process immortality technology: the underlying principle of process killing, APP fighting skills
[11] Web-side instant messaging practice dry goods: how to make your WebSocket disconnect and reconnect faster?

(This article has been published simultaneously at: http://www.52im.net/thread-4008-1-1.html )

Architecture design practice of system-level message push platform on vivo mobile phones

1 Introduction

2. About the author

3. Why do you need message push?

4. Technical barriers to message push

5. Understand the push platform from a technical point of view

6. Short and long connections

7. Business requirements drive architecture upgrade

8. Utilize long-term connection capabilities to empower more services

9. Domain model of vivo push platform

10. How to realize the monitoring and quality measurement of message timeliness?

11. How to achieve a high-performance and stable long-connection gateway?

12. How to achieve load balancing of hundreds of millions of devices?

13. How to meet high concurrency requirements?

14. How to ensure system stability?

15. Automated testing based on Karate

16. Content Security

17. Future planning

18. Download the attachment of the speech

19. References

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

视频直播技术干货(十三)：B站实时视频直播技术实践和音视频知识入门

支持百万人超大群聊的Web端IM架构设计与实践

全平台开源即时通讯IM框架MobileIMSDK：7端+TCP/UDP/WebSocket协议

鸿蒙NEXT如何保证应用安全：详解鸿蒙NEXT数字签名和证书机制

MobPush智能消息推送能力大集结，国内外消息触达通道来了！