The evolution of Meituan&#39;s terminal message delivery service Pike

Pike 2.0 is committed to providing Meituan with a set of easy-to-access, highly reliable, and high-performance two-way message delivery services. This article first introduces the technological evolution of Pike2.0 from the aspects of system architecture upgrade, working mode upgrade, and long-term stability keep-alive mechanism upgrade, and then introduces the feature support of Pike 2.0 in new business scenarios such as live broadcast and games. I hope this article can give some help and inspiration to readers who are interested in message delivery services or engaged in related work.

1 Pike's past and present

1.1 The birth background of Pike 1.0

In 2015, Meituan launched the Shark terminal network channel to provide long-link agency acceleration services for the company's mobile terminal. Shark improves the end-to-end success rate of network requests through the global deployment of network access points and maintains long-term connections, reduces end-to-end delay, and improves user experience.

Pike 1.0 is an in-app push service based on the Shark connection channel. Because the underlying transmission is based on the Shark long-connection channel, Pike 1.0 is born with excellent genes such as low latency, high reliability, and anti-DNS hijacking. At present, Pike 1.0 is widely used in business scenarios such as real-time interaction, marketing push, status delivery, and configuration synchronization within Meituan.

1.2 Pike 1.0 workflow

The mobile SDK will use APPID, the unique device identifier UnionID (Meituan unique identifier, comment unique identifier, etc.) to initiate registration with the server after each long connection is successfully created. After the registration is successful, the business server can pass the Pike 1.0 server. The interface provided by the SDK actively pushes messages to the App of the device. The message pushed by the server arrives at the client through the long connection channel, and finally delivered to the business party through the registered callback interface. The overall workflow is shown in the figure below:

图1 Pike 1.0工作流程图

1.3 Advantages of Pike 1.0

The underlying transmission of Pike 1.0 is based on the Shark long-connection channel, so Pike 1.0 has a good performance in the following aspects:

Anti-DNS hijacking : The underlying channel directly uses IP direct connection, which saves time-consuming DNS resolution and avoids the risk of DNS hijacking.
Low Latency : Shark long connection uses a long connection to the nearest access point, eliminating the need for multiple connection establishments and handshake consumption for traditional HTTP transmission, and the end-to-end data transmission delay is greatly reduced compared to HTTP.
has good security : Shark adopts a custom binary protocol for data transmission, and carries out channel-level TLS encryption, which is tamper-proof and safer.
Better overseas experience : Pike 1.0 shares service clusters with Shark, and Shark long link channels have deployed access points in many overseas locations. The proxy accelerates access, and the network delay and success rate are better than regular requests.

1.4 Pain points of Pike 1.0

As a derivative product of Shark, Pike 1.0 has its shining points, but the pain points caused by the strong reliance on Shark make developers complain. The main pain points are as follows.

1.4.1 Code structure coupling

In the client SDK, the Pike 1.0 code is coupled with the Shark code structure, sharing the logic of the underlying channel connection, data encryption and decryption, and binary protocol. The figure shows the relationship between Pike 1.0 and Shark in the code structure.

图2 Pike 1.0与Shark代码结构示意图

The disadvantage of : Code optimization and upgrading are difficult. Changes to one SDK often require more consideration of whether it has a negative impact on another SDK and whether the impact is controllable, which unreasonably increases development costs.

disadvantage of : The network configuration environment of Shark and Pike 1.0 are shared. As shown in the figure, the network environment configuration of SharkTunnel through DebugPanel will take effect for both Shark and Pike 1.0 at the same time, but the business side often only Pay attention to one of the SDKs. The interaction between different SDKs has introduced many customer service issues, and also brought more interference factors to the troubleshooting of customer service issues.

1.4.2 The account system is chaotic

Pike 1.0 supports only one device unique identifier UnionID on the same App. The UnionID registered on different apps will be different. For example, Meituan uses the Meituan unique identifier, and reviews use the comment unique identifier. If a business is only used on one App, Pike 1.0 can naturally work well, but the same business may need to be used on multiple apps at the same time (as shown in the figure). If the business side is not compatible with the account system, Services that use the unique ID of Meituan as a push identifier on the Meituan App will not work, and businesses that use the unique ID of Meituan as a push identifier on the Meituan App will also not work. As a result, the push ID logic of the same service on different apps will be very complicated, and the back-end must maintain the mapping between multiple account systems at the same time to solve the problem of account system confusion.

图3 Pike 1.0账号体系不兼容示意图

1.4.3 Push connection is unstable

Pike 1.0 lacks special optimization for push scenarios due to the shared channel logic of Shark, and it is not good enough in detecting channel abnormalities and disconnected recovery. In terms of channel availability, the SLAs of Shark and Pike 1.0 are also very different.

For example, when the long connection channel is unavailable, Shark can downgrade the short connection to avoid the problem of decreasing the success rate caused by the continuous failure of business network requests. But for Pike 1.0 at this time, if the channel cannot be recovered quickly, it will cause business message delivery failure, which will directly affect the message delivery success rate. Therefore, the public logic of the Shark channel for connection keep-alive cannot be perfectly applied to the Pike 1.0 business scenario.

Although Pike 1.0 further strengthens the heartbeat detection mechanism at the protocol layer on the basis of the Shark channel to improve channel availability, the channel cannot detect abnormalities in time. In addition, the reliability of the event distribution technology used internally in Pike 1.0 has not yet reached 100%, and some customer service issues that are abnormally disconnected and cause unsuccessful push will be reported sporadically. In summary, the demand for special optimization for unstable push connections is constantly being put on the agenda.

1.5 Birth of Pike 2.0

The existing pain points of Pike 1.0 have encountered many challenges in the current situation of increasingly rich business scenarios. Strive to solve the existing problems encountered in the operation of Pike 1.0 on the Android and iOS platforms. On the one hand, we reorganized the product architecture and code implementation. On the other hand, we decided to cooperate with Pike Web, another message delivery service serving H5 by the Basic Technology Department. Product integration, and then launched a new upgraded product-Pike 2.0.

The following figure shows the product panorama of Pike 2.0. In view of the current status of Pike 1.0, Pike 2.0 has made many optimizations on the front and back ends, including technical architecture upgrades, cluster independence, and protocol expansion. On the client side, Pike 2.0 provides an SDK based on multi-language implementation to serve multiple platforms. On the server side, Pike uses a distributed cluster that deploys Java applications to provide services.

图4 Pike 2.0产品全景图

This article mainly elaborates the technical solution design of the Pike 2.0 client SDK from the perspective of the client, and explains the technical advantages brought by Pike 2.0 in principle.

2 Pike 2.0 architecture design

In response to the aforementioned pain points of Pike 1.0 code structure coupling, Pike 2.0 has carried out a new architecture upgrade, maintaining product isolation from Shark in terms of code structure, environment configuration, and service clusters.

2.1 Design ideas

After nearly a year of technology accumulation and precipitation, the TunnelKit long-link kernel components and TNTunnel general channel components extracted from Shark have stabilized, so Pike 2.0 chose to build a two-way message channel service based on TunnelKit and TNTunnel. Specific advantages are:

Pike 2.0 is built based on the TunnelKit long-link kernel, which can effectively reuse existing long-link control-related functions and reduce unnecessary development work.
Pike 2.0 can share the relevant general features of TNTunnel, such as Shark protocol encapsulation, data encryption and decryption, etc., with low maintenance costs.
The Pike 2.0 protocol is used as the payload transmission of the Shark protocol, which can flexibly customize the protocol related to its own characteristics.

2.2 Overall architecture

图5 客户端架构演进图

The overall architecture is shown in the figure, including the Pike interface layer, the Pike channel layer, the TNTunnel channel layer and the TunnelKit long-connect kernel layer.

2.2.1 Pike interface layer

The Pike interface layer aims to provide a simple and reliable interface for all businesses that require in-app messaging services in the mainstream front-end technology stack:

Pike 2.0 provides access to SDKs for mainstream technology stacks of companies such as Android, iOS, and MRN, and services can be flexibly selected according to their needs.
Pike 2.0 has designed two different clients for different message QPS. For services with a volume of more than 50 messages per second, such as live barrage push, we recommend connecting to the aggregated message client; for other services with a small volume of messages, the ordinary message client can meet the demand.
Pike 2.0 provides a business-insensitive migration solution for the online Pike 1.0 system. The business side can migrate from the previous Pike 1.0 system to the Pike 2.0 system to send and receive messages without any human input.

2.2.2 Pike channel layer

The Pike channel layer is the implementation layer of the feature. All API calls of the Pike interface layer will be converted into encapsulated Tasks through thread scheduling to complete specific operations at the Pike channel layer. The Pike channel layer is a single-threaded model, which avoids thread safety issues to the greatest extent. .

Pike features are as follows:

disconnected reconnect : In view of the unstable characteristics of long connections, Pike 2.0 channels through disconnected reconnect mechanism to enable business parties to believe that it is continuously available without network failure.
service authentication : The service backend can monitor connection changes through the Pike 2.0 channel, and at the same time, it can judge the availability of client devices that access the network.
Alias Mechanism : Separate business IDs for different business parties, and each business can customize the ID ID, which solves the pain point that Pike 1.0 must use the same ID for different services on the same App platform.
Uplink/Downlink Message : Pike 2.0 is a two-way channel service, which not only supports the original message push capability of Pike 1.0, that is, the server sends downlink messages to the client; it also supports the client to actively send messages, that is, the client to the server Send an upstream message. As long as the business passes through the Pike 2.0 system, a closed loop of messages can be formed.
grouping/aggregating messages : Pike 2.0 supports message grouping and message aggregation to meet the use of high QPS business scenarios. Among them, the message grouping means that the service can broadcast messages to a group of users through custom labels; the message aggregation means that the blowout messages will be aggregated and delivered in a short time to improve the throughput of the system.
message order : Pike 2.0 supports the orderly delivery of uplink messages sent by the same client to a fixed service server.
independent channel : Pike 2.0 defaults that all services use a shared channel. For services with large business volume or throughput requirements, the exclusive channel can be automatically switched to ensure the success rate and delay of message delivery.
channel keepalive : Pike 2.0 adds channel inspection on the basis of connection keepalive, which can automatically restart the channel when an abnormality is detected, ensuring that the channel availability is further improved in an environment that requires long-term stability.

2.2.3 TNTunnel channel layer

The TNTunnel channel layer is a functional layer that encapsulates general channel logic. It mainly involves common core modules such as channel state management, protocol encapsulation, and data encryption and decryption. It is an independent layer that extracts the common channel logic from the original Shark channel. Although the Pike protocol is an application layer protocol built on top of the existing Shark protocol, the Pike channel has been completely decoupled logically from the original Shark channel. On the one hand, Pike 2.0 will reuse the mature technology of the Shark protocol to the maximum, but does not rely on the original Shark logic; on the other hand, subsequent upgrades and optimizations involving binary protocols, security protocols and other protocols can all be served at the same time In Pike 2.0.

2.2.4 TunnelKit connects to the kernel layer

The main function of the core layer of the TunnelKit long connection is to connect to the Socket to process the sending and receiving of TCP or UDP data, and to manage the availability of each connection. Each Pike 2.0 channel maintains a connection in TunnelKit. The heartbeat keep-alive mechanism and connection management ensure that there is always a connection to carry Pike data under the normal network environment. As the foundation of all channel layers, TunnelKit is the most important layer that determines the stability of the upper long-connected channel.

3 Pike 2.0 working mechanism

On the basis of the new architecture upgrade, Pike redesigned and improved the working mechanism for the aforementioned pain points of Pike 1.0 account system chaos and unstable push connection.

Among them, PikeClient, as the portal for the Pike system to connect to the business side, plays a vital role in the entire Pike 2.0 system. This article will introduce Pike's working mechanism with PikeClient as the starting point.

3.1 PikeClient life cycle

In order to better maintain the internal state of Pike 2.0, PikeClient uses a state machine to be responsible for life cycle management.

图6 PikeClient生命周期图

As shown in the figure, the life cycle of PikeClient mainly includes the following parts:

onStart: This state is the state that the business party enters after calling StartClient or RestartClient. At this time, PikeClient has been started normally. After that, Pike 2.0 will initiate business authentication and transfer to other states according to the authentication result, as shown in the figure. If the authorization fails, it enters the onStop state, and if the service authentication succeeds, it enters the running state.
onStop: This state is the state entered after the business party calls StopClient or the business authentication fails. At this time, PikeClient has stopped working. After the client enters this state, it needs Restart before it can be reused.
running: This state is the state of PikeClient working steadily. At this time, Pike 2.0 is waiting to respond to the downlink message pushed by the service or is ready to send the uplink message at any time. As a two-way message channel, the ability of Pike 2.0 to process upstream and downstream messages is completely parallel.
onReceive: This state is the state that PikeClient enters after successfully receiving the downlink message. Pike 2.0 will re-enter the running state after delivering the received message to the business party and wait for the next operation.
onSendSuccess/onSendFailure: This state is the state that PikeClient enters after sending an uplink message. The business party can obtain the result of this message sending by monitoring this state.

Through the life cycle management based on the state machine, it not only strictly defines the workflow of PikeClient, but also accurately monitors its internal state, which improves the maintainability of PikeClient.

3.2 PikeClient working mode

In response to the chaotic account system pain points of Pike 1.0, Pike 2.0 has designed a new working mode. As shown in the figure below, Pike provides two modes of shared channel and independent channel through the channel proxy module to meet the needs of different business scenarios.

图7 PikeClient工作模式示意图

3.2.1 Shared channel mode

The shared channel mode is the basic working mode of Pike 2.0, and new business parties will use this mode to access Pike 2.0 by default.

In Pike 2.0, the PikeClient and the Pike channel service have a many-to-one sharing relationship. Each business party will have its own PikeClient. Each PikeClient can customize its message push ID and avoid using the global ID. The business backend can streamline the push identification logic and avoid maintaining multiple account systems at the same time.

PikeClient for different services is only isolated at the access level, and unified management will be completed by the Pike channel service in the Pike 2.0 channel. This many-to-one sharing relationship allows all Pike businesses to share the Pike 2.0 channel characteristics, and at the same time, it can set its specific message processing capabilities for each business use scenario. Each business party that accesses Pike 2.0 only needs to pay attention to it. Just your own PikeClient.

3.2.2 Independent channel mode

Independent channel mode is the expansion capability of shared channel mode. Pike 2.0 decides whether to switch to this mode through configuration control.

By default in Pike 2.0, all business parties share the same Pike channel service. However, in view of the different business scenarios, each business has different requirements for message throughput, message delay and other SLA indicators. For example, game business has different message delays. Tolerance for too long is worse. For special services, Pike 2.0 provides independent channel switching capability support.

All PikeClients connect to Pike channel services through the Pike channel proxy module. The Pike channel proxy module can control PikeClient to work with specific Pike channel services through switch configuration. Through the use of proxy mode, the integrity of the original structure is guaranteed, and independent channel capability support can be completed without adjusting the Pike channel code logic; it can also expand the channel switching capability, effectively manage the channel switching process, and let The Pike 2.0 channel maximizes service capabilities while avoiding waste of resources.

3.3 PikeClient keep-alive mechanism

The keep-alive of PikeClient completely relies on the keep-alive of the Pike 2.0 channel. Aiming at the pain point of the unstable push connection of Pike 1.0, the Pike 2.0 channel continues to optimize on the basis of absorbing the technology precipitated by Pike 1.0 in the keep-alive mechanism, and finally designed based on the heartbeat detection. , Reconnection mechanism and triple keep-alive mechanism of channel inspection. The keep-alive mechanism is as follows:

图8 长连通道保活机制示意图

3.3.1 Heartbeat detection

Heartbeat detection is a common method to check the network connection status. Pike long connection is a TCP connection, and TCP is a virtual connection. If factors such as abnormal network nodes in the actual physical link cause the connection to be abnormal, the client and server cannot When an abnormal connection is detected in time, the connection status will be in the ESTABLISHED state, but the connection may be dead. Heartbeat detection is a technical solution to solve this network abnormality.

When the heartbeat period set by the heartbeat patrol timer arrives, the client judges whether there is an abnormality of the last heartbeat timeout. If the heartbeat timeout, the connection is considered to be unavailable, and the connection is removed from the connection pool and triggers the following retry. Even mechanism. In order to detect channel abnormalities faster, Pike 2.0 is configurable for the heartbeat period and heartbeat timeout, and can be flexibly set for different App usage scenarios; and every time uplink data is sent, it will promptly detect whether the last heartbeat timeout has expired , So that the heartbeat detection results do not have to wait until the next heartbeat cycle arrives.

Pike 2.0 does not use a fixed heartbeat frequency to send heartbeat packets. Pike 2.0 uses the uplink and downlink data packets of the channel to dynamically reduce the number of heartbeat packets sent. In addition, smart heartbeat is also a topic that Pike 2.0 continues to pay attention to.

3.3.2 Reconnection mechanism

The reconnection mechanism is the core feature of Pike 2.0 as a long connection channel, and it is also the most important part of Pike 2.0's connection stability construction.

The client will decide whether to trigger a reconnection in the three links of sending a message, receiving a message, and heartbeat detection. On the one hand, if it actively finds that the available connections in the connection pool are insufficient, it will automatically start the reconnection mechanism; The reconnection mechanism will also be triggered automatically when it is closed.

During the reconnection process, Pike 2.0 will use the Fibonacci sequence backoff algorithm to initiate a connection request until the connection is successfully established. On the one hand, Pike 2.0 guarantees that as long as the network is available, it can always maintain an available long connection to serve Business news; on the other hand, Pike 2.0 avoids continuous connection establishment and makes the system full when the network is continuously unavailable.

3.3.3 Channel inspection

Channel inspection is an effective mechanism to further improve the stability of Pike 2.0 based on the heartbeat detection and reconnection mechanism.

The client will set a global inspection timer according to the heartbeat cycle. When the time set by the timer arrives, the client will trigger the channel anomaly detection logic, and will try to restart the channel once an abnormality is found.

Pike 2.0 will first obtain the current channel status when the channel abnormality detection is triggered. If the channel is not actively closed but the channel is in an unavailable state, Pike 2.0 will force a self-start; in addition, during the channel inspection process, The inspection manager will continuously collect the timeout exceptions that occur during the message sending and receiving process. When the number of timeout exceptions continuously exceeds the configured maximum threshold, Pike 2.0 will consider that the current channel availability is low, and it needs to be forced to close and perform a self-start.

4 New features of Pike 2.0

As an upgraded product of Pike 1.0, Pike 2.0 is not only to solve the pain points of Pike 1.0, but also to open up new application scenarios by adding new features is also the focus of Pike 2.0.

4.1 Aggregate messages

With the rise of live broadcast services in the company, many business parties within the company also use Pike 1.0 as a transmission channel for downlink real-time messages such as barrage, commentary, and live room control signaling. However, based on the previous design architecture, Pike 1.0's ability to provide reliable services for barrage and comment scenarios that need to process a large amount of messages in a short period of time is gradually inadequate. The main manifestation is that when the QPS increases significantly, the message delivery success rate decreases, the delay increases, and the system Performance overhead growth and other aspects. Pike proposes a more general solution for the delivery of messages in live broadcast scenes by introducing aggregated messages.

4.1.1 Design Idea

The messages involved in the live broadcast scenario mainly have the following characteristics:

As a real-time interactive carrier, barrage needs to process a large amount of information such as pictures and texts in a short time. If aggregation is not done, a lot of bandwidth will be wasted.
Compared with the normal push scenario in the live broadcast room, since the user has entered the live broadcast room, user behavior is relatively uniform and controllable, so a group message is needed for unified processing.
Different types of message processing logic can be prioritized in the live broadcast room, such as lottery and control signaling that require reliability and cannot be discarded, while barrage can be appropriately discarded according to the heat of the live broadcast room and service affordability.

Aggregate messages mainly adopt the following ideas in design:

Aggregate messages from the time dimension to reduce unnecessary bandwidth consumption.
The message classification strategy is adopted, and different priorities are set according to the types of messages to ensure the reliability of important messages.
Abstract an aggregation unit similar to a live broadcast room, and uniformly manage user behaviors that join the aggregation unit.
Adopt the client-side active pull strategy. Compared with the traditional server-side push strategy, active pull is the use of the client's natural distributed characteristics to save the user's state on the client. The server can reserve more resources for business processing by reducing state maintenance.
Provide upstream message capability and provide a more complete message circulation path.

4.1.2 Scheme process

Pike 2.0 uses a circular queue to maintain the message list for each aggregation unit. After priority filtering, messages sent to the aggregation unit will be inserted into the position indicated by the tail pointer of the queue. As the messages in the aggregation unit continue to increase, they finally reach the maximum. When the queue is long, the head pointer will continue to move to make room for the tail pointer. The aggregation unit avoids service performance problems caused by the short-term blowout growth of messages by controlling the maximum length of the circular queue.

When the client actively pulls, it will carry the offset of the last obtained message in the circular queue, so that the service will aggregate the messages between the position indicated by the offset and the position indicated by the tail pointer as The result of this pull is returned to the client at one time. Different clients maintain their own offsets to avoid the server's state maintenance of the client.

The specific interaction between the client and the server is shown in the figure. The client takes the initiative to pull after joining the aggregation unit. If the offset carried in this pull can get the aggregated message from the service circular queue, then the message will be called back. Perform the next pull operation immediately after giving the business. If the offset carried this time is already at the position of the tail pointer of the circular queue, the server will not make any response. The client will wait for the pull timeout to start the next pull operation, and repeat the process until the client leaves the Polymerization unit. At the same time, if the business server has a message that needs to be pushed, it will be sent to the Pike server by RPC, and the message processing module will insert the effective message filtered by the message classification strategy into the ring queue.

图9 聚合消息交互流程图

4.2 Message Preservation

Pike 1.0 was only suitable for message push scenarios at the beginning of its design, and Pike 2.0 evolved into a two-way message delivery service based on it, which not only supports downstream message push, but also supports upstream message delivery. Pike 2.0 further expands the function of message order protection in the upstream message delivery. The message order protection here mainly includes two levels of meaning. First, the messages sent by each business client reach the same business server to the greatest extent, and secondly These messages arrive at the service server consistently in accordance with the sequence sent by the client.

4.2.1 Sticky sessions

In order to make the messages sent by each business client reach the same business server to the greatest extent, Pike 2.0 introduces the concept of sticky sessions. A sticky session means that messages on the same client connection are forwarded to a specific business machine for processing. After the client is disconnected and reconnected, the messages on the new connection are still forwarded to the business machine.

Sticky sessions can be summarized as the following process. At the first business login, the Pike 2.0 server will use the load balancing algorithm to select a business server, and notify the client of the routing ID of the business server through the business login result and save it. After that, if the channel status is stable, all upstream messages will be delivered. To the business server. If the channel status fluctuates and the connection is disconnected during the period, Pike 2.0 will re-login the business after initiating the reconnection. This time the business login will re-report the previously saved routing identifier to the Pike 2.0 server, so that the Pike 2.0 server will be routed Identifies the service server to be re-bound. Of course, if the business server indicated by the routing identifier has stopped providing services, the Pike 2.0 server will re-select a new business server through the load balancing algorithm, and the client will obtain the new routing identifier, and the subsequent logic will repeat the process until The Pike 2.0 client exits.

4.2.2 Timing consistency

We all know that TCP is in order, so under the premise of the same TCP connection, under what circumstances will the messages sent by the client arrive at the business server out of order? The reason is that the Pike 2.0 server reads the message from the TCP and delivers it to the business server through RPC asynchronous call.

In order to solve this problem, the simplest solution is of course that the client limits the sending window of the message queue to 1. Each sent message can only receive ACK after the Pike 2.0 server delivers it to the business server, and then send the next message . However, considering that the delay of network transmission on the link is much greater than the delay of processing on the end, the QPS of this scheme is set by the network transmission as a bottleneck. Assuming an RTT is 200ms, then the theory of this scheme can only reach 5 QPS.

In order to improve the QPS of upstream message order delivery, Pike 2.0 adopts the solution of setting up a message queue buffer on the server side. As shown in the figure, the client can send multiple messages at once within the range allowed by the sending window. The server buffers the received messages in the message queue in order, and then serially buffers these messages through RPC calls. The messages of are delivered to the business server in order. This order-preserving scheme shifts the bottleneck point of QPS performance from the previous network transmission delay on the link to the delay of RPC calls. In actual scenarios, an RPC call is often between a few milliseconds, which is much smaller than that of the network. The transmission delay on the link then significantly improves the QPS.

图10 时序一致性示意图

5 Pike 2.0 stability guarantee

5.1 Pike 2.0 monitoring system

Pike 2.0 relies on the Meituan monitoring platform Raptor to complete the monitoring system construction. Both the server and the client have built their own comprehensive indicator monitoring. The Pike 2.0 client uses Raptor's end-to-end indicator capabilities and custom indicator capabilities to output more than 10 monitoring indicators to monitor the Pike system in real time. These indicators cover multiple dimensions such as channel establishment, message delivery, business login, and system exceptions. On the basis of real-time indicator monitoring, Pike 2.0 has configured alarm thresholds for different indicators. Taking push messages as an example, if the market data of a particular App fluctuates by more than 10% per minute, the Raptor system will report to the members of the Pike project team. Push alarm information.

Based on all Raptor monitoring indicators, Pike 2.0 refines the core SLA indicators as follows:

Indicator name	Indicator definition	Index meaning
Uplink message delivery success rate	The success rate of the uplink message from sending to receiving ACK	On behalf of Pike 2.0 business uplink message delivery capability
Uplink message delivery delay	Upstream message delivery RTT	Same as above
Downlink message delivery success rate	The success rate of the downlink message from sending to receiving ACK	Represents Pike 2.0 business downlink message delivery capability
Downlink message delivery delay	Downlink message delivery RTT	Same as above
Channel available time-consuming	The time from when the channel is established to when the message can be delivered	Represents Pike 2.0 channel connection capability
Average daily news volume	The total number of upstream and downstream messages delivered daily	Represents Pike 2.0 product service capabilities

Pike 2.0 will regularly output large-scale data reports based on core SLA indicators. At the same time, data can be filtered based on multiple dimensions such as App, business type, and network type to meet the needs of different users for indicator data.

5.2 Pike 2.0 case user tracking

The monitoring system can reflect the stability of the Pike 2.0 system from a global perspective. For individual users, the Pike management platform provides complete link tracking information.

Each Pike 2.0 connection is distinguished by a unique identification Token. Through the active detection of the "connection sniffing" module of the Pike management platform through the unique identification Token, the interaction flow of all signaling on the corresponding connection can be obtained. As shown in the figure, the process clearly marked the client's signaling such as establishing connection, initiating authentication, and binding alias. Click the corresponding signaling to jump to the signaling details to further view the information carried in the signaling, and then combine the SDK to embed Click on the offline log of Meituan log service Logan to quickly find and locate the problem.

图11 个案用户追踪信令交互图

6 Pike 2.0 construction results

As of June 2021, Pike has access to 200+ services, with an average daily message volume of about 5 billion+, Pike 2.0 message arrival rate>99.5% (up 0.4% compared to Pike 1.0), and Pike 2.0 has an average end-to-end delay Time <220ms (about 37% reduction compared with Pike 1.0).

Some application cases:

Message service solution for live broadcast scenes. The live interactive function that supports the live broadcast business has the ability to support millions of large-scale live broadcasts online at the same time.
Real-time reach solutions such as message push and feed stream pre-loading. Supports real-time push of marketing, control and other business messages, the arrival rate of business messages is increased by up to 10%, and the time required to establish a long connection channel is reduced by 5%.
IoT device access solution. It supports the IoT access capability of the meal cupboard business, helping to increase the arrival rate of business messages from 98.4% to 99.6%.
Message delivery scheme for small game scenes. It supports the communication capability of Meituan mini-game scenes, with a message arrival rate of over 99.8% and an uplink delay as low as 195ms.

7 Summary and future prospects

Pike is widely used in Meituan. At present, it mainly focuses on business scenarios such as real-time contact, interactive live broadcast, and mobile synchronization. With the rapid development of the company's business, Pike has put forward higher requirements for usability, ease of use, and scalability. Hope To improve the network experience in various business scenarios, Pike's future planning focuses mainly on: providing multi-terminal and multi-scenario network communication solutions, continuously improving the protocol ecology, and fighting complex networks in various application scenarios.

expands Pike's general basic capabilities and improves channel performance . By optimizing the order-preserving program, providing dedicated channels, optimizing the transmission protocol and other methods, the throughput and stability are further improved, and the push delay is reduced.
builds Pike IoT and provides the IoT access layer solution . Provide a unified IoT access layer solution for the company's Internet of Things application scenarios (bicycles, power banks, dining cabinets, smart helmets, warehouses, store equipment, etc.), and support multiple access protocols (HTTP, MQTT, CoAP, etc.) , Provide safe and reliable equipment connection and communication capabilities for business.
optimizes the communication experience in a weak network environment . Based on Meituan’s self-developed MQUIC network protocol library on the mobile and IoT terminals, explore Pike over QUIC, explore WebTransport technology on the desktop, and fully support the QUIC protocol to improve network performance in weak network and large packet scenarios and reduce long-tail distribution The request is time consuming.

About the Author

Jianwu, Jiameng, Lu Kai, Feng Jiang, etc., all come from the Meituan Basic Technology Department-Front-end Technology Center.

Job Offers

Meituan Basic Technology Department-Front-end Technology Center is looking for senior and senior technical experts, Base Shanghai, Beijing. We are committed to building high-performance, high-availability, and high-experience front-end basic technical services for Meituan’s massive business, covering network communications, resource hosting, design systems, R&D collaboration, low-code, performance monitoring, operation and maintenance assurance, Node infrastructure, IoT In technical fields such as infrastructure, interested students are welcome to submit their resumes to: edp.itu.zhaopin@meituan.com.

Read more technical articles from the

| In the public account menu bar dialog box, reply to keywords such as [Products in 2020], [Products in 2019], [Products in 2018], [Products in 2017], and you can view a collection of technical articles from the Meituan technical team over the years.

| This article is produced by the Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "the content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activity, please send an email to tech@meituan.com to apply for authorization.