Industrial and Commercial Bank of China Distributed Service C10K Scenario Solution

头图.png

Author | Yan Gaofei
Source | Alibaba Cloud Native

Dubbo is a lightweight open source Java service framework, and it is the first choice for many enterprises when building a distributed service architecture. Industrial and Commercial Bank of China has been exploring the transformation of distributed architecture since 2014, and independently developed a distributed service platform based on open source Dubbo.

The Dubbo framework has stable operation and good performance under the service scale with a small number of providers and consumers. With the growing demand for online, diversified, and intelligent banking services, in the foreseeable future, there will be a scenario where a provider provides services to thousands or even tens of thousands of consumers.

Under such a high load, if the server-side program design is not good enough, the network service may be inefficient or even completely paralyzed when processing tens of thousands of client connections, which is a C10K problem. So, can the Dubbo-based distributed service platform cope with complex C10K scenarios? To this end, we built a large-scale connection environment and simulated service calls to conduct a series of explorations and verifications.

A large number of transaction failures occur in the Dubbo service call in the C10K scenario

1. Prepare the environment

Use Dubbo 2.5.9 (the default netty version is 3.2.5.Final) to write service providers and corresponding service consumers. The provider service method has no actual business logic and only sleeps 100ms; the consumer side configures the service timeout time as 5s, and each consumer calls the service once every minute after starting.

One 8C16G server is prepared to deploy a service provider in a containerized manner, and hundreds of 8C16G servers are prepared to deploy 7000 service consumers in a containerized manner.

Start the Dubbo monitoring center to monitor the service invocation.

2. Customize the verification scenario and observe the verification results

The verification situation is not satisfactory. In the C10K scenario, the Dubbo service call may fail over time.

If the distributed service call takes a long time, the full link node from the service consumer to the service provider will occupy the thread pool resources for a long time, which increases the additional performance loss. And when the service call concurrency suddenly increases, it is easy to cause the full link node to be blocked, which affects the call of other services, and further causes the performance of the entire service cluster to decrease or even become unavailable as a whole, leading to an avalanche. The service call timeout problem cannot be ignored. Therefore, we conducted a detailed analysis of the failure of Dubbo service call timeout in this C10K scenario.

C10K scenario problem analysis

According to the service call transaction link, we first suspect that the transaction timeout is caused by the process of the provider or consumer itself or the delay in the network.

Therefore, we start the process gc log on the provider and consumer server where the transaction fails, print the process jstack multiple times, and perform network packet capture on the host machine.

1. Observe gc log, jstack

The provider and consumer process gc duration, gc interval, memory usage, thread stack, etc. have no obvious abnormalities, temporarily rule out the conjectures such as gc triggering stop the world leading to timeout, or improper thread design leading to blocking and timeout.

2. Observe failed transactions in two scenarios

In view of the failed transactions in the above two scenarios, observe the network packet capture separately, corresponding to the following two different phenomena:

scenario 1: The transaction timeout during the stable operation of the

Track network captures and transaction logs of providers and consumers. After the consumer initiates the service call request, the consumer request message is quickly caught on the provider side, but the provider takes 2s+ from receiving the request message to starting to process the transaction.

At the same time, observe the data flow of the transaction request response. After the provider's business method is processed, it takes 2s+ to send the return packet to the consumer. After that, the consumer quickly receives the transaction return message. But at this time, the total transaction time has exceeded 5s and the service call timeout time has exceeded, resulting in a timeout exception being thrown.

Therefore, it is determined that the cause of the transaction timeout is not on the consumer side, but on the provider side .

Scenario 2: A large number of transactions time out after the provider restarts

After the service call request was initiated, the provider quickly received the request message from the consumer, but the provider did not normally submit the transaction message to the application layer, but responded with an RST message, and the transaction failed over time.

Observe that a large number of RST packets appear within 1-2 minutes after the provider restarts. Through the deployment script, the number of established connections was printed every 10ms after the provider restarted. It was found that the number of connections failed to quickly recover to 7000 after the provider restarted, but the number of connections returned to normal after 1-2 minutes. In this process, the connection status with the provider is inquired on each consumer side, and the connection status with the provider is all established, and it is suspected that the provider has a unilateral connection.

We continue to analyze these two abnormal scenarios separately.

Scenario 1: The provider takes a long time before and after the actual transaction, causing the transaction to time out

Detailed collection of the operating status and performance indicators of the provider:

The service provider jstack is collected every 3s on the provider server, and it is observed that the netty worker thread processes heartbeats frequently every 60s or so.
Print top -H at the same time and observe that there are 9 netty worker threads in the top 10 threads that take up more CPU time slices. Because the provider server is 8C, Dubbo defaults to 9 netty worker threads, that is, all 9 netty worker threads are busy.

Deploy the server system performance collection tool nmon and observe that the CPU produces glitches every 60 seconds; the number of network packets at the same time also has glitches.

Deploy ss -ntp to continuously print the data backlog in the network receiving queue and sending queue. It is observed that there is more queue accumulation near the time-consuming transaction point.

In the Dubbo service framework, providers and consumers send heartbeat messages (message length 17) with a period of 60s, which is close to the above interval. Combined with network capture, there are many heartbeat packets near the time-consuming transaction point.

According to the heartbeat mechanism of the Dubbo framework, when the number of consumers is large, the provider sends heartbeat packets and the consumer heartbeat packets that need to be answered will be very dense. Therefore, it is suspected that the intensive heartbeat causes the netty thread to become busy, which affects the processing of transaction requests, which in turn leads to increased transaction time.

Further analyze the operating mechanism of the netty worker thread, and record the processing time of each netty worker thread in the three key links of processing connection requests, processing write queues, and processing selectKeys. It is observed that every 60s interval (consistent with the heartbeat interval) processing and reading data packets is more time-consuming, and there is a situation where transaction time-consuming increases during this period. Observe the network packet capture at the same time, the provider receives more heartbeat packets.

Therefore, the above suspicion is confirmed. Intensive heartbeats cause netty worker threads to be busy, which leads to increased transaction time.

Scenario 2: Unilateral connection causes transaction timeout

Analyze the causes of unilateral connections

During the three-way handshake of TCP connection establishment, if the full connection queue is full, a unilateral connection will result.

The size of the fully connected queue is determined by the minimum of the system parameters net.core.somaxconn and the backlog of listen(somaxconn,backlog). somaxconn is a parameter of the Linux kernel, the default value is 128; the backlog is set when the socket is created, the default backlog value in Dubbo 2.5.9 is 50. Therefore, the production environment fully connected queue is 50. Through the ss command (Socket Statistics), the size of the fully connected queue is also found to be 50.

Observe the status of the TCP connection queue and confirm that there is an overflow of the full connection queue.

means that the insufficient capacity of the fully connected queue results in a large number of unilateral connections . Because in this verification scenario, there are too many consumers subscribing to the provider, when the provider restarts, the registry pushes the provider's online notification to the consumer, and all consumers reconnect with the provider almost at the same time, causing the full connection queue to overflow .

Analyze the influence range of unilateral connection

The scope of the unilateral connection is mostly the first transaction of the consumer, and occasionally, the first transaction starts to fail 2-3 consecutively.

When a unilateral connection is established, the transaction does not necessarily fail. After the three-way handshake full connection queue is full, if the half connection queue is idle, the provider creates a timer to retransmit syn+ack to the consumer. The retransmission is 5 times by default, and the retransmission interval increases by multiples, 1s..2s..4s.. A total of 31s. Within the number of retransmissions, if the fully connected queue becomes idle, the consumer responds with an ack and the connection is established successfully. The transaction is successful at this time.

Within the number of retransmissions, if the full connection queue is still busy, the new transaction fails after the timeout period is reached.

After reaching the number of retransmissions, the connection is dropped. After that, the consumer sends a request, and the provider responds with an RST. After the transaction reaches the timeout period, it fails.

According to Dubbo's service invocation model, after the provider sends an RST, the consumer throws an exception Connection reset by peer, and then disconnects from the provider. The consumer cannot receive the response message of the current transaction, resulting in a timeout exception. At the same time, the consumer timer detects the connection with the provider every 2s. If the connection is abnormal, a reconnection is initiated and the connection is restored. Since then, the transaction has been normal.

3. C10K scenario analysis summary

To sum up, there are two reasons for the above transaction timeout:

heartbeat mechanism causes the netty worker thread to be busy . In each heartbeat task, the provider sends a heartbeat to all consumers who have not sent or received a message in 1 heartbeat cycle; the consumer sends a heartbeat to all providers that have not sent or received a message in 1 heartbeat cycle. There are more consumers connected to the provider, which leads to the accumulation of heartbeat packets; at the same time, the heartbeat processing consumes more CPU, which affects the processing timeliness of business packets.
fully connected queue capacity is less than . After the provider restarts, the queue overflows, resulting in a large number of unilateral connections. The first transaction under a unilateral connection has a high probability of overtime failure.

4. Thinking Next

For the above scenario 1: How to reduce the time for a single netty worker thread to process the heartbeat and speed up the efficiency of the IO thread? The following schemes are initially conceived:

Reduce the processing time of a single heartbeat
Increase the number of netty worker threads and reduce the load of a single IO thread
Break up the heartbeat and avoid intensive processing

For the above scenario 2: How to avoid the first transaction failure caused by a large number of semi-connections? The following scheme is envisaged:

Increase the length of the TCP full connection queue, involving operating systems, containers, and Netty
Improve the speed of server accept connection

Improved processing efficiency of transaction messages

1. Optimize layer by layer

Based on the above assumptions, we have made a lot of optimizations from the system level and Dubbo framework level to improve transaction processing efficiency in the C10K scenario and increase the performance capacity of service calls.

The optimization content includes the following aspects:

The specific framework layers involved in optimization are as follows:

After verifying each optimization content item by item, each measure has been improved to varying degrees, and the effects are as follows:

2. Comprehensive optimization verification effect

The comprehensive use of the above optimization results is best. In this verification scenario where 1 provider connects to 7000 consumers, after restarting the provider, the scenario without transaction timeout runs for a long time. Comparing before and after optimization, the provider's CPU peak value dropped by 30%, the processing time difference between the consumer and the provider was controlled within 1ms, and the P99 transaction time dropped from 191ms to 125ms. While improving the transaction success rate, it effectively reduces the waiting time of consumers, reduces the resource occupation of service operation, and improves system stability.

3. Online actual operation effect

Based on the above verification results, Industrial and Commercial Bank of China integrated the above optimized content in the distributed service platform. As of the date of publication, there is already an online application scenario where tens of thousands of consumers are connected to one provider. After landing this optimized version, there was no abnormal transaction timeout in the provider version upgrade and long-term operation, and the actual operation effect was in line with expectations.

Future outlook

Industrial and Commercial Bank of China is deeply involved in the construction of the Dubbo community. It has encountered many technical challenges in the process of large-scale application of Dubbo financial level. In order to meet the demanding operation requirements of financial-level high-sensitivity transactions, it has carried out large-scale independent research and development and adopted the expansion of the Dubbo framework. And customization continues to improve the stability of the service system, and continuously contributes general enhanced capabilities to the open source community with the concept of "originating from open source and giving back to open source".

In the future, we will continue to work on Dubbo's financial-level large-scale applications, and coordinate with the community to continue to improve Dubbo's performance capacity and high availability level, accelerate the digital innovation and transformation of the financial industry, and fully independent control of the basic core key.

About the Author

Yan Gaofei, an architect in the field of microservices, is mainly engaged in research and development work such as service discovery and high-performance network communication. He is good at ZooKeeper, Dubbo, RPC protocol and other technical directions.

in to start.aliyun.com on the PC to know the mobile laboratory, and experience the online interactive tutorial immersively.

Industrial and Commercial Bank of China Distributed Service C10K Scenario Solution

A large number of transaction failures occur in the Dubbo service call in the C10K scenario

1. Prepare the environment

2. Customize the verification scenario and observe the verification results

C10K scenario problem analysis

1. Observe gc log, jstack

2. Observe failed transactions in two scenarios

scenario 1: The transaction timeout during the stable operation of the

Scenario 2: A large number of transactions time out after the provider restarts

Scenario 1: The provider takes a long time before and after the actual transaction, causing the transaction to time out

Scenario 2: Unilateral connection causes transaction timeout

3. C10K scenario analysis summary

4. Thinking Next

Improved processing efficiency of transaction messages

1. Optimize layer by layer

2. Comprehensive optimization verification effect

3. Online actual operation effect

Future outlook

About the Author

阿里云云原生

引用和评论

Higress 入选全球 Top 100 MCP Servers 榜单｜MCPMarket.com

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

Koupleless 2024 年度报告 & 2025 规划展望

深入浅出微服务基础设施：服务架构的演进历史

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

深入浅出微服务基础设施: 微服务核心组件

容器化对数据库的性能有影响吗？