Long connection gateway technology topic (5): Himalaya self-developed billion-level API gateway technology practice

This article was originally shared by the Himalaya technical team. The original title "Practice of Himalaya Self-developed Gateway Architecture" has been modified.

1 Introduction

Gateway is a relatively mature product. Basically, major Internet companies will have gateways as middleware to solve the rise of some public services, and it can update and iterate quickly. If there is no gateway, to update a public feature, it is necessary to promote the update and release of all business parties, which is extremely inefficient. With the gateway, all this becomes no problem.

The Himalayas are the same. The number of users has increased to more than 600 million, and the number of Web services has reached 500+. At present, our gateway processes 20 billion+ calls per day, and the peak QPS of a single machine has reached 4w+.

In addition to the most basic functions of the reverse proxy, the gateway also has public features, such as black and white lists, flow control, authentication, fuse, API release, monitoring, and alarms. We have also implemented traffic scheduling, traffic copy, pre-release, intelligent upgrade and downgrade, traffic preheating and other related functions according to the needs of the business side.

Technically speaking, the technical evolution roadmap of the Himalaya API gateway is roughly as follows:

This article will share the technological evolution development process and practical experience summary of the Himalaya API gateway under the premise of 100 million traffic.

Learning Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published simultaneously at: http://www.52im.net/thread-3564-1-1.html)

2. Thematic catalog

This article is the fifth in a series of articles. The general content is as follows:

"Special Topic on Long Connection Gateway Technology (1): Summary of Jingdongmai's Production-level TCP Gateway Technology Practice"
"Special topic on persistent connection gateway technology (2): Knowing the practice of high-performance persistent connection gateway technology with tens of millions of concurrent"
"Special Topic on Long-Term Connection Gateway Technology (3): The Road to Technology Evolution of Mobile Access Layer Gateways in Hand-Taking 100 Million Levels"
"Special topic on persistent connection gateway technology (4): Practice of iqiyi WebSocket real-time push gateway technology"
"Special topic on long-connection gateway technology (5): Himalaya's self-developed 100-million-level API gateway technology practice" (* this article)

3. Version 1: Tomcat NIO+Async Servlet

The most critical point in the architecture design of the gateway is that the gateway cannot block the block when calling the back-end service after receiving the request, otherwise the throughput of the gateway will be difficult to increase, because the most time-consuming process is the remote calling process of calling the back-end service.

If it is blocked here, Tomcat's worker threads are all blocked, and other requests cannot be processed while waiting for the back-end service to respond. This place must be asynchronous.

architecture diagram of

In this version, we implement a separate Push layer. After receiving the response as the gateway, when responding to the client, this layer is implemented through this layer. The communication with the back-end service is HttpNioClient, which supports black and white lists, flow control, authentication, and API release. And other functions.

However, this version only meets the requirements of the gateway functionally, and the processing capacity quickly becomes a bottleneck. When the QPS of a single machine reaches 5K, Full GC will not stop.

Later, through the heap analysis on the Dump line, it was found that Tomcat cached a lot of HTTP requests, because Tomcat caches 200 requestProcessors by default, and each principal is associated with a request.

There is also a memory leak in the asynchronous implementation of Servlet 3.0 Tomcat. Later, by reducing this configuration, the effect is obvious.

But the performance will definitely drop. In summary, based on Tomcat as the access terminal, there are the following problems.

Tomcat's own problems:

1) There are too many caches. Tomcat uses a lot of object pool technologies. When the memory is limited, high traffic can easily trigger GC;
2) Memory Copy, Tomcat uses heap memory by default, so the data needs to be read into the heap, and our back-end service is Netty, which has off-heap memory and needs to pass several copies;
3) Another problem with Tomcat is that reading the body is blocked. The NIO model of Tomcat is different from the reactor model. The reading of the body is a block.

Here is another diagram of Tomcat buffer:

From the above figure, we can see that Tomcat is well packaged externally, and there will be three copies by default internally.

The problem of HttpNioClient: both acquiring and releasing connections need to be locked. Corresponding to proxy service scenarios such as gateways, connections are frequently established and closed, which will inevitably affect performance.

Based on these problems with Tomcat, we will modify the access terminal later and use Netty as the access layer and service call layer, which is our second edition, which can completely solve the above problems and achieve the desired performance.

4. Version 2: Netty+ fully asynchronous

Based on Netty's advantages, we have implemented a fully asynchronous, lock-free, and layered architecture.

Let's take a look at the architecture diagram of our access terminal based on Netty:

PS: If you know too little about Netty and Java NIO, please be sure to read the following articles:

"Less long-winded! One minute to take you to understand the difference between Java's NIO and classic IO "
"Java BIO and NIO are difficult to understand? Use code practice to show you, if I don’t understand, I will switch to another career! 》
"Introduction to the strongest Java NIO in history: For those who are worried about getting started to giving up, please read this! 》
"Writing for Beginners: Learning Methods and Advanced Strategies of Netty, a Java High-Performance NIO Framework"
"Novice Getting Started: The most thorough analysis of Netty's high-performance principles and framework architecture so far"
"Introduction to the most popular Netty framework in history: basic introduction, environment construction, hands-on combat"

4.1 Access layer
Netty's IO thread is responsible for the encoding and decoding of the HTTP protocol, as well as monitoring and alarming abnormalities at the protocol level.

Optimized the encoding and decoding of the HTTP protocol, and visualized the monitoring of abnormal and offensive requests. For example, we have restrictions on the size of HTTP request lines and request headers. Tomcat uses request lines and requests together, and does not exceed 8K. Netty has separate size limits.

If the client sends a request that exceeds the threshold, the request with cookie is easily exceeded. Under normal circumstances, Netty will directly respond with a 400 to the client.

After the transformation, we only take the normal size part and mark that the protocol analysis failed. After reaching the business layer, we can determine that the service has such problems, and some other offensive requests, such as sending only the request header, and not Sending the body or part of these needs to be monitored and alarmed.

4.2 Business logic layer
Responsible for a series of public logic supporting services such as API routing, traffic scheduling, etc., are implemented in this layer, sampling the responsibility chain mode, this layer will not have IO operations.

In the gateway design of the industry and some major manufacturers, the business logic layer is basically designed as a responsibility chain model, and public business logic is also implemented at this layer.

We also have the same routine on this layer, supporting:

1) User authentication and login verification, support interface level configuration;
2) Black and white list: divided into global and application, as well as IP dimension, parameter level;
3) Flow control: support automatic and manual, automatic is the automatic interception of super large flow, realized by token bucket algorithm;
4) Intelligent fuse: Improved on the basis of Histrix, support automatic upgrade and downgrade, we are all automatic, and also support manual configuration to immediately fuse, that is, when the abnormal ratio of the service is found to reach the threshold, the fuse is automatically triggered;
5) Gray release: I support a slow start mechanism similar to TCP for the traffic of the newly started machine, which gives the machine a warm-up time window;
6) Unified downgrade: We will find a unified downgrade logic for all requests that fail to forward. As long as the business side is equipped with downgrade rules, they will be downgraded. We support the downgrade rules to the parameter level, including the value in the request header, which is very Fine-grained, in addition, we will connect with Varnish to support the graceful degradation of Varnish;
7) Traffic scheduling: support the business to filter the traffic to the corresponding machine according to the filtering rules, and also support to allow only the filtered traffic to access this machine, which is very useful when checking problems/new function release verification, and can pass a small part of the traffic first Verification and then large-scale release and online;
8) Traffic copy: We support a copy of the original online request according to the rules and write it to MQ or other upstream for online cross-computer room verification and stress testing;
9) Request log sampling: We will sample and place all failed requests, provide support for the business side to troubleshoot problems, and also support the business side to conduct personalized sampling according to the rules. We sample the data of the entire life cycle, including requests and responses. All the data.
All of the above mentioned are for traffic management. Each of our functions is a filter, and processing failures will not affect the forwarding process, and the metadata of all these rules will be initialized when the gateway is started.

During the execution process, there will be no IO operations. At present, some designs will perform concurrent execution of multiple filters. Since ours are all memory operations and the overhead is not large, we currently do not support concurrent execution.

Another is that the rules will be modified. When we modify the rules, we will notify the gateway service to refresh in real time. Our internal requests for such metadata updates are processed through independent threads to prevent IO from affecting business threads during operation.

4.3 Service call layer
Service invocation is the key to the proxy gateway service, and it must be asynchronous. We implement it through Netty, and at the same time make good use of the connection pool provided by Netty, so that both acquisition and release are lock-free operations.

4.3.1) Asynchronous Push:

After the gateway initiates a service call, it allows the worker thread to continue processing other requests without waiting for the server to return.

The design here is that we create a context for each request. After we send the request, we bind the context of the request to the corresponding connection, and when Netty receives the server response, it will execute on the connection. read operation.

After decoding, the corresponding context is obtained from the connection, and the session of the access terminal can be obtained through the context.

In this way, push writes the response back to the client through the session. This design is also based on the exclusive use of the HTTP connection, that is, the connection is bound to the request context.

4.3.2) Connection pool:

The principle of the connection pool is as follows:

In addition to asynchronously initiating remote calls, the service call layer also needs to manage the connection of back-end services.

HTTP is different from RPC. The connection of HTTP is exclusive, so you must be careful when releasing it. You must wait for the server to respond before releasing it. Also, you must be careful when handling connection closure.

summarizes the following points:

1）Connection:close；
2) Idle timeout, close the connection;
3) Close the connection when read timeout;
4) Write timeout, close the connection;
5）Fin、Reset。
The above several scenarios that need to close the connection, the following mainly talk about Connection: close and idle write timeout. Others should be more common, such as read timeout, connection idle timeout, received fin, reset code.

4.3.3）Connection:close：

The back-end service is Tomcat, and Tomcat has a limit on the number of times a connection can be reused. The default is 100 times.

When it reaches 100 times, Tomcat will add Connection: close to the response header to let the client close the connection, otherwise if the connection is sent again, 400 will appear.

Also, if the request on the end carries connection:close, then Tomcat will not wait for the connection to be reused 100 times, that is, it will be closed once.

By adding Connection: close to the response header, it becomes a short connection. When maintaining a long connection with Tomcat, you need to pay attention to it. If you want to use it, you must actively remove the close header.

4.3.4) Write timeout:

First, when does the gateway start to calculate the timeout time of the service, if it is calculated from the call to writeAndFlush, this actually includes the time that Netty encodes HTTP and the time to flush the request from the queue. This is not for the back-end service. Fair.

Therefore, it is necessary to start timing after the real flush succeeds. This is the closest to the server. Of course, it also includes the network round-trip time and the processing time of the kernel protocol stack. This is inevitable, but basically unchanged.

So we start the timeout task after the flush successfully callback.

Here is a note: If flush cannot be called back quickly, such as a large post request, the body part is relatively large, and when Netty sends it, the default is 1k for the first time.

If it has not been sent, increase the size of the sending and continue sending. If Netty has not completed sending after 16 times, it will not continue sending, but submit a flushTask to the task queue, and wait until the next execution. send.

At this time, the flush callback time is relatively long, resulting in such requests cannot be closed in time, and the back-end service Tomcat will always be blocked in reading the body. Based on the above analysis, we need a write timeout. For large body requests, pass Write timeout to close in time.

5. Full link timeout mechanism

above picture is our mechanism for handling the timeout of the entire link:

1) Protocol resolution timeout;
2) Waiting for the queue to time out;
3) Connection timeout;
4) Waiting for the connection to time out;
5) Check whether it is overtime before writing;
6) Write timeout;
7) The response timed out.
6. Monitoring and alarm
What the gateway business can see is monitoring and alarming. We implement second-level alarm and second-level monitoring. The monitoring data is regularly reported to our management system. The management system is responsible for the aggregation of statistics and placing it in influxdb.

We have done a comprehensive monitoring and alarm on the HTTP protocol, whether it is at the protocol layer or at the service layer.

protocol layer:

1) Offensive requests, send only the header, do not send/send part of the body, sample and place, restore the scene, and give an alarm;
2) If the line or head or body is too large, the sample is placed on the plate, the scene is restored, and the alarm is issued.

application layer:

1) Time-consuming monitoring: there are slow requests, overtime requests, and tp99, tp999, etc.;
2) OPS monitoring and alarm;
3) Bandwidth monitoring and alarm: Supports separate monitoring of request and response lines, headers, and bodies;
4) Response code monitoring: especially 400, and 404;
5) Connection monitoring: We also monitor the connection of the access terminal, the connection with the back-end service, and the size of the bytes to be sent on the back-end service connection;
6) Failed request monitoring;
7) Traffic jitter alarm: This is very necessary. Traffic jitter is either a problem or a precursor to a problem.

overall architecture:

7. Performance optimization practice
7.1 Object Pool Technology
For high-concurrency systems, frequent creation of objects not only has the overhead of allocating memory, but also puts pressure on gc. We will write and reuse frequently used tasks such as thread pool tasks, StringBuffer, etc., to reduce frequent The overhead of applying for memory.

7.2 Context switch
High-concurrency systems usually adopt asynchronous design. After being asynchronous, you have to consider the problem of thread context switching.

Our threading model is as follows:

Our entire gateway does not involve io operations, but our business logic is still asynchronous with netty's io codec thread.

is for two reasons:

1) It is to prevent the code written by development from blocking;
2) There may be more business logic logging. In the event of an emergency, but when we push the thread, we support the use of netty's io thread instead. There is less work done here. Here is the asynchronous modification to synchronization (through Modify the configuration adjustment), the context switch of the cpu is reduced by 20%, thereby improving the overall throughput, that is, it cannot be asynchronous for the sake of asynchrony. The design of zull2 is similar to ours.

7.3 GC optimization
In high-concurrency systems, gc optimization is inevitable. When we use object pool technology and off-heap memory, objects rarely enter the old age. In addition, our young generation will set a larger value, and SurvivorRatio=2, the promotion age setting is the largest 15. Try to recycle objects in the young generation, but the monitoring found that the memory of the old generation will still grow slowly. Through dump analysis, each of our back-end services creates a link. There is always a socket, the AbstractPlainSocketImpl of the socket, and the AbstractPlainSocketImpl Override the finalize method of the Object class.

implemented as follows:

/**
 * Cleans up if the user forgets to close it.
 */
protected void finalize() throws IOException {
    close();
}

It's because we didn't actively close the link, we made a bottom line. When the gc recycles, the corresponding link resources are first released.

Since the mechanism of finalize is processed by the Finalizer thread of the jvm, and the priority of the Finalizer thread is not high, the default is 8. You need to wait until the Finalizer thread finishes executing the finalize method of the ReferenceQueue object, and you have to wait until the next time GC. In order to recycle the object, the objects that created the link cannot be reclaimed immediately in the young generation, thus entering the old generation. This is why the old generation will continue to grow slowly.

7.4 Log
Under high concurrency, especially Netty's IO thread has to perform asynchronous tasks and timing tasks in addition to the IO read and write operations on the thread. If the IO thread cannot handle the tasks in the queue, it is likely to cause new asynchronous tasks. There is a rejection situation.

Under what circumstances is it possible? IO is an asynchronous read and write. The problem is not that it consumes more CPU. The most likely thing that blocks the IO thread is our log.

At present, the immediateFlush attribute of the ConsoleAppender log of Log4j is set to true by default, that is, the flush is written to the disk synchronously every time the log is logged. This is much slower for memory operations.

At the same time, the log queue of AsyncAppender will block the thread when it is full. The default buffer size of log4j is 128, and it is block.

That is, if the size of the buffer reaches 128, the thread that writes the log is blocked. In the case of a large amount of concurrent log writing, especially when the stack is large, the Dispatcher thread of log4j will slow down and need to be flushed.

In this way, the buffer cannot be consumed quickly, and it is easy to fill up log events, causing the Netty IO thread to block. Therefore, we must pay attention to streamlining when logging.

8. Future planning

Now we are all based on HTTP/1. Compared with HTTP/1, HTTP/2 now provides services at the connection level, that is, multiple HTTP requests can be sent on one connection.

That is, the HTTP connection can be the same as the RPC connection. Just establish a few connections, which completely solves the overhead of establishing a connection and slow start every time that the HTTP/1 connection cannot be reused.

We are also upgrading to HTTP/2 based on Netty. In addition to technical upgrades, we have also been continuously optimizing monitoring and alarming, and we have been working hard on how to provide accurate and error-free alarms to the business side.

There is also downgrading. As a unified access gateway, we have been working on all-round downgrade measures with the business side. It is also our focus to ensure that any failure of the whole station can be downgraded through the gateway in the first time.

9. Write at the end

The gateway is already a standard configuration of an Internet company. Here is a summary of some of the experience and experience in the practice process, and I hope to give you some reference and ideas for solving some problems. We are still constantly improving, and we are also doing more projects. , Welcome to communicate.

Appendix: More relevant information

[1] NIO asynchronous network programming materials:

"Java's new generation network programming model AIO principle and Linux system AIO introduction"
"11 Questions and Answers about "Why Choose Netty""
"The source code of MINA and Netty (online reading version) has been compiled and released"
"Explain the Security of Netty: Principle Introduction, Code Demonstration (Part 1)"
"Detailed Netty Security: Principle Introduction, Code Demonstration (Part 2)"
"Detailed explanation of Netty's elegant exit mechanism and principle"
"NIO Framework Explained: Netty's High-Performance Way"
"Twitter: How to use Netty 4 to reduce JVM GC overhead (translation)"
"Absolute Dry Goods: Technical Essentials of Push Service for Massive Access Based on Netty"
"Novice Getting Started: The most thorough analysis of Netty's high-performance principles and framework architecture so far"
"Writing for Beginners: Learning Methods and Advanced Strategies of Netty, a Java High-Performance NIO Framework"
"Less long-winded! One minute to take you to understand the difference between Java's NIO and classic IO "
"Introduction to the strongest Java NIO in history: For those who are worried about getting started to giving up, please read this! 》
"Teach you how to use Netty to realize the heartbeat mechanism and disconnection reconnection mechanism of network communication programs"
"Java BIO and NIO are difficult to understand? Use code practice to show you, if I don’t understand, I will switch to another career! 》
"Introduction to the most popular Netty framework in history: basic introduction, environment construction, hands-on combat"
"Special Topic on Long Connection Gateway Technology (1): Summary of Jingdongmai's Production-level TCP Gateway Technology Practice"
"Special topic on long-term connection gateway technology (5): Himalaya self-developed 100-million-level API gateway technology practice"

More similar articles...

[2] Articles about IM architecture design:
"On the architecture design of IM system"
"A brief description of the pits of mobile IM development: architecture design, communication protocol and client"
"A set of mobile IM architecture design practice sharing for massive online users (including detailed graphics and text)"
"An Original Distributed Instant Messaging (IM) System Theoretical Architecture Plan"
"From Zero to Excellence: The Evolution of the Technical Architecture of JD's Customer Service Instant Messaging System"
"Mushroom Street Instant Messaging/IM Server Development Architecture Selection"
"Tencent QQ's 140 million online users' technical challenges and architecture evolution PPT"
"How to Interpret "WeChat Technical Director Talking about Architecture: The Way of WeChat-The Road to the Simple""
"Rapid Fission: Witness the evolution of WeChat's powerful back-end architecture from 0 to 1 (1)"
"How to ensure the efficiency and real-time performance of large-scale group message push in mobile IM? 》
"Discussion on the Synchronization and Storage Scheme of Chat Messages in Modern IM System"
"Technical Challenges and Practice Summary Behind the 100 Billion Visits in WeChat Moments"
"Summary of Tencent's Senior Architect Dry Goods: An article to understand all aspects of large-scale distributed system design"
"Take Weibo application scenarios as an example to summarize the architectural design steps of massive social systems"
"Behind the glamorous bullet message: the chief architect of Netease Yunxin shares the technical practice of the billion-level IM platform"
"A set of high-availability, easy-scalable, and high-concurrency IM group chat and single chat architecture design practices"
"Social Software Red Envelope Technology Decryption (1): Comprehensive Decryption of QQ Red Envelope Technology Scheme-Architecture, Technical Implementation, etc."
"Introduction to Instant Messaging: What is Nginx?" Can it achieve IM load balancing? 》
"From guerrilla to regular army (1): the evolution of the IM system architecture of Mafengwo Travel Network"
"From guerrilla to regular army (2): Mafengwo Travel Network's IM Client Architecture Evolution and Practice Summary"
"From Guerillas to Regular Army (3): Technical Practice of Distributed IM System of Mafengwo Travel Network Based on Go"
"The data architecture design of Guazi IM intelligent customer service system (organized from the on-site speech, with supporting PPT)"
"Ali DingTalk Technology Sharing: Enterprise-level IM King-DingTalk's outstanding features in the back-end architecture"
"Design and Practice of a New Generation of Mass Data Storage Architecture Based on Time Sequence in WeChat Backend"
"IM Development Basic Knowledge Supplementary Lesson (9): Want to develop an IM cluster? First understand what RPC is! 》
"Ali Technology Sharing: E-commerce IM messaging platform, technical practice in group chat and live broadcast scenarios"
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 1): overall architecture, service split, etc."
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 2): reliability, orderliness, weak network optimization, etc."
"From novice to expert: How to design a distributed IM system with billions of messages"

More similar articles...

[3] More other architecture design related articles:
"Summary of Tencent's Senior Architect Dry Goods: An article to understand all aspects of large-scale distributed system design"
"Quickly understand the principle of load balancing technology on the high-performance HTTP server"
"Behind the glamorous bullet message: the chief architect of Netease Yunxin shares the technical practice of the billion-level IM platform"
"Knowing the technology sharing: the road to practice of Redis high-performance caching from a single machine to 20 million concurrent QPS"
"Getting Started: A Zero-Basic Understanding of the Evolution History, Technical Principles, and Best Practices of Large-scale Distributed Architectures"
"Alibaba Technology Sharing: Demystifying the 10-year history of changes in Alibaba's database technology solutions"
"Alibaba Technology Sharing: The Hard Way to Growth of Alibaba's Self-developed Financial-Level Database OceanBase"
"Dada O2O Back-end Architecture Evolution Practice: The Efforts Behind High Concurrency Requests From 0 to 4000"
"Excellent back-end architect must have knowledge: Summary of the most complete MySQL large table optimization program in history"
"Xiaomi Technology Sharing: Deciphering the Evolution and Practice of Millet's Panic-buying System with High Concurrency Architecture"
"An Understanding of Load Balancing Technology under Distributed Architecture: Classification, Principles, Algorithms, Common Solutions, etc."
"Easy to understand: How to design a database architecture that can support millions of concurrency? 》
"Comparing 5 mainstream distributed MQ message queues in multiple dimensions, mom no longer worry about my technology selection"
"From novice to architect, one piece is enough: the evolution of architecture from 100 to 10 million high concurrency"
"Meituan Technology Sharing: Deciphering Meituan's Distributed ID Generation Algorithm"
"The Enlightenment Brought by 12306 Tickets: See how I can use Go to achieve a million QPS spike system (including source code)"

More similar articles...

This article has been simultaneously published on the official account of "Instant Messaging Technology Circle".

▲ The link of this article on the official account is: click here to enter. The synchronous publishing link is: http://www.52im.net/thread-3564-1-1.html

Long connection gateway technology topic (5): Himalaya self-developed billion-level API gateway technology practice

1 Introduction

2. Thematic catalog

3. Version 1: Tomcat NIO+Async Servlet

4. Version 2: Netty+ fully asynchronous

5. Full link timeout mechanism

8. Future planning

9. Write at the end

Appendix: More relevant information

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

极致出海友好，融云 IM 支持消息免打扰设置时区

支持百万人超大群聊的Web端IM架构设计与实践

全平台开源即时通讯IM框架MobileIMSDK：7端+TCP/UDP/WebSocket协议

Netty高级使用与源码详解

鸿蒙NEXT如何保证应用安全：详解鸿蒙NEXT数字签名和证书机制

《北京日报》点赞！融云助力打造“数字丝路”新范式