The evolution of billion-level traffic system architecture

Simultaneous high-frequency access by a large number of users is a difficult problem for any platform, and it is also a research direction that the industry has never tired of. But fortunately, although the business scenarios are different, the design and optimization ideas are invariable. This article will combine the core technical points of business and high-concurrency system design to conduct an in-depth analysis of the system architecture tuning scheme.

The article is compiled based on the keynote speech "The Evolution of Billion-Level Traffic System Architecture" delivered by Roger Lin, senior engineer of Authing Identity Cloud at the Open Talk Technology Salon Beijing Station. The live video and PPT can be Click to read the original text of

I believe everyone agrees that the gradual and fierce development of the Internet has changed many of our lifestyles. For example, online shopping, bank transfer and other services no longer need us to handle offline, which greatly facilitates our lives. Behind this, of course, for us as Internet practitioners, the challenges we face are getting bigger and bigger, and more effort will be devoted to upgrading the system architecture.

Understand high concurrency systems

High-concurrency systems have the characteristics of high concurrency, high performance, high availability, distribution, clustering, and security.

Let's first look at the high concurrency, high performance, and high availability, which is the three-high system we often mention. When our traffic is very large, we must ensure these three highs. Among them, high concurrency refers to supporting many concurrent users, high performance is to ensure excellent performance under the premise of high concurrency, and high availability is to ensure that the system will not be down as a whole and continue to provide services when a node has a problem. It can be seen that the main characteristics of Sangao are distributed and clustered, and the main problem we have to solve is security.

The picture above shows some common high-concurrency scenarios that are closely related to our lives. The e-commerce spike in the upper left is the most common scenario. During the epidemic last year, there was a shortage of masks to grab masks. Many people clicked on the same page at a uniform time, and the number of concurrency was extremely high. On the upper right is ticket grabbing, which everyone is familiar with. Especially friends who work in the field who need to go home during the Spring Festival must open a ticket grabbing software and keep brushing them to grab tickets. This kind of concurrent traffic is particularly large. . The bottom left is the banking transaction system. All our online and offline scan codes actually need to pass through the banking system, which makes its daily transaction volume extremely large. Finally, the Authing ID card. We mainly provide users with a complete set of identity authentication and user management system. This system allows developers to avoid repeated operations of constructing identities, reduces the code written by developers, and improves their efficiency. The following figure is used as an example:

The picture shows our core components. On the surface, it looks like a simple login box, which is the user authentication interface, but behind it there is a huge backend support composed of a series of services such as user system, management system, and authentication system. . Although the user only entered the user name and password, we have to consider not only the user's security authentication, multiple login methods, but also how to deal with many users' authentication at the same time, and so on. In addition, we also need to consider how to enable multiple types of customers, including privatized users, to achieve high availability and rapid deployment, and complete rapid integration.

CPA 理论

If you have friends with high concurrency, you must be familiar with CAP theory. Its main point is that a distributed system cannot satisfy three at the same time, but can only satisfy two of them. That is, the distributed system either satisfies the CA or the CP, but cannot satisfy the CAP at the same time. The meaning is that if the availability and partition fault tolerance are met, it may mean sacrificing consistency to achieve the final data consistency. It tells us to make trade-offs.

Speaking from the monolithic application architecture

The monolithic application architecture shown in the figure above is a commonly used pattern in the early days. In the early days, due to the shortage of manpower, Web and Server were usually developed together and then deployed together. After that, they were connected to the database to provide services normally. The advantage of this is that the maintenance is simple, but the iteration is more troublesome.

Now that the front and back ends are separated, we usually separate Web and Server into two service deployments, which facilitates rapid iteration. If we have a Server that needs to be repaired, we can modify and deploy the code separately for this service, and then quickly launch the service. But its disadvantage is that as the business increases, the server contains more and more content, which makes it deeply coupled and slows down the service. I have a deep understanding of this. Many years ago, a friend of mine had a problem with the structure. For a while, he would buy a bag of melon seeds every weekend and come to my house to think about it. Why buy a bag of melon seeds? Because the coupling is too deep, the service takes 5 minutes to start, and another 5 minutes to restart when changing something, so we chatted with melon seeds and waited.

Similar to the above-mentioned complex dependencies, bloated and complicated is a problem that a single application will encounter, in addition to the following problems, the single application has the following problems:

Single point bottleneck
Poor stability
Poor scalability
Missing business model
Poor new business expansion
Lack of basic business process capabilities
Serious front-end coupling
API is messy and difficult to maintain

Since the pain points are so obvious, how to optimize is very important. But before we talk about this issue, we need to think about a new question-is the performance better with more CPUs?

This is the case in most cases, because the CPU can increase the computing speed. But this is not absolute. If our program has a lot of lock concepts, it will not be able to reflect the multi-threaded multi-core nature. That may not have a significant effect on the amount of CPU. Generally, in this situation, many companies will consider disassembling the service. This involves cost issues, that is, increasing the CPU is not the optimal solution, we still need to consider how to optimize the lock. But before thinking about specific optimization, we can first understand the pooling technology.

The above figure is an abstract concept of pooling technology. Generally, connections are acquired and threads are put into the resource pool resource pool after they are used up. At the same time, we also need the following four concepts: connection pool, thread pool, constant pool, memory pool.

Generally, connection pools are used more, because calls between systems and requests for external services are all done by requesting connections. We used to use short connections, but because each connection of HTTP needs to repeat the process of establishing and closing the connection, it is very time-consuming, so now we start to use the connection pool. The connection created after each request is reusable, which is very helpful in saving overhead. At the same time, our tasks all need to be dismantled in the end, and those dismantled asynchronous tasks are all placed in the thread pool. The concepts of constant pool and memory pool are figured out, we will apply for a large memory reuse.

After understanding the pooling technology, we return to the specific optimization.

Application architecture optimization

Web Server optimization

Let's first look at the optimization of Web Server, which is mainly achieved through steps such as code optimization, hotspot caching, and algorithm optimization.

The first step is code optimization, optimizing unreasonable code. For example, the query interface usually queries a lot of content, which makes the calculation slow, which requires priority optimization.

The second step is hotspot caching, which caches all hotspot data to reduce database operations as much as possible. For example, in Authing authentication, it is impossible to perform database operations every time after getting the token, so QPS will be very slow. We can improve QPS by caching all the hot data.

The third step is algorithm optimization, because our business is usually very complex, so this concept is very broad. For example, to query a list, do I need to list all the lists at once or return the results to the front end after the calculation is completed in memory? This requires optimization for different business scenarios to improve performance.

Separate deployment

After the monolithic application is optimized, if these services are deployed on the same server, CPU and memory may be occupied. At this time, we can take out the Web and the application after loading the cache and deploy them to a separate server. At the same time, all static resources are stored on the CDN, and the page loading speed can be accelerated by visiting nearby. Through these methods, let our Auting meet the demand of response within 50 milliseconds. The separate deployment method is also very suitable for the needs between systems. No matter what business scenario you are in, if you need to improve the response speed, then everyone can consider this method.

vertical split

After that, we need to split the business. There are three ways to split the business:

Split according to business scenarios, such as splitting users, orders, and accounts.
According to whether the business is split synchronously or asynchronously, the advantage of doing so is that the asynchronous traffic can be well controlled, and it will not affect the operation of our core services.
Split according to the model, because the business split is mainly to solve the problem of serious dependence on the coupling between systems, in order to minimize the time between systems in the later stage, so the model in the early stage must be built as well as possible.

After completing the system split, we need to judge how much business the optimized system can carry and how much has been optimized. Then I need to perform a stress test on it. The pressure test involves the wooden barrel theory that everyone knows. We compare the system to a wooden barrel, so how much water the barrel can hold depends on the lowest wooden board. Therefore, we don't need to pay attention to the parts that take up less resources during the stress test, but we need to pay attention to the high parts that have reached the system bottleneck. Use this part to find potential problems in our system.

Horizontal split

After we split the service vertically, it may still be unable to meet the demand as the number of requests gradually increases. At this time, we can split the system horizontally, and then expand the capacity horizontally, adding two or more if one is not enough. At the same time, the load balancing server distributes the request evenly to these horizontal nodes. Usually we choose to use NG as a load balancing server.

The picture above is our load balancing server. There will be many gateway systems under load balancing, and we see an Nginx cluster in the middle. We all know that the amount of concurrency that Nginx can withstand is very large, so this cluster is not needed when the traffic is small, and it must be a very large amount of concurrency when it is needed. When your concurrency is so great that the Nginx cluster cannot afford it, we'd better not put another layer of Nginx in front of its cluster, because the effect is not obvious. At the same time, I personally do not recommend that you choose F5, because F5 is a piece of hardware and its cost is relatively high. I personally recommend that you choose LVS, which is a virtual service under Linux. If it is configured well, its performance is completely comparable to F5.

After talking about load balancing, we return to horizontal splitting.

We cannot ignore the caching problem when performing horizontal splits. In the stand-alone mode, the caches are all local caches, and when we become distributed, if one server gets the token and stores it locally, the other server will not be able to communicate because it does not get it. Therefore, we introduce a distributed cache, such as putting the cache in a distributed cache like Redis, and let all applications request Redis to get the cache.

After we split horizontally, we also need to pay attention to distributed ID. Because the method of generating ID in a single unit may not be suitable for distributed services. Take the timestamp as an example. It used to exist in a single unit, and we would generate an ID when requested. This is unique. In a distributed situation, multiple servers may generate duplicate IDs when receiving requests, which cannot be unique. So we need to do a separate ID service to generate ID.

Configuration Center

After we split the service horizontally and vertically, how to make the configuration uniformly synchronized to each service has become a problem. The best way is to make all services aware of the change at the same time after we modify the configuration, and then apply and configure it ourselves. So we introduced the configuration center.

The figure above is the general process of the configuration center. At present, there are two popular configuration center solutions, one is Nacos open sourced by Ali, and the other is Spring Cloud config formed by Spring Cloud. Those who are interested can learn about it.

Next, let's take a look at the above picture in detail. Among them, Server is the console that stores our configuration. Generally, developers will modify the configuration through the API in the console, and the modified configuration can be persistently stored in Mysql or other databases. Client contains all our applications. In it, there will be a monitor to monitor whether there is a configuration change in the Server. When there is a configuration change, the configuration is obtained, so that all applications can be updated in time after the front-end update. At the same time, in order to prevent the app from failing to get the update due to network problems, we will take a snapshot locally. When there is a problem with the network, the app can be downgraded to get files locally.

database split

We have completed the split of the system, done a good job of load balancing, and completed the configuration center. In the case where the request volume is not too large, we have actually completed the optimization of the system. When the later business continues to expand, the bottleneck we encounter is no longer the system, but the database. So how to solve this problem?

The first method is the separation of master-slave replication and read-write. Read-write separation can solve the problem that data read and write are all on the same database. By splitting the master and slave libraries into master and slave, the write is handled by the master, and the pressure of writing is allocated to improve database performance. Later, as the business volume continues to increase, when a single master-slave replication can no longer meet our needs, we use the second method to deal with it.

The second way is to split vertically. The concept of vertical split is similar to business split. We split the database into Users, Orders, Apps, etc. according to the service, so that each service has its own database, avoiding unified requests and improving concurrency. As the business volume continues to grow, even a single library will reach the bottleneck. At this time, we need to use the third method.

The third way is to split horizontally. For example, we further split the tables in the Users database into Users1, Users2, Users3 and so on. To complete this split, we need to consider how to do the query in the face of multiple tables. At this time, we need to judge according to our specific business. For example, to query users, we can split the ID into pieces according to the user ID, and then use the hash algorithm to make them unified within a certain range. After that, every time we get the Users, we use the hash to calculate the specific one and quickly reach the corresponding location. The concept of splitting is used in the design of Auting multi-tenancy, as shown in the figure below.

Service current limit

When the business volume reaches a certain level, we will definitely involve service current limiting, which is a disguised downgrade strategy. Although our ideal is that the system can withstand more and more users, but because resources are always limited, you must limit it.

request rejected

There are two main algorithms for service current limiting, leaky bucket algorithm and token bucket algorithm. We can take a look at the picture above, it is more vivid. In the leaky bucket algorithm, we can imagine the flow rate as a glass of water and limit it where the water flow flows out. No matter how fast the water flow flows in, the flow rate is the same. The token bucket is to establish a task of issuing tokens, so that each request needs to get the token first. If the request speed is too fast, the corresponding current limiting strategy will be adopted when the token is not enough. In addition to these two algorithms, the counter algorithm that everyone is familiar with is generally used. Interested friends can also learn about it by themselves. We won't talk about it in detail here.

In fact, these algorithms are essentially rejecting the excessive part of the request when the traffic is excessive. In addition to this rejection strategy, we also have a queuing strategy.

Message queue

When our business cannot limit or refuse, we need to use queue messages.

As shown in the figure, the main concept of the message queue is that the producer puts the message into the queue, and the consumer gets the message from the queue and resolves it. We usually use MQ, Redis, and Kafka to do message queues. The queue is responsible for solving the two problems of publish/subscribe and client push and pull, and the producer is responsible for solving the following problems:

Buffer: set buffer for excessive flow at the entrance
Peak clipping: similar to the effect of buffering
System decoupling: If the two services do not depend on the calling relationship, they can be decoupled through the message queue
Asynchronous communication
Extension: Based on the message queue, many listeners can be used for monitoring

Service fusing

When the business provides services normally, we may encounter the following situation:

Services A and B call services C and D respectively, and both of them call service E. Once service E hangs up, it will bring down all previous services due to the accumulation of requests. We generally call this phenomenon a service avalanche.

In order to avoid this situation, we introduced the concept of service fuse, letting it function as a fuse. When the amount of failure of service E reaches a certain level, the next request will not allow service E to continue processing, but will directly return failure information to avoid the accumulation of requests to continue calling service E.

To put it simply, this is a service degradation. The usual service degradations are as follows:

Page downgrading: the visual interface disables clicking buttons and adjusting static pages
Delayed service: such as delayed processing of timing tasks, delayed processing of messages after entering MQ
Write downgrade: directly prohibit service requests related to write operations
Read downgrade: directly prohibit service requests related to read
Cache downgrade: Use cache to downgrade some service interfaces that are frequently read
Stop service: Turn off unimportant functions and free up resources for core services

Stress test

The above picture is what we need to pay attention to in specific stress testing. First of all, we need to know that stress testing is actually a closed loop, because we may need to repeat this process many times, constantly repeating the process of discovering problems, solving problems, verifying whether it works, and discovering new problems, until we finally reach our stress testing goals.

Before the pressure test starts, we will formulate the pressure test target, and then prepare the environment according to the target. The stress test model can be online or offline. Generally, the offline cost is considered, so a single machine or a small cluster is selected. This may make the results less accurate, so usually everyone chooses to conduct pressure testing online or in the computer room, and the data is more accurate. During the stress testing process, we will discover new problems, then solve it, and verify the results until the stress testing target is reached.

In the process of stress testing, we need to pay attention to the following points. The first is QPS, which is the number of queries per second. The difference between it and TPS is that TPS has the concept of a transaction, and only a transaction is counted as a request. However, QPS does not have this concept. As long as it finds the result, it counts as a request. The second is RT (Response Time), which requires our attention, and the more highly concurrent the system, the more important RT is. Later, in the stress test, we need to pay attention to how much concurrency and throughput the system can carry. The success rate refers to whether our business can be executed according to the original plan and get the planned results when the pressure is getting greater and greater during the stress testing process. GC refers to garbage collection, which is also a big problem, because if our code is not well written, then as the pressure increases, the GC will become more frequent and eventually cause the system to stop.

After that, it is the hardware aspect. We need to pay attention to the share of CPU, memory, network, and I/O. Any card owner may cause a system bottleneck. The last is the database, I won't go into details here.

log

How can we know the problems that occur during the stress test? Then we must rely on the log, it makes the system visible, so that we can find the root cause of the problem.

How to do the log? This is mainly done by burying points, for example, by burying points to request the time and response time to enter each system, each layer, and then see the time consumption of the system through these two time differences. It can be seen that only when the buried point is clear can the problem be accurately found.

The above picture is a more general log processing solution. The logs generated by each service are collected to Kafka through Filbeat, then to Logstach, and finally to ElasticSearch. Among them, Kibana is a visual interface, which is convenient for us to analyze logs.

The picture above is Auting's log and monitoring system. In the middle is the K8S cluster, on the left is the business message queue, and on the right is our monitoring system. As long as we use Grafana to alarm the monitoring system according to the business, for example, we will configure the alarm when the success rate is lower than what. The main log system uses logstash to extract log files into ES and use Kibana to view them.

高可用分布式系统架构

Finally, what I want to say is that all high-availability systems must not forget a core concept, that is, live more in different places. For example, we need to prepare multiple computer rooms in multiple locations, and have multiple locations for backup and disaster recovery. The above picture is my summary of all the above application architecture optimizations. I hope to provide you with a reference, thank you.

The evolution of billion-level traffic system architecture

Understand high concurrency systems

Speaking from the monolithic application architecture

Application architecture optimization

Service current limit

Service fusing

Stress test

Recommended reading

云叔_又拍云

引用和评论

3 分钟了解 NVIDIA 新出的 H200

浏览器原生「磁吸」效果！Anchor Positioning 锚点定位神器解析

Flex 布局学习总结（对齐方式）

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

JavaScript&ES6----数组去重的多种方法

Base64编码的“暗坑”：解密失败？可能是这些原因！

从 DeepSeek 看25年前端的一个小趋势