2
Summary: teaches you how to design a spike system architecture: from e-commerce system architecture to spike system, from high-concurrency "black technology" and winning tricks to server hardware optimization, and master the spike system architecture in all dimensions! !

This article is shared from HUAWEI CLOUD COMMUNITY " : Decryption of the strongest spike system architecture in the entire network, not all spikes are spikes! ! ", author: Glacier.

E-commerce system architecture

In the field of e-commerce, there are typical spike business scenarios, so what is a spike scenario? Simply put, the number of purchasers of a product is far greater than the inventory of the product, and the product will be sold out in a short period of time. For example, business scenarios such as the annual 618, Double 11 big promotion and Xiaomi's new product promotion are typical spike business scenarios.

We can simplify the architecture of the e-commerce system as shown in the figure below.
image.png

As shown in the figure, we can simply divide the core layer of the e-commerce system into: load balancing layer, application layer and persistence layer. Next, we estimate the concurrency of each layer.

• If the load balancing layer uses high-performance Nginx, we can estimate that the maximum concurrency of Nginx is: 10W+, here is 10,000.
• Assuming that we are using Tomcat at the application layer, and the maximum concurrency of Tomcat can be estimated to be around 800, here is a unit of hundred.
• Assuming that the cache of the persistence layer uses Redis and the database uses MySQL, the maximum concurrency of MySQL can be estimated to be about 1000, in units of thousands. The maximum concurrency of Redis can be estimated to be about 5W, in units of 10,000.

Therefore, the concurrency of the load balancing layer, application layer, and persistence layer are different. So, in order to improve the overall concurrency and cache of the system, what solutions can we usually take?

(1) System expansion

System expansion includes vertical expansion and horizontal expansion, adding equipment and machine configurations, and most of the scenarios are effective.

(2) Cache

Local cache or centralized cache reduces network IO and reads data based on memory. Most scenes are valid.

(3) Read and write separation

Using read-write separation, divide and conquer, increase the parallel processing capabilities of the machine.

Features of spike system

For the spike system, we can explain some of its own characteristics from two perspectives of business and technology.

Business characteristics of spike system

Here, we can use the 12306 website as an example. During the Spring Festival each year, the number of visits to the 12306 website is very large, but the website visits are usually relatively flat. In other words, during the Spring Festival every year, the number of visits to the 12306 website will be There is an instantaneous sudden increase.

For another example, the Xiaomi spike system, which sells products at 10 am, the number of visits before 10 am is relatively flat, and there will also be an instantaneous increase in concurrent volume at 10 o'clock.

Therefore, the traffic and concurrency of the spike system can be represented by the following figure.
image.png

It can be seen from the figure that the concurrency of the spike system has the characteristic of instantaneous peaks, which is also called the traffic spike phenomenon.

We can summarize the characteristics of the spike system as follows.
image.png

(1) Time-limited, limited, price-limited

In the specified time; the quantity of goods in the spike activity is limited; the price of the goods will be much lower than the original price, that is to say, in the spike activity, the goods will be sold at a price much lower than the original price.

For example, the time of the spike activity is limited to 10 am to 10:30 am on a certain day, the quantity of goods is only 100,000 pieces, and the price of the goods is very low, such as: 1 yuan purchase and other business scenarios.

time limit, limit and price can exist alone or in combination.

(2) Activity preheating

The event needs to be configured in advance; before the event has started, users can view the relevant information of the event; before the spike event starts, the event will be vigorously promoted.

(3) Short duration

The number of people buying is huge; merchandise will be sold out quickly.

In the presentation of system traffic, there will be a sudden spike phenomenon. At this time, the number of concurrent visits is very high. In most spike scenarios, the goods will be sold out in a very short time.

Technical characteristics of spike system

We can summarize the technical characteristics of the spike system as follows.
image.png

(1) The instantaneous concurrency is very high

A large number of users will rush to buy goods at the same time; the instantaneous concurrency peak is very high.

(2) Read more and write less

The number of visits to the product page in the system is huge; the number of purchases of the product is very small; the number of visits to the inventory is far greater than the number of purchases of the product.

Some current limiting measures are often added to the product page. For example, the early spike system product page will add a verification code to smooth the front-end access to the system, and the recent spike system product detail page will prompt the user to log in to the system when the user opens the page . These are all measures to limit the access to the system.

(3) Simple process

The business process of the spike system is generally relatively simple; in general, the business process of the spike system can be summarized as: placing an order and reducing inventory.

Spike three stages

Usually, from the beginning to the end of the spike, there are often three stages:

preparation stage: also called the system warm-up stage. At this time, the business data of the spike system will be warmed up in advance. At this time, users will constantly refresh the spike page to check whether the spike activity has started. To a certain extent, some data can be stored in Redis for preheating through the user's continuous page refresh operation.
spike stage: This stage is mainly a spike activity process, which will generate instantaneous high concurrent traffic, which will cause a huge impact on system resources, so you must do a good job of system protection during the spike phase.
settlement stage: completes the data processing work after the spike, such as data consistency problem processing, abnormal situation processing, and commodity return processing.

For this kind of system with large traffic in a short time, it is not suitable to use system expansion, because even if the system is expanded, the expanded system will be used in a short time. Most of the time, the system It can be accessed normally without expansion. So, what solutions can we take to improve the spike performance of the system?

Spike system solution

In view of the characteristics of the spike system, we can take the following measures to improve the performance of the system.
image.png

(1) Asynchronous decoupling

The overall process is disassembled, and the core process is controlled by a queue method.

(2) Current limit and anti-brush

Control the overall website traffic, raise the threshold of requests, and avoid exhaustion of system resources.

(3) Resource control

Control the resource scheduling in the overall process to maximize strengths and avoid weaknesses.

Because the amount of concurrency that the application layer can carry is much less than that of the cache. Therefore, in a high-concurrency system, we can directly use OpenResty to access the cache from the load balancing layer to avoid the performance loss of calling the application layer. You can go to https://openresty.org/cn/ to learn more about OpenResty. At the same time, because the number of goods in the spike system is relatively small, we can also use dynamic rendering technology, CDN technology to accelerate the performance of the website.

If at the beginning of the spike activity, when the concurrency is too high, we can put the user's request in the queue for processing, and pop up the queue page for the user.
image.png

Note: The picture is from Meizu

Spike system timing diagram

Many spike systems and solutions to spike systems on the Internet are not real spike systems. They only use solutions for synchronous processing of requests. Once the amount of concurrency really rises, the performance of their so-called spike systems will drop sharply. Let's first take a look at the timing diagram of the spike system when placing an order synchronously.

Synchronous ordering process

image.png

1. The user initiates a spike request

In the synchronous order process, first, the user initiates a spike request. The mall service needs to execute the following processes in order to process the business of the spike request.

(1) Is the identification verification code correct?

The mall service determines whether the verification code submitted when the user initiates a spike request is correct.

(2) Determine whether the activity has ended

Verify whether the current spike activity has ended.

(3) Verify whether the access request is in the blacklist

In the field of e-commerce, there is a lot of malicious competition, that is, other businesses may maliciously request the spike system through improper means, occupying a lot of system bandwidth and other system resources. At this time, it is necessary to use a risk control system to implement a blacklist mechanism. For simplicity, you can also use the interceptor to count the access frequency to implement a blacklist mechanism.

(4) Verify whether the real inventory is sufficient

The system needs to verify whether the actual inventory of the product is sufficient and whether it can support the product inventory for this spike activity.

(5) Deduction of inventory in the cache

In the spike business, information such as commodity inventory is often stored in the cache. At this time, it is also necessary to verify whether the commodity inventory used by the spike activity is sufficient, and the amount of commodity inventory for the spike activity needs to be deducted.

(6) Calculate the price of the spike

In the spike activity, there is a difference between the spike price of the commodity and the real price of the commodity, so the spike price of the commodity needs to be calculated.

Note: If the business involved in the system is more complex in the spike scenario, more business operations will be involved. Here, I just list some common business operations.

2. Submit the order

(1) Order entry

Save the order information submitted by the user in the database.

(2) Deduction of real inventory

After the order is placed in the warehouse, the quantity of the product successfully placed this time needs to be deducted from the actual inventory of the product.

If we use the above process to develop a spike system, when a user initiates a spike request, since each business process of the system is executed serially, the overall system performance will not be too high. When the amount of concurrency is too high, we will The following queuing page pops up for the user to prompt the user to wait.
image.png

Note: The picture is from Meizu

The queue time at this time may be 15 seconds, it may be 30 seconds, or even longer. There is a problem: the connection between the client and the server will not be released during the time between the user initiates the spike request and the server returns the result, which will occupy a large amount of server resources.

Internet that introduce how to implement the spike system use this method, so, can this method be used as a spike system? The answer is that it can be done, but the amount of concurrency supported by this method is not too high. At this time, some netizens may ask: Our company does this spike system! I've been using it since it went online, no problem! What I want to say is: the use of synchronous ordering can indeed be used as a spike system, but the performance of synchronous ordering will not be too high. The reason why your company adopted the method of synchronously placing orders for the spike system did not cause major problems, that is because the concurrency of your spike system has not reached a certain level of magnitude, that is to say, the concurrency of your spike system is actually not high.

Therefore, many so-called seckill systems have a seckill business, but they cannot be called a real seckill system because they use a synchronous ordering process, which limits the concurrent traffic of the system. The reason why there are no major problems after going online is that the concurrency of the system is not high enough to crush the entire system.

If the spike system of 12306, Taobao, Tmall, JD, Xiaomi and other large shopping malls is played like this, then their system will be played to death sooner or later, and their system engineers will not be expelled! Therefore, in the spike system, this kind of solution for synchronously processing the business process of placing an order is not advisable.

The above is the entire process operation of synchronous ordering. If the ordering process is more complicated, more business operations will be involved.

Asynchronous order process

Since the seckill system of the synchronous ordering process cannot be called a real seckill system, then we need to adopt an asynchronous ordering process. The asynchronous ordering process does not limit the high concurrent traffic of the system.
image.png

1. The user initiates a spike request

After the user initiates a spike request, the mall service will go through the following business processes.

(1) Check whether the verification code is correct

When the user initiates a spike request, the verification code will be sent along with it, and the system will check whether the verification code is valid and correct.

(2) Whether the current limit is

The system will judge whether the user's request is current-limiting or not. Here, we can judge by judging the length of the message queue. Because we put the user's request in the message queue, and the message queue accumulates the user's request, we can judge whether the user's request needs to be limited according to the number of pending requests in the current message queue.

For example, in the spike activity, we sell 1,000 items, and there are 1,000 requests in the message queue at this time. If there are still users who initiate the spike request, we can no longer process the subsequent requests and return directly to the user that the goods have been sold. End of the prompt.
image.png

Therefore, after using the current limit, we can process user requests faster and release connected resources.

(3) Send MQ

After the user's spike request passes the previous verification, we can send the user's request parameters and other information to MQ for asynchronous processing, and at the same time, respond to the user with the result information. In the mall service, there will be a special asynchronous task processing module to consume the requests in the message queue and process the subsequent asynchronous processes.

When a user initiates a spike request, the asynchronous ordering process has fewer business operations than the synchronous ordering process. It sends subsequent operations to the asynchronous processing module through MQ for processing, and quickly returns the response result to the user to release the request connection.

2. Asynchronous processing

We can asynchronously process the following operations of the order process.

(1) Determine whether the event has ended

(2) Determine whether this request is on the system blacklist. In order to prevent malicious competition among peers in the e-commerce field, a blacklist mechanism can be added to the system, and the malicious request can be placed in the system blacklist. It can be achieved by using interceptors to count the frequency of access.

(3) Deduct the inventory quantity of spike products in the cache.

(4) Generate a spike Token, this Token is bound to the current user and the current spike activity. Only the request that generates the spike Token is eligible for the spike activity.

Here we have introduced an asynchronous processing mechanism, In asynchronous processing, how many resources the system uses and how many threads are allocated to process the corresponding tasks can be controlled.

3. Short polling to query the spike result

Here, the client can take a short polling scheme to query whether to obtain the qualification for the spike. For example, the client can poll the request server every 3 seconds to inquire whether it is eligible for a spike. Here, our processing on the server is to determine whether the current user has a spike token. If the server generates a spike token for the current user, then the current user There is a spike qualification. Otherwise, continue to poll the query until the timeout or the server returns information such as the product has been sold out or no spike qualification.

When using short polling to query the spike results, we can also prompt the user to queue up on the page, but at this time, the client will poll the server every few seconds to query the status of spike qualifications. Compared with the synchronous order process, There is no need to take a long time to request a connection.

At this time, may be asked by netizens: If the short polling method is used, will it exist until the timeout period and the status of whether it is eligible for a spike can not be inquired? The answer is: it is possible! Here, let’s imagine the real scenario of the spike. The merchants participating in the spike activity are not essentially to make money, but to increase the sales of the products and the merchant’s popularity, and attract more users to buy their products. Therefore, we do not need to guarantee that users can 100% inquire whether they are eligible for a spike.

4. Spike settlement

(1) Verify the order Token

When the client submits the spike settlement, it will submit the spike token to the server, and the mall service will verify whether the current spike token is valid.

(2) Add to spike shopping cart

After verifying that the seckill Token is legal and valid, the mall service will add the user’s seckill products to the seckill shopping cart.

5. Submit the order

(1) Order storage

Save the order information submitted by the user in the database.

(2) Delete Token

After the spike product order is successfully placed in the warehouse, the spike token will be deleted.

Here everyone can think about a question: Why do we only use asynchronous processing in the pink part of the asynchronous order process, and not take asynchronous peak shaving and valley filling measures in other parts?

This is because in the design of the asynchronous ordering process, whether in product design or interface design, we limit the user’s request when the user initiates a spike request. It can be said that the system’s current limit operation is Very front-end. When the user initiates a spike request, the current limit is performed, and the peak traffic of the system has been smoothly resolved, and then going back, in fact, the concurrency and system traffic of the system are not very high.

Therefore, when many articles and posts on the Internet introduce the spike system, they say that asynchronous peak clipping is used to perform some current limiting operations when placing an order. That is all nonsense! Because the order operation is a relatively late operation in the entire spike system process, the current limiting operation must be pre-processed, and it is useless to do the current limiting operation in the process behind the spike business.

High-concurrency "black technology" and winning tricks

Assume that we use Redis to implement caching in the spike system. Assume that the concurrent read and write of Redis is about 50,000. The amount of concurrency that our mall spike business needs to support is around 1 million. If all the 1 million concurrency is in Redis, Redis is likely to fail. Then, how do we solve this problem? Next, we will discuss this issue together.

In a high-concurrency spike system, if Redis is used to cache data, the concurrent processing capability of the Redis cache is the key, because many prefix operations require access to Redis. While asynchronous peak clipping is only a basic operation, the key is to ensure the concurrent processing capabilities of Redis.

The key idea to solve this problem is: divide and conquer, and open up the commodity inventory.

Darkness Chen Cang

When we store the inventory of spike products in Redis, we can "segment" the stock of spike products to increase the read and write concurrency of Redis.

For example, the id of the original spike product is 10001, the inventory is 1000 pieces, and the storage in Redis is (10001, 1000). We divide the original inventory into 5 parts, and each stock is 200 pieces. At this time , The information we store in Reda is (10001_0, 200), (10001_1, 200), (10001_2, 200), (10001_3, 200), (10001_4, 200).
image.png

At this point, after we divide the inventory, each divided inventory is stored using the product id plus a digital identifier. In this way, when the Hash operation is performed on each key of the stored product inventory, the Hash results obtained are different Yes, this means that the Key that stores the commodity inventory is very likely not in the same slot of Redis, which can improve the performance and concurrency of Redis processing requests.

After splitting the inventory, we also need to store a mapping relationship between the product id and the key after splitting the inventory in Redis. At this time, the key of the mapping relationship is the id of the product, which is 10001, and the Value is the Key that stores the inventory information after splitting the inventory. , Which is 10001_0, 10001_1, 10001_2, 10001_3, 10001_4. In Redis we can use List to store these values.

When actually processing inventory information, we can first query all the keys after the split inventory corresponding to the spike product from Redis, and at the same time use AtomicLong to record the current request quantity, and use the request quantity to query the spike product from Redia. After dividing the inventory, the length of all Keys is modulo calculation, and the result is 0, 1, 2, 3, 4. Then splicing the product id in front to get the key of the real inventory cache. At this point, you can directly go to Redis to obtain the corresponding inventory information based on this Key.

graft

In high-concurrency business scenarios, we can directly use the Lua script library (OpenResty) to directly access the cache from the load balancing layer.

Here, we think about a scenario: If in the spike business scenario, the spiked goods are sold out in an instant. At this point, when the user initiates a spike request again, if the system requests the various services of the application layer from the load balancing layer, and then the various services of the application layer access the cache and database, in fact, it does not make any sense in essence because the goods are sold out. , And then verifying layer by layer through the application layer of the system is no longer meaningful! ! The concurrent access volume of the application layer is in units of hundreds, which will reduce the concurrency of the system to a certain extent.

In order to solve this problem, At this time, we can take out the user id, product id, and spike activity id carried when the user sends the request in the load balancing layer of the system, and directly access the inventory information in the cache through Lua scripts and other technologies. If the inventory of the spike product is less than or equal to 0, the prompt message that the user's product has been sold out will be returned directly, without having to go through the layer-by-layer verification of the application layer. this architecture, we can refer to the architecture diagram of the e-commerce system in this article (the first image at the beginning of the text).

Redis helps the spike system

We can design a Hash data structure in Redis to support the deduction of commodity inventory, as shown below.

seckill:goodsStock:${goodsId}{
    totalCount:200,
    initStatus:0,
    seckillCount:0
}

In the Hash data structure we designed, there are three very main attributes.

  • totalCount: Represents the total number of products participating in the spike. Before the spike activity starts, we need to load this value into the Redis cache in advance.
  • initStatus: We design this value as a boolean value. Before the spike starts, this value is 0, which means that the spike has not started. You can modify this value to 1 through timed tasks or background operations, which means that the spike starts.
  • seckillCount: Indicates the number of products that are seckilled. During the seckill process, the upper limit of this value is totalCount. When this value reaches totalCount, it means that the seckill of the product has been completed.

We can use the following code snippet to cache the product data loading that will participate in the spike during the preheating phase of the spike.

/**
 * @author binghe
 * @description 秒杀前构建商品缓存代码示例
 */
public class SeckillCacheBuilder{
    private static final String GOODS_CACHE = "seckill:goodsStock:"; 
    private String getCacheKey(String id) { 
        return  GOODS_CACHE.concat(id);
    } 
    public void prepare(String id, int totalCount) { 
        String key = getCacheKey(id); 
        Map<String, Integer> goods = new HashMap<>(); 
        goods.put("totalCount", totalCount); 
        goods.put("initStatus", 0); 
        goods.put("seckillCount", 0); 
        redisTemplate.opsForHash().putAll(key, goods); 
     }
}

When the spike starts, we need to first determine in the code whether the seckillCount value in the cache is less than the totalCount value. If the seckillCount value is indeed less than the totalCount value, we can lock the inventory. In our program, these two steps are not actually atomic. If in a distributed environment, we use multiple machines to operate the Redis cache at the same time, synchronization problems will occur, which will cause serious consequences of "oversold".

In the field of e-commerce, there is a professional term called "oversold". As the name implies: "Oversold" means that the number of goods sold is more than the number of goods in stock, which is a very serious problem in the field of e-commerce. So, how do we solve the "oversold" problem?

Lua script perfectly solves the problem of oversold

How do we solve the synchronization problem of multiple machines operating Redis at the same time? A better solution is to use Lua scripts. We can use Lua scripts to encapsulate the inventory deduction operation in Redis into an atomic operation, so that the atomicity of the operation can be guaranteed, thereby solving the synchronization problem in a high-concurrency environment.

For example, we can write the following Lua script code to perform the inventory deduction operation in Redis.

local resultFlag = "0" 
local n = tonumber(ARGV[1]) 
local key = KEYS[1] 
local goodsInfo = redis.call("HMGET",key,"totalCount","seckillCount") 
local total = tonumber(goodsInfo[1]) 
local alloc = tonumber(goodsInfo[2]) 
if not total then 
    return resultFlag 
end 
if total >= alloc + n  then 
    local ret = redis.call("HINCRBY",key,"seckillCount",n) 
    return tostring(ret) 
end 
return resultFlag

We can use the following Java code to call the above Lua script.

public int secKill(String id, int number) { 
    String key = getCacheKey(id); 
    Object seckillCount =  redisTemplate.execute(script, Arrays.asList(key), String.valueOf(number)); 
    return Integer.valueOf(seckillCount.toString()); 
}

In this way, when we perform the spike activity, we can ensure the atomicity of the operation, thereby effectively avoiding the problem of data synchronization, and effectively solving the "oversold" problem.

In order to deal with the high-concurrency and high-traffic business scenarios of the spike system, in addition to the business architecture of the spike system itself, we need to further optimize the performance of the server hardware. Next, let's take a look at how to optimize the performance of the server.

Optimize server performance

operating system

Here, the operating system I am using is CentOS 8. We can enter the following command to view the version of the operating system.

CentOS Linux release 8.0.1905 (Core)
For high-concurrency scenarios, we mainly optimize the network performance of the operating system. In the operating system, there are many network protocol parameters. Our optimization of server network performance is mainly to optimize these system parameters to achieve improvement. The purpose of our application to access performance.

System parameters

In CentOS operating system, we can use the following commands to view all system parameters.

/sbin/sysctl -a
Some of the output results are shown below.
image.png

There are too many parameters here, probably more than 1,000. In a high concurrency scenario, it is impossible for us to tune all the parameters of the operating system. We are more concerned with network-related parameters. If you want to get network-related parameters, we first need to get the type of operating system parameter. The following command can get the type of operating system parameter.

/sbin/sysctl -a|awk -F "." '{print $1}'|sort -k1|uniq

The result information output by running the command is as follows.

abi
crypto
debug
dev
fs
kernel
net
sunrpc
user
vm

image.png
The net type is the network-related operating system parameters that we need to pay attention to. We can get the subtypes under the net type, as shown below.

/sbin/sysctl -a|grep "^net."|awk -F "[.| ]" '{print $2}'|sort -k1|uniq

The output result information is as follows.

bridge
core
ipv4
ipv6
netfilter
nf_conntrack_max
unix

image.png
In the Linux operating system, these network-related parameters can be modified in the /etc/sysctl.conf file. If these parameters do not exist in the /etc/sysctl.conf file, we can edit them in the /etc/sysctl.conf file by ourselves Add these parameters.

the net type, the subtypes we need to focus on are: core and ipv4.

Optimize socket buffer

If the server's network socket buffer is too small, it will cause the application to read and write multiple times to process the data, which will greatly affect the performance of our program. If the network socket buffer is set large enough, it can improve the performance of our program to a certain extent.

We can enter the following commands on the server's command line to obtain information about the server socket buffer.

/sbin/sysctl -a|grep "^net."|grep "[r|w|_]mem[_| ]"

The output result information is as follows.

net.core.rmem_default = 212992
net.core.rmem_max = 212992
net.core.wmem_default = 212992
net.core.wmem_max = 212992
net.ipv4.tcp_mem = 43545        58062   87090
net.ipv4.tcp_rmem = 4096        87380   6291456
net.ipv4.tcp_wmem = 4096        16384   4194304
net.ipv4.udp_mem = 87093        116125  174186
net.ipv4.udp_rmem_min = 4096
net.ipv4.udp_wmem_min = 4096

image.png
Among them, the keywords with max, default, and min respectively represent: maximum value, default value and minimum value; those with keywords mem, rmem, and wmem are: total memory, receiving buffer memory, sending buffer memory .

should be noted here that the units with the rmem and wmem keywords are all "bytes", and the units with the mem keyword are "pages". "Page" is the smallest unit of the operating system to manage memory. In Linux systems, a page is 4KB in size by default.

How to optimize the frequent sending and receiving of large files

If large files need to be sent and received frequently in a high concurrency scenario, how can we optimize the performance of the server?

Here, the system parameters we can modify are as follows.

net.core.rmem_default
net.core.rmem_max
net.core.wmem_default
net.core.wmem_max
net.ipv4.tcp_mem
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem

Here, we make an assumption, assuming that the system can allocate a maximum of 2GB of memory to TCP, the minimum is 256MB, and the pressure value is 1.5GB. Calculated according to a page of 4KB, the minimum, pressure, and maximum values of tcp_mem are 65536, 393216, and 524288, respectively, and the unit is "page".

If the average file data packet is 512KB, each socket read and write buffer can hold 2 data packets at least, 4 data packets can be accommodated by default, and 10 data packets can be accommodated at most, then we can calculate The minimum, default, and maximum values of tcp_rmem and tcp_wmem are 1048576, 2097152, 5242880, and the unit is "bytes". And rmem_default and wmem_default are 2097152, and rmem_max and wmem_max are 5242880.

Note: How these values are calculated in detail later~~

Here, it should be noted that the buffer exceeds 65535, and the net.ipv4.tcp_window_scaling parameter needs to be set to 1.

After the above analysis, we finally got the system tuning parameters as shown below.

net.core.rmem_default = 2097152
net.core.rmem_max = 5242880
net.core.wmem_default = 2097152
net.core.wmem_max = 5242880
net.ipv4.tcp_mem = 65536  393216  524288
net.ipv4.tcp_rmem = 1048576  2097152  5242880
net.ipv4.tcp_wmem = 1048576  2097152  5242880

Optimize TCP connection

Friends who have a certain understanding of computer networks know that TCP connection needs to go through "three handshake" and "four wave hands", and a series of technologies that support reliable transmission such as slow start, sliding window, and sticky packet algorithm. support. Although, these can guarantee the reliability of the TCP protocol, but sometimes this will affect the performance of our program.

So, in high concurrency scenarios, how do we optimize TCP connections?

(1) Close the sticky packet algorithm

If the user is sensitive to the time-consuming request, we need to add the tcp_nodelay parameter to the TCP socket to turn off the sticky packet algorithm so that the data packet can be sent out immediately. At this time, we can also set the parameter value of net.ipv4.tcp_syncookies to 1.

(2) Avoid frequent creation and recycling of connection resources

The creation and recycling of network connections is very performance consuming. We can optimize the performance of the server by closing idle connections and reusing the allocated connection resources. It is no stranger to reuse the allocated connection resources, such as: thread pool, database connection pool is the reuse of threads and database connections.

We can close the idle connection of the server and reuse the allocated connection resources through the following parameters.

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time=1800

(3) Avoid repeated sending of data packets

TCP supports a timeout retransmission mechanism. If the sender has sent the data packet to the receiver, but the sender has not received feedback, at this time, if the set time interval is reached, the TCP timeout retransmission mechanism will be triggered. In order to avoid sending the successfully sent packets again, we need to set the net.ipv4.tcp_sack parameter of the server to 1.

(4) Increase the number of server file descriptors

In the Linux operating system, a network connection also occupies a file descriptor. The more connections, the more file descriptors it occupies. If the file descriptor is set relatively small, it will also affect the performance of our server. At this point, we need to increase the number of server file descriptors.

For example: fs.file-max = 10240000, which means that the server can open 10240000 files at most.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量