10

From the network card to the application, the data packet passes through a series of components. What does the driver do? What does the kernel do? What can we do to optimize? There are many finely adjustable hardware and software parameters involved in the whole process, and they influence each other. There is no “silver bullet” once and for all. In this article, Yang Peng, a senior engineer for cloud system development, will combine his own practical experience to introduce how to make the optimal configuration of "scenarios" based on a deep understanding of the underlying mechanism.

The article is organized based on Yang Peng's keynote speech "Performance Optimization: Receive Data Faster" at the Beijing station of the Open Talk Technology Salon. The live video and PPT can be Click to read the original text of

Hello, everyone. I’m Yang Peng, a development engineer of Youpaiyun. I have been working at Youpaiyun for four years. During this period, I have been engaged in the development of the CDN underlying system. I am responsible for scheduling, caching, load balancing and other core components of CDN. I am very happy to come. Share with you your experience and feelings in network data processing. The topic shared today is "How to Receive Data Faster", which mainly introduces methods and practices for accelerating network data processing. Hope to help everyone better understand how to achieve the ultimate optimization at the system level, as far as possible in the case of application insensitivity. Closer to home, enter the topic.

First of all, you need to be clear about what should be the first thing that comes to mind when trying to do any optimization? Personally think it is a measure. Before making any changes or optimizations, you must clearly know what indicators reflect the current problems. Then, after making corresponding adjustments or changes, the actual effects and effects can be verified through indicators.

For the topic to be shared, there is a basic principle around the core of the above indicators. Optimizing at the network level, in the final analysis, you only need to look at one point. If you can achieve each level of the network stack, add the packet loss rate that can monitor the corresponding level, so that the core indicators can clearly know where the problem is. . With clear and monitorable indicators, it is easy to make corresponding adjustments and verify the actual effect afterwards. Of course, the above two points are relatively a bit imaginary, and the next part is the more dry part.

As shown in the figure above, when a data packet is received, from entering the network card to the application layer, there are many overall data flows. At the current stage, there is no need to pay attention to each process, just pay attention to several core critical paths:

  • The first one, the data packet arrives at the network card;
  • Second, when the network card receives a data packet, it needs to generate an interrupt to tell the CPU that the data has arrived;
  • In the third step, the kernel takes over from this time, taking the data out of the network card and handing it over to the protocol stack of the following kernel for processing.

The above are three key paths. The hand-drawn drawing on the right in the figure above refers to these three steps and deliberately distinguishes two colors. The reason for this distinction is that will be shared according to these two parts, one is the upper driver part, and the other is the part related to the kernel in the lower layer. Of course, there are many kernels, and the whole article only involves the kernel network subsystem, and more specifically the content of the interaction between the kernel and the driver.

NIC driver

For the network card driver, the network card is the hardware, and the driver is the software, including most of the network card driver. This part can be simply divided into four points, followed by initialization, startup, monitoring and tuning to drive its initialization process.

network card driver-initialization

The process of driver initialization is related to the hardware, so there is no need to pay too much attention. But one thing to note is a series of operations for registering ethool. This tool can perform various operations on the network card. It can not only read the configuration of the network card, but also change the configuration parameters of the network card. It is a very powerful tool.

How does it control the network card? When the driver of each network card is initialized, it registers and supports a series of operations of the ethool tool through the interface. Ethool is a set of very general interfaces, for example, it supports 100 functions, but each model of network card can only support a subset. So which functions are specifically supported will be declared in this step.

The intercepted part of the above figure is the assignment of the structure during initialization. You can look at the first two briefly. The driver will tell the kernel when it is initialized that if you want to operate the callback function corresponding to this network card, the most important one is startup and shutdown. Those who use the ifconfig tool to operate the network card should be familiar with it. When using ifconfig up/down a network card, all the functions that are specified when it is initialized are called.

network card driver-start

After the driver initialization process is the process of starting (open), it is divided into four steps: allocating rx/tx queue memory,

Turn on NAPI, register interrupt handling functions, and turn on interrupts. Among them, registering the interrupt processing function and opening the interrupt are natural, and any piece of hardware connected to the machine needs to do this operation. When it receives some events later, it needs to notify the system through interrupts, and then turn on the interrupts.

The second step of NAPI will be explained in detail later, here we will focus on the memory allocation during the startup process. When the network card receives data, it must copy the data from the link layer to the machine's memory. This memory is obtained from the kernel and the operating system through the interface when the network card is started. Once the memory is applied for and the address is determined, when the subsequent network card receives the data, it can directly transfer the data packet to the fixed address of the memory through the DMA mechanism, even without the participation of the CPU.

The allocation of memory to the queue can be seen in the figure above. Long ago network cards were single-queue mechanisms, but most modern network cards are multi-queue. The advantage is that the data reception of the machine network card can be load-balanced to multiple CPUs, so multiple queues will be provided. Here is a concept that will be explained in detail later.

Let's introduce in detail the second step in the startup process, NAPI, which is a very important extension in the modern network packet processing framework. The reason why it can now support very high-speed network cards such as 10G, 20G, 25G, etc., the NAPI mechanism has played a very important role. Of course, NAPI is not complicated, and its core is two points: interruption and round robin. Generally speaking, when the network card receives data, it must receive a packet, generate an interrupt, and then process the packet during the interrupt processing function. In the cycle of receiving packets, processing interrupts, receiving packets next, and processing interrupts. The advantage of the NAPI mechanism is that only one interrupt is required. After receiving it, all the data in the queue memory can be taken away in a round-robin manner to achieve a very efficient state.

NIC driver-monitoring

The next step is the monitoring that can be done in the driver layer. You need to pay attention to the source of some of the data.


$ sudo ethtool -S eth0
NIC statistics:
     rx_packets: 597028087
     tx_packets: 5924278060
     rx_bytes: 112643393747
     tx_bytes: 990080156714
     rx_broadcast: 96
     tx_broadcast: 116
     rx_multicast:20294528
     .... 

First of all, the very important tool is the ethool tool, which can get general information such as the statistical data in the network card, the number of packets received, the flow of processing, and so on, and we need to pay more attention to abnormal information.


$ cat /sys/class/net/eth0/statistics/rx_dropped
2

Through the sysfs interface, you can see the number of packets lost on the network card, which is a sign of an abnormality in the system.

The information obtained in the three ways is similar to the previous one, but the format is a bit messy, just understand it.

The picture above is an online case to be shared. At that time, there was an abnormality in the business. After investigation, it was finally suspected to be at the network card level. For this reason, further analysis is needed. Some statistical data of the network card can be viewed intuitively through the ifconfig tool. In the figure, you can see that the errors data index of the network card is very high, and there is obviously a problem. But the more interesting point is that the last frame indicator value to the right of errors is exactly the same as it. Because the errors index is the accumulated index of many errors in the network card, the dropped and overruns adjacent to it are zero, which means that in the current state, most of the network card errors come from the frame.

Of course, this is only an instantaneous state. The lower part of the above figure is the monitoring data, and you can clearly see the fluctuations. It is indeed an abnormality of a certain machine. The frame error is generally caused by the failure of the RCR check when the network card receives the data packet. When a data packet is received, it will check the contents of the packet. When it is found that it does not match the stored check, it means that the packet is damaged, so it will be directly discarded.

This reason is relatively easy to analyze, two points and one line, the machine's network card is connected to the uplink switch through the network cable. When there is a problem here, it is either the network cable or the network card problem of the machine itself, or the port of the opposite switch, that is, the problem with the port of the uplink switch. Of course, analyze according to the first priority, coordinate the operation and maintenance to replace the network cable corresponding to the machine, and the following indicators also reflect the effect. The indicator drops directly until it disappears completely, and the error no longer exists. Corresponding to the upper-level business Also quickly returned to normal.

network card driver-tuning

After talking about monitoring, let's take a look at the final tuning. There are not many things that can be adjusted at this level, mainly for the adjustment of the network card multi-queue, which is relatively intuitive. It is possible to adjust the number and size of the queues, the weights between the queues, and even adjust the hash fields.

$ sudo ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:   0
TX:   0
Other:    0
Combined: 8
Current hardware settings:
RX:   0
TX:   0
Other:    0
Combined: 4

The figure above is an adjustment for multiple queues. To illustrate the concept just now, for example, for example, a web server is bound to CPU2, and the machine has multiple CPUs, the network card of this machine is also multi-queue, and one of the queues will be processed by CPU2. There will be a problem at this time, because the network card has multiple queues, so the traffic on port 80 will only be allocated to one of the queues. If this queue is not processed by CPU2, some data will be moved. When the bottom layer receives the data and then passes it to the application layer, the data needs to be moved. If it was originally processed on CPU1, it needs to be moved to CPU2. At this time, it will involve the failure of the CPU cache, which is a costly operation for a high-speed CPU.

So what should we do? We can deliberately direct 80 port tcp data traffic to the network card queue processed by CPU2 through the tools mentioned above. The effect of this is that the data packets are all the same CPU from the time they arrive at the network card, to the kernel processing and then to the application layer. The biggest advantage of this is the cache, the CPU cache is always hot, so overall, its delay and effect will be very good. Of course, this example is not practical, it is mainly to illustrate an effect that can be achieved.

Kernel network subsystem

After talking about the entire network card driver part, the next part is to explain the kernel subsystem, which will be divided into two parts: soft interrupt and network subsystem initialization to share.

Soft interrupt

The NETDEV shown in the figure above is a branch held by the Linux network subsystem every year. The more interesting point is that the number of sessions held by the annual conference is represented by a special character. The picture shows the 0X15 class, and I must have found that this is a hexadecimal number. 0X15 is just 21 years old, which is also a more geeky style. Those interested in the network subsystem can pay attention to it.

Closer to home, there are multiple mechanisms for kernel delay tasks, and soft interrupts are just one of them. The figure above is the basic structure of Linux. The upper layer is user mode, the middle is the kernel, and the lower layer is the hardware, which is a very abstract layer. There are two ways of interaction between user mode and kernel mode: through system calls, or through exceptions, you can fall into the kernel mode. How does the underlying hardware interact with the kernel? The answer is an interrupt. When the hardware interacts with the kernel, an interrupt must be used. To process any event, an interrupt signal must be generated to inform the CPU and the kernel.

However, such a mechanism may not be a problem in general, but for network data, one datagram is interrupted, so there are two obvious problems.

Problem 1: During interrupt processing, the previous interrupt signal will be shielded. When an interrupt processing time is very long, the interrupt signal received during processing will be lost. If it takes ten seconds to process a packet, five more data packets are received during the ten seconds, but because the interrupt signal is lost, even if the previous processing is completed, the subsequent data packets will not be processed again. Corresponding to the tcp side, if the client sends a data packet to the server, the processing is completed in a few seconds, but during the processing, the client sends the subsequent three packets, but the server does not know it behind and thinks it will only receive When a packet is reached, the client is waiting for the reply from the server. This will cause both sides to get stuck, which also shows that signal loss is an extremely serious problem.

Problem 2: If a data packet triggers an interrupt processing, when a large number of data packets arrive, a very large number of interrupts will be generated. reaches 100,000, 500,000, or even millions of pps, then the CPU needs to deal with a large number of network interrupts, and there is no need to do other things.

The solution to the above two problems is to make the interrupt processing as short as possible. Specifically, it can't be in the interrupt handler function, it can only be pulled out and handed over to the soft interrupt mechanism. The actual result after this is that the hardware interrupt processing does very little, and some necessary things such as receiving data are handed over to the soft interrupt to complete, which is also the meaning of the existence of the soft interrupt.

static struct smp_hotplug_thread softirq_threads = {
  .store              = &ksoftirqd,
  .thread_should_run  = ksoftirqd_should_run,
  .thread_fn          = run_ksoftirqd,
  .thread-comm        = “ksoftirqd/%u”,
};

static _init int spawn_ksoftirqd(void)
{
  regiter_cpu_notifier(&cpu_nfb);
  
  BUG_ON(smpboot_register_percpu_thread(&softirq_threads));

  return 0;
}
early_initcall(spawn_ksoftirqd);

The soft interrupt mechanism is realized by the thread of the kernel. The figure shows a corresponding kernel thread. The server CPU will have a kernel thread such as ksoftirqd, and a multi-CPU machine will have multiple threads corresponding to it. In the figure, the last member of the structure, ksoftirqd/, if there are three CPUs corresponding to it, there will be three kernel threads /0/1/2.

Information about the soft interrupt mechanism can be seen under softirqs. There are not many soft interrupts, but there are only a few. Among them, the ones that need attention are NET-TX and NET-RX, the two scenarios of network data sending and receiving, which are related to the network.

Kernel initialization

After laying out the soft interrupt, let's look at the process of kernel initialization. There are mainly two steps:

  • For each CPU, create a data structure, which has a lot of members hanging on it, which is closely related to the subsequent processing;
  • Register a soft interrupt processing function, corresponding to the two soft interrupt processing functions NET-TX and NET-RX seen above.

The figure above is a hand-drawn processing flow of a data packet:

  • In the first step, the network card received the data packet;
  • The second step is to copy the data packet into the memory through DMA;
  • The third step generates an interrupt to tell the CPU and start processing the interrupt. The key interrupt processing can be divided into two steps: one is to shield the interrupt signal, and the other is to wake up the NAPI mechanism.

static irqreturn_t igb_msix_ring(int irq, void *data)
{
  struct igb_q_vector *q_vector = data;
  
  /* Write the ITR value calculated from the previous interrupt. */
  igb_write_itr(q_vector);
  
  napi_schedule(&q_vector->napi);
  
  return IRO_HANDLED;
}

The above code is what the igb network card drives the interrupt handling function to do. If the initial variable declaration and subsequent return are omitted, this interrupt handling function has only two lines of code, which is very short. Need to pay attention to is the second, in the hardware interrupt processing function, only use the external NIPA soft interrupt processing mechanism to activate, no need to do anything else. So this interrupt handler will return very quickly.

NIPI activation


/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi)
{
  list_add_tail(&napi->poll_list, &sd->poll_list);
  _raise_softirq_irqoff(NET_RX_SOFTIRQ);
}

The activation of NIPI is also very simple, mainly in two steps. When the kernel network system is initialized, each CPU will have a structure, which will insert the information corresponding to the queue into the linked list of the structure. In other words, when each network card queue receives data, it needs to tell its own queue information to the corresponding CPU, and bind the two information together to ensure that a certain CPU processes a certain queue.

In addition, as with triggering hard interrupts, soft interrupts need to be triggered. The following figure puts a lot of steps together, and I won’t repeat the previous ones. The figure to be concerned about is how the soft interrupt is triggered. Similar to the hard interrupt, the soft interrupt also has an interrupt vector table. Each interrupt number corresponds to a processing function. When an interrupt needs to be processed, just look it up in the corresponding interrupt vector table, which is exactly the same as the processing of a hard interrupt.

Data receiving-monitoring

After talking about the operating mechanism, let's take a look at where monitoring can be done. There are a lot of things under proc, you can see the processing of the interrupt. The first column is the interrupt number. Each device has an independent interrupt number, which is hard-coded. For the network, you only need to pay attention to the interrupt number corresponding to the network card, which is 65, 66, 67, 68, etc. in the figure. Of course, it doesn't make sense to look at the actual numbers, but you need to look at its distribution, whether interrupts are processed by different CPUs, if all interrupts are processed by one CPU, then you need to make some adjustments to spread them out.

Data reception-tuning

There are two adjustments that can be made for interruption: one is to interrupt merging, and the other is to interrupt affinity.

Adaptive interrupt merging

  • rx-usecs: After the data frame arrives, how long is the delay to generate an interrupt signal, in microseconds
  • rx-frames: The maximum number of accumulated data frames before
  • rx-usecs-irq: If interrupt processing is being executed, how long is the current interrupt delay to reach CPU
  • rx-frames-irq: If interrupt processing is being executed, how many data frames can be accumulated at most

The functions listed above are all supported by the hardware network card. NAPI is essentially an interrupt merging mechanism. If there are many packets coming, NAPI can generate only one interrupt. Therefore, hardware is not needed to help interrupt merging. The actual effect is the same as NAPI, which reduces the total The number of interruptions.

Interrupted affinity

$ sudo bash -c ‘echo 1 > /proc/irq/8/smp_affinity’

This is closely related to the network card multi-queue. If the network card has multiple queues, you can manually specify which CPU will be used for processing, and evenly distribute the data processing load to the available CPUs of the machine. The configuration is also relatively simple, just write the number into the file corresponding to /proc. This is a bit array, after it is converted to binary, the corresponding CPU will process it. If you write a 1, it may be processed by CPU0; if you write a 4, it will be converted to 100 in binary, then it will be handed over to CPU2 for processing.

In addition, there is a small issue that needs to be noted. Many distributions may come with an irqbalance daemon ( http://irqbalance.github.io/irqbalance), which will override the settings for manually interrupting the balance. The core thing this program does is to put the operation of manually setting files above into the program. If you are interested, you can check its code (https://github.com/Irqbalance/irqbalance/blob/master/activate.c), Just open this file and write the corresponding number in it.

core-data processing

The last part is the data processing part. When the data arrives at the network card and enters the queue memory, the kernel needs to pull the data out of the queue memory. If the PPS of the machine reaches one hundred thousand or even one million, and the CPU only processes network data, then other basic business logic is unnecessary, so the processing of data packets cannot be monopolized by the entire CPU, and the core point is how to do it limit.

There are two main restrictions on the above problems: overall restriction and single restriction

while (!list_empty(&sd->poll_list)){
  struct napi_struct *n;
  int work,weight;
  
  /* If softirq window is exhausted then punt.
   * Allow this to run for 2 jiffies since which will allow
   * an average latency of 1.5/HZ.
   */
   if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))
   goto softnet_break;

The overall limit is well understood, that is, one CPU corresponds to one queue. If the number of CPUs is less than the number of queues, then one CPU may need to process multiple queues.

weight = n->weight;

work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
        work = n->poll(n,weight);
        trace_napi_poll(n);
}

WARN_ON_ONCE(work > weight);

budget -= work;

The single limit is to limit the number of packets processed by a queue in a round. After reaching the limit, stop and wait for the next round of processing.

softnet_break:
  sd->time_squeeze++;
  _raise_softirq_irqoff(NET_RX_SOFTIRQ);
  goto out;

Stopping is a very critical node. Fortunately, there are corresponding indicator records, and there are interrupted counts such as time-squeeze. With this information, you can determine whether the machine's network processing has a bottleneck and the frequency of forced interruptions.

The above figure is the data for monitoring CPU indicators. The format is very simple. Each line corresponds to one CPU. The values are separated by spaces. The output format is hexadecimal. So what does each column of value represent? Unfortunately, there is no documentation for this. You can only check the kernel version used and then look at the corresponding code.

seq_printf(seq,
     "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
     sd->processed, sd->dropped, sd->time_squeeze, 0,
     0, 0, 0, 0, /* was fastroute */
     sd->cpu_collision, sd->received_rps, flow_limit_count);

The following explains how each field in the file comes from, the actual situation may be different, because with the iteration of the kernel version, the number of fields and the order of the fields may change, and network data processing is interrupted. Related to the number of times is the squeeze field:

  • The number of packets processed by sd->processed (multi-NIC bond mode may be more than the actual number of received packets)
  • sd->dropped The number of dropped packets because the queue is full
  • sd->time_spueeze soft interrupt processing net_rx_action is forced to interrupt the number of times
  • sd->cpu_collision Obtain device lock conflict when sending data, such as multiple CPUs sending data at the same time
  • sd->received_rps The current number of times the CPU is awakened (via inter-processor interrupt)
  • sd->flow_limit_count The number of flow limit triggers

The following figure is a case of related problems encountered in the business, and finally the CPU level is checked. Figure 1 is the output of the TOP command, which shows the usage of each CPU. The usage rate of CPU4 marked in the red box is abnormal, especially the SI occupancy in the penultimate column reaches 89%. SI is the abbreviation of softirq, which represents the percentage of time the CPU spends on soft interrupt processing, and the percentage of time CPU4 in the figure is obviously too high. Figure 2 is the output result corresponding to Figure 1. CPU4 corresponds to the fifth row. The value in the third column is significantly higher than other CPUs, indicating that it is frequently interrupted when processing network data.

In view of the above problem, it is inferred that CPU4 has a certain performance degradation, which may be due to poor quality or other reasons. In order to verify whether it is a performance degradation, I wrote a simple python script, an endless loop that keeps accumulating. Each time it runs, bind this script to a certain CPU, and then observe the time-consuming comparison of different CPUs. The final comparison results also show that the time consumption of CPU4 is several times higher than that of other CPUs, which also verifies the previous inference. After the coordinated operation and maintenance replaced the CPU, the intention indicator returned to normal.

Summarize

All the above operations are only when the data packet goes from the network card to the kernel layer, and the common protocol has not yet been reached. It just completed the first step of the Long March, and there are a series of steps behind, such as data packet compression (GRO), network card multi-queue software (RPS) and RFS considers the characteristics of the flow on the basis of load balancing, that is, the characteristics of the IP port quadruple, and finally the data is submitted to the IP layer and to the familiar TCP layer.

In general, today’s sharing is centered around driving. The core point of performance optimization I want to emphasize is indicators. If you can’t measure it, it’s hard to improve. Indicators are needed so that all optimizations are meaningful. .

Recommended reading

MySQL's common mistakes in design specifications

-wide HTTPS necessarily safe?


云叔_又拍云
5.9k 声望4.6k 粉丝

又拍云是专注CDN、云存储、小程序开发方案、 短视频开发方案、DDoS高防等产品的国内知名企业级云服务商。