k8s network solution performance PK

introduction

Kubernetes requires each pod in the cluster to have a unique, routable IP. Kubernetes itself does not assign an IP, but leaves the task to a third-party solution.

The goal of this project is to find a solution with the lowest latency, highest throughput and lowest installation cost. Since my load is sensitive to latency, my purpose is to measure a higher percentage of latency under relatively high network utilization, especially performance between 30% and 50% of the maximum load, because I think this is the most Can represent the most common use cases for non-overloaded systems.

Competitors

Docker with --net=host

This is one of our reference settings, and all other competitors will compare with this setting. The --net=host option means that the container inherits the IP of its host, that is, the container network is not involved.

According to a priori principle, no container network solution will perform better than others, which is why this setting is used as a reference.

Flannel

Flannel is a virtual network solution, maintained by CoreOS. It is a fully verified solution that can be put into production immediately, so the installation cost is the lowest. When adding a worker node with Flannel to the k8s cluster, Flannel will do three things:

Use etcd to assign a subnet to the new worker node
Create a virtual bridge interface (called docker0 bridge) on the machine
Set up the packet forwarding backend:

aws-vpc
Register the machine subnet in the Amazon AWS instance table. The number of records in this table is limited to 50. That is, if the Flannel cloth is used with aws-vpc, the number of nodes in the cluster cannot exceed 50. In addition, this backend is only applicable to Amazon's AWS.
host-gw
Create an IP route to the subnet through the remote host IP. Requires direct layer 2 connections between hosts running Flannel.
vxlan
Create a virtual VXLAN interface
Since Flannel uses a bridge interface to forward data packets, each data packet passes through two network stacks during the flow of traffic from one container to another.

IPvlan

IPvlan is a driver in the Linux kernel, which can create a virtual interface with a unique IP without using a bridge interface.

To assign an IP to a container with IPvlan, you must:

Create a container with no network interface at all
Create an ipvlan interface in the default network namespace
Move this interface to the container's network namespace

IPvlan is a relatively new solution, so there are no ready-made tools to automate this process. This makes it difficult to deploy IPvlan in many servers and containers, that is, the deployment cost is high.
However, IPvlan does not need to bridge the interface, but directly forwards the data packet from the NIC to the virtual interface, so it is expected that its performance is better than Flannel.

Load test plan

For each competitor, perform the following steps:

Set up the network on two physical machines
Run tcpkali in a container on a machine and let it send requests at a constant rate
Run Nginx in a container on another computer and let it respond with a fixed-size file
Capture system metrics and tcpkali results

We ran the benchmark at a request rate of 50,000 to 450,000 requests per second (RPS).
On each request, Nginx will respond to a static file of a fixed size: 350 B (content is 100 B, title is 250 B) or 4 KB.

Test Results

The results show that IPvlan has the lowest latency and the highest maximum throughput, followed by host-gw and aws-vpc Flannel, but host-gw shows better results under maximum load.
Flannel using vxlan showed the worst results in all tests. However, I suspect that the 99.999% ile test result abnormality is caused by a bug.
The result of the 4 KB response is similar to the result of the 350 B response, but there are two obvious differences: the maximum RPS is much lower, because only about 270kRPS is required for a 4 KB response to fully load a 10 Gbps NIC.
In the throughput test, IPvlan is infinitely close to --net=host.

Our current choice is to use Flannel in host-gw mode. It does not have too many dependencies (for example, does not require AWS or a new Linux version). Compared with IPvlan, it is easy to deploy and has sufficient performance characteristics. IPvlan It is an alternative. If Flannel adds IPvlan support at some point, we will switch to it.

Although the performance of aws-vpc is slightly better than host-gw, the limitation of its 50 computer nodes and the fact that it is hardwired to Amazon's AWS are an obstacle for us.

50,000 RPS, 350 B

At a rate of 50,000 requests per second, all candidates showed satisfactory performance. The main trends were: IPVlan performed the best, host-gw and aws-vpc followed closely, and vxlan performed the worst.

150,000 RPS, 350 B

IPvlan is slightly better than host-gw and aws-vpc, but is the worst in the index of 99.99%ile, and the performance of host-gw is slightly better than aws-vpc.

250,000 RPS, 350 B

This load is also very common in production, so these results are particularly important.

IPvlan shows the best performance again, but aws-vpc is better on the two indicators of 99.99 and 99.999%ile, and host-gw is better than aws-vpc on 95 and 99%%ile.

350,000 RPS, 350 B

In most cases, the delay is close to 250,000 RPS (350 B case), but it increases rapidly after 99.5% ile, which means that we are already close to the maximum RPS.

450,000 RPS, 350 B

This is the largest RPS that can produce reasonable results in theory. IPvlan is once again leading, and the delay is about 30% worse than --net-host:

Interestingly, the performance of host-gw is better than aws-vpc:

500,000 RPS, 350 B

At 500,000 RPS, only IPvlan is still effective, even better than --net=host, but the delay is so high that we think it is useless for delay-sensitive programs.

50k RPS, 4 KB

Larger responses result in higher network usage, but the test results look almost the same as smaller responses:

150k RPS, 4 KB

Host-gw has an astonishing 99.999% ile and also shows good results for lower percentiles.

250k RPS, 4 KB

This is the maximum RPS under a large request response. Unlike the small request response test case, the performance of aws-vpc is much better than host-gw, and Vxlan is again excluded from the figure.

test environment

background

In order to understand this article and reproduce our test environment, you should be familiar with the basics of high performance.
These articles provide useful insights on the subject:

how to receive 1 million packets per second by CloudFlare
10Gbps Ethernet achieves low latency by CloudFlare
Extend the Linux network stack from the Linux kernel documentation

Server specification list

Two Amazon AWS EC2 instances are required, the system version is CentOS 7, the instance size is c4.8xlarge, and both instances have the enhanced networking function enabled.
Each instance has a NUMA with 2 processors, each processor has 9 cores, and each core has 2 hyperthreads, which actually allows 36 threads to run on each instance.
Each instance has a 10Gbps network interface card (NIC) and 60 GB of memory.
In order to support enhanced networking functions and IPvlan, Linux kernel 4.3.0 version with Intel ixgbevf driver has been installed.

Installation and deployment

Modern NICs provide Receiver Extension (RSS) through multiple interrupt request (IRQ) lines. EC2 only provides two interrupt lines in a virtualized environment, so we tested several RSS and receive packet oriented (RPS) receive data Package-oriented (RPS) configuration, and finally get the following configuration, partly provided by the Linux kernel documentation:

IRQ

The first core on each of the two NUMA nodes is configured to receive interrupts from the NIC.
Use lscpu to match the CPU with the NUMA node:

$ lscpu | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-8,18-26
NUMA node1 CPU(s):     9-17,27-35

This is achieved by writing 0 and 9 to /proc/irq/<num>/smp_affinity_list, where the IRQ number is obtained by grep eth0 /proc/interrupts:

$ echo 0 > /proc/irq/265/smp_affinity_list
$ echo 9 > /proc/irq/266/smp_affinity_list

RPS

Several combinations of RPS have been tested. In order to increase latency, we only use CPUs 1–8 and 10–17 to reduce the burden on the IRQ processing processor. Unlike IRQ's smp_affinity, the rps_cpus sysfs file entry does not have a _list corresponding item, so we use a bit mask to list the CPU to which RPS can forward traffic:

$ echo "00000000,0003fdfe" > /sys/class/net/eth0/queues/rx-0/rps_cpus
$ echo "00000000,0003fdfe" > /sys/class/net/eth0/queues/rx-1/rps_cpus

Transmit Packet Steering (XPS)

Set all NUMA 0 processors (including HyperThreading, namely CPU 0-8, 18-26) to tx-0, and set NUMA 1 (CPU 9-17, 27-37) to tx-12:

$ echo "00000000,07fc01ff" > /sys/class/net/eth0/queues/tx-0/xps_cpus
$ echo "0000000f,f803fe00" > /sys/class/net/eth0/queues/tx-1/xps_cpus

Receive Flow Steering (RFS)

We plan to use a 60k resident connection, and the official documentation recommends rounding it to the nearest power of 2:

$ echo 65536 > /proc/sys/net/core/rps_sock_flow_entries
$ echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
$ echo 32768 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt

Nginx

Nginx uses 18 workers, each worker has its own CPU (0-17), set by the worker_cpu_affinity option:

**workers** 18;
**worker_cpu_affinity** 1 10 100 1000 10000 ...;

Tcpkali

Tcpkali does not have built-in CPU affinity support. In order to take advantage of RFS, we run tcpkali in the task set and adjust the scheduler to prevent thread migration:

$ echo 10000000 > /proc/sys/kernel/sched_migration_cost_ns
$ taskset -ac 0-17 tcpkali --threads 18 ...

Compared with the other settings we have tried, this setting distributes the interrupt load more evenly among the CPU cores and achieves better throughput with the same latency.
CPUs 0 and 9 deal exclusively with NIC interrupts and do not process packets, but they are still the busiest:

RedHat's adjustments are also used together with the network delay configuration file. In order to minimize the impact of nf_conntrack, NOTRACK rules have been added, and the kernel parameters have been adjusted to support a large number of tcp connections:

fs.file-max = 1024000
net.ipv4.ip_local_port_range = "2000 65535"
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1