author

Liu Xu, a senior engineer at Tencent Cloud, focuses on the native field of container cloud. He has many years of experience in large-scale Kubernetes cluster management and microservice governance. He is now responsible for the design and research and development of Tencent Cloud service grid TCM data plane product architecture.

introduction

At present, the service grid represented by Istio[1] generally uses the Sidecar architecture and uses iptables to hijack the traffic to the Sidecar proxy. The advantage is that it is not intrusive to the application, but the Sidecar proxy will increase the request delay and resource consumption.

Performance has always been a point that users are very concerned about, and it is also a key factor for users to evaluate whether to use service mesh products. Tencent Cloud TCM team has been committed to optimizing service mesh performance. Last week we shared the use of eBPF instead of iptables to optimize the service network at KubeCon. Scheme of grid data surface performance.

iptables achieves traffic hijacking

First take a look at the current iptables-based traffic hijacking scheme used by the community. The following figure shows the process of creating a Pod. The sidecar injector will inject two containers into the Pod, istio-init and istio-proxy.

  • Istio-init is an init container, responsible for creating iptables rules related to traffic hijacking, and will exit after creation
  • Envoy is running in istio-proxy, which is responsible for proxying the network traffic of Pod, and iptables will hijack the request to istio-proxy for processing

The following figure shows the entire process of iptables completing traffic hijacking. Here is a brief description, and interested students can check it [2]

  • Inbound iptables redirects incoming traffic to port 15006, which is the VirtualInboundListener of envoy, and envoy will forward it to the specified port of the application according to the original destination address of the request.
  • Outbound iptables redirects outbound traffic to port 15001, which is the VirtualOutboundListener of envoy. Envoy will route to the designated backend based on the original destination address of the request and the Host URL.

eBPF achieves traffic hijacking

eBPF (extended Berkeley Packet Filter) is a technology that can run user-written programs in the Linux kernel without modifying the kernel code or loading kernel modules. It is currently widely used in the fields of network, security, and monitoring. The earliest and most influential eBPF-based project in the Kubernetes community is Cilium[4]. Cilium uses eBPF instead of iptables to optimize Service performance.

Inbound

Let's first look at the hijacking of inbound traffic. The hijacking of inbound traffic is mainly done using the eBPF program hook bind system call.

The eBPF program will hijack the bind system call and modify the address. For example, the application bind 0.0.0.0:80 will be modified to 127.0.0.1:80, and the application may also bind the address of ipv6, so here are two eBPF programs that handle ipv4 separately And ipv6 bind.

Unlike iptables, iptables can individually set rules for each netns. After the eBPF program is attached to the specified hook point, it will take effect on the entire system. For example, after attaching to the bind system call, all processes in the Pod and on the node that call bind will trigger eBPF. Program, we need to distinguish which calls are from Pods that need to be hijacked by eBPF.

In K8s, except for the case of hostnetwork, each Pod has an independent netns, and each netns has a unique cookie, so we will need to use eBPF to complete the traffic hijacking. The corresponding netns cookie is stored in cookie_map , eBPF program Determine whether to modify the bind address by judging whether the netns cookie of the current socket is in cookie_map

After modifying the bind address of the application, you also need to send the pod_ip:80 listener configuration to envoy. The pod_ip:80 listener will forward the request to 127.0.0.1:80, which is the address that the application listens to, so as to achieve the inbound traffic hijack. But there is a problem here. Since istio uses the istio-proxy user to start envoy, non-root users cannot bind privileged ports below 1024 by default. We have solved this problem sysctl net.ipv4.ip_unprivileged_port_start=0

Comparing the hijacking of inbound traffic between iptables and eBPF, the iptables solution requires conntrack processing for each packet, while the eBPF solution is executed only once when the application calls bind, and will not be executed afterwards, reducing performance overhead.

Outbound

Let's take a look at the hijacking of outgoing traffic. The hijacking of outgoing traffic is more complicated. According to the protocol, it is divided into two situations: TCP and UDP.

TCP traffic hijacking

The process of hijacking TCP outgoing traffic:

  • _coonect4 hijacks the connect system call to modify the destination address to 127.0.0.1:15001, which is the VirtualOutboundListerer of envoy, and saves the original destination address of the connection in sk_storage_map
  • After the TCP connection is established, sockops will read sk_storage_map origin_dst_map with the four-tuple (source IP, destination IP, source port, destination port) as the key
  • _getsockopt hijack getsockopt system call to read origin_dst_map the data in the original destination address to the envoy

UDP traffic hijacking

Istio supports smart DNS proxy in version 1.8 [5]. After opening, iptables will hijack DNS requests to Sidecar for processing. We also need to use eBPF to implement the same logic. The hijacking of TCP DNS is similar to the above, and the hijacking of UDP DNS is shown below. picture

The process of hijacking UDP outgoing traffic:

  • _connect4 and _sendmsg4 are responsible for modifying the UDP destination address to 127.0.0.1:15053 and saving the original destination address to sk_storage_map , because Linux provides two ways to send UDP data

    • First call connect and then send, this situation is handled _connect4
    • Call sendto directly, this situation is handled _sendmsg4
  • recvmsg4 the source address of the return packet to the original destination address by reading sk_storage_map . This is because some applications, such as nslookup, will verify the source address of the return packet.

For TCP and connected UDP, each packet of the iptables solution needs to be processed by conntrack, while the overhead of the eBPF solution is one-time, and only needs to be executed once when the socket is established, which reduces the performance overhead.

Sockmap

The solution to use sockmap to optimize the performance of the service grid was first proposed by cilium. Our solution also refers to cilium. Here we borrow two pictures of cilium to illustrate the optimization effect.

Before optimization, the network communication between the Sidecar agent and the application needs to be processed by the TCP/IP protocol stack

After optimization, the network communication between the Sidecar agent and the application bypasses the TCP/IP protocol stack. If the two Pods are on the same node, the network communication between the two Pods can also be optimized. Here is a brief description of the optimization principle of sockmap, and interested students can check 6.

  • sock_hash is an eBPF map that stores socket information, the key is a four-tuple (source IP, destination IP, source port, destination port)
  • _sockops is responsible for monitoring socket events and storing the socket information in sock_hash
  • _sk_msg intercepts sendmsg system call, and then to the sock_hash find peer socket, if found will call bpf_msg_redirect_hash send data directly to the peer socket

problem

However, using quadruples as keys may cause conflicts. For example, in two Pods on the same node, envoy uses the same source port 50000 to request port 80 of the application.

In order to solve this problem, we added netns cookie to the key, and set the cookie to 0 for non-localhost requests. This not only ensures that the key does not conflict, but also speeds up the network communication between two Pods on the same node.

However, the previous version of the kernel does not support sockops and sk_msg . Therefore, we submitted two patch 8s to the kernel community. The 5.15 version has now been merged.

Architecture

The architecture of the entire program is shown in the figure. istio-ebpf runs on the node in the form of DaemonSet and is responsible for load/attach eBPF programs and create eBPF maps. The istio-init container is still retained, but no iptables rules are created. Instead, the eBPF map is updated. istio-init will save the Pod's netns cookie in the cookie_map. At the same time, we also modified istiod, istiod will issue different xDS configurations according to Pod traffic hijacking mode (iptables/eBPF).

Performance comparison

Test environment: Ubuntu 21.04 5.15.7

  • Under the same conditions, using eBPF can reduce System CPU usage by 20%
  • Under the same conditions, using eBPF can increase QPS by 20%
  • Under the same conditions, using eBPF can reduce request latency

Summarize

The Sidecar architecture of the service grid will inevitably increase request latency and resource consumption. We use eBPF instead of iptables to achieve traffic hijacking, and use sockmap to accelerate the network communication between the Sidecar proxy and the application, which reduces the request time to a certain extent. Due to restrictions on the kernel version and other constraints, this solution is expected to be launched early next year. The TCM team will continue to explore new performance optimization directions.

Reference

[1] https://istio.io

[2] https://jimmysong.io/blog/sidecar-injection-iptables-and-traffic-routing

[3] https://ebpf.io

[4] https://cilium.io

[5] https://istio.io/latest/blog/2020/dns-proxy

[6] https://arthurchiao.art/blog/socket-acceleration-with-ebpf-zh

[7] https://github.com/cilium/cilium/tree/v1.11.0/bpf/sockops

[8] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6cf1770d

[9] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=fab60e29f

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝