Preface
Kubernetes has become the de facto standard in the field of container management, and the network system is the core part of Kubernetes. As more and more services are deployed in Kubernetes, some new requirements have been put forward for container networks.
- How to improve the observability of the network, the demand for serverless products is particularly prominent
- How to minimize the network performance loss introduced by the container
The above requirements have impacted traditional firewall and load balancer technologies such as iptables and IPVS. It also prompts us to think about whether the container network access link does not depend on nodes, thereby shortening the container access link and improving network performance
eBPF is a revolutionary technology. It can execute programs at many hook points in the kernel in a safe manner. This technology has strong programmability and does not need to maintain the kernel module. The maintainability is good. This technology is satisfactory. The above requirements provide the possibility. Cilium is a container network open source project based on eBPF technology, providing solutions for network interoperability, service load balancing, security and observability
As a result, Tencent Cloud Container Service TKE implemented a high-performance ClusterIP Service solution in the independent network card mode based on Cilium and eBPF. TKE is committed to providing higher-performance, safer and easier-to-use container networks. Therefore, it will continue to pay attention to Cilium and other cutting-edge container network technology solutions, and will launch more and more complete Cilium productization capabilities in the future.
Independent network card Service solution
TKE launched a new generation of container network solutions last year, which realized that a Pod can exclusively occupy a flexible network card without passing through the node network protocol stack (default namespace). The current kube-proxy implementation of ClusterIP relies on setting the corresponding iptables rules in the network protocol stack on the node side, so this solution is no longer applicable to the independent network card solution.
One of the solutions is Cilium, which provides eBPF-based address translation capabilities to support ClusterIP Service. However, its native solution only supports the data plane of veth pair and ipvlan l3, and does not support the data plane of the Pod completely without the node network protocol stack, so it cannot natively solve the access problem of the independent network card ClusterIP.
As a result, TKE has transformed Cilium to support the third data plane solution besides the natively supported veth and ipvlan l3, as shown in the figure (assuming pod access Service IP is 172.16.0.2). On the data plane, mount the original The bpf program on the veth on the node side is mounted on the independent network card (also an elastic network card) in the pod, so that the network packet of the Pod is subjected to DNAT (destination address translation) when it is sent, and the return message is received by the network card When doing reverse DNAT to support ClusterIP access. The data plane solution can be used as a general solution to adapt to data plane scenarios such as Ipvlan 12 and SRIOV.
On the control plane, Cilium and TKE's VPC-CNI mode (including shared network card mode and independent network card mode) are deeply integrated, and users can use Cilium's functional features without any modification to the business code logic.
Performance comparison
This article uses the wrk tool to perform a performance stress test on Cilium's productized solution. The test ensures that the Client Pod and Server Pod are distributed on different nodes.
Test environment: TKE cluster, 4 CVM nodes, configured as Server S5.2XLARGE8, Client S5.SMALL2.
Test data shows that best performance based on Cilium's independent network card ClusterIP access scheme . In the short connection scenario, its QPS compared to the shared network card iptables and ipvs solutions increased by 48% and 74%, compared with the global routing iptables and ipvs solutions increased by 62% and 91%. In the long connection scenario, its QPS has increased by 33% and 57% compared to the iptables and ipvs schemes of shared network cards, and has increased by 49% and 66% compared with the iptables and ipvs schemes of global routing. The performance of iptables is better than that of ipvs because the number of services in the test environment is not enough. The advantage of ipvs lies in the scenario of a large number of services.
Related issues in the process of commercialization
In the process of implementing the Cilium product solution, the TKE team also discovered some problems in the Cilium project. The corresponding solutions and Cilium support new data plane solutions will be organized and submitted to the Cilium community in the form of pr in the near future.
ClusterIP under the independent network card scheme is self-accessible
In fact, the above solution cannot completely solve the problem of ClusterIP access. There is a special scenario where access will be unavailable. This type of scenario is the ClusterIP accessed by the Pod, and its backend includes itself. In this type of scenario, the network packets sent by the Pod of the independent network card will directly reach the IAAS layer, which does not meet expectations.
Since there are actually only two network devices in the independent network card Pod: the loopback device (lo) and the elastic network card, a simple idea is to forward the message directly to the self-access traffic through the bpf_redirect call (redirect) before the message is sent. To the loopback (lo) device. Based on this, the TKE team modified the relevant bpf code of cilium and provided a solution. After testing, this solution can solve the ClusterIP self-access problem under the independent network card solution.
The name of the Cilium loading bpf program is missing
There is a problem with the debuggability of the Cilium project. Its bpf program was developed relatively early, and many old toolsets, such as tc, were used at the bottom to load bpf code.
The old tc was designed based on the old kernel version (<4.15). It ignored the name of the bpf program when loading the bpf program, causing the bpf programs loaded by Cilium to be unnamed. This affects the understanding, tracking and debugging of the code.
For this, the TKE team combined with the newer kernel to modify the tc tool so that it will correctly pass in the name when it loads the bpf program. Through this name, you can figure out which bpf function is actually running, thereby improving Cilium's debuggability.
usage
, open 160ed5d223b581 in the advanced settings when creating a TKE cluster, and then ClusterIP can be
Summary and outlook
This article introduces the high-performance ClusterIP service solution implemented by the TKE team based on Cilium and eBPF in the independent network card mode. Compared with the current traditional network solutions based on iptables and ipvs, this solution greatly improves the performance (33%-91%).
Obviously, the capabilities provided by Cilium do not stop there. Based on the revolutionary technology of eBPF, it also provides capabilities in security, observability, and QoS, while providing a higher-performance, safer and easier-to-use container network. It is exactly the service goal of TKE. Therefore, in the future, TKE will actively participate in the Cilium community and work with the community to launch stronger and more complete container network capabilities.
Reference
- Cilium project official website
- eBPF Introduction and Reference Guide
- Tencent Cloud Container Service TKE launches a new generation of zero-loss container network
- Kubernetes Service
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。