Tencent Cloud TKE-based on Cilium unified hybrid cloud container network (below)

Preface

In Tencent Cloud TKE-Cilium-based Unified Hybrid Cloud Container Network (Part 1), we introduce the TKE hybrid cloud cross-plane network interworking solution and the TKE hybrid cloud Overlay network solution. In the public cloud TKE cluster adding third-party IDC node services, in order to meet the needs of customers in different usage scenarios (especially low tolerance requirements for network performance loss), the TKE hybrid cloud network solution also proposes an Underlay network solution based on BGP direct routing . The network model is implemented using GoBGP . Based on Cilium, it opens up the network between Node-Pod and Pod-Pod, which can ensure high network performance and support large-scale cluster expansion.

Before the TKE public cloud went online, this network solution had already gone through the large-scale practice of Tencent Cloud Proprietary Cloud Agile PaaS platform privatization environment, and was integrated and open sourced Tencent Cloud TKEStack This article will introduce in detail the design and implementation of the TKE Hybrid Cloud's Underlay container network solution based on BGP direct routing.

Background introduction

The diversity of customer needs, especially the tolerance for network performance loss, determines that the Underlay network solution is imperative. Why choose BGP protocol? Compared with internal gateway protocols such as OSPF and RIP, BGP focuses on controlling the propagation of routes and selecting the best path. The biggest advantage of BGP is that it has strong scalability and can meet the requirements of large-scale cluster horizontal expansion. On the other hand, BGP is simple and stable enough, and the industry also has successful cases based on BGP landing in the production environment.

According to the size of the cluster, there are different schemes for BGP routing mode. When the cluster is small, the Full Mesh interconnection mode can be used. It requires all BGP speakers in the same AS to be fully connected, and all external routing information must be redistributed to other routers in the same AS. With the expansion of the cluster size, the efficiency of the Full Mesh mode will decrease sharply. The Route Reflection mode is a mature alternative. Under the RR scheme, a BGP Speaker (that is, Route Reflector) is allowed to broadcast the learned routing information to other BGP Peers, which greatly reduces the number of BGP Peer connections.

Compared with existing solutions, Tencent Hybrid Cloud uses GoBGP to implement Cilium's Underlay solution. This solution implements its own BGP Agent based on the good programming interface provided by GoBGP, which has good scalability. Its characteristics are as follows:

Support the expansion of large-scale clusters
Support BGP neighbor discovery
Support network visualization
Support VIP and PodCIDR routing announcement
Support advanced routing protocols such as ECMP
Implement Cilium native-routing function
Support L3 layer network communication

Tencent Hybrid Cloud Underlay Container Network Solution

Without changing the internal network topology of the IDC computer room, the access layer switch and the core layer switch establish a BGP connection with the help of the existing routing strategy in the computer room. The PodCIDR is allocated to the physical location of the Node, and each node announces the PodCIDR to the access layer switch through the BGP protocol to achieve the ability of the entire network to communicate.

Each access layer switch is connected to the second layer of Node it manages to form an AS. BGP service runs on each node to announce the routing information of the node.
Each router between the core layer switch and the access layer switch occupies a separate AS, is physically directly connected, and runs the BGP protocol. The core layer switch can perceive the routing information of the entire network, and the access layer switch can perceive the routing information on the Node directly connected to it.
There is only one default route on each Node, which points directly to the access layer switch. The next hop of Node communication under the same access layer switch points to the opposite end.

Neighbor discovery

In a cluster network implemented by BGP, there are often situations in which nodes are added and deleted. If you use static configuration of peers, you need to frequently operate the switch to add and delete peers, which is a lot of maintenance work, which is not conducive to the cluster. Horizontal expansion. In order to avoid manually operating the switch, we support the two modes of configuring the access layer switch and the route reflector implemented at the software level to dynamically discover BGP neighbors.

Realize dynamic neighbor discovery through access layer switches

The access layer switch acts as a border router and Dynamic Neighbors function. For H3C, Cisco, and Huawei routers, please refer to the official documentation for the specific enabling of Dynamic Neighbors. The BGP service on the Node actively establishes an iBGP connection with the access layer switch and announces the local route. The access layer switch announces the learned route to the entire data room.

Realize dynamic neighbor discovery through RR

The physical switch or Node node acts as the reflection router RR, the reflection router establishes an iBGP connection with the access layer switch, and the BGP service on the Node node establishes a connection with the reflection router. The BGP service on the Node announces the local route to the RR, which reflects the RR to the access layer switch, and the access layer switch then announces it to the entire data room.

Next hop

Each Node runs the BGP service and announces the PodCIDR on the node to the access layer switch. Each access layer switch can perceive the PodCIDR on all directly connected Nodes. Nodes under the access layer switch learn routes from each other and deliver them locally, and the traffic is forwarded through the second layer of the access layer switch. The next hop of communication between nodes across the access layer switch points to the access layer switch, and the next hop of communication between nodes under the same access layer switch points to the opposite node. The following figure shows the routing learning situation of nodes under the same access layer switch and across access layer switches, and the next hop address can be determined intuitively according to the routing table.

Communication link under the same access layer switch: Node 10.2.0.2 and node 10.2.0.3 are under the same access layer switch, with Layer 2 connectivity, and the packet is encapsulated and sent directly to the pair without being forwarded by Layer 3. end.
Communication links between different access layer switches: The 10.2.0.2 node and the 10.3.0.3 node are under different access layer switches, and packets need to be routed through the access layer switch and the core switch to reach the opposite end.

BMP monitoring

Develop BMP Server based on BGP Monitoring Protocol , which is used for real-time monitoring of the running status of BGP sessions, including establishment and closure of peer relationships, routing information, etc. Use the collected BMP Message to directly locate the fault.

Graceful restart

BGP is a routing protocol based on TCP. After the TCP connection is abnormally disconnected, the switch that enables Graceful Restart will not delete the RIB and FIB. It will still forward packets according to the original forwarding entries and start the RIB route aging timer. BGP Peer needs to enable the Graceful Restart function at both ends to take effect. Graceful Restart can effectively prevent the BGP link from oscillating and improve the availability of the underlying network.

Custom IPAM

In the common configuration of allocate-node-cidrs and configure-cloud-routes kube-controller-manager are used to allocate PodCIDR to nodes and configure routing. However, the community’s solution limits the nodes to only have a PodCIDR and cannot be dynamically expanded. This kind of PodCIDR policy of one node is too simple, resulting in too low IP resource utilization. Some nodes may not be used up if the size is small, and some nodes are too large but not enough.

In the hybrid cloud scenario, we found that customers put forward higher requirements for IPAM:

Hope that the PodCIDR of the node can be multi-segment
Hope that the PodCIDR of the node can support dynamic expansion and recycling on demand

In order to solve this problem, we used our own tke-ipamd component to realize the custom IPAM function, the principle is shown in the figure below:

The kube-controller-manager component no longer allocates PodCIDR to the node, but the tke-ipamd component uniformly allocates PodCIDR to the node
Cilium Agent reads the tke-ipamd allocated by 060e57b29580c0 through the CiliumNode object, and responds to the CNI request to assign an IP to the Pod
tke-ipamd monitors node IP resource usage through the list-watch mechanism, and dynamically expands the node's PodCIDR when the node IP resource usage rate is too high

Performance Testing

In order to have a better understanding of the performance of the TKE hybrid cloud Underlay container network, we used the netperf tool to conduct a performance test. It can be seen that Underlay has basically no performance loss in network throughput and bandwidth.

Summary and outlook

After introducing the hybrid cloud scenario, TKE's Cilium-based hybrid cloud container network interconnection solution and Overlay network solution, this article focuses on the Underlay network solution based on BGP routing. The TKE hybrid cloud Underlay container network solution takes advantage of the scalability of BGP to meet the requirements of large-scale cluster horizontal expansion. At the same time, it can be basically lossless in performance relative to the node network, providing customers with higher data plane forwarding performance. Before the TKE public cloud went online, this network solution had already passed the large-scale practice of Tencent Cloud's proprietary cloud agile PaaS platform TCNS privatization environment, and was integrated and open sourced in Tencent Cloud TKEStack.

The combination of hybrid cloud and container is attracting the attention of more and more enterprise customers. In scenarios such as resource expansion, multiple active disaster preparedness, and distributed business deployment, it can improve the utilization of existing computing resources in the enterprise and bring significant benefits to customers. The Tencent Cloud Container team breaks through the differences between the public cloud and IDC environments, provides customers with a unified management view, and realizes the unification of multi-cloud scenarios, IDC scenarios, and edge scenarios. In addition to the unification of single-cluster capabilities, the Tencent Cloud container team also has a unified solution for cluster registration, multi-cluster management, and cross-cloud and cross-cluster mutual visits. Welcome to pay attention and experience.

Reference

Using BIRD to run BGP

Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. Provide users with enterprise-level services that integrate container cluster scheduling, Helm application orchestration, Docker image management, Istio service management, automated DevOps, and a full set of monitoring operation and maintenance systems.
[Tencent Cloud Native] Cloud Talk New Products, Cloud Research New Techniques, Yunyou Xinhuo, Yunxiang Information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Tencent Cloud TKE-based on Cilium unified hybrid cloud container network (below)

Preface

Background introduction

Tencent Hybrid Cloud Underlay Container Network Solution

Neighbor discovery

Realize dynamic neighbor discovery through access layer switches

Realize dynamic neighbor discovery through RR

Next hop

BMP monitoring

Graceful restart

Custom IPAM

Performance Testing

Summary and outlook

Reference

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

火热报名中| 第五届Light创造营邀你一起破茧成光！

可能是腾讯游戏首次接入满血版 DeepSeek——知几公众号对接腾讯云 DeepSeek 方案

2025版 RTC、直播、点播技术对比｜腾讯云/即构/声网如何选型

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

推理模型升级浪潮下，Agentic RAG 如何借力 DeepSeek 实现知识革命？

信息安全风云录，AI 时代安全江湖如何见招拆招？