Preface
In Tencent Cloud TKE-Cilium-based Unified Hybrid Cloud Container Network (Part 1), we introduce the TKE hybrid cloud cross-plane network interworking solution and the TKE hybrid cloud Overlay network solution. In the public cloud TKE cluster adding third-party IDC node services, in order to meet the needs of customers in different usage scenarios (especially low tolerance requirements for network performance loss), the TKE hybrid cloud network solution also proposes an Underlay network solution based on BGP direct routing . The network model is implemented using GoBGP . Based on Cilium, it opens up the network between Node-Pod and Pod-Pod, which can ensure high network performance and support large-scale cluster expansion.
Before the TKE public cloud went online, this network solution had already gone through the large-scale practice of Tencent Cloud Proprietary Cloud Agile PaaS platform privatization environment, and was integrated and open sourced Tencent Cloud TKEStack This article will introduce in detail the design and implementation of the TKE Hybrid Cloud's Underlay container network solution based on BGP direct routing.
Background introduction
The diversity of customer needs, especially the tolerance for network performance loss, determines that the Underlay network solution is imperative. Why choose BGP protocol? Compared with internal gateway protocols such as OSPF and RIP, BGP focuses on controlling the propagation of routes and selecting the best path. The biggest advantage of BGP is that it has strong scalability and can meet the requirements of large-scale cluster horizontal expansion. On the other hand, BGP is simple and stable enough, and the industry also has successful cases based on BGP landing in the production environment.
According to the size of the cluster, there are different schemes for BGP routing mode. When the cluster is small, the Full Mesh interconnection mode can be used. It requires all BGP speakers in the same AS to be fully connected, and all external routing information must be redistributed to other routers in the same AS. With the expansion of the cluster size, the efficiency of the Full Mesh mode will decrease sharply. The Route Reflection mode is a mature alternative. Under the RR scheme, a BGP Speaker (that is, Route Reflector) is allowed to broadcast the learned routing information to other BGP Peers, which greatly reduces the number of BGP Peer connections.
Compared with existing solutions, Tencent Hybrid Cloud uses GoBGP to implement Cilium's Underlay solution. This solution implements its own BGP Agent based on the good programming interface provided by GoBGP, which has good scalability. Its characteristics are as follows:
- Support the expansion of large-scale clusters
- Support BGP neighbor discovery
- Support network visualization
- Support VIP and PodCIDR routing announcement
- Support advanced routing protocols such as ECMP
- Implement Cilium native-routing function
- Support L3 layer network communication
Tencent Hybrid Cloud Underlay Container Network Solution
Without changing the internal network topology of the IDC computer room, the access layer switch and the core layer switch establish a BGP connection with the help of the existing routing strategy in the computer room. The PodCIDR is allocated to the physical location of the Node, and each node announces the PodCIDR to the access layer switch through the BGP protocol to achieve the ability of the entire network to communicate.
- Each access layer switch is connected to the second layer of Node it manages to form an AS. BGP service runs on each node to announce the routing information of the node.
- Each router between the core layer switch and the access layer switch occupies a separate AS, is physically directly connected, and runs the BGP protocol. The core layer switch can perceive the routing information of the entire network, and the access layer switch can perceive the routing information on the Node directly connected to it.
- There is only one default route on each Node, which points directly to the access layer switch. The next hop of Node communication under the same access layer switch points to the opposite end.
Neighbor discovery
In a cluster network implemented by BGP, there are often situations in which nodes are added and deleted. If you use static configuration of peers, you need to frequently operate the switch to add and delete peers, which is a lot of maintenance work, which is not conducive to the cluster. Horizontal expansion. In order to avoid manually operating the switch, we support the two modes of configuring the access layer switch and the route reflector implemented at the software level to dynamically discover BGP neighbors.
Realize dynamic neighbor discovery through access layer switches
The access layer switch acts as a border router and Dynamic Neighbors function. For H3C, Cisco, and Huawei routers, please refer to the official documentation for the specific enabling of Dynamic Neighbors. The BGP service on the Node actively establishes an iBGP connection with the access layer switch and announces the local route. The access layer switch announces the learned route to the entire data room.
Realize dynamic neighbor discovery through RR
The physical switch or Node node acts as the reflection router RR, the reflection router establishes an iBGP connection with the access layer switch, and the BGP service on the Node node establishes a connection with the reflection router. The BGP service on the Node announces the local route to the RR, which reflects the RR to the access layer switch, and the access layer switch then announces it to the entire data room.
Next hop
Each Node runs the BGP service and announces the PodCIDR on the node to the access layer switch. Each access layer switch can perceive the PodCIDR on all directly connected Nodes. Nodes under the access layer switch learn routes from each other and deliver them locally, and the traffic is forwarded through the second layer of the access layer switch. The next hop of communication between nodes across the access layer switch points to the access layer switch, and the next hop of communication between nodes under the same access layer switch points to the opposite node. The following figure shows the routing learning situation of nodes under the same access layer switch and across access layer switches, and the next hop address can be determined intuitively according to the routing table.
- Communication link under the same access layer switch: Node 10.2.0.2 and node 10.2.0.3 are under the same access layer switch, with Layer 2 connectivity, and the packet is encapsulated and sent directly to the pair without being forwarded by Layer 3. end.
- Communication links between different access layer switches: The 10.2.0.2 node and the 10.3.0.3 node are under different access layer switches, and packets need to be routed through the access layer switch and the core switch to reach the opposite end.
BMP monitoring
Develop BMP Server based on BGP Monitoring Protocol , which is used for real-time monitoring of the running status of BGP sessions, including establishment and closure of peer relationships, routing information, etc. Use the collected BMP Message to directly locate the fault.
Graceful restart
BGP is a routing protocol based on TCP. After the TCP connection is abnormally disconnected, the switch that enables Graceful Restart will not delete the RIB and FIB. It will still forward packets according to the original forwarding entries and start the RIB route aging timer. BGP Peer needs to enable the Graceful Restart function at both ends to take effect. Graceful Restart can effectively prevent the BGP link from oscillating and improve the availability of the underlying network.
Custom IPAM
In the common configuration of allocate-node-cidrs
and configure-cloud-routes
kube-controller-manager are used to allocate PodCIDR to nodes and configure routing. However, the community’s solution limits the nodes to only have a PodCIDR and cannot be dynamically expanded. This kind of PodCIDR policy of one node is too simple, resulting in too low IP resource utilization. Some nodes may not be used up if the size is small, and some nodes are too large but not enough.
In the hybrid cloud scenario, we found that customers put forward higher requirements for IPAM:
- Hope that the PodCIDR of the node can be multi-segment
- Hope that the PodCIDR of the node can support dynamic expansion and recycling on demand
In order to solve this problem, we used our own tke-ipamd
component to realize the custom IPAM function, the principle is shown in the figure below:
- The kube-controller-manager component no longer allocates PodCIDR to the node, but the
tke-ipamd
component uniformly allocates PodCIDR to the node - Cilium Agent reads the
tke-ipamd
allocated by 060e57b29580c0 through the CiliumNode object, and responds to the CNI request to assign an IP to the Pod tke-ipamd
monitors node IP resource usage through the list-watch mechanism, and dynamically expands the node's PodCIDR when the node IP resource usage rate is too high
Performance Testing
In order to have a better understanding of the performance of the TKE hybrid cloud Underlay container network, we used the netperf tool to conduct a performance test. It can be seen that Underlay has basically no performance loss in network throughput and bandwidth.
Summary and outlook
After introducing the hybrid cloud scenario, TKE's Cilium-based hybrid cloud container network interconnection solution and Overlay network solution, this article focuses on the Underlay network solution based on BGP routing. The TKE hybrid cloud Underlay container network solution takes advantage of the scalability of BGP to meet the requirements of large-scale cluster horizontal expansion. At the same time, it can be basically lossless in performance relative to the node network, providing customers with higher data plane forwarding performance. Before the TKE public cloud went online, this network solution had already passed the large-scale practice of Tencent Cloud's proprietary cloud agile PaaS platform TCNS privatization environment, and was integrated and open sourced in Tencent Cloud TKEStack.
The combination of hybrid cloud and container is attracting the attention of more and more enterprise customers. In scenarios such as resource expansion, multiple active disaster preparedness, and distributed business deployment, it can improve the utilization of existing computing resources in the enterprise and bring significant benefits to customers. The Tencent Cloud Container team breaks through the differences between the public cloud and IDC environments, provides customers with a unified management view, and realizes the unification of multi-cloud scenarios, IDC scenarios, and edge scenarios. In addition to the unification of single-cluster capabilities, the Tencent Cloud container team also has a unified solution for cluster registration, multi-cluster management, and cross-cloud and cross-cluster mutual visits. Welcome to pay attention and experience.
Reference
Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. Provide users with enterprise-level services that integrate container cluster scheduling, Helm application orchestration, Docker image management, Istio service management, automated DevOps, and a full set of monitoring operation and maintenance systems.
[Tencent Cloud Native] Cloud Talk New Products, Cloud Research New Techniques, Yunyou Xinhuo, Yunxiang Information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。