Kubernetes Network from the shallower to the deeper

Mainly to learn in-depth analysis of the notes of the Kubernetes column class.

`One, stand-alone container network`

`noun`

Network stack: The network stack includes the network card ( Network Interface ), loopback device ( Loopback Device ), routing table ( Routing Table ) and Iptables rules. For a process, these elements constitute the basic environment for it to initiate and respond to network requests.
Bridge ( Bridge ): bridge is a virtual network device, so it has the characteristics of a network device. It can be configured with IP and MAC addresses; Bridger is a virtual switch with similar functions as a physical switch. It is a device that works at the data link layer.
Veth Pair : Virtual network cable, used to connect the container to the bridge; after it is created, it always Veth Peer ), and the data packet sent from one of the network cards will automatically appear On the corresponding network card, even if these two \_*network cards\_ are in different Network Namespace .
ARP : It is a protocol to find the corresponding layer 2 MAC IP
CAM table: The virtual switch (here is the bridge) learns and maintains the port corresponding to the MAC MAC

`Host network`

As a container, you can use -net=host host computer Network Namespace .

$ docker run -d -net=host --name nginx-1 nginx

The advantage of using the Host network is that the network performance is better, and the network stack of the host is directly used. The disadvantage is that it will introduce problems of sharing network resources, such as port conflicts. Therefore, in most cases, we all hope to use the network stack in our Network Namespace and have our own IP and port.

`How to communicate`

As shown in the figure above, the communication process of a single-node container network is described. The following describes the interaction process in detail based on the access process of C1->C2

# 先创建两个容器，用于模拟发起请求,启动两个centos容器，并在里面安装net-tools工具，才可以使用ifconfig命令
# 创建C1，并安装net-tools
$ docker run -d -it --name c1 centos /bin/bash
$ docker exec -it c1 bash
$ [root@60671509044e /]# yum install -y net-tools
# 创建C2，并安装net-tools
$ docker run -d -it --name c2 centos /bin/bash
$ docker exec -it c2 bash
$ [root@94a6c877b01a /]# yum install -y net-tools

After the containers C1 and C2 started, there is a default routing rule in the container, and all requests in the current container network segment will go to the eth0 network card device.
- C1

# 进入c1容器，查看ip以及路由表
$ docker exec -it c1 bash
# 查看IP
$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.7  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:07  txqueuelen 0  (Ethernet)
        RX packets 6698  bytes 9678058 (9.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3518  bytes 195061 (190.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
# 查看路由       
$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0

C2

# 进入C2容器查看IP和路由表
$ docker exec -it c2 bash
$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.8  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:08  txqueuelen 0  (Ethernet)
        RX packets 6771  bytes 9681937 (9.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3227  bytes 179347 (175.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
# 查看路由
$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth0

The above containers IP have their own 0609dd9aeecc1a and MAC addresses, and each container has a default route _gateway pointing to the eth0 network card; and _gateway has a corresponding MAC address that already exists in the local ARP cache.

Need to use network communication between the host MAC address, which is the data link layer to identify the host mode, C1 access C2 time will start the local ARP lookup cache if there C2 container corresponding IP:172.17.0.3 the MAC address. If not, it will initiate the ARP protocol to find the MAC address.

# c1 -> c2 ,先发起ARP请求查找MAC地址，可以在容器中查看ARP缓存对应IP的MAC
$ docker exec -it c1 bash
# 先查看本地的ARP缓存
$ [root@94a6c877b01a /]# arp
Address                  HWtype  HWaddress           Flags Mask            Iface
_gateway                 ether   02:42:2e:8d:21:d6   C                     eth0
# 执行ping命令就会发起ARP寻址请求
$ ping 172.17.0.8
# 再查询本地arp缓存，发现已经有MAC地址存在了
$ [root@60671509044e /]# arp
Address                  HWtype  HWaddress           Flags Mask            Iface
172.17.0.8               ether   02:42:ac:11:00:08   C                     eth0
_gateway                 ether   02:42:2e:8d:21:d6   C                     eth0

ARP addressing process: C1 container initiates ARP request, after entering the local routing protocol, the request will be routed to the bridge. At this time, the bridge ( Bridge ) acts as a virtual switch, and the virtual switch will ARP to others inserted into the bridge C2 receiving the ARP agreement, 0609dd9aeecd92 will reply to the MAC address.

Find C2 of MAC after the address can initiate communication.

`Two, cross-host container communication`

Cross-host container communication is divided into two network structures Overlay and Underlay according to whether they rely on the underlying network environment Overlay network requires only that the network between the hosts is reachable, and the hosts are not required to be in the same layer 2 domain; Underlay There are requirements for the underlying infrastructure. According to the implementation method, there are different requirements for the underlying network infrastructure. For example, the Flanan host-gw component requires the hosts to be in the same layer 2 domain, that is, the hosts must be connected to a switch.

`noun`

Overlay Network (overlay network): On top of the existing host network, a virtual network that covers the host network and connects all containers is built through software.
Tun device (Tunnel device): In Linux , the TUN device is a virtual network device that works at the third layer ( Network Layer Tun IP package between the operating system kernel and user applications.
VXLAN : Virtual Extensible Local Area Network ( Virtual Extensible LAN ), is a network virtualization technology supported by the LINUX VXLAN completely realizes the encapsulation and decapsulation of network data packets in the kernel mode.
VTEP : Virtual tunnel endpoint device, which has both IP and MAC addresses.
BGP : Border Gateway Protocol ( Border Gateway Protocol ), which is a Linux kernel and specifically used in large-scale data centers to maintain routing information between different autonomous systems.

`Cross-host communication`

For cross-host container communication, Overlay Network used to realize cross-host container communication. There are many ways to implement Overlay Network

`Overlay mode`

`1. Three-layer Flannel UDP`

Flannel UDP mode is the simplest and most easily implemented container cross-main network solution provided by Flannel But it is of great reference significance for understanding Overlay

Let's take an example to describe the process of network access. In this process, there are two hosts and four containers. We need to request Container-4 Container-1 container.

Container-1 container initiates a request to the Container-4 Docker0 is located at Root Network Namespace , through veth peer one end is connected to the container Network Namespace connected to the Docker0 virtual network device Root Netwrok Namespace

The container 100.96.1.2 accesses 100.96.2.2 . Since the destination address is not in the network segment of the Docker0 bridge (you know that the target container is not on this bridge ARP IP will execute Container-1 the default routing rules in the container. It is default via 172.17.0.1 dev eth0 as follows. Corresponds to step 1 in the figure above.

# 容器中默认设置的的路由规则，
[root@94a6c877b01a /]# ip route
default via 172.17.0.1 dev eth0 
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 
# 下一跳是172.17.0.1且从eth0设备上出去，通过查看docker的网络，172.17.0.1就是bridge设备的网关IP
lengrongfu@MacintoshdeMacBook-Pro ~ % docker network ls        
NETWORK ID     NAME                               DRIVER    SCOPE
e522990979b3   bridge                             bridge    local
# 查看网络
lengrongfu@MacintoshdeMacBook-Pro ~ % docker inspect network e522990979b3
[
    {
        "Name": "bridge",
        "Id": "e522990979b365e9df4d967c3600483e598e530361deb28513b6e75b8b66bedf",
        "Created": "2021-04-12T12:11:57.321486866Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "94a6c877b01ac3a1638f1c5cde87e7c58be9ce0aafd4a78efcb96528ab00ed94": {
                "Name": "c2",
                "EndpointID": "a5c12fb3800991228f8dc3a2a8de1d6f4865439701a83558e4430c2aebf783a8",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

After entering the Docker0 bridge, the route on the host depends on how to go. Here's what the host's routing table, access to the target IP to 100.96.2.2 device will hit second match rule, meaning that access 100.96.0.0/16 data segment to flannel0 equipment, and is willing to IP is 100.96.1.0 . Correspond to step 2 in the figure above.

# Node1路由表
$ ip route
1 Default via 10.168.0.1 dev eth0
2 100.96.0.0/16 dev flannel0 proto kernel scope link src 100.96.1.0
3 100.96.1.0/24 dev docker0 proto kernel scope link src 100.96.1.1
4 10.168.0.0/24 dev eth0 proto kernel scope link src 10.168.0.2

`Flannel0 device`

The above mentioned that Flannel0 is a TUN virtual three-layer network device, which is mainly used to transfer the IP packet between the kernel mode and the user mode; continue according to the above process analysis, after the data message reaches the Flannel0 device from the kernel mode, it will be passed to the creation Flannel0 process equipment is FlannelD process, then flanneld process see the destination address is 100.96.2.2 , put the packets sent to Node2 on the node flanneld process of listening UDP port. flanneld will encapsulate the data to be sent into a UDP data packet and send it out. Correspond to steps 3, 4, 5, and 6 in the figure above.
- How does the flanneld 100.96.2.2 ip on Node2 ? This is because it uses the subnet. When each node is started, it will be assigned a subnet segment. Through the subnet, it can be determined that this ip belongs to that node. , The subnet is stored in etcd .
Node2 the flanneld process on 0609dd9aeed540 receives the data packet, it will be sent to the flannel0 device. This is a process from user mode to kernel mode, so the Linux kernel network protocol stack will be responsible for processing this IP packet. The specific processing method is IP for the next flow of this 0609dd9aeed54a packet through the local routing table. Correspond to steps 7 and 8 in the figure above.

# node2上的路由表
$ ip route
1 default via 10.168.0.1 dev eth0
2 100.96.0.0/16 dev flannel0 proto kernel scope link src 100.96.2.0
3 100.96.2.0/24 dev docker0 proto kernel scope link src 100.96.2.1
4 10.168.0.0/24 dev eth0 proto kernel scope link src 10.168.0.3

By analyzing the target ip is 100.96.2.2 match, he third and routing rule more accurately, the route rule means destined to 100.96.2.0/24 network data packet to docker0 device up and set the source IP to 100.96.2.1 . Correspond to step 9 in the figure above.
After the data packet enters the docker0 device, the docker0 bridge will play the role of a two-layer switch and send the data packet to the correct veth pair pair. After entering this device, it enters the Contaniner-2 network protocol stack. Corresponds to step 10 in the figure above.

Flannel UDP mode provides a three-layer Overlay network. It first IP packet at the UDP end with 0609dd9aeed681, and then decapsulates it at the receiving end to get the original IP packet, and then IP packet to the destination container.

Flannel UDP mode has serious performance problems. The main problem is that because the TUN device is used, only in IP packet, it needs to go through three data copies between the user mode and the kernel mode.

`2. Three-layer Calico ipip`

`3. Layer 2 + Layer 3 VXLAN`

VXLAN network is to cover a virtual layer 2 network maintained VXLAN module on top of the existing layer 3 network VXLAN layer 2 network can be like Communicate freely as in the same local area network. In order to open the tunnel on the Layer 2 network, VXLAN will set up a special network device on the host machine as the two ends of the tunnel, this device is called VTEP , the full name is: ( VXLAN Tun End Poin ) virtual tunnel endpoint. The role of the VTEP flanneld process, which is to encapsulate and decapsulate data packets, except that it encapsulates and decapsulates the two-layer data frame, and this workflow is all done in the kernel.

`Underlay mode`

`1. Three-layer BGP`

The above figure is a typical BGP network topology diagram. By using Route1 and Route2 as border routing gateways, and writing other LAN routing information into the current routing, the LAN can be achieved to achieve the full three-layer network. through.

`Calico BGP usage`

After understanding BGP , Calico project is very easy to understand. It treats each host node as a boundary route, so the routing information of all other nodes is stored on each node, let’s analyze it Its realization, it consists of three parts:

Calico plug-in of CNI , this is the docking part Caclico and Kubernetes
BIRD is BGP , which is responsible for distributing routing information in the cluster.
Felix , which is a Demoset , is responsible for inserting routing rules on the host machine ( FIB Linux kernel), and maintaining the network equipment required for Calico

Calico BGP mode and Flannel host-gw different modes, Calico not created any virtual bridge device, Calico work adopts the following chart to illustrate.

Calico BGP pattern shown in FIG interaction network as described above, as container1 need access Container3 , we analyze how to reach the network. Because the cni0 virtual bridge device is veth device pair is in the Network Namespace of the container, and one end is in the container network space of the host.

First of all, the Calico CNI plug-in also needs to Veth Pair device of each container on each host, because it accepts incoming IP packets, such as:

# 192.168.0.2节点上的路由信息有
$ ip route
10.20.0.2 dev cali1 scope link
10.20.0.3 dev cali2 scope link

# 192.168.0.3节点上的路由信息有
$ ip route
10.20.1.2 dev cali3 scope link
10.20.1.3 dev cali4 scope link

There are other node routing protocols broadcast by BGP on each node, such as:

# 192.168.0.2上有一条指向192.169.0.3的路由
$ ip route
10.20.1.0/24 via 192.168.0.3 dev eth0
# 192.168.0.3上有一条指向192.168.0.2的路由
$ ip route
10.20.0.0/24 via 192.168.0.2 dev eth0

The default Calico BGP uses the Node to Node mode, which will cause the connection on each node to increase by N^2 Sudoku. Generally, it is recommended to use less than 100 nodes in a cluster. In a large-scale cluster, Route Reflector is needed. All routes are reported to a central node, and other nodes are synchronized from the central node.

Calico BGP mode, Flannel host-gw mode, has a dependency on the basic network facilities and requires that the cluster hosts be reachable at the second layer. If in a different between a host LAN , the need to use Calico ipip of Overlay pattern of.

`2. Layer 2 VLAN`

`3、Flannel host-gw`

Flannel host-gw mode A picture can explain the realization principle between them clearly.

CNI0 device is a three-layer switch with the function of a two-layer switch and an independent IP

flannel to Daemonset way to start each node on a Flanneld process for maintaining routing information on each node, implementation is local

For example: The 192.168.1.0/24 via 10.20.0.3 dev eth0 route defines the next hop to 192.168.1.0/24 10.20.0.3 and exits from the eth0 device.

Then when the IP packet is encapsulated into a frame and sent out, the next hop in the routing table will be used to set the destination MAC address; in this way, the destination host can be reached through the Layer 2 network.

Because he will use the next hop destination MAC address, so it requires the host to be connected at the second layer, it is better to use the ARP protocol to use IP to obtain the MAC address.

Kubernetes Network from the shallower to the deeper