From VLAN to IPVLAN: Talking about virtual network devices and their cloud-native applications

Author: Zhang Wei (Xie Shi)

Since this article is really long, and a lot of space is about the implementation of the kernel, if you are not interested in this part, then I suggest you think about it after reading the three questions in the first part, and then jump directly to us answer to the question.

submit questions

Note: All the codes in this article are Linux kernel source code, version 5.16.2

Have you heard of VLANs? Its full name is Virtual Local Area Network, which is used to isolate different broadcast domains in Ethernet. It was born very early. In 1995, IEEE published the 802.1Q standard [ 1] to define the VLAN format in Ethernet data frames, and it is still in use today. If you know VLANs, have you heard of MACVlan and IPVlan? With the continuous rise of container technology, IPVlan and MACVlan, as Linux virtual network devices, slowly came to the foreground. In the 1.13.1 version of Docker Engine in 2017 [2 ] , IPVlan and MACVlan began to be introduced as container networks. solution.

So have you also had the following questions?

1. What is the relationship between VLAN, IPVlan and MACVlan? Why are there VLANs in the names?

2. Why do IPVlan and MACVlan have various modes and flags, such as VEPA, Private, Passthrough, etc.? What is the difference between them?

3. What are the advantages of IPVlan and MACVlan? Under what circumstances should you touch and use them?

I also had the same problem. In today's article, we will explore the above three questions.

background knowledge

The following is some background knowledge, if you know Linux itself well, you can skip it.

Kernel abstraction of network devices

In Linux, we operate a network device, nothing more than using the ip command or the ifconfig command. For the implementation of the ip command, iproute2, it really relies on the netlink message mechanism provided by Linux. The kernel will abstract a structure that specifically responds to netlink messages for each type of network device (whether real or virtual). All are implemented in accordance with the rtnl_link_ops structure, which is used to respond to the creation, destruction and modification of network devices. For example, the more intuitive Veth device:

 static struct rtnl_link_ops veth_link_ops = {
    .kind   = DRV_NAME,
    .priv_size  = sizeof(struct veth_priv),
    .setup    = veth_setup,
    .validate = veth_validate,
    .newlink  = veth_newlink,
    .dellink  = veth_dellink,
    .policy   = veth_policy,
    .maxtype  = VETH_INFO_MAX,
    .get_link_net = veth_get_link_net,
    .get_num_tx_queues  = veth_get_num_queues,
    .get_num_rx_queues  = veth_get_num_queues,
};

For a network device, the operation of Linux and the response of hardware devices also require a set of specifications. Linux abstracts it into the structure of net_device_ops. If you are interested in device drivers, it is mainly to deal with it, and still use Veth Device example:

 static const struct net_device_ops veth_netdev_ops = {
  .ndo_init            = veth_dev_init,
  .ndo_open            = veth_open,
  .ndo_stop            = veth_close,
  .ndo_start_xmit      = veth_xmit,
  .ndo_get_stats64     = veth_get_stats64,
  .ndo_set_rx_mode     = veth_set_multicast_list,
  .ndo_set_mac_address = eth_mac_addr,
#ifdef CONFIG_NET_POLL_CONTROLLER
  .ndo_poll_controller  = veth_poll_controller,
#endif
  .ndo_get_iflink   = veth_get_iflink,
  .ndo_fix_features = veth_fix_features,
  .ndo_set_features = veth_set_features,
  .ndo_features_check = passthru_features_check,
  .ndo_set_rx_headroom  = veth_set_rx_headroom,
  .ndo_bpf    = veth_xdp,
  .ndo_xdp_xmit   = veth_ndo_xdp_xmit,
  .ndo_get_peer_dev = veth_peer_dev,
};

From the above definition, we can see several methods with very intuitive semantics: ndo_start_xmit is used to send packets, and newlink is used to create a new device.

For receiving data packets, Linux's packet receiving action is not completed by each process itself, but the ksoftirqd kernel thread is responsible for receiving from the driver, network layer (ip, iptables), transport layer (tcp, udp) processing, Finally, it is put into the recv buffer of the Socket held by the user process, and then processed by the kernel inotify user process. For virtual devices, all the differences are concentrated before the network layer, and there is a unified entry here, namely __netif_receive_skb_core.

801.2q Protocol's Definition of VLAN

In the 802.1q protocol, the VLAN field used to mark the Ethernet data frame header is a 32-bit field, and the structure is as follows:

title=

As shown above, there are 16 bits for marking Protocol, 3 bits for marking priority, 1 bit for marking format, and 12 bits for storing VLAN id. Seeing here, I think you can easily calculate, How many broadcast domains can we divide by VLAN? That's right, it is 2*12, 4096, minus the reserved all 0s and all 1s, and the client divides 4094 usable broadcast domains. (Before the rise of OpenFlow, the implementation of vpc in the earliest prototype of cloud computing relied on VLAN to distinguish networks, but due to this limitation, it was quickly eliminated, which also gave birth to another term you may be familiar with, VxLAN , although the two are very different, but there are still reference reasons).

VLAN was originally a concept on a switch like bridge, but Linux implemented them in software. Linux uses a 16-bit vlan_proto field and a 16-bit vlan_tci field in each Ethernet data frame to implement the 802.1q protocol. For each VLAN, a sub-device will be created to process the packets after the VLAN is removed. Yes, VLAN also has its own sub-device, namely VLAN sub-interface. Different VLAN sub-devices physically send packets through a main device. Send and receive, is this concept a little familiar again? That's right, that's exactly what ENI-Trunking is all about.

Deep dive into the kernel implementation of VLAN/MACVlan/IPVlan

After supplementing the background knowledge, let's start with the VLAN sub-device and see how the Linux kernel does it. All the kernel codes here are based on the latest version 5.16.2 as an example.

VLAN child device

device creation

The VLAN sub-device was not initially treated as a separate virtual device. After all, it appeared very early, and the code distribution was chaotic, but the core logic was located under the /net/8021q/ path. From the background, we can learn that the netlink mechanism implements the entry for the creation of network card devices. For VLAN sub-devices, the structure of their netlink messages is vlan_link_ops, and the vlan_newlink method is responsible for creating VLAN sub-devices. The kernel initialization code flow as follows:

title=

First create a Linux general net_device structure to save the configuration information of the device. After entering vlan_newlink, vlan_check_real_dev will be performed to check whether the incoming VLAN id is available, which will call the vlan_find_dev method, which is used to find a master device. To the sub-devices that meet the conditions, we will use them later. Let's intercept a part of the code and observe:

 static int vlan_newlink(struct net *src_net, struct net_device *dev,
      struct nlattr *tb[], struct nlattr *data[],
      struct netlink_ext_ack *extack)
{
  struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
  struct net_device *real_dev;
  unsigned int max_mtu;
  __be16 proto;
  int err;

  /*这里省略掉了用于参数校验的部分*/

    // 这里会设置vlan子设备的vlan信息，也就是背景知识中vlan相关的protocol，vlanid，优先级和flag信息的默认值
  vlan->vlan_proto = proto;
  vlan->vlan_id  = nla_get_u16(data[IFLA_VLAN_ID]);
  vlan->real_dev   = real_dev;
  dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
  vlan->flags  = VLAN_FLAG_REORDER_HDR;

  err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id,
          extack);
  if (err < 0)
    return err;

  /*这里会进行mtu的设置*/

  err = vlan_changelink(dev, tb, data, extack);
  if (!err)
    err = register_vlan_dev(dev, extack);
  if (err)
    vlan_dev_uninit(dev);
  return err;
}

The next step is to set the properties of the device through the vlan_changelink method. If you have special configuration, it will override the default value.
Finally, enter the register_vlan_dev method, which is to fill the previously completed information into the net_device structure, and register it into the kernel according to the Linux device management unified interface.

receive message

From the point of view of the creation process, the difference between a VLAN sub-device and a general device is that it can be found by the main device and the VLAN id through vlan_find_dev, which is very important.

Next, let's look at the receiving process of the message. According to the background knowledge, after the physical device receives the message, before entering the protocol stack for processing, the conventional entry is __netif_receive_skb_core, we will start from this entry and gradually analyze, the kernel operation process is as follows :

title=

According to the schematic diagram above, we intercept part of __netif_receive_skb_core for analysis:

First, when the data packet processing process starts, the skb_vlan_untag operation will be performed. For VLAN data packets, the protocol field of the data packet is always the ETH_P_8021Q of the VLAN. skb_vlan_untag is to extract the VLAN information from the vlan_tci field of the data packet, and call vlan_set_encap_proto to The Protocol is updated to the normal network layer protocol. At this time, part of the VLAN has been converted into normal data packets.
The data packet with VLAN tag will enter the processing flow of vlan_do_recieve in skb_vlan_tag_present. The core of the processing process of vlan_do_receive is to find the sub-device through vlan_find_dev, set the dev in the data packet as the sub-device, and then carry out VLAN-related information such as Priority. Clean up, at this point, the VLAN packet has been transformed into a normal packet destined for the VLAN sub-device.
After vlan_do_receive is completed, it will enter another_round, execute __netif_receive_skb_core again according to the normal data packet process, enter according to the processing logic of normal packets, and enter the processing of rx_handler, just like a normal data packet, on the sub-device through and The same rx_handler of the master device goes to the network layer.

 static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
            struct packet_type **ppt_prev)
{
  rx_handler_func_t *rx_handler;
  struct sk_buff *skb = *pskb;
  struct net_device *orig_dev;

another_round:
  skb->skb_iif = skb->dev->ifindex;
  /* 这是尝试对数据帧报文本身做一次vlan的解封装，也就从将背景中的vlan相关的两个字段填充*/
  if (eth_type_vlan(skb->protocol)) {
    skb = skb_vlan_untag(skb);
    if (unlikely(!skb))
      goto out;
  }

    /* 这里就是你所熟知的tcpdump的抓包点了，pt_prev记录了上一个处理报文的handler，如你所见，一份skb可能被很多地方处理，包括pcap */
  list_for_each_entry_rcu(ptype, &ptype_all, list) {
    if (pt_prev)
      ret = deliver_skb(skb, pt_prev, orig_dev);
    pt_prev = ptype;
  }

  /* 这里在存在vlan tag的情况下，如果有pt_prev已经存在，则做一次deliver_skb，这样其他handler处理的时候就会复制一份，原始报文就不会被修改 */
  if (skb_vlan_tag_present(skb)) {
    if (pt_prev) {
      ret = deliver_skb(skb, pt_prev, orig_dev);
      pt_prev = NULL;
    }
        /* 这里是核心的部分，我们看到经过vlan_do_receive处理之后，会变成正常包文再来一遍 */
    if (vlan_do_receive(&skb))
      goto another_round;
    else if (unlikely(!skb))
      goto out;
  }

    /* 这里是正常报文应该到达的地方，pt_prev表示已经找到了正常的handler，然后调用rx_handler进入上层处理 */
  rx_handler = rcu_dereference(skb->dev->rx_handler);
  if (rx_handler) {
    if (pt_prev) {
      ret = deliver_skb(skb, pt_prev, orig_dev);
      pt_prev = NULL;
    }
    switch (rx_handler(&skb)) {
    case RX_HANDLER_CONSUMED:
      ret = NET_RX_SUCCESS;
      goto out;
    case RX_HANDLER_ANOTHER:
      goto another_round;
    case RX_HANDLER_EXACT:
      deliver_exact = true;
      break;
    case RX_HANDLER_PASS:
      break;
    }
  }

  if (unlikely(skb_vlan_tag_present(skb)) && !netdev_uses_dsa(skb->dev)) {
check_vlan_id:
    if (skb_vlan_tag_get_id(skb)) {
    /* 这里是对vlan id并没有正确被摘除的处理，通常是因为vlan id不合法或者不存在在本地
  }
}

data sending

The data transmission entry of the VLAN sub-device is vlan_dev_hard_start_xmit. Compared with the packet receiving process, the transmission process is actually much simpler. The process of the kernel when sending is as follows:

title=

When the hardware sends, the VLAN sub-device will enter the vlan_dev_hard_start_xmit method. This method implements the ndo_start_xmit interface. It fills the VLAN-related Ethernet information into the message through the __vlan_hwaccel_put_tag method, and then modifies the device of the message as the main device and calls the main device. The dev_queue_xmit method of the device re-enters the sending queue of the master device for sending. We intercept a key part for analysis:

 static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
              struct net_device *dev)
{
  /* 这里就是上文提到的vlan_tci的填充，这些信息都归属于子设备本身 */
  if (veth->h_vlan_proto != vlan->vlan_proto ||
      vlan->flags & VLAN_FLAG_REORDER_HDR) {
    u16 vlan_tci;
    vlan_tci = vlan->vlan_id;
    vlan_tci |= vlan_dev_get_egress_qos_mask(dev, skb->priority);
    __vlan_hwaccel_put_tag(skb, vlan->vlan_proto, vlan_tci);
  }

    /* 这里直接将设备从子设备改为了主设备，非常直接 */
  skb->dev = vlan->real_dev;
  len = skb->len;
  if (unlikely(netpoll_tx_running(dev)))
    return vlan_netpoll_send_skb(vlan, skb);

    /* 这里就可以直接调用主设备进行报文发送了 */
  ret = dev_queue_xmit(skb);

    ...

  return ret;
}

MACVlan device

After reading the VLAN sub-device, analyze the MACVlan immediately. The difference between MACVlan and VLAN sub-device is that it is no longer the ability of Ethernet itself, but a virtual network device with its own driver. It is reflected in the independence of the driver code. The MACVlan related code is basically located in /drivers/net/macvlan.c.

There are five modes for MACVlan devices. Except for the source mode, the other four appear earlier and are defined as follows:

 enum macvlan_mode {
  MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
  MACVLAN_MODE_VEPA    = 2, /* talk to other ports through ext bridge */
  MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
  MACVLAN_MODE_PASSTHRU = 8,/* take over the underlying device */
  MACVLAN_MODE_SOURCE  = 16,/* use source MAC address list to assign */
};

Keep in mind the behavior of these patterns here, and why is a question we'll answer later.

device creation

For a MACVlan device, its netlink response structure is macvlan_link_ops. We can find that the response method for creating a device is macvlan_newlink. Starting from the entry, the overall process of creating a MACVlan device is as follows:

title=

macvlan_newlink will call macvlan_common_newlink to perform the actual sub-device creation operation. macvlan_common_newlink will first perform a legality check. It should be noted that the netif_is_MACVlan check is used. If a MACVlan sub-device is created as the master device, this will be automatically used. The master device of the sub-device is used as the master device of the newly created network card.
Next, a random mac address will be created for the MACVlan device through eth_hw_addr_random. Yes, the mac address of the MACVlan sub-device is random, which is very important and will be mentioned later.
After having the mac address, start to initialize the MACVlan logic on the master device. There will be a check here. If the master device has never created a MACVlan device, it will support the initialization of MACVlan through macvlan_port_create, and the core of this initialization is, Call netdev_rx_handler_register to carry out the rx_handler method macvlan_handle_frame of MACVlan to replace the action of the rx_handler originally registered by the device.
After the initialization is completed, a port is obtained, that is, a sub-device, and then the information of the sub-device is set.
Finally, the creation of the device is completed through register_netdevice. We intercept some core logic for analysis:

 int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
         struct nlattr *tb[], struct nlattr *data[],
         struct netlink_ext_ack *extack)
{
  ...

    /* 这里检查了主设备是否是macvlan设备，如果是则直接使用他的主设备 */
  if (netif_is_macvlan(lowerdev))
    lowerdev = macvlan_dev_real_dev(lowerdev);

    /* 这里生成了随机的mac地址 */
  if (!tb[IFLA_ADDRESS])
    eth_hw_addr_random(dev);

  /* 这里进行了初始化操作，也就是替换了rx_handler */
  if (!netif_is_macvlan_port(lowerdev)) {
    err = macvlan_port_create(lowerdev);
    if (err < 0)
      return err;
    create = true;
  }
  port = macvlan_port_get_rtnl(lowerdev);

    /* 接下来一大段都是省略的关于模式的设置 */
  vlan->lowerdev = lowerdev;
  vlan->dev      = dev;
  vlan->port     = port;
  vlan->set_features = MACVLAN_FEATURES;
  vlan->mode     = MACVLAN_MODE_VEPA;

    /* 最后注册了设备 */
  err = register_netdevice(dev);
  if (err < 0)
    goto destroy_macvlan_port;
}

receive message

The message reception of the MACVlan device still starts from the __netif_receive_skb_core entry. The specific code flow is as follows:

title=

When __netif_receive_skb_core is received by the master device, it will enter the macvlan_handle_frame method registered by the MACVlan driver. This method will first process multicast packets, and then process unicast packets.
For multicast packets, after is_multicast_ether_addr, firstly through macvlan_hash_lookup, find the sub-device through the relevant information on the sub-device, then process it according to the mode of the network card, if it is private or passthrou, find the sub-device and send it through macvlan_broadcast_one alone; If it is bridge or no VEPA, all sub-devices will receive broadcast packets through macvlan_broadcast_enqueue.
For unicast packets, the source mode and passthru mode are first processed to directly trigger the upper layer operation. For other modes, the macvlan_hash_lookup operation is performed according to the source mac. If VLAN information is found, the dev of the packet is set to find sub-device.
Finally, set the pkt_type of the message, pass it through the return of RX_HANDLER_ANOTHER, and perform the __netif_receive_skb_core operation again. In this operation, when it goes to macvlan_hash_lookup, since it is already a sub-device, it will return RX_HANDLER_PASS to enter the upper layer processing.
For the data receiving process of MACVlan, the most important thing is the logic of selecting the sub-device after the master device receives the message. This part of the code is as follows:

 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
                 const unsigned char *addr)
{
  struct macvlan_dev *vlan;
  u32 idx = macvlan_eth_hash(addr);

  hlist_for_each_entry_rcu(vlan, &port->vlan_hash[idx], hlist,
         lockdep_rtnl_is_held()) {
    /* 这部分逻辑就是macvlan查找子设备的核心，比较mac地址 */
    if (ether_addr_equal_64bits(vlan->dev->dev_addr, addr))
      return vlan;
  }
  return NULL;
}

send message

The message sending process of MACVlan also starts from the sub-device receiving the ndo_start_xmit callback function. Its entry is macvlan_start_xmit. The overall kernel code flow is as follows:

title=

When the data packet enters macvlan_start_xmit, the macvlan_queue_xmit method is mainly used to send the data packet.
macvlan_queue_xmit first deals with bridge mode. We know from the definition of mode that direct communication between different sub-devices within the main device is possible only in bridge mode. All this special case is dealt with here, and the multicast message and destination are other The unicast packets of the sub-device are sent directly to the sub-device.
For other messages, it will be sent through dev_queue_xmit_accel, and dev_queue_xmit_accel will directly call the netdev_start_xmit method of the master device, so as to realize the real sending of the message.

 static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
{
  ...

  /* 这里首先是bridge模式下的逻辑，需要考虑不通子设备间的通信 */
  if (vlan->mode == MACVLAN_MODE_BRIDGE) {
    const struct ethhdr *eth = skb_eth_hdr(skb);

    /* send to other bridge ports directly */
    if (is_multicast_ether_addr(eth->h_dest)) {
      skb_reset_mac_header(skb);
      macvlan_broadcast(skb, port, dev, MACVLAN_MODE_BRIDGE);
      goto xmit_world;
    }
    /* 这里对发往同一个主设备的其他子设备进行处理，直接进行转发 */
    dest = macvlan_hash_lookup(port, eth->h_dest);
    if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
      /* send to lowerdev first for its network taps */
      dev_forward_skb(vlan->lowerdev, skb);

      return NET_XMIT_SUCCESS;
    }
  }
xmit_world:
  skb->dev = vlan->lowerdev;
    /* 这里已经将报文的设备设置为主设备，然后通过主设备进行发送 */
  return dev_queue_xmit_accel(skb,
            netdev_get_sb_channel(dev) ? dev : NULL);
}

IPVlan device

Compared with MACVlan and VLAN sub-devices, IPVlan sub-devices have a more complex model. Unlike MACVlan, IPVlan defines the interworking behavior between IPVlan and sub-devices through flags, and provides three modes, which are defined as follows:

 /* 最初只有l2和l3，后面linux有了l3mdev，于是就出现了l3s，他们主要的区别还是在rx */
enum ipvlan_mode {
  IPVLAN_MODE_L2 = 0,
  IPVLAN_MODE_L3,
  IPVLAN_MODE_L3S,
  IPVLAN_MODE_MAX
};

/* 这里其实还有个bridge，因为默认就是bridge，所有省略了，他们的语义和macvlan一样 */
#define IPVLAN_F_PRIVATE  0x01
#define IPVLAN_F_VEPA   0x02

device creation

With the analysis of the previous two sub-devices, in the analysis of IPVlan, we can also continue the analysis according to this idea. The netlink message processing structure of the IPVlan device is ipvlan_link_ops, and the entry method for creating the device is ipvlan_link_new. The process is as follows:

title=

Enter ipvlan_link_new to judge the legality. Similar to MACVlan, if an IPVlan device is used as the master device to be added, the master device of the IPVlan device will be automatically used as the master device of the new device.
Set the mac address of the IPVlan device through eth_hw_addr_set to the mac address of the master device, which is the most obvious difference between IPVlan and MACVlan.
Enter the register_netdevice process registered by the unified network card. In this process, if there is no IPVlan sub-device currently, it will enter the initialization process of ipvlan_init like MACVlan, which will create ipvl_port on the main device and replace it with IPVlan's rx_handler The original rx_handler of the master device will also start a dedicated kernel worker to process multicast packets. That is to say, for IPVlan, all multicast packets are actually processed uniformly.
Next, continue to process the current new sub-device, save the current sub-device to the information of the main device through ipvlan_set_port_mode, and register its l3mdev processing method in nf_hook for the l3s sub-device, yes, this is The biggest difference from the above devices is that the exchange of data packets between the main device and the sub-device of l3s is actually done at the network layer.

For IPVlan network devices, we intercept a part of the code of ipvlan_port_create for analysis:

 static int ipvlan_port_create(struct net_device *dev)
{
    /* 从这里可以看到，port是主设备对子设备管理的核心 */
  struct ipvl_port *port;
  int err, idx;

    /* 子设备的各种属性，都在port中体现，也可以看到默认的mode是l3 */
  write_pnet(&port->pnet, dev_net(dev));
  port->dev = dev;
  port->mode = IPVLAN_MODE_L3;

    /* 这里可以看到，对于ipvlan，多播的报文都是单独处理的 */
  skb_queue_head_init(&port->backlog);
  INIT_WORK(&port->wq, ipvlan_process_multicast);

    /* 这里就是常规操作了，其实他是靠着这里来让主设备的收包可以顺利配合ipvlan的动作 */
  err = netdev_rx_handler_register(dev, ipvlan_handle_frame, port);
  if (err)
    goto err;
}

receive message

The three modes of the IPVlan sub-device have different packet receiving processing processes. The process in the kernel is as follows:

title=

Similar to MACVlan, it will first enter the processing flow of the ipvlan_handle_frame registered at the time of creation through __netif_receive_skb_core. At this time, the data packet is still owned by the master device.
For packet processing in mode l2 mode, only multicast packets are processed, and the packets are put into the multicast processing queue initialized when the sub-device was created; for unicast packets, they will be directly handed over to ipvlan_handle_mode_l3 for processing!
For mode l3 or unicast mode l2 packets, enter the ipvlan_handle_mode_l3 processing flow, first obtain the header information of the network layer through ipvlan_get_L3_hdr, then find the corresponding sub-device according to the ip address, and finally call ipvlan_rcv_frame to set the dev of the packet It is an IPVlan sub-device and returns RX_HANDLER_ANOTHER for the next packet collection.
For mode l3s, RX_HANDLER_PASS will be returned directly in ipvlan_handle_frame, that is to say, the message of mode l3s will enter the processing stage of the network layer when the master device. For mode l3s, the pre-registered nf_hook will be triggered at NF_INET_LOCAL_IN, Execute the ipvlan_l3_rcv operation, find the sub-device through addr, change the destination address of the network layer of the packet, and then directly enter ip_local_deliver to perform the remaining operations at the network layer.

send message

IPVlan packet transmission, although the implementation is relatively complex, the fundamental thing is that each sub-device is trying to use the main device to send packets. When the IPVlan sub-device sends data packets, it first enters ipvlan_start_xmit, The core sending operation is in ipvlan_queue_xmit, and the kernel code flow is as follows:

title=

ipvlan_queue_xmit selects different sending methods according to the mode of the sub-device, mode l2 is sent through ipvlan_xmit_mode_l2, and mode l3 and mode l3s are sent through ipvlan_xmit_mode_l3.
For ipvlan_xmit_mode_l2, first determine whether it is a local address or VEPA mode. If it is not a local packet in VEPA mode, first check whether it is an IPVlan sub-device under the same master device through ipvlan_addr_lookup, and if so, let other sub-devices through ipvlan_rcv_frame Receive packet processing; if not, let the master device process it through dev_forward_skb.
Next, ipvlan_xmit_mode_l2 will process the multicast packets. Before processing, clear the netns related information of the packets through ipvlan_skb_crossing_ns, including priority, etc., and finally put the packets into ipvlan_multicast_enqueue to trigger the above multicast processing flow.
For non-local packets, they are sent through the master's dev_queue_xmit.
The processing of ipvlan_xmit_mode_l3 also judges VEPA first. For data packets in non-VEPA mode, use ipvlan_addr_lookup to find out whether they are other sub-devices. If so, call ipvlan_rcv_frame to trigger other devices to receive packets.
For data packets in non-VEPA mode, first process ipvlan_skb_crossing_ns, and then perform ipvlan_process_outbound operation. At this time, according to the network layer protocol of the data packet, select ipvlan_process_v4_outbound or ipvlan_process_v6_outbound for processing.
Taking ipvlan_process_v6_outbound as an example, the route will be searched first through ip_route_output_flow, and then directly through the ip_local_out of the network layer, and continue to send packets at the network layer of the master device.

Solve the problem

After the above analysis and some experience and thinking, I think at least the first question can be easily answered:

The relationship between VLAN and MACVlan/IPVlan

What is the relationship between VLAN and IPVlan and MACVlan? Why are there VLANs in the names?

Since MACVlan and IPVlan have chosen to be called by this name, it shows that there are still similarities in some aspects. Our overall analysis found that the core logic of VLAN sub-devices is very similar to MACVlan and IPVlan:

The master device is responsible for physically sending and receiving packets.
The main device manages the sub-devices as multiple ports, and then finds the ports according to certain rules, such as VLAN information, mac address and ip address (macvlan_hash_lookup, vlan_find_dev, ipvlan_addr_lookup).
After the master device receives the packet, it needs to go through a "return" in __netif_receive_skb_core.
The sub-devices send packets directly by modifying the dev of the message, and then let the main device operate.

So it is not difficult to deduce that the internal logic of MACVlan/IPVlan is actually largely based on the implementation of Linux VLAN. Linux first joined MACVlan in version 2.6.63 released on June 18, 2007 [ 3] . The description for him is:

The new "MACVlan" driver allows the system administrator to create virtual interfaces mapped to and from specific MAC addresses.

In version 3.19 [ 4] released on December 7, 2014, IPVlan was introduced for the first time, and his description is:

The new "IPVlan" driver enables the creation of virtual network devices for container interconnection. It is designed to work well with network namespaces. IPVlan is much like the existing MACVlan driver, but it does its multiplexing at a higher level in the stack.

As for VLAN, it appeared much earlier than Linux version 2.4. The first version drivers of many devices already support VLAN. However, Linux's hwaccel implementation for VLAN was 2.6.10 [ 5] in 2004, which was updated at that time. Among a large number of features, this one appears:

I was poking about in the National Semi 83820 driver, and I happened to notice that the chip supports VLAN tag add/strip assist in hardware, but the driver wasn't making use of it. This patch adds in the driver support to use the VLAN tag add/remove hardware, and enables the drivers use of the kernel VLAN hwaccel interface.

That is to say, when Linux starts to treat VLAN as an interface, there are two virtual interfaces, MACVlan and IPVlan. In order to accelerate the processing of VLAN packets, Linux virtualizes different VLANs into devices. , and the later MACVlan and IPVlan under this idea, let the virtual device have a greater use.

In this way, their relationship is more like a tribute.

About VEPA/passthrough/bridge/private

Why do IPVlan and MACVlan have various modes and flags, such as VEPA, private, passthrough, etc.? What is the difference between them?

In fact, in the analysis of the kernel, we have already roughly understood the performance of these modes. If the main device is a DingTalk group, and all group friends can send messages to the outside world, then several modes are actually very intuitive:

In private mode, group friends are banned from each other, neither in the group nor outside the group.
In bridge mode, group friends can speak happily in the group.
In the VEPA mode, the group members are banned from speaking in the group, but you are directly private chatting outside the group, which is equivalent to a collective ban during the red envelope grabbing period of the annual meeting.
In passthrough mode, you are the group owner at this time, and no one can speak except you.

So why do these modes exist? From the perspective of the performance of the kernel, whether it is a port or a bridge, it is actually the concept of a network. That is to say, from the beginning, Linux has been trying to represent itself as a qualified network device. For the main device, Linux strives to Make it a switch, and for sub-devices, it is the device behind each network cable, so it seems very reasonable.

In fact, that's exactly what it is, whether it's VEPA or private, they were originally network concepts. In fact, not only Linux, we have seen many projects dedicated to disguising themselves as physical networks, all follow these behavioral patterns, such as OpenvSwitch [ 6] .

Application of MACVlan and IPVlan

What are the advantages of IPVlan and MACVlan? Under what circumstances should you touch and use them?

In fact, here is the beginning of the original intention of this article. We found from the second question that both IPVlan and MACVlan are doing one thing: virtual network. Why do we want virtual networks? There are many answers to this question, but like the value of cloud computing, virtual networks, as a basic technology of cloud computing, are ultimately for the improvement of resource utilization efficiency.

MACVlan and IPVlan serve this ultimate purpose. The era of local tyrants running a helloworld with one physical machine is still over. From virtualization to containerization, the era has put forward higher and higher requirements for network density. With the birth of container technology , First of all, veth came to the stage, but the density is enough, and the performance is efficient. MACVlan and IPVlan through sub-devices to improve the density and ensure high efficiency came into being (and of course our ENI-Trunking).

Having said that, I would like to recommend the IPVlan solution [ 7] , a new high-performance and high-density network solution brought by Alibaba Cloud Container Service ACK.

ACK is based on Terway plug-in and implements IPVlan-based K8s network solution. The Terway network plug-in is a self-developed network plug-in of ACK. It allocates the native elastic network card to the Pod to realize the Pod network, supports the network policy based on the Kubernetes standard to define the access policy between containers, and is compatible with Calico's network policy.

In the Terway network plugin, each Pod has its own network stack and IP address. Communication between Pods in the same ECS is directly forwarded within the machine, and Pod communication across ECSs and packets are directly forwarded through the elastic network card of the VPC. Since there is no need to use VxLAN and other tunnel technologies to encapsulate packets, the Terway mode network has high communication performance. The network mode of Terway is shown in the following figure:

title=

Customers can configure them to use Terway IPvlan mode if they choose the Terway network plug-in when creating a cluster with ACK. Terway IPvlan mode adopts IPvlan virtualization and eBPF kernel technology to realize high-performance Pod and Service network.

Different from the default Terway network mode, the IPvlan mode mainly optimizes the performance of the Pod network, Service, and Network Policy:

The Pod's network is directly implemented through the IPvlan L2 sub-interface of the ENI network card, which greatly simplifies the forwarding process of the network on the host, so that the network performance of the Pod is almost the same as that of the host, and the delay is reduced by 30% compared to the traditional mode.
The service's network adopts eBPF to replace the original kube-proxy mode, which does not need to be forwarded by iptables or IPVS on the host machine. The performance is almost unchanged in large-scale clusters, and the scalability is better. In a large number of new connections and port multiplexing scenarios, the request latency is significantly lower than that of IPVS and iptables modes.
Pod's network policy (NetworkPolicy) also uses eBPF to replace the original iptables implementation, which does not need to generate a large number of iptables rules on the host, so that the impact of network policy on network performance is minimized.

Therefore, using IPVlan to assign an IPVlan network card to each service pod not only ensures the density of the network, but also greatly improves the performance of the traditional network Veth solution (see reference link 7 for details). At the same time, the Terway IPvlan mode provides a high-performance Service solution. Based on the eBPF technology, we avoid the long-criticized performance problem of Conntrack.

I believe that no matter what the business scenario is, ACK with IPVlan is a better choice.

Finally, thank you for reading this. Behind this question, there is actually a hidden question. Do you know why we choose IPVlan instead of MACVlan? If you have an understanding of virtual network technology, then combined with the above content, you should have an answer soon. You are also welcome to leave a message in the comment area.

Reference link:

[1] "About IEEE 802.1Q"**

https://en.wikipedia.org/wiki/IEEE_802.1Q

[2] "Docker Engine release notes"

https://docs.docker.com/engine/release-notes/prior-releases/

[3] "Merged for MACVlan 2.6.23"

https://lwn.net/Articles/241915/

[4] "MACVlan 3.19 Merge window part 2"

https://lwn.net/Articles/626150/

[5] "VLan 2.6.10-rc2 long-format changelog"

https://lwn.net/Articles/111033/

[6] "[ovs-dev] VEPA support in OVS"