toa kernel module analysis

Origin of TOA

We know that there are three load balancing modes before LVS: DR, NAT, and Tunnel, but they all have their own shortcomings. For example, DR and NAT require the virtual server and the real server to be in the same subnet, and the Tunnel is more complicated to operate and maintain. Therefore, for flexible deployment, a fourth mode, FULLNAT, was developed.

FULLNAT mode is an extension of NAT mode, which not only replaces the destination IP, but also replaces the source IP. The benefit is that the virtual server and the real server are freed from the shackles of the back-end network, and they are no longer required to be located in the same subnet.

However, this mode also brings a problem. The real server cannot obtain the real client IP address. In many business scenarios, when we provide services to the outside world, we need to check the IP address of the service requester to target the IP address. To do some business processing, the most common example is: whitelist verification, only the IP addresses in the whitelist will allow it to access our service; there is another application scenario, which is based on the client's request IP To perform scheduling, such as CDN services, then it is necessary to schedule the most suitable resources to provide services according to the client's request IP.

In order to solve the above problems, TOA came into being, it is actually a TCP option filed, using 8 bytes (kind = 0xfe, Length = 0x08, Value = 4B client's IP + 2B port), the source code is as follows,

/* MUST be 4 bytes alignment */
struct toa_data {
    __u8 opcode;
    __u8 opsize;
    __u16 port;
    __u32 ip;
};

After the server machine is patched, the real client IP address can be obtained through the system call getsockopt in lvs FULLNAT mode.

Use of TOA

In order to support TOA, FULLNAT directly modifies the kernel code. If you want to recompile the kernel, it will be very troublesome to use. We can load it into the kernel in the form of .ko file, and check whether the current machine has the toa module loaded by the following command ,

lsmod | grep toa

For the compilation of the toa module, please refer to the document TOA Plugin Configuration .

Implementation principle of TOA

TOA mainly parses toa data from tcp option through hook system function.

Note: The linux source version used in the following instructions is 3.2.101.

toa_init function is the initialization function of the toa module,

/* module init */
static int __init
toa_init(void)
{
    ...
    /* hook funcs for parse and get toa */
    hook_toa_functions();
    ...
}

Some processing details are omitted above, and the key code is the hook processing function hook_toa_functions , which is explained by taking the ipv4 protocol as an example.

/* replace the functions with our functions */
static inline int
hook_toa_functions(void)
{
    /* hook inet_getname for ipv4 */
    struct proto_ops *inet_stream_ops_p =
            (struct proto_ops *)&inet_stream_ops;
    
    /* hook tcp_v4_syn_recv_sock for ipv4 */
    struct inet_connection_sock_af_ops *ipv4_specific_p =
            (struct inet_connection_sock_af_ops *)&ipv4_specific;
    ...
    inet_stream_ops_p->getname = inet_getname_toa;
    ...
    ipv4_specific_p->syn_recv_sock = tcp_v4_syn_recv_sock_toa;
    return 0;
}

In the linux source code, the processing functions of the ipv4 protocol are defined as follows:

/* net/ipv4/tcp_ipv4.c */
const struct inet_connection_sock_af_ops ipv4_specific = {
    ..
    .send_check       = tcp_v4_send_check,
    .conn_request       = tcp_v4_conn_request,
    .syn_recv_sock       = tcp_v4_syn_recv_sock,
    .get_peer       = tcp_v4_get_peer,
};
EXPORT_SYMBOL(ipv4_specific);

The processing functions of the stream type socket are defined as follows:

/* net/ipv4/af_inet.c */
const struct proto_ops inet_stream_ops = {
    .family           = PF_INET,
    .bind           = inet_bind,
    .connect       = inet_stream_connect,
    .accept           = inet_accept,
    .getname       = inet_getname,
    .listen           = inet_listen,
    .shutdown       = inet_shutdown,
    ...
};
EXPORT_SYMBOL(inet_stream_ops);

Combining the linux source code and the toa code, two key hooks were found:

syn_recv_sock function pointer tcp_v4_syn_recv_sock -> tcp_v4_syn_recv_sock_toa ;
getname function pointer inet_getname -> inet_getname_toa .

syn_recv_sock call

syn_recv_sock function receives the Server third handshake trigger a call logic behind ack packet, call the path tcp_v4_do_rcv -> tcp_v4_hnd_req -> tcp_check_req -> syn_recv_sock .

/* net/ipv4/tcp_minisocks.c */
struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
               struct request_sock *req,
               struct request_sock **prev)
{
    ...
    child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
    if (child == NULL)
        goto listen_overflow;
    ...
}

In addition, when reading this part of the linux source code, it is found that the status of the server socket is still TCP_LISTEN when it receives the third handshake.

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
    ...
    if (sk->sk_state == TCP_LISTEN) {
        struct sock *nsk = tcp_v4_hnd_req(sk, skb);
        if (!nsk)
            goto discard;

         /* 在第三次握手时产生了一个新的 socket，进入该逻辑 */
        if (nsk != sk) {
            sock_rps_save_rxhash(nsk, skb);
            if (tcp_child_process(sk, nsk, skb)) {
                rsk = nsk;
                goto reset;
            }
            return 0;
        }
    }
    ...
}

The third handshake will generate a new socket, the initial state is TCP_SYN_RECV, and then converted to TCP_ESTABLISHED.

Let's take a look at the code logic of the alternative function tcp_v4_syn_recv_sock_toa ,

static struct sock *
tcp_v4_syn_recv_sock_toa(struct sock *sk, struct sk_buff *skb,
            struct request_sock *req, struct dst_entry *dst)
{
    struct sock *newsock = NULL;

    /* 先走原有的逻辑 */
    newsock = tcp_v4_syn_recv_sock(sk, skb, req, dst);

    /* 解析 toa data 放到 newsock->sk_user_data */
    if (NULL != newsock && NULL == newsock->sk_user_data) {
        newsock->sk_user_data = get_toa_data(skb);
        ..
    }
    return newsock;
}

The function for parsing toa data is get_toa_data . The key to the code is to find the corresponding field of tcp option and parse it into a variable of type toa_data sk_user_data , which will not be analyzed here.

inet_getname call

When we need to get the client ip from the socket, the inet_getname function will be called.

One way to use it is through the accept system call.

#include <sys/socket.h>
int accept(int sockfd, struct sockaddr *restrict addr,
           socklen_t *restrict addrlen);

If a sockaddr type variable is passed in, the inet_getname function call logic will be triggered,

/* net/socket.c */
SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
        int __user *, upeer_addrlen, int, flags)
{
    ...
    if (upeer_sockaddr) {
        if (newsock->ops->getname(newsock, (struct sockaddr *)&address,
                      &len, 2) < 0) {
            err = -ECONNABORTED;
            goto out_fd;
        }
        ...
    }
    ...
}

In addition, it can also be triggered by system calls such as getpeername and getsockopt .

Then, let's take a look at the implementation logic of the alternative function inet_getname_toa .

static int
inet_getname_toa(struct socket *sock, struct sockaddr *uaddr,
        int *uaddr_len, int peer)
{
    int retval = 0;
    struct sock *sk = sock->sk;
    struct sockaddr_in *sin = (struct sockaddr_in *) uaddr;
    struct toa_data tdata;

    /* 调用原来的逻辑 */
    retval = inet_getname(sock, uaddr, uaddr_len, peer);
    
    /* sk_user_data 有数据会进行数据拷贝 */
    if (retval == 0 && NULL != sk->sk_user_data && peer) {
        if (sk_data_ready_addr == (unsigned long) sk->sk_data_ready) {
            memcpy(&tdata, &sk->sk_user_data, sizeof(tdata));
            if (TCPOPT_TOA == tdata.opcode &&
                TCPOLEN_TOA == tdata.opsize) {
                sin->sin_port = tdata.port;
                sin->sin_addr.s_addr = tdata.ip;
            }
            ...
        }
        ...
    } 
    return retval;
}

When there is data in the sk_user_data variable and it is toa data, the corresponding ip and port will be replaced, so that the normal client ip and port can be obtained.

It can be seen from the above analysis that the working mode of the toa module is to parse the toa data into the sk_user_data variable during the third handshake, and then replace it accordingly every time it is needed.

toa kernel module analysis

Origin of TOA

Use of TOA

Implementation principle of TOA

syn_recv_sock call

inet_getname call

happen

引用和评论

Redis replication 中的探活

OpenInfra 基金会董事会宣布加入 Linux 基金会意向，增强开源全球影响力

rocky linux 使用记录

linux替换原有java

发现一款出色的通用主机监控系统【WGCLOUD】免费

WGCLOUD支持在信创系统部署使用吗

资产盘点系统 WGFIX v1.1 更新特性详解

toa kernel module analysis

Origin of TOA

Use of TOA

Implementation principle of TOA

syn_recv_sock call

inet_getname call

happen

引用和评论

Redis replication 中的探活

OpenInfra 基金会董事会宣布加入 Linux 基金会意向，增强开源全球影响力

rocky linux 使用记录

linux替换原有java

发现一款出色的通用主机监控系统 【WGCLOUD】免费

WGCLOUD支持在信创系统部署使用吗

资产盘点系统 WGFIX v1.1 更新特性详解

发现一款出色的通用主机监控系统【WGCLOUD】免费