连接跟踪之synproxy

原理分析

synproxy是一个很方便的缓解syn flood攻击的功能，linux使用一个连接跟踪扩展来支持synproxy。

下面这篇文章详细解释了linux synproxy的实现原因,文章链接为https://lwn.net/Articles/563151/：

The following patches against nf-next.git implement a SYN proxy for
netfilter. The series applies on top of the patches I sent last week
and is split into five patches:

- a patch to split out sequence number adjustment from NAT and make it
  usable from other netfilter subsystems. This is used to translate
  sequence numbers from the server to the client once the full connection
  has been established.

  This patch contains a bit of churn, but the core is to simply move the
  code to a new file and move the sequence number adjustment data into a
  ct extend.

- a patch to extract the TCP stack independant parts of syncookie generation
  and validation and make the usable from netfilter

- the SYN proxy core and IPv4 SYNPROXY target. See below for more details.

- a similar patch to the second one for IPv6

- an IPv6 version of the SYNPROXY target


The SYNPROXY operates by marking the initial SYN from the client as UNTRACKED
and directing it to the SYNPROXY target. The target responds with a SYN/ACK
containing a cookie and encodes options such as window scaling factor, SACK
perm etc. into the timestamp, if timestamps are used (similar to TCP). The
window size is set to zero. The response is also sent as untracked packet.

When the final ACK is received the cookie is validated, the original options
extracted and a SYN to the original destination is generated. The SYN to the
original destination uses the avertised window from the final ACK and the
options from the initial SYN packet. The SYN is not sent as untracked, so
from a connection tracking POV it will look like the original packet from
the client and instantiate a new connection. When the server responds with
a SYN/ACK a final ACK for the server is generated and a window update with
the window size announced by the server is sent to the client. At this
point the connection is handed of to conntrack and the only thing the
target is still involved in is timestamp translation through the registerd
hooks.

Since the SYN proxy can't know the options the server supports, they have
to be specified as parameters to the SYNPROXY target. The assumption is that
these options are constant as long as you don't change settings on the
server. Since the SYN proxy can't know the initial sequence number and
timestamp values the server will use, both have to be translated in the
direction server->client. Sequence number translation is done using the
standard sequence number translation mechanisms originally only used for
NAT, timestamps are translated in a hook registered by the SYNPROXY target.

Martin Topholm made some performance measurements with an earlier version
(that should still be valid, the only difference was that the core and IPv4
parts were in the same file) and measured a load of about 7% on a 8 way
system with 2 million SYNs per second, which without the target basically
killed the server (Martin, please correct me if I'm wrong).

The iptables patches will follow in a seperate thread, testing can be done
by:

iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK
iptables -A INPUT -i eth0 -p tcp --dport 80 -m state --state UNTRACKED,INVALID \
    -j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

The second rule catches untracked packets and directs them to the target.
The purpose of disabling loose tracking is to have the final ACK from the
client not be picked up by conntrack, so it won't create a new conntrack
entry and will be marked INVALID and also get directed to the target.

Unfortunately I couldn't come up with a nicer way to catch just the
first SYN and final ACK from the client and not have any more packets
hit the target, but even though it doesn't look to nice, it works well.

Comments welcome.

synproxy工具原理

当开启synproxy功能的时候，synproxy相对于客户端是透明的。三次握手首先是在客户端和synproxy之间进行：
客户端发送TCP SYN 给server A
当报文到达防火墙的时候，通过上面的第一条规则将其设置为UNTRACKED，那么该syn报文不会进行连接跟踪。
这个UNTRACKED TCP SYN报文将会命中第二条规则，执行SYNPROXY动作。
SYNPROXY将会捕获该报文，记录报文中的相关信息，然后模仿服务器A发送一个TCP SYN+ACK 给客户端(源IP是服务器的IP)，该报文从OUTPUT节点出去，由于syn报文没有进行连接跟踪，并且设置了nf_conntrack_tcp_loose=0。synack报文在会被连接跟踪设置为INVALID(注意不是UNTRACKED)，不会创建CT。
客户端回应一个 TCP ACK，同理该报文也会被设置为INVALID，报文将会命中第二条规则，执行SYNPROXY动作。
客户端完成了与SYNPROXY的三次握手后, SYNPROXY将会马上自动与真实server完成三次握手。伪造一个SYN packet让真实服务器以为客户端在尝试与其连接：
SYNPROXY发送一个TCP SYN真实服务器server A. 这是一个新的连接，该报文通过OUTPUT节点进入netfilter，该报文会创建连接跟踪，状态为NEW。源IP是客户端的源IP，目的IP是实服务器的IP。
真实服务器server A发送一个SYN+ACK给客户端。
SYNPROXY收到来自真实服务器的SYN+ACK报文后，将会回应一个ACK报文，CT的状态被标记为ESTABLISHED。
一旦连接跟踪进入 ESTABLISHED状态，SYNPROXY将会让客户端与真实服务器直接通信。

所以, SYNPROXY可以处理任何类型的TCP流量。也可以用于加密流量，因为SYNPROXY不关注TCP负载。

图片来源于网络

实际在实现synproxy的时候，与上面讲的会有一点点差异，因为synproxy还需要解决TCP的一些细节问题，这些主要体现在：

syncookie
tcp-options
sequence number

实现源码分析

synproxy taget实现

与客户端的三次握手报文，通过上面的规则1：

iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK
echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

在netfilter中是不会创建连接跟踪的，它会命中第二条规则，从而被SYNPROXY处理。

iptables -A INPUT -i eth0 -p tcp --dport 80 -m state --state UNTRACKED,INVALID \
    -j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

static struct xt_target synproxy_tg4_reg __read_mostly = {
    .name        = "SYNPROXY",
    .family        = NFPROTO_IPV4,
    .hooks        = (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD),
    .target        = synproxy_tg4,
    .targetsize    = sizeof(struct xt_synproxy_info),
    .checkentry    = synproxy_tg4_check,
    .destroy    = synproxy_tg4_destroy,
    .me        = THIS_MODULE,
};

tcp选项的处理

由于SYNPROXY不知道真实服务器支持的tcp选项，所以在配置SYNPROXY的时候需要设置相应的TCP选项参数，这些参数一旦设置之后就不会变了，相当于常量。规则2设置了--sack-perm --timestamp --mss 1480 --wscale 7 --ecn五个选项，会被填充到如下结构体中：

#define XT_SYNPROXY_OPT_MSS        0x01
#define XT_SYNPROXY_OPT_WSCALE        0x02
#define XT_SYNPROXY_OPT_SACK_PERM    0x04
#define XT_SYNPROXY_OPT_TIMESTAMP    0x08
#define XT_SYNPROXY_OPT_ECN        0x10

struct xt_synproxy_info {
    __u8    options;//具体支持的选项标志位
    __u8    wscale;//窗口缩放因子
    __u16    mss;//最大报文段大小
};

初始化

static int __net_init synproxy_net_init(struct net *net)
{
    struct synproxy_net *snet = synproxy_pernet(net);
    struct nf_conn *ct;
    int err = -ENOMEM;
    //创建了一个ct模板,会设置IPS_TEMPLATE_BIT标志
    ct = nf_ct_tmpl_alloc(net, &nf_ct_zone_dflt, GFP_KERNEL);
    if (!ct)
        goto err1;
    //添加序列号调整扩展控制块
    if (!nfct_seqadj_ext_add(ct))
        goto err2;
    //添加synproxy扩展控制块
    if (!nfct_synproxy_ext_add(ct))
        goto err2;
    //设置已经IPS_CONFIRMED_BIT
    __set_bit(IPS_CONFIRMED_BIT, &ct->status);
    nf_conntrack_get(&ct->ct_general);
    snet->tmpl = ct;

    snet->stats = alloc_percpu(struct synproxy_stats);
    if (snet->stats == NULL)
        goto err2;

    err = synproxy_proc_init(net);
    if (err < 0)
        goto err3;

    return 0;

err3:
    free_percpu(snet->stats);
err2:
    nf_ct_tmpl_free(ct);
err1:
    return err;
}

synproxy_tg4

static unsigned int
synproxy_tg4(struct sk_buff *skb, const struct xt_action_param *par)
{
    //获取synproxy配置参数
    const struct xt_synproxy_info *info = par->targinfo;
    struct net *net = xt_net(par);
    struct synproxy_net *snet = synproxy_pernet(net);
    struct synproxy_options opts = {};
    struct tcphdr *th, _th;

    if (nf_ip_checksum(skb, xt_hooknum(par), par->thoff, IPPROTO_TCP))
        return NF_DROP;

    th = skb_header_pointer(skb, par->thoff, sizeof(_th), &_th);
    if (th == NULL)
        return NF_DROP;
    //解析syn报文选项
    if (!synproxy_parse_options(skb, par->thoff, th, &opts))
        return NF_DROP;
    
    if (th->syn && !(th->ack || th->fin || th->rst)) {//syn报文
        /* Initial SYN from client */
        this_cpu_inc(snet->stats->syn_received);

        if (th->ece && th->cwr)
            opts.options |= XT_SYNPROXY_OPT_ECN;
        //取交集
        opts.options &= info->options;
        //只有客户端和服务器端都支持时间戳选项，才有可能使用syn-cookie
        //因为只有在syn报文中才会存在sack-perm，mss，窗口缩放因子等选项。
        //syn-cookie借助时间戳选项保存这些信息，从而可以在ack报文中还原
        //这些信息。
        if (opts.options & XT_SYNPROXY_OPT_TIMESTAMP)
            synproxy_init_timestamp_cookie(info, &opts);
        else//当时间戳选项不支持的话，只能宣告不支持这些选项。
            opts.options &= ~(XT_SYNPROXY_OPT_WSCALE |
                      XT_SYNPROXY_OPT_SACK_PERM |
                      XT_SYNPROXY_OPT_ECN);
        //发送synack报文给客户端
        synproxy_send_client_synack(net, skb, th, &opts);
        consume_skb(skb);
        return NF_STOLEN;
    } else if (th->ack && !(th->fin || th->rst || th->syn)) {//ack报文
        /* ACK from client */
        if (synproxy_recv_client_ack(net, skb, th, &opts, ntohl(th->seq))) {
            consume_skb(skb);
            return NF_STOLEN;
        } else {
            return NF_DROP;
        }
    }

    return XT_CONTINUE;
}

发送syn+ack给客户端

static void
synproxy_send_client_synack(struct net *net,
                const struct sk_buff *skb, const struct tcphdr *th,
                const struct synproxy_options *opts)
{
    struct sk_buff *nskb;
    struct iphdr *iph, *niph;
    struct tcphdr *nth;
    unsigned int tcp_hdr_size;
    u16 mss = opts->mss;

    iph = ip_hdr(skb);

    tcp_hdr_size = sizeof(*nth) + synproxy_options_size(opts);
    nskb = alloc_skb(sizeof(*niph) + tcp_hdr_size + MAX_TCP_HEADER,
             GFP_ATOMIC);
    if (nskb == NULL)
        return;
    skb_reserve(nskb, MAX_TCP_HEADER);
    //构建IP头
    niph = synproxy_build_ip(net, nskb, iph->daddr, iph->saddr);

    skb_reset_transport_header(nskb);
    nth = skb_put(nskb, tcp_hdr_size);
    nth->source    = th->dest;
    nth->dest    = th->source;
    //计算synproxy的初始序列号，这里使用了syncookie。将mss掩藏到初始序列号中。
    //收到客户端的ack报文后，将会还原出来。
    nth->seq    = htonl(__cookie_v4_init_sequence(iph, th, &mss));//使用mss作为因数计算syncookie
    nth->ack_seq    = htonl(ntohl(th->seq) + 1);
    tcp_flag_word(nth) = TCP_FLAG_SYN | TCP_FLAG_ACK;
    if (opts->options & XT_SYNPROXY_OPT_ECN)
        tcp_flag_word(nth) |= TCP_FLAG_ECE;
    nth->doff    = tcp_hdr_size / 4;
    nth->window    = 0;
    nth->check    = 0;
    nth->urg_ptr    = 0;
    //构建tcp选项
    synproxy_build_options(nth, opts);
    //首包为notrack，所以没有ct
    synproxy_send_tcp(net, skb, nskb, skb_nfct(skb),
              IP_CT_ESTABLISHED_REPLY, niph, nth, tcp_hdr_size);
}

发送syn报文给真实服务器

//发送syn报文给真实服务器。
//recv_seq为ack报文的发送序列号，它比syn报文的发送序列号多1，我们将
//使用该序列号减去1作为发送给服务器的syn报文的发送序列号，那么请求方向就不
//需要进行序列号调整了。
static void
synproxy_send_server_syn(struct net *net,
             const struct sk_buff *skb, const struct tcphdr *th,
             const struct synproxy_options *opts, u32 recv_seq)
{
    struct synproxy_net *snet = synproxy_pernet(net);
    struct sk_buff *nskb;
    struct iphdr *iph, *niph;
    struct tcphdr *nth;
    unsigned int tcp_hdr_size;

    iph = ip_hdr(skb);
    //计算报文头部大小
    tcp_hdr_size = sizeof(*nth) + synproxy_options_size(opts);
    nskb = alloc_skb(sizeof(*niph) + tcp_hdr_size + MAX_TCP_HEADER,
             GFP_ATOMIC);
    if (nskb == NULL)
        return;
    skb_reserve(nskb, MAX_TCP_HEADER);
    //构建IP头
    niph = synproxy_build_ip(net, nskb, iph->saddr, iph->daddr);
    //复位传输层头
    skb_reset_transport_header(nskb);
    nth = skb_put(nskb, tcp_hdr_size);
    nth->source    = th->source;
    nth->dest    = th->dest;
    nth->seq    = htonl(recv_seq - 1);//将ack报文的序列号减去1作为syn报文的序列号
    /* ack_seq is used to relay our ISN to the synproxy hook to initialize
     * sequence number translation once a connection tracking entry exists.
     * 这里设置nth->ack_seq有很重要的意义，这里将客户端发送给synproxy的应答序列号-1(实际就是
     * synproxy的初始发送序列号)填充在nth->ack_seq中，是为了在hook处理的时候记录到
     * synproxy扩展控制块中。详细查看函数ipv4_synproxy_hook。
     */
    nth->ack_seq    = htonl(ntohl(th->ack_seq) - 1);
    tcp_flag_word(nth) = TCP_FLAG_SYN;//设置syn标志
    if (opts->options & XT_SYNPROXY_OPT_ECN)//设置ECN标志
        tcp_flag_word(nth) |= TCP_FLAG_ECE | TCP_FLAG_CWR;
    nth->doff    = tcp_hdr_size / 4;
    nth->window    = th->window;//使用客户端的窗口
    nth->check    = 0;
    nth->urg_ptr    = 0;
    //构建选项
    synproxy_build_options(nth, opts);
    //设置了标志位syn，这里将为syn代理建立请求方向的ct，该报文会创建ct。这里传递了一个&snet->tmpl->ct_general
    //ct模板给synproxy_send_tcp，会设置报文nfct，将来在output hook点会根据模板创建ct。
    //该模板添加了seqadj和synproxy控制块。
    synproxy_send_tcp(net, skb, nskb, &snet->tmpl->ct_general, IP_CT_NEW,
              niph, nth, tcp_hdr_size);
}

static void
synproxy_send_tcp(struct net *net,
          const struct sk_buff *skb, struct sk_buff *nskb,
          struct nf_conntrack *nfct, enum ip_conntrack_info ctinfo,
          struct iphdr *niph, struct tcphdr *nth,
          unsigned int tcp_hdr_size)
{
    nth->check = ~tcp_v4_check(tcp_hdr_size, niph->saddr, niph->daddr, 0);
    nskb->ip_summed   = CHECKSUM_PARTIAL;
    nskb->csum_start  = (unsigned char *)nth - nskb->head;
    nskb->csum_offset = offsetof(struct tcphdr, check);

    skb_dst_set_noref(nskb, skb_dst(skb));
    nskb->protocol = htons(ETH_P_IP);
    if (ip_route_me_harder(net, nskb, RTN_UNSPEC))
        goto free_nskb;

    if (nfct) {
        nf_ct_set(nskb, (struct nf_conn *)nfct, ctinfo);
        nf_conntrack_get(nfct);
    }

    ip_local_out(net, nskb->sk, nskb);
    return;

free_nskb:
    kfree_skb(nskb);
}

在synproxy_tg4中完成了synproxy与客户端的三次握手，同时启动了与服务端的三次握手，这些报文由synproxy_tg4处理，synproxy注册的钩子函数不会处理这三个报文。

synproxy注册的钩子函数

static const struct nf_hook_ops ipv4_synproxy_ops[] = {
    {
        .hook        = ipv4_synproxy_hook,
        .pf        = NFPROTO_IPV4,
        .hooknum    = NF_INET_LOCAL_IN,
        .priority    = NF_IP_PRI_CONNTRACK_CONFIRM - 1,//优先级非常低，在CONFIRM之前。
    },
    {
        .hook        = ipv4_synproxy_hook,
        .pf        = NFPROTO_IPV4,
        .hooknum    = NF_INET_POST_ROUTING,
        .priority    = NF_IP_PRI_CONNTRACK_CONFIRM - 1,//优先级非常低，在CONFIRM之前。
    },
};

ipv4_synproxy_hook

synproxy在接收到客户端的ack后，会发送syn报文给服务器，此时会根据模板创建CT，状态为NEW。走到NF_INET_POST_ROUTING节点的时候会经过ipv4_synproxy_hook钩子函数。

static unsigned int ipv4_synproxy_hook(void *priv,
                       struct sk_buff *skb,
                       const struct nf_hook_state *nhs)
{
    struct net *net = nhs->net;
    struct synproxy_net *snet = synproxy_pernet(net);
    enum ip_conntrack_info ctinfo;
    struct nf_conn *ct;
    struct nf_conn_synproxy *synproxy;
    struct synproxy_options opts = {};
    const struct ip_ct_tcp *state;
    struct tcphdr *th, _th;
    unsigned int thoff;
    //前面几个syn，syn-ack，ack报文没有ct，所以直接退出
    ct = nf_ct_get(skb, &ctinfo);
    if (ct == NULL)
        return NF_ACCEPT;
    //获取连接跟踪的synproxy控制块
    synproxy = nfct_synproxy(ct);
    if (synproxy == NULL)
        return NF_ACCEPT;
    //从lo接口收到的报文，非tcp报文直接退出。
    if (nf_is_loopback_packet(skb) ||
        ip_hdr(skb)->protocol != IPPROTO_TCP)
        return NF_ACCEPT;
    //获取tcp头地址
    thoff = ip_hdrlen(skb);
    th = skb_header_pointer(skb, thoff, sizeof(_th), &_th);
    if (th == NULL)
        return NF_DROP;

    state = &ct->proto.tcp;
    switch (state->state) {
    case TCP_CONNTRACK_CLOSE:
        if (th->rst && !test_bit(IPS_SEEN_REPLY_BIT, &ct->status)) {
            nf_ct_seqadj_init(ct, ctinfo, synproxy->isn -
                              ntohl(th->seq) + 1);
            break;
        }

        if (!th->syn || th->ack ||
            CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL)
            break;

        /* Reopened connection - reset the sequence number and timestamp
         * adjustments, they will get initialized once the connection is
         * reestablished.
         */
        nf_ct_seqadj_init(ct, ctinfo, 0);
        synproxy->tsoff = 0;
        this_cpu_inc(snet->stats->conn_reopened);

        /* fall through */
    case TCP_CONNTRACK_SYN_SENT:
        if (!synproxy_parse_options(skb, thoff, th, &opts))
            return NF_DROP;
/*      
        ** 代理发送给sever的syn报文丢失，或者server发送过来的syn-ack丢失，导致连接跟踪处于该状态。客户端发送
        ** 保活报文。
        ** 保活报文：
        ** 保活探测报文为一个空报文段（或1个字节），序列号等于对方主机发送的ACK报文的最大序列号减1。
        ** 因为这一序列号的数据段已经被成功接收,所以不会对到达的报文段造成影响,但探测报文返回的响应
        ** 可以确定连接是否仍在工作。接收方收到该报文以后，会认为是之前丢失的报文，所以不会添加进数据流中。
        ** 但是仍然要发送一个ACK确认。探测及其响应报文丢失后都不会重传。探测方主动不重传，相应方的ACK报文
        ** 并不能自己重传，所以需要保活探测数。
*/
        if (!th->syn && th->ack &&
            CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL) {
            /* Keep-Alives are sent with SEG.SEQ = SND.NXT-1,
             * therefore we need to add 1 to make the SYN sequence
             * number match the one of first SYN.
             */
            if (synproxy_recv_client_ack(net, skb, th, &opts,
                             ntohl(th->seq) + 1)) {
                this_cpu_inc(snet->stats->cookie_retrans);
                consume_skb(skb);
                return NF_STOLEN;
            } else {
                return NF_DROP;
            }
        }
        //前面在填充报文的时候将发送给服务器的syn报文的th->ack_seq填充为synproxy的初始发送序列号了。
        synproxy->isn = ntohl(th->ack_seq);
        //这里记录了
        if (opts.options & XT_SYNPROXY_OPT_TIMESTAMP)
            synproxy->its = opts.tsecr;//应答时间戳即为syn+ack报文的初始时间戳

        nf_conntrack_event_cache(IPCT_SYNPROXY, ct);
        break;
    case TCP_CONNTRACK_SYN_RECV://服务器端发送过来的syn+ack报文经过output节点后处于该状态。
        if (!th->syn || !th->ack)
            break;
        //解析选项
        if (!synproxy_parse_options(skb, thoff, th, &opts))
            return NF_DROP;
        //这段代码放在这里是不合适的，会导致发送给server的ack报文时间戳出错。
        //应该放到synproxy_send_server_ack之后。
        if (opts.options & XT_SYNPROXY_OPT_TIMESTAMP) {
            //记录时间戳差值
            synproxy->tsoff = opts.tsval - synproxy->its;
            nf_conntrack_event_cache(IPCT_SYNPROXY, ct);
        }

        opts.options &= ~(XT_SYNPROXY_OPT_MSS |
                  XT_SYNPROXY_OPT_WSCALE |
                  XT_SYNPROXY_OPT_SACK_PERM);
        //opts.tsecr为synproxy的组合时间戳
        //opts.tsval为服务器端的syn+ack的时间戳，这里需要回应一个ack给服务器
        swap(opts.tsval, opts.tsecr);
        //发送应答报文给服务器端，该报文的时间戳将会被修改，这里是一个bug。
        //实际该报文是不需要修改的。
        synproxy_send_server_ack(net, state, skb, th, &opts);
        //初始化序列号调整上下文。这个时候是应答方向的连接跟踪。
        //需要调整服务器端发送给客户端的序列号，因为在第一次握手的时候
        //客户端记录的是synproxy提供的初始发送序列号。这里比较synproxy与真实服务器的
        //的初始发送序列号的差值。
        //同时需要调整客户端发给服务器端的应答序列号。这里记录差值。
        //这里的th->seq为服务器发送给客户端的初始发送序列号。
        nf_ct_seqadj_init(ct, ctinfo, synproxy->isn - ntohl(th->seq));
        nf_conntrack_event_cache(IPCT_SEQADJ, ct);
        //发送给客户端一个ack报文，用于同步时间戳，窗口缩放因子等选项。
        //再次交换回来，与发往server的时间戳相反，这里报文会再次从output出去
        //然后在post-routing再次进入该函数，调用synproxy_tstamp_adjust进行
        //时间戳调整。
        swap(opts.tsval, opts.tsecr);
        synproxy_send_client_ack(net, skb, th, &opts);

        consume_skb(skb);
        return NF_STOLEN;
    default:
        break;
    }
    //发往客户端的tcp window update报文会走这里进行时间戳调整。
    synproxy_tstamp_adjust(skb, thoff, th, ct, ctinfo, synproxy);
    return NF_ACCEPT;
}

时间戳调整

时间戳有如下作用（摘自网络，详细请看https://blog.csdn.net/u011130...）：

（1）通信延迟RTT（Round Trip Time）测量

　　 RTT对于拥塞控制是十分重要的（比如计算多长时间重传数据）。通常，测量RTT的方法是发送一个报文，记录发送时间t1；当收到这个报文的确认时记录时间t2，t2 - t1就可以得到RTT。但TCP使用延迟确认机制，而且ACK可能会丢失，使得收到ACK时也无法确定是对哪个报文的回应。

（2）序列号快速回绕

　　TCP判断数据是新是旧的方法是检查数据的序列号是否位于sun.una到sun.una + 2**31的范围内，而序列号空间的总大小为2*32，即约4.29G。在万兆局域网中，4.29G字节数据回绕只需几秒钟，这时TCP就无法准确判断数据的新旧。

（3）SYN Cookie的选项信息

　　TCP开启SYN Cookie功能时由于Server在收到SYN请求后不保存连接，故SYN包中携带的选项（WScale、SACK）无法保存，当SYN Cookie验证通过、新连接建立之后，这些选项都无法开启。

　　使用时间戳选项就可以解决上述问题。

　　问题（1）解决方法：发送一个报文时将发送时间写入时间戳选项，在收到的ACK报文时通过其时间戳选项的回显值就能知道它确认的是什么时候发送的报文，用当前时间减去回显时间就可以得到一个RTT。

　　问题（2）解决方法：收到一个报文时记录选项中的时间戳值，收到下一个报文时将其中的时间戳与上次的进行对比即可。时间戳回绕的速度只与对端主机时钟频率有关。Linux以本地时钟计数（jiffies）作为时间戳的值，假设时钟计数加1需要1ms，则需要约24.8天才能回绕一半，只要报文的生存时间小于这个值的话判断新旧数据就不会出错，这个功能被称为PAWS（Protect Against Wrapped Sequence numbers）。这样虽然可以解决问题（2），但随着硬件时钟频率的提升，时间戳回绕的速度也会加快，用时间戳解决序列号回绕问题的方法早晚会遇到困境。

　　问题（3）解决方法：将WScale和SACK选项信息编码进32 bit的时间戳值中，建立连接时会收到ACK报文，将报文的时间戳选项的回显信息解码就可以还原WScale和SACK信息（这部分内容见《3.6 SYN Cookie》）。

由于客户端收到的初始的时间戳是synproxy发送给客户端的syn-cookie组合时间戳。客户端后续收到的报文是来自服务器端的时间戳，这两个时间戳之间存在差异(可能比初始时间戳小，也可能比初始时间戳大)。服务器端收到的第一个syn报文的时间戳是客户端的ack报文的时间戳选项中的时间戳。

报文序号	目的	目的	报文类型	时间戳	时间错回显
1	client	synproxy	syn	client-timestam1	0
2	synproxy	client	syn+ack	syn-cookie-timestamp	client-timestam1
3	client	synproxy	ack	client-timestam2	syn-cookie-timestamp
4	synproxy	server	syn	client-timestam2	syn-cookie-timestamp
5	server	synproxy	syn+ack	server-timestam1	client-timestam2
6	synproxy	server	ack	client-timestam2	server-timestam1 （该时间戳会被调整，实际上是不需要被修改的，这里是一个bug）
7	synproxy	client	ack（tcp window update）	server-timestam1 （该时间戳会被调整）	client-timestam2

总的来说，请求方向需要修改回显时间戳，应答方向需要修改发送时间戳。

struct nf_conn_synproxy {
    u32    isn;//应答方向，synproxy的初始发送序列号
    u32    its;//初始时间戳，synproxy计算的syn-cookie序列号，通过ack-seq传递。
    u32    tsoff;//时间戳差值，在收到sever的syn+ack的时候使用发送时间戳减去its得到该值。
};

unsigned int synproxy_tstamp_adjust(struct sk_buff *skb,
                    unsigned int protoff,
                    struct tcphdr *th,
                    struct nf_conn *ct,
                    enum ip_conntrack_info ctinfo,
                    const struct nf_conn_synproxy *synproxy)
{
    unsigned int optoff, optend;
    __be32 *ptr, old;

    if (synproxy->tsoff == 0)
        return 1;

    optoff = protoff + sizeof(struct tcphdr);
    optend = protoff + th->doff * 4;

    if (!skb_make_writable(skb, optend))
        return 0;

    while (optoff < optend) {
        unsigned char *op = skb->data + optoff;

        switch (op[0]) {
        case TCPOPT_EOL:
            return 1;
        case TCPOPT_NOP:
            optoff++;
            continue;
        default:
            if (optoff + 1 == optend ||
                optoff + op[1] > optend ||
                op[1] < 2)
                return 0;
            if (op[0] == TCPOPT_TIMESTAMP &&
                op[1] == TCPOLEN_TIMESTAMP) {
                if (CTINFO2DIR(ctinfo) == IP_CT_DIR_REPLY) {//应答方向，需要修改发送事件戳
                    ptr = (__be32 *)&op[2];
                    old = *ptr;
                    *ptr = htonl(ntohl(*ptr) -
                             synproxy->tsoff);
                } else {//请求方向需要修改应答时间戳
                    ptr = (__be32 *)&op[6];
                    old = *ptr;
                    *ptr = htonl(ntohl(*ptr) +
                             synproxy->tsoff);
                }
                inet_proto_csum_replace4(&th->check, skb,
                             old, *ptr, false);
                return 1;
            }
            optoff += op[1];
        }
    }
    return 1;
}

seqadj调整

由于synproxy在给server发送syn报文的时候，使用了和client发送给synproxy相同的序列号，所以seqadj只需要修改请求方向的应答序列号和应答方向的发送序列号。

client在于synproxy建立连接的时候，synproxy是无法确定sever的初始序列号，并且synproxy使用了syncookie生成初始序列号，导致了synproxy发送给client的初始序列号，与server的初始序列号不一致，从而产生了seqadj的需求。

synproxy在收到sever的syn+ack报文的时候会调用函数nf_ct_seqadj_init(ct, ctinfo, synproxy->isn - ntohl(th->seq));初始化时间戳扩展控制块上下文，然后在ipv4_confirm函数中负责调整。

int nf_ct_seqadj_init(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
              s32 off)
{
    enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
    struct nf_conn_seqadj *seqadj;
    struct nf_ct_seqadj *this_way;

    if (off == 0)
        return 0;

    set_bit(IPS_SEQ_ADJUST_BIT, &ct->status);

    seqadj = nfct_seqadj(ct);
    this_way = &seqadj->seq[dir];
    this_way->offset_before     = off;//两个值是一样的，这与alg导致的不一样。
    this_way->offset_after     = off;
    return 0;
}

nf_conntrack_tcp_loose的作用

echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

设置该值，表示要求连接跟踪严格校验tcp的状态变化，会使得不符合三次握手顺序的报文不会创建CT（在用户态看来就是状态为INVALID的报文）。需要与下面的命令配合使用，下面的命令让syn包不进行连接跟踪，从而破坏连接跟踪的三次握手。

iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK

由于syn报文没有创建连接跟踪，导致后续的syn+ack，ack报文以new的状态进入连接跟踪，会调用如下函数：

/* Called when a new connection for this protocol found. */
static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
            unsigned int dataoff, unsigned int *timeouts)
{
    ...

    /* Don't need lock here: this conntrack not in circulation yet 获取下一个状态*/
    new_state = tcp_conntracks[0][get_conntrack_index(th)][TCP_CONNTRACK_NONE];
    //new_state会被设置为TCP_CONNTRACK_MAX
    /* Invalid: delete conntrack 不合适状态，直接放弃连接跟踪 */
    if (new_state >= TCP_CONNTRACK_MAX) {
        pr_debug("nf_ct_tcp: invalid new deleting.\n");
        return false;
    }

    if (new_state == TCP_CONNTRACK_SYN_SENT) {/* 首包，初始化相关信息 */
        memset(&ct->proto.tcp, 0, sizeof(ct->proto.tcp));
        /* SYN packet */
        ct->proto.tcp.seen[0].td_end =
            segment_seq_plus_len(ntohl(th->seq), skb->len,
                         dataoff, th);
        ct->proto.tcp.seen[0].td_maxwin = ntohs(th->window);
        if (ct->proto.tcp.seen[0].td_maxwin == 0)
            ct->proto.tcp.seen[0].td_maxwin = 1;
        ct->proto.tcp.seen[0].td_maxend =
            ct->proto.tcp.seen[0].td_end;

        tcp_options(skb, dataoff, th, &ct->proto.tcp.seen[0]);
    } else if (tn->tcp_loose == 0) {/* 严格情况，直接不让通过，这里就是nf_conntrack_tcp_loose == 0的作用 */
        /* Don't try to pick up connections. */
        return false;//直接返回false，不进行连接跟踪。
    } else 
    ...
    return true;
}

synproxy连接跟踪处理

synproxy在给server发送syn报文的时候，有如下调用,在这种情况下给报文设置了一个模板CT，这是与其它情况最大的不同，这直接影响了该报文在OUTPUT节点进入连接跟踪时的创建动作：

//设置了标志位syn，这里将为syn代理建立请求方向的ct，该报文会创建ct。
synproxy_send_tcp(net, skb, nskb, &snet->tmpl->ct_general, IP_CT_NEW,
                  niph, nth, tcp_hdr_size);

unsigned int
nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
        struct sk_buff *skb)
{
    const struct nf_conntrack_l3proto *l3proto;
    const struct nf_conntrack_l4proto *l4proto;
    struct nf_conn *ct, *tmpl;
    enum ip_conntrack_info ctinfo;
    unsigned int *timeouts;
    unsigned int dataoff;
    u_int8_t protonum;
    int ret;
    //对于该syn报文，是有一个模板的
    tmpl = nf_ct_get(skb, &ctinfo);
    if (tmpl || ctinfo == IP_CT_UNTRACKED) {
        /* Previously seen (loopback or untracked)?  Ignore. */
        if ((tmpl && !nf_ct_is_template(tmpl)) ||//synproxy会设置模板标志，所以不会进去
             ctinfo == IP_CT_UNTRACKED) {
            NF_CT_STAT_INC_ATOMIC(net, ignore);
            return NF_ACCEPT;
        }
        skb->_nfct = 0;//清0
    }

    ...

    /* It may be an special packet, error, unclean...
     * inverse of the return code tells to the netfilter
     * core what to do with the packet. */
    if (l4proto->error != NULL) {
        ret = l4proto->error(net, tmpl, skb, dataoff, pf, hooknum);
        if (ret <= 0) {
            NF_CT_STAT_INC_ATOMIC(net, error);
            NF_CT_STAT_INC_ATOMIC(net, invalid);
            ret = -ret;
            goto out;
        }
        /* ICMP[v6] protocol trackers may assign one conntrack. */
        if (skb->_nfct)
            goto out;
    }
repeat:
    //查找其对应的连接跟踪模块。如果是首包的话，会进行创建。
    ret = resolve_normal_ct(net, tmpl, skb, dataoff, pf, protonum,
                l3proto, l4proto);
    ...

    return ret;
}

/* On success, returns 0, sets skb->_nfct | ctinfo */
static int
resolve_normal_ct(struct net *net, struct nf_conn *tmpl,
          struct sk_buff *skb,
          unsigned int dataoff,
          u_int16_t l3num,
          u_int8_t protonum,
          const struct nf_conntrack_l3proto *l3proto,
          const struct nf_conntrack_l4proto *l4proto)
{
    const struct nf_conntrack_zone *zone;
    struct nf_conntrack_tuple tuple;
    struct nf_conntrack_tuple_hash *h;
    enum ip_conntrack_info ctinfo;
    struct nf_conntrack_zone tmp;
    struct nf_conn *ct;
    u32 hash;
    ... 

    /* look for tuple match 从模板中获取zone */
    zone = nf_ct_zone_tmpl(tmpl, skb, &tmp);
    hash = hash_conntrack_raw(&tuple, net);
    h = __nf_conntrack_find_get(net, zone, &tuple, hash);
    if (!h) {
        //根据模板进行初始化
        h = init_conntrack(net, tmpl, &tuple, l3proto, l4proto,
                   skb, dataoff, hash);
        if (!h)
            return 0;
        if (IS_ERR(h))
            return PTR_ERR(h);
    }
    ...
    return 0;
}

/* Allocate a new conntrack: we return -ENOMEM if classification
   failed due to stress.  Otherwise it really is unclassifiable. */
static noinline struct nf_conntrack_tuple_hash *
init_conntrack(struct net *net, struct nf_conn *tmpl,
           const struct nf_conntrack_tuple *tuple,
           const struct nf_conntrack_l3proto *l3proto,
           const struct nf_conntrack_l4proto *l4proto,
           struct sk_buff *skb,
           unsigned int dataoff, u32 hash)
{
    struct nf_conn *ct;
    struct nf_conn_help *help;
    struct nf_conntrack_tuple repl_tuple;
    struct nf_conntrack_ecache *ecache;
    struct nf_conntrack_expect *exp = NULL;
    const struct nf_conntrack_zone *zone;
    struct nf_conn_timeout *timeout_ext;
    struct nf_conntrack_zone tmp;
    unsigned int *timeouts;

    if (!nf_ct_invert_tuple(&repl_tuple, tuple, l3proto, l4proto)) {
        pr_debug("Can't invert tuple.\n");
        return NULL;
    }

    zone = nf_ct_zone_tmpl(tmpl, skb, &tmp);
    //分配连接跟踪
    ct = __nf_conntrack_alloc(net, zone, tuple, &repl_tuple, GFP_ATOMIC,
                  hash);
    if (IS_ERR(ct))
        return (struct nf_conntrack_tuple_hash *)ct;
    //模板必须添加了synproxy扩展控制块
    if (!nf_ct_add_synproxy(ct, tmpl)) {
        nf_conntrack_free(ct);
        return ERR_PTR(-ENOMEM);
    }
    //超时时间使用模板的超时时间
    timeout_ext = tmpl ? nf_ct_timeout_find(tmpl) : NULL;
    if (timeout_ext) {
        timeouts = nf_ct_timeout_data(timeout_ext);
        if (unlikely(!timeouts))
            timeouts = l4proto->get_timeouts(net);
    } else {
        timeouts = l4proto->get_timeouts(net);
    }

    /* 协议相关初始化，对于tcp而言会进行状态转换检查，比如首包不是syn包则建不起连接 */
    if (!l4proto->new(ct, skb, dataoff, timeouts)) {
        nf_conntrack_free(ct);
        pr_debug("can't track with proto module\n");
        return NULL;
    }

    if (timeout_ext)
        nf_ct_timeout_ext_add(ct, rcu_dereference(timeout_ext->timeout),
                      GFP_ATOMIC);

    nf_ct_acct_ext_add(ct, GFP_ATOMIC);
    nf_ct_tstamp_ext_add(ct, GFP_ATOMIC);
    nf_ct_labels_ext_add(ct);

    ecache = tmpl ? nf_ct_ecache_find(tmpl) : NULL;
    nf_ct_ecache_ext_add(ct, ecache ? ecache->ctmask : 0,
                 ecache ? ecache->expmask : 0,
                 GFP_ATOMIC);
    ...

    return &ct->tuplehash[IP_CT_DIR_ORIGINAL];
}
//添加synproxy控制块的时候会添加seqadj扩展控制块
static inline bool nf_ct_add_synproxy(struct nf_conn *ct,
                      const struct nf_conn *tmpl)
{
    if (tmpl && nfct_synproxy(tmpl)) {
        if (!nfct_seqadj_ext_add(ct))
            return false;

        if (!nfct_synproxy_ext_add(ct))
            return false;
    }

    return true;
}

syn-cookie的计算与作用

这方面请查看下面博客，写的非常详细：

https://blog.csdn.net/u011130...

实验

在一台运行ubuntu18.04的设备上进行：

#在一个terminal上运行nginx，其ip为172.17.0.2
admin@ubuntu:~$ sudo docker run -it --name synproxynginx nginx bash
执行nginx

#在另外一个terminal上执行curl，其ip为172.17.0.3
admin@ubuntu:~$ sudo docker run -it --name synproxyclient ubuntu bash
执行cur 172.17.0.2

#在第三个terminal上执行
admin@ubuntu:~$ sudo iptables -t raw -A PREROUTING -p tcp --dport 80 --syn -j NOTRACK
admin@ubuntu:~$ sudo iptables -A FORWARD -i eth0 -p tcp --dport 80 -m state --state UNTRACKED,\
                        INVALID -j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

admin@ubuntu:~$ sudo sysctl -w net.netfilter.nf_conntrack_tcp_loose=0

#在两个容器与docker0网桥相连的网口上抓包进行分析

连接跟踪之synproxy

原理分析

synproxy工具原理

实现源码分析

synproxy taget实现

tcp选项的处理

初始化

synproxy_tg4

发送syn+ack给客户端

发送syn报文给真实服务器

synproxy注册的钩子函数

ipv4_synproxy_hook

时间戳调整

seqadj调整

nf_conntrack_tcp_loose的作用

synproxy连接跟踪处理

syn-cookie的计算与作用

实验

ouyangxibao

引用和评论

Linux Bridge vlan filtering

OpenInfra 基金会董事会宣布加入 Linux 基金会意向，增强开源全球影响力

C++ 中 VS 项目引入公共配置文件

rocky linux 使用记录

Visual Studio Code (VS Code) – C/C++ 入门

快捷键打开某个窗口(如网页chatGPT)

如何系统地入门学习stm32？