原文链接:http://tabalt.net/blog/a-lazy-network-card-leads-to-server-load-imbalance/

最近因服务部署https,上线了一批物理机做Proxy,上线后发现我们有个机房的其中一台机器负载比较闲,网卡流入流出也相比其他机器低一截,于是准备看看到底是什么情况。

首先对比着看了下Nginx的/server-status页面,问题机器的Nginx的活跃连接数要高出很多。

curl http://127.0.0.1/server-status

#问题机器
Active connections: 4353 
server accepts handled requests
 33401902 33401902 37655622 
Reading: 0 Writing: 85 Waiting: 4231 

#正常机器
Active connections: 2686 
server accepts handled requests
 14810542 14810542 8026702 
Reading: 0 Writing: 69 Waiting: 2567

再看了下Nginx的错误日志,发现有很多用户提前主动关闭页面留下的日志,说明访问到这台机器上的用户,打开页面是非常慢了。如果大量用户的主动关闭,那这台机器上的流量是会低很多。

tail -f /data/nginx/logs/mydomain_com_error.log

#问题机器
upstream prematurely closed connection while reading response header from upstream

对比了一下问题机器和正常机器上的Nginx主Conf和Vhost配置,没有任何不同。用iostat看了下磁盘IO情况,相差不大。

百思不得其解,于是跑去请教高人,高人三下五除二之后,ping了下 Upstream 的IP,发现问题机器ping的时候延时竟然达到了300ms,而正常机器则只有0.2ms左右:

#问题机器
64 bytes from 10.10.10.10: icmp_seq=1 ttl=61 time=299 ms
64 bytes from 10.10.10.10: icmp_seq=2 ttl=61 time=246 ms
64 bytes from 10.10.10.10: icmp_seq=3 ttl=61 time=349 ms
64 bytes from 10.10.10.10: icmp_seq=4 ttl=61 time=291 ms

#正常机器
64 bytes from 10.10.10.10: icmp_seq=1 ttl=61 time=0.239 ms
64 bytes from 10.10.10.10: icmp_seq=2 ttl=61 time=0.083 ms
64 bytes from 10.10.10.10: icmp_seq=4 ttl=61 time=0.112 ms

怀疑是网络问题,于是找Ops帮忙查看。在各路Ops大神的热情帮助下,发现网卡竟然跑满了!但从我们正常机器实际流量看,高峰期单机流入也就140Mbit/s,不至于将我们的千兆网卡跑满。

再追的时候,发现了一个惊天秘密:

sudo ethtool eth0

#问题机器
Settings for eth0:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Full 
    Supports auto-negotiation: Yes
    Advertised link modes:  10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Full 
    Advertised pause frame use: No
    Advertised auto-negotiation: Yes
    Speed: 100Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    MDI-X: Unknown
    Supports Wake-on: pumbg
    Wake-on: g
    Current message level: 0x00000003 (3)
    Link detected: yes

#正常机器
Settings for eth0:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Full 
    Supports auto-negotiation: Yes
    Advertised link modes:  10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Full 
    Advertised pause frame use: No
    Advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    MDI-X: Unknown
    Supports Wake-on: pumbg
    Wake-on: g
    Current message level: 0x00000003 (3)
    Link detected: yes

说好的千兆网卡呢?在问题机器上怎么变成100Mb/s了?

NetOps的大神们又一次热情的帮助我们,人肉去到机房,让我们那块偷懒的网卡重新回到了工作岗位。至此,这次问题完美解决。

原文链接:http://tabalt.net/blog/a-lazy-network-card-leads-to-server-load-imbalance/


tabalt
1.3k 声望65 粉丝

Server Engineer (Java/Golang/LNMP)