最近一周都在解决filebeat dns解析失败的问题。filebeat通过daemonset方式部署在k8s集群中,从而收集整个主机pods的日志。在主机os为centos7.4 的版本集群中,没有任何问题。但是os为centos7.6的集群中,却出现了解析dns失败,导致日志无法发送到kafka集群。

查看filebeat错误日志如下:

Failed to connect to broker sg.main2.kafka.metis.service:9092: dial tcp: lookup sg.main2.kafka.metis.service: Try again

于是开启了debug过程,首先怀疑是coredns出了问题,去exec到pod中进行dig。

dig @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service

  

; <<>> DiG 9.12.4-P2 <<>> @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service

; (1 server found)

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44350

;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

  

;; OPT PSEUDOSECTION:

; EDNS: version: 0, flags:; udp: 4096

;; QUESTION SECTION: 

;sg.main2.kafka.metis.service. IN A

  

;; ANSWER SECTION:

sg.main2.kafka.metis.service. 30 IN A [10.21.42.97](10.21.42.97)

  

;; Query time: 1 msec

;; SERVER: [10.247.3.10](10.247.3.10)#53([10.247.3.10](10.247.3.10))

;; WHEN: Sun Jan 05 14:13:26 UTC 2020

;; MSG SIZE rcvd: 101

pod中是可以正常解析的,那么问题可以定位到代码了。

这个时候需要strace出马了。

shareit.png

发现filebeat 在向127.0.0.1 53 去做dns解析。结果可想而知,解析失败。

需要对应一下golang源码了。

// Copyright 2009 The Go Authors. All rights reserved.
 2// Use of this source code is governed by a BSD-style
 3// license that can be found in the LICENSE file.
 4
 5// +build aix darwin dragonfly freebsd linux netbsd openbsd solaris
 6
 7// Read system DNS config from /etc/resolv.conf
 8
 9package net
 10
 11import (
 12    "internal/bytealg"
 13    "os"
 14    "sync/atomic"
 15    "time"
 16)
 17
 18var (
 19    defaultNS   = []string{"127.0.0.1:53", "[::1]:53"}
 20    getHostname = os.Hostname // variable for testing
 21)
 22
 23type dnsConfig struct {
 24    servers       []string      // server addresses (in host:port form) to use
 25    search       []string      // rooted suffixes to append to local name
 26    ndots         int           // number of dots in name to trigger absolute lookup
 27    timeout       time.Duration // wait before giving up on a query, including retries
 28    attempts      int           // lost packets before giving up on server
 29    rotate        bool          // round robin among servers
 30    unknownOpt    bool          // anything unknown was encountered
 31    lookup        []string      // OpenBSD top-level database "lookup" order
 32    err           error         // any error that occurs during open of resolv.conf
 33    mtime         time.Time     // time of resolv.conf modification
 34    soffset       uint32        // used by serverOffset
 35    singleRequest bool          // use sequential A and AAAA queries instead of parallel queries
 36    useTCP        bool          // force usage of TCP for DNS resolutions
 37}
 38
 39// See resolv.conf(5) on a Linux machine.
 40func dnsReadConfig(filename string) *dnsConfig {
 41    conf := &dnsConfig{
 42        ndots:    1,
 43        timeout:  5 * time.Second,
 44        attempts: 2,
 45    }
 46    file, err := open(filename)
 47    if err != nil {
 48        conf.servers = defaultNS
 49        conf.search = dnsDefaultSearch()
 50        conf.err = err
 51        return conf
 52    }
 53    defer file.close()
 54    if fi, err := file.file.Stat(); err == nil {
 55        conf.mtime = fi.ModTime()
 56    } else {
 57        conf.servers = defaultNS
 58        conf.search = dnsDefaultSearch()
 59        conf.err = err
 60        return conf
 61    }
 62    for line, ok := file.readLine(); ok; line, ok = file.readLine() {
 63        if len(line) > 0 && (line[0] == ';' || line[0] == '#') {
 64            // comment.
 65            continue
 66        }
 67        f := getFields(line)
 68        if len(f) < 1 {
 69            continue
 70        }
 71        switch f[0] {
 72        case "nameserver": // add one name server
 73            if len(f) > 1 && len(conf.servers) < 3 { // small, but the standard limit
 74                // One more check: make sure server name is
 75                // just an IP address. Otherwise we need DNS
 76                // to look it up.
 77                if parseIPv4(f[1]) != nil {
 78                    conf.servers = append(conf.servers, JoinHostPort(f[1], "53"))
 79                } else if ip, _ := parseIPv6Zone(f[1]); ip != nil {
 80                    conf.servers = append(conf.servers, JoinHostPort(f[1], "53"))
 81                }
 82            }
 83
 84        case "domain": // set search path to just this domain
 85            if len(f) > 1 {
 86                conf.search = []string{ensureRooted(f[1])}
 87            }
 88
 89        case "search": // set search path to given servers
 90            conf.search = make([]string, len(f)-1)
 91            for i := 0; i < len(conf.search); i++ {
 92                conf.search[i] = ensureRooted(f[i+1])
 93            }
 94
 95        case "options": // magic options
 96            for _, s := range f[1:] {
 97                switch {
 98                case hasPrefix(s, "ndots:"):
 99                    n, _, _ := dtoi(s[6:])
 100                    if n < 0 {
 101                        n = 0
 102                    } else if n > 15 {
 103                        n = 15
 104                    }
 105                    conf.ndots = n
 106                case hasPrefix(s, "timeout:"):
 107                    n, _, _ := dtoi(s[8:])
 108                    if n < 1 {
 109                        n = 1
 110                    }
 111                    conf.timeout = time.Duration(n) * time.Second
 112                case hasPrefix(s, "attempts:"):
 113                    n, _, _ := dtoi(s[9:])
 114                    if n < 1 {
 115                        n = 1
 116                    }
 117                    conf.attempts = n
 118                case s == "rotate":
 119                    conf.rotate = true
 120                case s == "single-request" || s == "single-request-reopen":
 121                    // Linux option:
 122                    // http://man7.org/linux/man-pages/man5/resolv.conf.5.html
 123                    // "By default, glibc performs IPv4 and IPv6 lookups in parallel [...]
 124                    //  This option disables the behavior and makes glibc
 125                    //  perform the IPv6 and IPv4 requests sequentially."
 126                    conf.singleRequest = true
 127                case s == "use-vc" || s == "usevc" || s == "tcp":
 128                    // Linux (use-vc), FreeBSD (usevc) and OpenBSD (tcp) option:
 129                    // http://man7.org/linux/man-pages/man5/resolv.conf.5.html
 130                    // "Sets RES_USEVC in _res.options.
 131                    //  This option forces the use of TCP for DNS resolutions."
 132                    // https://www.freebsd.org/cgi/man.cgi?query=resolv.conf&sektion=5&manpath=freebsd-release-ports
 133                    // https://man.openbsd.org/resolv.conf.5
 134                    conf.useTCP = true
 135                default:
 136                    conf.unknownOpt = true
 137                }
 138            }
 139
 140        case "lookup":
 141            // OpenBSD option:
 142            // https://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man5/resolv.conf.5
 143            // "the legal space-separated values are: bind, file, yp"
 144            conf.lookup = f[1:]
 145
 146        default:
 147            conf.unknownOpt = true
 148        }
 149    }
 150    if len(conf.servers) == 0 {
 151        conf.servers = defaultNS
 152    }
 153    if len(conf.search) == 0 {
 154        conf.search = dnsDefaultSearch()
 155    }
 156    return conf
 157}
 158
 159// serverOffset returns an offset that can be used to determine
 160// indices of servers in c.servers when making queries.
 161// When the rotate option is enabled, this offset increases.
 162// Otherwise it is always 0.
 163func (c *dnsConfig) serverOffset() uint32 {
 164    if c.rotate {
 165        return atomic.AddUint32(&c.soffset, 1) - 1 // return 0 to start
 166    }
 167    return 0
 168}
 169
 170func dnsDefaultSearch() []string {
 171    hn, err := getHostname()
 172    if err != nil {
 173        // best effort
 174        return nil
 175    }
 176    if i := bytealg.IndexByteString(hn, '.'); i >= 0 && i < len(hn)-1 {
 177        return []string{ensureRooted(hn[i+1:])}
 178    }
 179    return nil
 180}
 181
 182func hasPrefix(s, prefix string) bool {
 183    return len(s) >= len(prefix) && s[:len(prefix)] == prefix
 184}
 185
 186func ensureRooted(s string) string {
 187    if len(s) > 0 && s[len(s)-1] == '.' {
 188        return s
 189    }
 190    return s + "."
 191}

由于我们同样的代码在centos7.4版本的集群中,运行没有问题,所以怀疑是基础镜像alpine3.8和centos 7.6存在某些兼容性的问题。

我们知道golang dns解析支持cgo和purego两种模式。那可能是某些设置导致golang 通过cgo去解析,然后alpine 使用的是比较特殊的musl库。可能这个库和centos7.6 不兼容。

var lookupOrderName = map[hostLookupOrder]string{
    hostLookupCgo:      "cgo",
    hostLookupFilesDNS: "files,dns",
    hostLookupDNSFiles: "dns,files",
    hostLookupFiles:    "files",
    hostLookupDNS:      "dns",
}

其中hostLookupCgo是一类,表示直接调用libc的getaddrinfo方法去解析。

域名解析函数,Dial函数会间接调用到,而LokupHost和LookupAddr则会直接调用域名解析函数,不同的操作系统实现不同,  在Unix系统中有两种方法进行域名解析:

     - 纯GO语言实现的域名解析,从/etc/resolv.conf中取出本地dns server地址列表, 发送DNS请求(UDP报文)并获得结果

     - 使用cgo方式, 最终会调用到c标准库的getaddrinfo或getnameinfo函数(不建议使用对GO协程不友好)

可以通过GODEBUG环境变量来设置go语言的默认DNS解析方式 纯go或cgo,
export GODEBUG=netdns=go    # force pure Go resolver 纯go 方式
export GODEBUG=netdns=cgo   # force cgo resolver   cgo 方式

为了印证猜想,分析GO语言的域名解析流程,强制export GODEBUG=netdns=go+9,问题不出现,设置为export GODEBUG=netdns=cgo+9,问题出现,在go1.11的版本中会走到cgo流程.

然后在编译filebeat的时候禁用cgo,如下:

CGO_ENABLED=0 go build --ldflags -w -o filebeat

一劳永逸解决。

在go调用C函数入口(getaddrinfo)增加了打印,发现正常和异常的场景下,入参是一致的,但是到lib库中的行为与低版本操作系统存在差异,存在lib库兼容性问题。

结论

  • 在alpine 环境中,go代码最好禁用cgo。
  • 在k8s集群中,选取镜像最好是和主机os一致的分发版本。

iyacontrol
1.4k 声望2.7k 粉丝

专注kubernetes,devops,aiops,service mesh。