源码版本
kubernetes version: v1.3.0
kubelet GC介绍
在分析kubelet启动流程时,老是会碰到各类GC,这里单独提出来做下较详细的分析。
kubelet's Garbage Collection主要由两部分组成:
containerGC: 使用指定的container回收策略,删除那些已经结束的containers
imageManager: k8s所有images的生命周期管理就是通过imageManager来实现的,其实该imageManager也是依赖了cAdvisor。
imageManager
策略初始化
imageManager的回收策略结构如下:
type ImageGCPolicy struct {
// Any usage above this threshold will always trigger garbage collection.
// This is the highest usage we will allow.
HighThresholdPercent int
// Any usage below this threshold will never trigger garbage collection.
// This is the lowest threshold we will try to garbage collect to.
LowThresholdPercent int
// Minimum age at which a image can be garbage collected.
MinAge time.Duration
}
该结构的出厂设置在cmd/kubelet/app/server.go中的UnsecuredKubeletConfig()接口进行。
func UnsecuredKubeletConfig(s *options.KubeletServer) (*KubeletConfig, error) {
...
imageGCPolicy := kubelet.ImageGCPolicy{
MinAge: s.ImageMinimumGCAge.Duration,
HighThresholdPercent: int(s.ImageGCHighThresholdPercent),
LowThresholdPercent: int(s.ImageGCLowThresholdPercent),
}
...
}
赋值的KubeletServer的几个参数的初始化在cmd/kubelet/app/options/options.go中的NewKubeletServer()接口中进行:
func NewKubeletServer() *KubeletServer {
return &KubeletServer{
...
ImageMinimumGCAge: unversioned.Duration{Duration: 2 * time.Minute},
ImageGCHighThresholdPercent: 90,
ImageGCLowThresholdPercent: 80,
...
}
}
从上面的初始化过程可以得出:
在磁盘的占用率高于90%时,imageGC将一直被触发
在磁盘的占用率低于80%时,imageGC将不会触发
imageGC会尝试先delete最少使用的image,但是如果该image的创建时间才低于2min,将不会被删除。
imageManager初始化
上面介绍的都是imageManager的回收策略参数初始化,下面开始介绍imageManager。
结构所在目录:pkg/kubelet/image_manager.go
结构如下:
type imageManager interface {
// Applies the garbage collection policy. Errors include being unable to free
// enough space as per the garbage collection policy.
GarbageCollect() error
// Start async garbage collection of images.
Start() error
GetImageList() ([]kubecontainer.Image, error)
// TODO(vmarmol): Have this subsume pulls as well.
}
可以看到imageManager是个interface,实际初始化的结构体是realImageManager:
type realImageManager struct {
// Container runtime
runtime container.Runtime
// Records of images and their use.
imageRecords map[string]*imageRecord
imageRecordsLock sync.Mutex
// The image garbage collection policy in use.
policy ImageGCPolicy
// cAdvisor instance.
cadvisor cadvisor.Interface
// Recorder for Kubernetes events.
recorder record.EventRecorder
// Reference to this node.
nodeRef *api.ObjectReference
// Track initialization
initialized bool
}
该接口的初始化需要先回到pkg/kubelet/kubelet.go中的NewMainKubelet()接口中:
func NewMainKubelet(
hostname string,
nodeName string,
...
) (*Kubelet, error) {
...
// setup containerGC
containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy)
if err != nil {
return nil, err
}
klet.containerGC = containerGC
// setup imageManager
imageManager, err := newImageManager(klet.containerRuntime, cadvisorInterface, recorder, nodeRef, imageGCPolicy)
if err != nil {
return nil, fmt.Errorf("failed to initialize image manager: %v", err)
}
klet.imageManager = imageManager
...
}
可以看到上面的接口中对containerGC和imageManager都进行了初始化,这里先介绍imageManager,containerGC留到下面再讲。
newImageManager()接口如下:
func newImageManager(runtime container.Runtime, cadvisorInterface cadvisor.Interface, recorder record.EventRecorder, nodeRef *api.ObjectReference, policy ImageGCPolicy) (imageManager, error) {
// 检查policy参数有效性
if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 {
return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent)
}
if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 {
return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent)
}
if policy.LowThresholdPercent > policy.HighThresholdPercent {
return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent)
}
// 初始化realImageManager结构
im := &realImageManager{
runtime: runtime,
policy: policy,
imageRecords: make(map[string]*imageRecord),
cadvisor: cadvisorInterface,
recorder: recorder,
nodeRef: nodeRef,
initialized: false,
}
return im, nil
}
查看上面的初始化接口,可以看出该imageManager跟容器runtime、cAdvisor、EventRecorder、nodeRef、Policy都有关。
这里可以进行大胆的猜测:
runtime用于进行image的删除操作
cAdvisor用于获取image占用磁盘的情况
EventRecorder用于发送具体的回收事件
Policy就是具体的回收策略了
nodeRef干嘛的?猜不到,还是后面继续看源码吧!
imageManager启动
所有的参数初始化结束后,需要开始进入真正的GC启动流程,该步骤还是需要查看CreateAndInitKubelet()接口。
接口目录:cmd/kubelet/app/server.go
接口调用流程:main -> app.Run -> run -> RunKubelet -> CreateAndInitKubelet
接口如下:
func CreateAndInitKubelet(kc *KubeletConfig) (k KubeletBootstrap,
pc *config.PodConfig, err error) {
...
k.StartGarbageCollection()
return k, pc, nil
}
该接口调用了启动GC的接口StartGarbageCollection(),具体实现如下:
func (kl *Kubelet) StartGarbageCollection() {
go wait.Until(func() {
if err := kl.containerGC.GarbageCollect(kl.sourcesReady.AllReady()); err != nil {
glog.Errorf("Container garbage collection failed: %v", err)
}
}, ContainerGCPeriod, wait.NeverStop)
go wait.Until(func() {
if err := kl.imageManager.GarbageCollect(); err != nil {
glog.Errorf("Image garbage collection failed: %v", err)
}
}, ImageGCPeriod, wait.NeverStop)
}
上面的接口分别启动了containerGC和imageManager的协程,可以看出containerGC是每1分钟触发回收,imageManager是每5分钟触发回收。
该GarbageCollect()接口需要根据之前参数初始化时的realImageManager结构进行查看,进入kl.imageManager.GarbageCollect()一看究竟:
func (im *realImageManager) GarbageCollect() error {
// 获取节点上所存在的images的磁盘占用率
fsInfo, err := im.cadvisor.ImagesFsInfo()
if err != nil {
return err
}
// 容量及可利用的空间
capacity := int64(fsInfo.Capacity)
available := int64(fsInfo.Available)
if available > capacity {
glog.Warningf("available %d is larger than capacity %d", available, capacity)
available = capacity
}
// Check valid capacity.
if capacity == 0 {
err := fmt.Errorf("invalid capacity %d on device %q at mount point %q", capacity, fsInfo.Device, fsInfo.Mountpoint)
im.recorder.Eventf(im.nodeRef, api.EventTypeWarning, container.InvalidDiskCapacity, err.Error())
return err
}
// 查看images的磁盘占用率是否大于等于HighThresholdPercent
usagePercent := 100 - int(available*100/capacity)
if usagePercent >= im.policy.HighThresholdPercent {
// 尝试去回收images的占用率到LowThresholdPercent之下
amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
glog.Infof("[ImageManager]: Disk usage on %q (%s) is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes", fsInfo.Device, fsInfo.Mountpoint, usagePercent, im.policy.HighThresholdPercent, amountToFree)
// 真正的回收接口
freed, err := im.freeSpace(amountToFree, time.Now())
if err != nil {
return err
}
if freed < amountToFree {
err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d, but freed %d", amountToFree, freed)
im.recorder.Eventf(im.nodeRef, api.EventTypeWarning, container.FreeDiskSpaceFailed, err.Error())
return err ·
}
}
return nil
}
这里最关键的接口便是im.freeSpace(),该接口才是真正进行资源回收的接口。
该接口有两个参数:第一个便是设置这次打算回收的空间,第二个是传入调用回收接口的当前时间。
具体的回收,我们进入接口继续细看:
func (im *realImageManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) {
// 用im.runtime遍历现存的所有的images,并更新im.imageRecords,下面会用到。
err := im.detectImages(freeTime)
if err != nil {
return 0, err
}
// 操作imageRecords的锁
im.imageRecordsLock.Lock()
defer im.imageRecordsLock.Unlock()
// 获取所有的images
images := make([]evictionInfo, 0, len(im.imageRecords))
for image, record := range im.imageRecords {
images = append(images, evictionInfo{
id: image,
imageRecord: *record,
})
}
sort.Sort(byLastUsedAndDetected(images))
// 下面的循环将尝试删除images,直到满足需要删除的空间为止
var lastErr error
spaceFreed := int64(0)
for _, image := range images {
glog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id)
// Images that are currently in used were given a newer lastUsed.
if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
glog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
break
}
// Avoid garbage collect the image if the image is not old enough.
// In such a case, the image may have just been pulled down, and will be used by a container right away.
// 查看该image的空闲时间是否够久,不够久的话将不删除
// 这个时间在GC的策略中有配置
if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
glog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge)
continue
}
// 调用runtime(即Docker)的接口删除指定的image
glog.Infof("[ImageManager]: Removing image %q to free %d bytes", image.id, image.size)
err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})
if err != nil {
lastErr = err
continue
}
// 将删除的镜像从imageRecords中去除,所以前面需要加锁
delete(im.imageRecords, image.id)
// 增加已经删除的image的size
spaceFreed += image.size
// 如果已经删除的image的大小已经满足要求,则退出回收流程
if spaceFreed >= bytesToFree {
break
}
}
return spaceFreed, lastErr
}
基本的imageManager模块流程差不多就这样了,这里还可以继续深入学习下cAdvisor和docker runtime的接口实现。
containerGC
策略初始化
containerGC回收策略相关结构如下:
type ContainerGCPolicy struct {
// Minimum age at which a container can be garbage collected, zero for no limit.
MinAge time.Duration
// Max number of dead containers any single pod (UID, container name) pair is
// allowed to have, less than zero for no limit.
MaxPerPodContainer int
// Max number of total dead containers, less than zero for no limit.
MaxContainers int
}
该结构的初始化是在cmd/kubelet/app/kubelet.go文件中的CreateAndInitKubelet()接口中进行。
调用流程:main --> app.Run --> RunKubelet --> CreateAndInitKubelet
func CreateAndInitKubelet(kc *KubeletConfig) (k KubeletBootstrap, pc *config.PodConfig, err error) {
var kubeClient clientset.Interface
if kc.KubeClient != nil {
kubeClient = kc.KubeClient
// TODO: remove this when we've refactored kubelet to only use clientset.
}
// containerGC的回收策略初始化
gcPolicy := kubecontainer.ContainerGCPolicy{
MinAge: kc.MinimumGCAge,
MaxPerPodContainer: kc.MaxPerPodContainerCount,
MaxContainers: kc.MaxContainerCount,
}
...
}
可以看到实际的参数来源于kc结构,而该结构的初始化是在cmd/kubelet/app/kubelet.go文件中的UnsecuredKubeletConfig()接口中进行。
调用流程:main --> app.Run --> UnsecuredKubeletConfig
func UnsecuredKubeletConfig(s *options.KubeletServer) (*KubeletConfig, error) {
...
MaxContainerCount: int(s.MaxContainerCount),
MaxPerPodContainerCount: int(s.MaxPerPodContainerCount),
MinimumGCAge: s.MinimumGCAge.Duration,
...
}
最开始的参数都来源于KubeletServer中的KubeletConfiguration结构,相关的参数如下:
type KubeletConfiguration struct {
...
// containerGC会回收已经结束的container,但是该container结束后必须要停留
// 大于MinimumGCAge时间才能被回收。 default: 1min
MinimumGCAge unversioned.Duration `json:"minimumGCAge"`
// 用于指定每个已经结束的Pod最多可以存在containers的数量,default: 2
MaxPerPodContainerCount int32 `json:"maxPerPodContainerCount"`
// 集群最大支持的container数量
MaxContainerCount int32 `json:"maxContainerCount"`
}
而该入参的初始化还是需要回到cmd/kubelet/app/options/options.go中的NewKubeletServer()接口,实际初始化如下:
func NewKubeletServer() *KubeletServer {
...
MaxContainerCount: 240,
MaxPerPodContainerCount: 2,
MinimumGCAge: unversioned.Duration{Duration: 1 * time.Minute},
从上面的初始化可以看出:
该节点可以创建的最大container数量是240
每个Pod最大可以容纳2个containers
container结束之后,至少需要在1分钟之后才能被containerGC回收
所以基本的containerGC策略就明白了。
containerGC初始化
策略结构初始化完之后,还要进行最后一步containerGC结构初始化,需要进入pkg/kubelet/kubelet.go的NewMainKubelet()接口查看:
func NewMainKubelet(...) {
...
// setup containerGC
containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy)
if err != nil {
return nil, err
}
klet.containerGC = containerGC
...
}
继续查看NewContainerGC(),该接口在pkg/kubelet/container/container_gc.go中,看下干了啥:
func NewContainerGC(runtime Runtime, policy ContainerGCPolicy) (ContainerGC, error) {
if policy.MinAge < 0 {
return nil, fmt.Errorf("invalid minimum garbage collection age: %v", policy.MinAge)
}
return &realContainerGC{
runtime: runtime,
policy: policy,
}, nil
}
接口很简单,根据之前的策略结构体又初始化了一个realContainerGC结构,可以看出该接口就比较完整了,可以想象一下需要进行container的回收的话,必须要用到runtime的接口(比如查看当前容器状态,删除容器等操作),所以结构中带入实际使用的runtime是必然的。
可以关注下该对象支持的方法,后面会用到。
containerGC启动
所有的参数初始化结束后,需要开始进入真正的GC启动流程,该步骤上面讲imageManager时已经提及,这里直接进入正题。
启动containerGC的接口是StartGarbageCollection(),具体实现如下:
func (kl *Kubelet) StartGarbageCollection() {
go wait.Until(func() {
if err := kl.containerGC.GarbageCollect(kl.sourcesReady.AllReady()); err != nil {
glog.Errorf("Container garbage collection failed: %v", err)
}
}, ContainerGCPeriod, wait.NeverStop)
go wait.Until(func() {
if err := kl.imageManager.GarbageCollect(); err != nil {
glog.Errorf("Image garbage collection failed: %v", err)
}
}, ImageGCPeriod, wait.NeverStop)
}
接下来我们一起看下containerGC的GarbageCollect()接口,但要找到这个接口的话,我们得回到之前初始化containerGC的步骤。
实际初始化containerGC时真正返回的是realContainerGC结构,所以GarbageCollect()是该结构的方法:
func (cgc *realContainerGC) GarbageCollect(allSourcesReady bool) error {
return cgc.runtime.GarbageCollect(cgc.policy, allSourcesReady)
}
看到这里,发现containerGC的套路跟imageManager一样,所以一招鲜吃遍天。
我们使用的runtime就是docker,所以需要去找docker的GarbageCollect()接口实现,具体runtime的初始化可以查看之前一篇文章<Kubelet源码分析(二) DockerClient>的介绍,这里就不讲具体的初始化了,直接进入正题。
Docker的GarbageCollect()接口在pkg/kubelet/dockertools/container_gc.go中:
func (cgc *containerGC) GarbageCollect(gcPolicy kubecontainer.ContainerGCPolicy, allSourcesReady bool) error {
// 从所有的容器中分离出那些可以被回收的contianers
// evictUnits: 可以识别的但已经dead,并且创建时间大于回收策略中的minAge的containers
// unidentifiedContainers: 无法识别的containers
evictUnits, unidentifiedContainers, err := cgc.evictableContainers(gcPolicy.MinAge)
if err != nil {
return err
}
// 先删除无法识别的containers
for _, container := range unidentifiedContainers {
glog.Infof("Removing unidentified dead container %q with ID %q", container.name, container.id)
err = cgc.client.RemoveContainer(container.id, dockertypes.ContainerRemoveOptions{RemoveVolumes: true})
if err != nil {
glog.Warningf("Failed to remove unidentified dead container %q: %v", container.name, err)
}
}
// 所有资源都已经准备好之后,可以删除那些已经dead的containers
if allSourcesReady {
for key, unit := range evictUnits {
if cgc.isPodDeleted(key.uid) {
cgc.removeOldestN(unit, len(unit)) // Remove all.
delete(evictUnits, key)
}
}
}
// 检查所有的evictUnits, 删除每个Pod中超出的containers
if gcPolicy.MaxPerPodContainer >= 0 {
cgc.enforceMaxContainersPerEvictUnit(evictUnits, gcPolicy.MaxPerPodContainer)
}
// 确保节点的最大containers数量
// 检查节点containers数量是否超出了最大限制,是的话就删除多出来的containers
// 优先删除最先创建的containers
if gcPolicy.MaxContainers >= 0 && evictUnits.NumContainers() > gcPolicy.MaxContainers {
// 计算每个单元最多可以有几个containers
numContainersPerEvictUnit := gcPolicy.MaxContainers / evictUnits.NumEvictUnits()
if numContainersPerEvictUnit < 1 {
numContainersPerEvictUnit = 1
}
//
cgc.enforceMaxContainersPerEvictUnit(evictUnits, numContainersPerEvictUnit)
// 需要删除containers的话,优先删除最老的containers
numContainers := evictUnits.NumContainers()
if numContainers > gcPolicy.MaxContainers {
flattened := make([]containerGCInfo, 0, numContainers)
for uid := range evictUnits {
// 先整合所有的containers
flattened = append(flattened, evictUnits[uid]...)
}
sort.Sort(byCreated(flattened))
// 删除numContainers-gcPolicy.MaxContainers个最先创建的contianers
cgc.removeOldestN(flattened, numContainers-gcPolicy.MaxContainers)
}
}
// 删除containers之后,需要清除对应的软连接
logSymlinks, _ := filepath.Glob(path.Join(cgc.containerLogsDir, fmt.Sprintf("*.%s", LogSuffix)))
for _, logSymlink := range logSymlinks {
if _, err = os.Stat(logSymlink); os.IsNotExist(err) {
err = os.Remove(logSymlink)
if err != nil {
glog.Warningf("Failed to remove container log dead symlink %q: %v", logSymlink, err)
}
}
}
return nil
}
User Configuration
上面通过源码介绍了imageManager和containerGC的实现,里面也涉及到了GC Policy的配置,我们也可以通过手动修改kubelet的flags来改变参数默认值。
imageManager相关配置
image-gc-high-threshold: 该值表示磁盘占用率达到该值后会触发image garbage collection。默认值是90%
image-gc-low-threshold: 该值表示image GC尝试以回收的方式来达到的磁盘占用率,若磁盘占用率原本就小于该值,不会触发GC。默认值是80%
containerGC相关配置
minimum-container-ttl-duration: 表示container结束后多长时间可以被GC回收,默认是1min
maximum-dead-containers-per-container: 表示每个已经结束的Pod中最多可以存在多少个containers,默认值是2个
maximum-dead-containers: 表示kubelet所在节点最多可以保留已经结束的containers的数量,默认值是240
容器在停止工作后是可以被garbage collection进行回收,但是我们也需要对containers进行保留,因为有些containers可能是异常停止的,而container可以保留有logs或者别的游泳的数据用于开发进行问题定位。
根据上面的需求,我们就可以通过maximum-dead-containers-per-container和maximum-dead-containers很好的来实现这个目标。
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。