Kubernetes Stability Assurance Manual: Insight + Plan

头图.png

Author | Wu Peng
Source | Alibaba Cloud Native Public

"Kubernetes Stability Assurance Manual" series of articles:

Kubernetes Stability Assurance Manual-Minimalist
Kubernetes Stability Assurance Manual - Log topic
Kubernetes Stability Assurance Manual -
Kubernetes Stability Assurance Manual-Insight + Plan (this article)

Summary

Stability assurance is a complicated topic. It requires be effective, iterable, and sustainable ensure the stability of the cluster. A systematic approach may be able to solve this problem.

In order to form a systematic method, you can sort out the source of the complexity of stability assurance, formulate a data model to describe it, and then perform digital and visualization The data model is the core to continuously iterate the understanding, practice and experience of stability assurance.

Sources of stability and complexity

The source of the complexity of stability assurance generally has the following dimensions:

number of system components and their interaction : Continuous changes over time
Dynamic behavior characteristics of system components and interaction : not easy to derive and observe
system resource type and quantity : continuous change over time
Dynamic behavior characteristics of system resources : not easy to derive and observe
cluster stability guarantee action : not easy to standardize and safe implementation

In summary, that is:

How to effectively and comprehensively insight into the cluster
How to safely perform stability assurance actions plan

Data model

The data model of insights and plans can be abstracted through 4 graphs and 3 tables:

4 pictures

architecture relationship diagram : Describe the cluster components and their interactions
architecture running diagram : Describe the dynamic characteristics of cluster components and interactions
resource composition diagram : Describe the composition of cluster resources
Resource Operation Diagram : Describe the dynamic usage characteristics of cluster resources

3 tables

event list : describe events that need attention generated by the cluster
operation list : describes the management operations that can be performed in the cluster
plan list : Describe the relationship between events and operations in the cluster

as follows:

Insight

The function of the cluster is provided by the cluster architecture, and the functional components operate based on cluster resources. Therefore, the core of the insight into cluster stability is to grasp the cluster architecture and cluster resources .

1. Architecture diagram

The cluster architecture can usually be graph , where nodes represent components and edges represent interaction relationships. The cluster architecture can be intuitively grasped through the graph structure, as shown in the following figure:

It can be described by the following data structure:

{
    "nodes": [
        {
            "_id": "0ce0e913f6e5516846c654dbd81db6ecab1f684e",
            "name": "kube-apiserver",
            "description": "XXX VPC 内",
            "type": "managed component",
            "dependencies": {}
        },
        {
            "_id": "f0740d8bb67520857061a9b71d4a9e4fc50bfe3d",
            "name": "etcd",
            "description": "XXX VPC 内",
            "type": "managed component | storage",
            "dependencies": {}
        },
        {
            "_id": "05952a825e91cb50a81cbaf23c6941d5c3bb2c89",
            "name": "eni-operator",
            "description": "XXX VPC 内，管理 ENI",
            "type": "component",
            "dependencies": {
                "serviceaccount": "enioperator",
                "clusterrole": "enioperator",
                "clusterrolebinding": "enioperator",
                "configmaps": ["eniconfig"],
                "secrets": ["enioperator"]
            }
        },
        {
            "_id": "42699513a7561e89a5f99881d7b05653a1625c51",
            "name": "Network Service",
            "description": "提供 VPC/VSwitch 等云网络资源的管理服务",
            "type": "cloud service"
        }
    ],
    "edges": [
        {
            "_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946",
            "source": "eni-operator", "target": "kube-apiserver",
            "description": "管理 ENI 请求"
        },
        {
            "_id": "93f3c21247165f0be3a969fc80f72bc1a402e9f5",
            "source": "eni-operator", "target": "Network Service",
            "description": "访问阿里云 ECS OpenAPI，管理 VPC/VSwitch 等网络资源"
        }
    ]
}

2. Architecture operation diagram

During the operation of the cluster, components and interaction relationships can be used to infer internal states through external observation data, such as log/metrics/trace. Combined with the cluster architecture diagram, dynamic insight data can be superimposed on the basis of the static architecture to more intuitively grasp the health status of the cluster, as shown in the following figure:

The number represents insight data, which can be "abnormal number", "request traffic" and so on. In addition to gaining insights through numbers, you can also use "color to indicate health status", "line thickness to indicate traffic size" and so on.

It can be described by the following data structure:

{
    "nodes": [
      {
            "_id": "ea4538dc0625d06b0dc93579998e04288656050f",
            "name": "mutatehook",
            "deploy": {
                "type": "K8s:Deployment",
                "namespace": "kube-system",
                "replicas": 3
            },
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "mutatehook",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "fuzzy": "fail OR Fail OR error OR Error"
                        }
                    }
              }
          ]
      }
    ],
    "edges": [
        {
            "_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946",
            "source": "eni-operator", "target": "kube-apiserver",
            "insight":[
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "xxx",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "unauthorized": "Unauthorized",
                            "throttling": "'Throttling' OR 'throttling'"
                        }
                    }
                }
            ]
        }
    ]
}

3. Resource composition diagram

Resource management is a complex topic. By analyzing the composition relationship of the resources in the cluster, you can also try graph structure, the nodes represent the resources, and the edges represent the affiliation or binding relationship of the resources.

It can be described by the following data structure:

{
    "kinds": ["vpc", "vswitch", "securitygroup", "ecs", "clb", "rds", "nat", "eip"],
    "tags": {
        "cluster/product": "xxx",
        "cluster/id": "2736f42d4e882ad6825d6364545a3f1cb5136859",
        "cluster/name": "xxx",
        "cluster/env": "staging"
    },
    "nodes": [
        {
            "kind": "vpc",
            "nodes": [
                {
                    "_id": "c505f21871bac7385c1387988cf226310af0831e",
                    "id": "vpc-xxx",
                    "description": "",
                    "ipv4": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": ""
                     },
                     "url": "https://vpc.console.aliyun.com/vpc/xxx"
                }
            ]
        },
        {
            "kind": "ecs",
            "nodes": [
                {
                    "_id": "47c4fe5cc2585a49f07798a0b8b69cda7f8d4a23",
                    "id": "xxx",
                    "az": "xxx",
                    "interfaces": {
                        "primary": {
                            "ip": "xxx",
                            "eni": "xxx",
                            "mac": "xxx"
                        }
                    },
                    "instance-type-family": "xxx",
                    "instance-type": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": "worker",
                        "node/container-runtime": "xxx",
                        "node/user-networking": "xxx",
                        "node/system-networking": "xxx"
                    },
                    "status": "",
                    "condition": "",
                    "url": "https://ecs.console.aliyun.com/#/server/xxx"
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "a754c748b2723a25c017421dd0969d00df3c000b",
            "source": "vsw-xxx", "target": "vpc-xxx",
            "description": ""
        },
        {
            "_id": "c34b164eba2897cfb2b574a576672d8aa441d709",
            "source": "eip-xxx", "target": "ngw-xxx",
            "description": ""
        }
    ]
}

4. Resource operation diagram

In the process of resource use, the internal state of resources and the relationship between resources can also be inferred through external observation data, such as log/metrics/event. Combined with the resource composition diagram, dynamic insight data can be superimposed on the basis of static resources to intuitively grasp the usage status of cluster resources.

It can be described by the following data structure:

{
    "nodes": [
         {
            "_id": "35103ac62d4ef0a314e2a5128f44c684205bea2f",
            "id": "vpc",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "vpc/exist": "DescribeVpcs",
                        "vswitch/count": "DescribeVSwitches"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/count": "DescribeInstances",
                        "securitygroup/count": "DescribeSecurityGroups"
                    }
                }
            ]
        },
        {
            "_id": "6450e07dc67027f76f29fbfcb841e57200855196",
            "id": "ecs",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/exist": "DescribeInstances",
                        "ecs/count": "DescribeInstances",
                        "ecs/usage": "DescribeInstanceMonitorData"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "auto"
                    },
                    "signal": {
                        "ecs/state_change": ""
                    }
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "caa1e395c713f47766ca7bcfc20419c0be0f0803",
            "source": "i-xxx", "target": "sg-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeInstances"
                    }
                }
            ]
        },
        {
            "_id": "537dc478d95714792b3694674d6164f72b361bb0",
            "source": "eip-xxx", "target": "ngw-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeEipAddresses"
                    }
                }
            ]
        }
    ]
}

Plan

Cluster exceptions are inevitable, and they need to be handled safely and effectively when they occur.

Anomalies can be characterized by events. Safe and effective operations are those that have been reviewed and drilled. Combining the anomalies with the operations, the operations are triggered by the abnormalities to form a reviewed and drilled plan, which can safely and effectively handle cluster exceptions.

1. Event List

Events that need attention will be generated during the operation of the cluster. The format of the event itself can be used based on the community CloudEvents standard: _ https://github.com/cloudevents/spec/blob/v1.0.1/spec.md_.

It can be described by the following data structure:

{
    "events": [
        {
            "_id": "a1ab5b61857be35a5c5b203dd84b49248161c823",
            "description": "restart workload manually",
            "event": {
                "id": "restart-workload",
                "source": "xxx",
                "specversion": "1.0",
                "type": "com.aliyun.trigger.manual",
                "datacontenttype": "application/json",
                "data": "{\"NAMESPACE\": \"\", \"NAME\": \"\", \"TYPE\": \"\"}"
            }
        }
    ]
}

2. Operation list

In order to reduce the possibility of misoperation and avoid unaudited and verified operations when an exception occurs, a list of operations that can be performed in the cluster needs to be defined.

It can be described by the following data structure:

{
    "actions": [
        {
            "_id": "47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d",
            "name": "Action Restart Workload",
            "exec": "restart-workload",
            "env": [
                "NAMESPACE",
                "NAME",
                "TYPE"
            ]
        }
    ]
}

3. List of Plans

On the basis of the event list and operation list, events and operations can be associated, and exceptions can be handled in an event-driven manner, that is, pre-plans.

It can be described by the following data structure:

{
    "plans": [
        {
            "_id": "29a091c48d8992991ed69e8694b017a11abe3eec",
            "name": "Plan Restart Workload",
            "description": "重启 workload",
            "event": "a1ab5b61857be35a5c5b203dd84b49248161c823",
            "actions": ["47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d"]
        }
    ]
}

Global visualization stability guarantee

Based on the above 4 pictures and 3 tables insights + plan for cluster stability guarantee can be derived, which can derive a global visual stability guarantee service.

Such services have the following key points:

Global perspective
Digitizing
Visualization

This service is based on two principles:

People's processing efficiency of images is much higher than that of text
A global perspective can provide the ability to “understand the system end-to-end”, “precisely locate problems”, and “handle problems safely”

Take the traffic map in daily life as an example:

Through the traffic graph, you can quickly understand the road distribution and key nodes in an area, and the customary red, yellow and green colors can intuitively express the congestion status of the road. On richer traffic maps, important events such as road repairs and road closures will also be observed.

In this way, based on visualization, you can quickly understand the traffic and geographic conditions of an area.

The underlying data model is the foundation, and visualization methods are used to make the value of data easier to play.

An implementation

1) Deployment form

Region-based deployment
Provide services for single cluster or multiple clusters in Region

2) Use somatosensory

According to the best practice of stability assurance, stability assurance is divided into the following columns :

Running link diagram:
- This column is an area where daily stability guarantees high-frequency use. Through the ability of visualization, you can intuitively perceive the occurrence, scope and impact of abnormalities, and handle abnormalities in a white screen + visualization method.
Deployment architecture diagram
- This column is used to understand the deployment architecture of the cluster, perceive and deal with the problems of the deployment dimension
- Capacity management (including node management, capacity planning, etc.) is carried out in this column
Business flowchart
- This column accumulates the functional flow chart of the business. On the one hand, it assists the business to control the complexity of the function, on the other hand, it assists the business to understand the status quo of the business function, and jointly assist the business iteration
- Business-related data analysis can be placed in this column
Data analysis: the column serves two data requirements
- Business needs
  - View categories: SLI information such as cluster size, SLO information such as cluster stability
  - Query category: query statistical information based on characteristics (such as query resource applications based on label, etc.)
- Stability guarantee needs
  - View category: SLI information such as cluster water level, SLO information such as cluster stability guarantee effect
  - Query category: Query statistical information based on characteristics (such as querying all associated resource information, resource leakage information, etc., based on label)
Observability Management
- This column is used to manage observability related matters, including:
  - Observation data generation
  - Observation data collection
  - Observation data processing
  - Observation data consumption
Controllable management
- This column is used to manage and control-related operations, including:
  - Release management
  - Disaster recovery management
  - Plan management
  - Resource management
  - Chaos Engineering
  - Safety management
  - Regular physical examination

During normal operation of the system :

Through the "Data Analysis" column, confirm the coverage and accuracy of the cluster in terms of "observability" and "controllability"
In the "Observability Management" column, manage the observable dimensions, including data source/monitoring/alarm supplementation, governance, etc.
In the ``Controllability Management'' column:
- According to the problems found in the observation data, plan configuration, issue management, etc.
- According to the problems found in chaos engineering or drills, carry out pre-plan configuration, etc.
In the "operation link diagram" and "deployment architecture diagram", visually combine the configured monitoring, alarm, and plan with components or links

during system abnormalities and recovery, in "Running link map" in :

Run link diagrams or alarms through the cluster to sense abnormal occurrences
Issue tracking triggered automatically or manually
Perceive abnormal components, abnormal links, and severity through the colors of components and interactions in the cluster running link diagram
Click on the abnormal number of the component in the cluster running link diagram to obtain the associated abnormal details, or jump to the log, tracing system, etc. for manual query
According to the exception details or platform prompts, determine the plan to be executed and the associated components
Execute the plan in the cluster operating link diagram (blocking the problem or restoring the service)
Confirm the execution effect of the plan through the color of the components and interactions in the cluster running link diagram
Automatically or manually end issue tracking

The main contents recorded in the issue tracking process are:

issue
The moment the exception occurred
Actions performed during exception handling
Run link graph snapshot
The moment of abnormal recovery

Data model and competitiveness analysis

The data model is a medium for iterating, sharing, and applying the best practices for stability assurance. General insights and plans can form standardized services. Personalized insights and plans can be described through a fixed structure, and then a general controller is used to Landing.

insight + plan is formed based on the data model. The technical core is:

Insight Model
- The key issue:
  - How to gain insight into cluster stability?
  - How to gain insight into the efficiency of business iteration?
Data model
- The key issue:
  - How to define an effective and extensible data description?

On the basis of the core technology, we can iterate around the following competitiveness:

Insight
- Globalization
- Digitizing
- Visualization
effectiveness
- Shortest operating path
- Minimal use cost
Advanced
- Process best practices

summary

Through the Spec specification of 7 data models, we can characterize insights + plans based on structured descriptions. With this as the core, we will continue to iterate on the practice and understanding of stability assurance and accelerate business iteration. To expand one step further, it is also possible to feed back the business in the direction of development based on this model.

If you are interested, welcome to communicate in the message area.

Kubernetes Stability Assurance Manual: Insight + Plan

Summary

Sources of stability and complexity

Data model

4 pictures

3 tables

Insight

1. Architecture diagram

2. Architecture operation diagram

3. Resource composition diagram

4. Resource operation diagram

Plan

1. Event List

2. Operation list

3. List of Plans

Global visualization stability guarantee

An implementation

1) Deployment form

2) Use somatosensory

Data model and competitiveness analysis

summary

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

JManus - 面向 Java 开发者的开源通用智能体

Kubernetes Stability Assurance Manual: Insight + Plan

Summary​

Sources of stability and complexity

Data model

4 pictures

3 tables

Insight

1. Architecture diagram

2. Architecture operation diagram

3. Resource composition diagram

4. Resource operation diagram

Plan

1. Event List

2. Operation list

3. List of Plans

Global visualization stability guarantee

An implementation

1) Deployment form

2) Use somatosensory

Data model and competitiveness analysis

summary

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

JManus - 面向 Java 开发者的开源通用智能体

Summary