Introduction to Stability assurance is a complex topic. It needs to be effective, iterable, and sustainable to ensure the stability of the cluster. A systematic approach may be able to solve this problem.

头图.png

Author | Wu Peng
Source | Alibaba Cloud Native Public

"Kubernetes Stability Assurance Manual" series of articles:

Summary​


Stability assurance is a complicated topic. It requires be effective, iterable, and sustainable ensure the stability of the cluster. A systematic approach may be able to solve this problem.

In order to form a systematic approach, you can sort out the source of the complexity of stability assurance, formulate a data model to describe it, and then perform digital and visualization The data model is the core to continuously iterate the understanding, practice and experience of stability assurance.

Sources of stability and complexity


The source of the complexity of stability assurance generally has the following dimensions:

  • number of system components and their interaction : Continuous changes over time
  • Dynamic behavior characteristics of system components and interaction : not easy to derive and observe
  • system resource type and quantity : continuous change over time
  • Dynamic behavior characteristics of system resources : not easy to derive and observe
  • Cluster stability guarantee action : It is not easy to standardize and execute safely

In summary, that is:

  • How to get an effective and comprehensive insight into the cluster
  • How to safely perform stability assurance actions plan

Data model


The data model of insights and plans can be abstracted through 4 graphs and 3 tables:

4 pictures

  • architecture relationship diagram : Describe cluster components and their interactions
  • architecture running diagram : Describe the dynamic characteristics of cluster components and interactions
  • resource composition diagram : Describe the composition of cluster resources
  • Resource Operation Diagram : Describe the dynamic usage characteristics of cluster resources

3 tables

  • Event list : Describe the events that need attention generated by the cluster
  • operation list : describes the management operations that can be performed in the cluster
  • plan list : Describe the relationship between events and operations in the cluster

as follows:

1.png

Insight


The function of the cluster is provided by the cluster architecture, and the functional components are based on cluster resources. Therefore, the core of the insight into cluster stability is to grasp the cluster architecture and cluster resources .

1. Architecture diagram


The cluster architecture can usually be graph , where nodes represent components and edges represent interaction relationships. Through the graph structure, the cluster architecture can be intuitively grasped, as shown in the following figure:

2.png

It can be described by the following data structure:

{
    "nodes": [
        {
            "_id": "0ce0e913f6e5516846c654dbd81db6ecab1f684e",
            "name": "kube-apiserver",
            "description": "XXX VPC 内",
            "type": "managed component",
            "dependencies": {}
        },
        {
            "_id": "f0740d8bb67520857061a9b71d4a9e4fc50bfe3d",
            "name": "etcd",
            "description": "XXX VPC 内",
            "type": "managed component | storage",
            "dependencies": {}
        },
        {
            "_id": "05952a825e91cb50a81cbaf23c6941d5c3bb2c89",
            "name": "eni-operator",
            "description": "XXX VPC 内,管理 ENI",
            "type": "component",
            "dependencies": {
                "serviceaccount": "enioperator",
                "clusterrole": "enioperator",
                "clusterrolebinding": "enioperator",
                "configmaps": ["eniconfig"],
                "secrets": ["enioperator"]
            }
        },
        {
            "_id": "42699513a7561e89a5f99881d7b05653a1625c51",
            "name": "Network Service",
            "description": "提供 VPC/VSwitch 等云网络资源的管理服务",
            "type": "cloud service"
        }
    ],
    "edges": [
        {
            "_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946",
            "source": "eni-operator", "target": "kube-apiserver",
            "description": "管理 ENI 请求"
        },
        {
            "_id": "93f3c21247165f0be3a969fc80f72bc1a402e9f5",
            "source": "eni-operator", "target": "Network Service",
            "description": "访问阿里云 ECS OpenAPI,管理 VPC/VSwitch 等网络资源"
        }
    ]
}

2. Architecture operation diagram


During the operation of the cluster, components and interaction relationships can be used to infer internal states through external observation data, such as log/metrics/trace. Combined with the cluster architecture diagram, dynamic insight data can be superimposed on the basis of the static architecture to more intuitively grasp the health status of the cluster, as shown in the following figure:

3.png

The number represents insight data, which can be "abnormal number", "request traffic" and so on. In addition to gaining insights through numbers, you can also use "color to indicate health status", "line thickness to indicate traffic size" and so on.

It can be described by the following data structure:

{
    "nodes": [
      {
            "_id": "ea4538dc0625d06b0dc93579998e04288656050f",
            "name": "mutatehook",
            "deploy": {
                "type": "K8s:Deployment",
                "namespace": "kube-system",
                "replicas": 3
            },
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "mutatehook",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "fuzzy": "fail OR Fail OR error OR Error"
                        }
                    }
              }
          ]
      }
    ],
    "edges": [
        {
            "_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946",
            "source": "eni-operator", "target": "kube-apiserver",
            "insight":[
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "xxx",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "unauthorized": "Unauthorized",
                            "throttling": "'Throttling' OR 'throttling'"
                        }
                    }
                }
            ]
        }
    ]
}

3. Resource composition diagram

Resource management is a complex subject, by analyzing the composition relation cluster resources, you can also try map resources to characterize the structure of cluster structure, characterization resource nodes, edges characterize the binding relationship or subordinate resources.

It can be described by the following data structure:

{
    "kinds": ["vpc", "vswitch", "securitygroup", "ecs", "clb", "rds", "nat", "eip"],
    "tags": {
        "cluster/product": "xxx",
        "cluster/id": "2736f42d4e882ad6825d6364545a3f1cb5136859",
        "cluster/name": "xxx",
        "cluster/env": "staging"
    },
    "nodes": [
        {
            "kind": "vpc",
            "nodes": [
                {
                    "_id": "c505f21871bac7385c1387988cf226310af0831e",
                    "id": "vpc-xxx",
                    "description": "",
                    "ipv4": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": ""
                     },
                     "url": "https://vpc.console.aliyun.com/vpc/xxx"
                }
            ]
        },
        {
            "kind": "ecs",
            "nodes": [
                {
                    "_id": "47c4fe5cc2585a49f07798a0b8b69cda7f8d4a23",
                    "id": "xxx",
                    "az": "xxx",
                    "interfaces": {
                        "primary": {
                            "ip": "xxx",
                            "eni": "xxx",
                            "mac": "xxx"
                        }
                    },
                    "instance-type-family": "xxx",
                    "instance-type": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": "worker",
                        "node/container-runtime": "xxx",
                        "node/user-networking": "xxx",
                        "node/system-networking": "xxx"
                    },
                    "status": "",
                    "condition": "",
                    "url": "https://ecs.console.aliyun.com/#/server/xxx"
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "a754c748b2723a25c017421dd0969d00df3c000b",
            "source": "vsw-xxx", "target": "vpc-xxx",
            "description": ""
        },
        {
            "_id": "c34b164eba2897cfb2b574a576672d8aa441d709",
            "source": "eip-xxx", "target": "ngw-xxx",
            "description": ""
        }
    ]
}

4. Resource operation diagram


In the process of resource use, the internal state of resources and the relationship between resources can also be inferred through external observation data, such as log/metrics/event. Combined with the resource composition diagram, dynamic insight data can be superimposed on the basis of static resources to intuitively grasp the usage status of cluster resources.

It can be described by the following data structure:

{
    "nodes": [
         {
            "_id": "35103ac62d4ef0a314e2a5128f44c684205bea2f",
            "id": "vpc",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "vpc/exist": "DescribeVpcs",
                        "vswitch/count": "DescribeVSwitches"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/count": "DescribeInstances",
                        "securitygroup/count": "DescribeSecurityGroups"
                    }
                }
            ]
        },
        {
            "_id": "6450e07dc67027f76f29fbfcb841e57200855196",
            "id": "ecs",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/exist": "DescribeInstances",
                        "ecs/count": "DescribeInstances",
                        "ecs/usage": "DescribeInstanceMonitorData"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "auto"
                    },
                    "signal": {
                        "ecs/state_change": ""
                    }
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "caa1e395c713f47766ca7bcfc20419c0be0f0803",
            "source": "i-xxx", "target": "sg-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeInstances"
                    }
                }
            ]
        },
        {
            "_id": "537dc478d95714792b3694674d6164f72b361bb0",
            "source": "eip-xxx", "target": "ngw-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeEipAddresses"
                    }
                }
            ]
        }
    ]
}

Plan


Cluster exceptions are inevitable, and they need to be handled safely and effectively when they occur.

Anomalies can be characterized by events. Safe and effective operations are those that have been reviewed and drilled. Combining the anomalies with the operations, the operations are triggered by the abnormalities to form a reviewed and drilled plan, which can safely and effectively handle cluster exceptions.

1. Event List


Events that need attention will be generated during the operation of the cluster. The format of the event itself can be used based on the community CloudEvents standard: \_ https://github.com/cloudevents/spec/blob/v1.0.1/spec.md\_ .

It can be described by the following data structure:

{
    "events": [
        {
            "_id": "a1ab5b61857be35a5c5b203dd84b49248161c823",
            "description": "restart workload manually",
            "event": {
                "id": "restart-workload",
                "source": "xxx",
                "specversion": "1.0",
                "type": "com.aliyun.trigger.manual",
                "datacontenttype": "application/json",
                "data": "{\"NAMESPACE\": \"\", \"NAME\": \"\", \"TYPE\": \"\"}"
            }
        }
    ]
}

2. Operation list


In order to reduce the possibility of misoperation and avoid unaudited and verified operations when an exception occurs, a list of operations that can be performed in the cluster needs to be defined.

It can be described by the following data structure:

{
    "actions": [
        {
            "_id": "47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d",
            "name": "Action Restart Workload",
            "exec": "restart-workload",
            "env": [
                "NAMESPACE",
                "NAME",
                "TYPE"
            ]
        }
    ]
}

3. List of plans


On the basis of the event list and operation list, events and operations can be associated, and exceptions can be handled in an event-driven manner, that is, pre-plans.

It can be described by the following data structure:

{
    "plans": [
        {
            "_id": "29a091c48d8992991ed69e8694b017a11abe3eec",
            "name": "Plan Restart Workload",
            "description": "重启 workload",
            "event": "a1ab5b61857be35a5c5b203dd84b49248161c823",
            "actions": ["47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d"]
        }
    ]
}

Global visualization stability guarantee


Based on the above 4 pictures and 3 tables insight + plan for cluster stability guarantee can be derived, which can derive a global visual stability guarantee service.

Such services have the following key points:

  • Global perspective
  • Digitizing
  • Visualization

This service is based on two principles:

  • People's processing efficiency of images is much higher than that of text
  • A global perspective can provide the ability to “understand the system end-to-end”, “precisely locate problems”, and “handle problems safely”

Take the traffic map in daily life as an example:

4.png

Through the traffic graph, you can quickly understand the road distribution and key nodes in an area, and the customary red, yellow and green colors can intuitively express the congestion status of the road. On richer traffic maps, important events such as road repairs and road closures will also be observed.

In this way, based on visualization, you can quickly understand the traffic and geographic conditions of an area.

The underlying data model is the foundation, and visualization methods are used to make the value of data easier to play.

An implementation


5.png

1) Deployment form

  • Region-based deployment
  • Provide services for single cluster or multiple clusters in Region

2) Use somatosensory


According to the best practice of stability assurance, stability assurance is divided into the following columns :

  • Running link diagram:

    • This column is an area where daily stability guarantees high-frequency use. Through the ability of visualization, you can intuitively perceive the occurrence, scope and impact of abnormalities, and handle abnormalities in a white screen + visualization method
  • Deployment architecture diagram

    • This column is used to understand the deployment architecture of the cluster, perceive and deal with the problems of the deployment dimension
    • Capacity management (including node management, capacity planning, etc.) is carried out in this column
  • Business flowchart

    • This column accumulates the functional flow chart of the business, on the one hand, it helps the business control the complexity of the function, on the other hand it helps the business understand the status quo of the business function, and jointly help the business iteration
    • Business-related data analysis can be placed in this column
  • Data analysis: the column serves two data requirements

    • Business needs

      • View categories: SLI information such as cluster size, SLO information such as cluster stability
      • Query category: query statistical information based on characteristics (such as query resource applications based on label, etc.)
    • Stability guarantee needs

      • View category: SLI information such as cluster water level, SLO information such as cluster stability guarantee effect
      • Query category: Query statistical information based on characteristics (such as querying all associated resource information, resource leakage information, etc., based on label)
  • Observability Management

    • This column is used to manage observability related matters, including:

      • Observation data generation
      • Observation data collection
      • Observation data processing
      • Observation data consumption
  • Controllable management

    • This column is used to manage and control-related operations, including:

      • Release management
      • Disaster recovery management
      • Plan management
      • Resource management
      • Chaos Engineering
      • Safety management
      • Regular physical examination

During normal system operation :

  • Through the "Data Analysis" column, confirm the coverage and accuracy of the cluster in terms of "observability" and "controllability"
  • In the "Observability Management" column, manage the observable dimensions, including data source/monitoring/alarm supplementation, governance, etc.
  • In the ``Controllability Management'' column:

    • According to the problems found in the observation data, plan configuration, issue management, etc.
    • According to the problems found in chaos engineering or drills, carry out pre-plan configuration, etc.
  • In the "operation link diagram" and "deployment architecture diagram", visually combine the configured monitoring, alarm, and plan with components or links

during system abnormalities and recovery, in "Running link map" in :

  • Run link diagrams or alarms through the cluster to sense the occurrence of abnormalities
  • Issue tracking triggered automatically or manually
  • Perceive abnormal components, abnormal links, and severity through the colors of components and interactions in the cluster running link diagram
  • Click on the abnormal number of the component in the cluster running link diagram to obtain the associated abnormal details, or jump to the log, tracing system, etc. for manual query
  • According to the exception details or platform prompts, determine the plan to be executed and the associated components
  • Execute the plan in the cluster operating link diagram (blocking the problem or restoring the service)
  • Confirm the execution effect of the plan through the color of the components and interactions in the cluster running link diagram
  • Automatically or manually end issue tracking

The main contents recorded in the issue tracking process are:

  • issue
  • The moment the exception occurred
  • Actions performed during exception handling
  • Run link graph snapshot
  • The moment of abnormal recovery

Data model and competitiveness analysis


The data model is a medium for iterating, sharing, and applying the best practices for stability assurance. General insights and plans can form standardized services. Personalized insights and plans can be described through a fixed structure, and then a general controller is used to Landing.

insight + plan is formed based on the data model. The technical core is:

  • Insight Model

    • The key issue:

      • How to gain insight into cluster stability?
      • How to gain insight into the efficiency of business iteration?
  • Data model

    • The key issue:

      • How to define an effective and extensible data description?

On the basis of the core technology, we can iterate around the following competitiveness:

  • Insight

    • Globalization
    • Digitizing
    • Visualization
  • effectiveness

    • Shortest operating path
    • Minimal use cost
  • Advanced

    • Process best practices

summary


Through the Spec specification of 7 data models, we can characterize insights + plans based on structured descriptions. With this as the core, we will continue to iterate on the practice and understanding of stability assurance, and accelerate business iteration. To expand one step further, it is also possible to feed back the business in the direction of development based on this model.

If you are interested, welcome to communicate in the message area.

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。