For cloud services, if the system is abnormal, it will bring great losses. In order to minimize the loss, we can only constantly explore when the system will be abnormal, and even narrow it down to whether certain specific parameter changes will cause the system to be abnormal. However, with the development of cloud native, the further decoupling of microservices is continuously promoted, and the massive amount of data and user scale have also brought about the large-scale distributed evolution of infrastructure, and failures in the system have become more and more unpredictable. We need to constantly experiment in the system to actively find out the defects of the system. This method is called Chaos Engineering . After all, practice is the only criterion for testing truth, so chaos engineering can help us to more thoroughly grasp the operation laws of the system and improve the flexibility of the system.

Litmus is an open source cloud-native chaos engineering tool set that focuses on Kubernetes clusters for simulated failure testing to help developers and SREs find defects in clusters and programs, thereby improving the robustness of the system.

Litmus architecture

The architecture of Litmus is shown in the figure:

The components of Litmus can be divided into two parts:

  1. Portal
  2. Agents

Portal is a set of Litmus components, as a control plane (WebUI) for cross-cloud management of chaos experiments, used to coordinate and observe the chaos experiment workflow on the Agent.

Agent is also a set of Litmus components, including the chaos experiment workflow running on the K8s cluster.

Using Portal , users can create and schedule a new chaos experiment workflow Agent , and observe the results Portal Users can also connect more clusters to Portal and use Portal as a single portal for cross-cloud chaos engineering management.

Portal component

  • Litmus WebUI

    Litmus WebUI provides a Web user interface, where users can easily build and observe the chaos experiment workflow. Litmus WebUI also serves as the control plane for the cross-cloud chaos experiment.

  • Litmus Server

    As middleware, Litmus Server is used to process API requests from the user interface, and store configuration and processing result details in the database. It also acts as a communication interface between requests and schedules the work flow to Agent .

  • Litmus DB

    Litmus DB serves as a storage system for the chaos experiment workflow and its test results details.

Agent component

  • Chaos Operator

    Chaos Operator monitors ChaosEngine and performs the chaos experiment mentioned in CR Chaos Operator is namespace-scoped and runs in the litmus namespace by default. After the experiment is completed, Chaos Operator will call chaos-exporter to export the indicators of the chaos experiment to the Prometheus database.

  • CRDs

    The following CRDs will be generated during the Litmus installation process:

    chaosexperiments.litmuschaos.io
    chaosengines.litmuschaos.io
    chaosresults.litmuschaos.io
  • Chaos Experiment

    Chaos Experiment is the basic unit in the LitmusChaos architecture. Users can Chaos Hub or create a new chaos experiment by themselves to build the required chaos experiment workflow. Simply put, it is to define a list of CRD resources such as which operations the test supports, which parameters can be passed in, and which types of objects can be tested. It is usually divided into three categories: general tests (such as memory, disk, CPU, etc.) ), application testing (for example, testing for Nginx), platform testing (for testing on a certain cloud platform: AWS, Azure, GCP). For details, please refer to Chaos Hub document .

  • Chaos Engine

    ChaosEngine implements the functions implemented by Chaos Experiment into applications in the namespace. The CR is monitored by Chaos Operator.

  • Chaos Results

    ChaosResult saves the results of the chaos experiment. It will be created or updated when the experiment is running. It contains various information, including the configuration of Chaos Engine and the state of the experiment. chaos-exporter will read the results and export them to the Prometheus database.

  • Chaos Probes

    Chaos Probes are pluggable indicator probes that can be defined in ChaosEngine of any chaos experiment. The experimental Pod will perform corresponding detection according to its defined mode, and whether it is successful is a necessary condition for determining the experimental results (including Standard "built-in" detection).

  • Chaos Exporter

    You can choose to export metrics to the Prometheus database. Chaos Exporter implements Prometheus metrics endpoint.

  • Subscriber

    Subscriber is used to interact with Litmus Server to obtain detailed results of the chaos experiment workflow and send it back to the Agent.

Prepare KubeSphere application template

KubeSphere integrates OpenPitrix to provide application lifecycle management. OpenPitrix is a multi-cloud application management platform. KubeSphere uses it to implement application stores and application templates to deploy and manage applications in a visual manner. For applications that do not exist in the application store, users can deliver Helm Chart to the public warehouse of KubeSphere, or import a private application warehouse to provide application templates.

This tutorial will use KubeSphere's application template to deploy Litmus.

To deploy an application from an application template, you need to create an enterprise space, a project, and two user accounts ( ws-admin and project-regular ). ws-admin workspace-admin role in the corporate space, project-regular must be granted the operator role in the project. Before creating it, let's review the multi-tenant architecture of KubeSphere.

Multi-tenant architecture

KubeSphere's multi-tenant system is divided into three levels, namely clusters, enterprise spaces, and projects. The project in KubeSphere is equivalent to the namespace .

You need to create a new enterprise space to operate, instead of using the system enterprise space, the system enterprise space runs system resources, most of which are for viewing only. For security reasons, it is strongly recommended to grant different permissions to different tenants to collaborate in the corporate space.

You can create multiple enterprise spaces in a KubeSphere cluster, and multiple projects can be created in each enterprise space. KubeSphere has multiple built-in roles for each level by default. In addition, you can also create roles with custom permissions. The KubeSphere multi-level structure is suitable for enterprise users who have different teams or organizations and require different roles in each team.

Create account

After installing KubeSphere, you need to add users with different roles to the platform so that they can work at different levels for their authorized resources. At the beginning, the system defaulted to only one account admin , with the role of platform-admin In this step, you will create an account user-manager and then use user-manager create a new account.

  1. To admin identity using the default account and password ( admin/P@88w0rd ) sign-on Web console.
For security reasons, it is strongly recommended that you change your password when you log in to the console for the first time. personal settings in the upper right corner of the drop-down menu, set a new password in password settings , you can also modify the console language personal settings
  1. After logging in to the console, click platform management upper left corner, and then select access control .

    In the account role , there are four available built-in roles as shown below. The first account to be created next will be assigned the users-manager role.

    Built-in roledescription
    workspaces-managerEnterprise space administrator, manage all enterprise spaces on the platform.
    users-managerUser administrator, manage all users of the platform.
    platform-regularOrdinary users of the platform do not have any resource operation rights before being invited to join the corporate space or cluster.
    platform-adminThe platform administrator can manage all the resources in the platform.
  2. In account management , click create . In the pop-up window, provide all the necessary information (marked with *), and then select users-manager role field. Please refer to the example below.

    When finished, click confirm . The newly created account will be displayed in the account list Account Management

  3. Switch account and user-manager again using 060c33434d132a, and create the following three new accounts.

    accountCharacterdescription
    ws-managerworkspaces-managerCreate and manage all corporate spaces.
    ws-adminplatform-regularManage all resources in the designated enterprise space (this account is used to invite project-regular members to join the enterprise space).
    project-regularplatform-regularThis account will be used to create workloads, pipelines, and other resources in the specified project.
  4. View the three accounts created.

Create a corporate space

In this step, you need to create a corporate space ws-manager As the basic logical unit for managing projects, creating workloads, and organizing members, the enterprise space is the foundation of the KubeSphere multi-tenant system.

  1. Log in to ws-manager as 060c33434d1467, which has the authority to manage all corporate spaces on the platform. platform management upper left corner, select access control . In enterprise space , you can see that only one default enterprise space system-workspace is listed, that is, the system enterprise space, which runs system-related components and services, and you cannot delete the enterprise space.

  2. on the right to create , name the new enterprise space demo-workspace , and set user ws-admin as the enterprise space administrator, as shown in the following figure:

    When finished, click create .

  3. Log out of the console and log in again as ws-admin . In corporate space setting , select corporate member , and then click invite member .

  4. Invite project-regular enter the corporate space and grant it the role of workspace-viewer

    The format of the actual role name: <workspace name>-<role name> . For example, in the corporate space demo-workspace , the actual role name of the viewer demo-workspace-viewer .

  5. After adding project-regular to the corporate space, click confirm . In Corporate Member , you can see the two members listed.

    accountCharacterdescription
    ws-adminworkspace-adminManage all resources in the specified corporate space (in this example, this account is used to invite new members to join the corporate space and create projects).
    project-regularworkspace-viewerThis account will be used to create workloads and other resources in the specified project.

Create project

In this step, you need to use the account ws-admin created in the previous step to create the project. The project in KubeSphere is the same as the namespace in Kubernetes, providing virtual isolation for resources. For more information, see namespace .

  1. Log in to ws-admin as 060c33434d1739. In project management , click create .

  2. Enter the project name (for example, litmus ), and then click confirm that complete. You can also add an alias and description for the project.

  3. In project management , click on the project you just created to view its detailed information.

  4. Invite project-regular to the project and grant the user the role of operator Please refer to the figure below for specific steps.

    Users with the operator are project maintainers and can manage resources other than users and roles in the project.

Add application repository

  1. To ws-admin user login KubeSphere the Web console. In your corporate space, enter the application warehouse application management , and click add warehouse .

  2. In the pop-up dialog box, set the application warehouse name to litmus , the URL of the application warehouse to https://litmuschaos.github.io/litmus-helm/ , click verify to verify the URL, and then click confirm enter the next step.

  3. After the application warehouse is successfully imported, it will be displayed in the list as shown in the figure below.

Deploy the Litmus control plane

After importing the Litmus application repository, you can deploy Litmus through the application template.

  1. Log out of KubeSphere and log in project-regular user 060c33434d19f6. In your project, enter the application application load , and then click deploy the new application .

  2. In the pop-up dialog box, select from the application template .

  3. In the pop-up dialog box, select from the application template .

    from the application store : select the built-in application and the application uploaded separately in the form of Helm Chart.

    comes from application template : select applications from private application warehouses and enterprise space application pools.

  4. Select the previously added private application warehouse litmus from the drop-down list.

  5. Select litmus-2-0-0-beta to deploy.

  6. You can view the application information and configuration files, version drop-down list, and then click Deploy.

  7. Set the application name, confirm the application version and deployment location, and click Next.

  8. On the application configuration page, you can manually edit the manifest file or click Deploy directly.

  9. Wait for Litmus to be created and run.

    Access Portal service

The service name of the Portal is litmusportal-frontend-service . You can first go to the service interface to check its NodePort:

Use the ${Node IP}:${NODEPORT} access the Portal:

The default username and password:

Username: admin
Password: litmus

Deploy Agent (optional)

Litmus contains two types of agents:

  • Self Agent
  • External Agent

By default, Litmus is installed will be automatically registered as Self Agent, and Portal will perform chaos experiments in Self Agent by default.

Portal earlier, 060c33434d1d4f is a cross-cloud chaos experiment control plane. In other words, users can connect multiple External Agents deployed in external K8s clusters to the current Portal , so that the chaos experiment can be sent to Agent and in Portal Observe the results.

For the deployment method of External Agent, please refer to Litmus official document .

Create Chaos Experiment

After the Portal installation is complete, you can create a chaos experiment through the Portal interface. You need to create an application for testing first:

$ kubectl create deployment nginx --image=nginx --replicas=2 --namespace=default

Let’s start creating an experiment.

  1. Login Portal

  2. Enter the Workflows page and click [Schedule a workflow]

  3. Select Agent, such as Self-Agent:

  4. Choose to add chaos experiment from Chaos Hub:

  5. Set the name of Workflow:

  6. Click [Add a new experiment] to add chaos experiment to Workflow:

  7. Select experimental pod-delete:

  8. Start scheduling immediately:

  9. In KubeSphere, you can see that the Pod has been deleted and rebuilt:

  10. You can also see that the experiment was successful in the Portal interface:

    Click on the specific Workflow node, you can see the detailed log:

  11. Repeat the above steps to create a chaotic experiment pod-cpu-hog:

    In KubeSphere, you can see that the CPU usage of Pod is close to 1C:

  12. The following experiment is used to simulate Pod network packet loss. Before starting the experiment, set the number of copies of Nginx to 1:

    Now there is only one Pod, the IP is 10.233.71.170 :

    Now repeat the above steps to create a chaos experiment pod-network-loss, and modify the packet loss rate to 50% :

    After entering the KubeSphere interface, toolbox Kubectl in the pop-up menu.

    Test the packet loss rate by pinging the Pod's IP, you can see that the packet loss rate is close to 50% , the experiment is successful:

All the above experiments are conducted on Pod. In addition to Pod, you can also experiment on various services such as Node and K8s components. Interested readers can test by themselves.

Workflow detailed

The so-called Workflow is actually a workflow of chaos experiments. Although there is only one experiment for each Workflow in the demonstration in the previous section, in fact, each Workflow can set up multiple experiments and execute them in order.

Workflow is implemented by CRD. You can view the CRD in the KubeSphere Console interface. Here you can see all the previously created Workflows:

Take pod-network-loss as an example to see which parameters are available:

Each experiment in Workflow is also a CRD, and the CRD name is ChaosEngine .

Explain the meaning of each environment variable here:

  • appns : The namespace of the object to be executed.
  • experiments : The name of the test to be performed (such as network delay test, Pod deletion test, etc.), you can use kubectl get chaosexperiments -n test to view the supported experiments.
  • chaosServiceAccount : the sa to be used.
  • jobCleanUpPolicy : Whether to retain the job that executes this test, the field can be delete/retain.
  • annotationCheck : Whether to perform annotation check, if not, all Pods will be tested, the field can be true/false.
  • engineState : The state of this test can be set to active/stop.
  • TOTAL_CHAOS_DURATION : Chaos test duration, the default is 15s.
  • CHAOS_INTERVAL : Chaos test time interval, the default is 5s.
  • FORCE : Whether to use the --force option to delete a pod.
  • TARGET_CONTAINER : Delete a container in the Pod (the first one is deleted by default).
  • PODS_AFFECTED_PERC total, the default is 0 (equivalent to 1 copy).
  • RAMP_TIME : The time to wait before and after the chaos test.
  • SEQUENCE : Test execution strategy, the default is parallel (parallel) execution, can be set to serial/parallel.

The detailed parameters of each other experiment will not be repeated here, and interested readers can refer to the relevant documents by themselves.

to sum up

This article introduces you to the architecture of the chaos engineering framework Litmus and the deployment method on KubeSphere, and verifies the ability of the entire infrastructure and services to resist failures through a series of chaos experiments. Litmus is a particularly excellent chaos engineering framework, with strong community support behind it. There will be more and more experiments built in its experiment store (ie Chaos Hub). You can deploy these chaos experiments to the cluster with one click. Confusion, through the visual interface to visually display the experimental results to verify the flexibility of the cluster. With Litmus, we can not only face failures directly, but also take the initiative to create failures to find system defects and avoid black swan events.

Reference


KubeSphere
127 声望60 粉丝

KubeSphere 是一个开源的以应用为中心的容器管理平台,支持部署在任何基础设施之上,并提供简单易用的 UI,极大减轻日常开发、测试、运维的复杂度,旨在解决 Kubernetes 本身存在的存储、网络、安全和易用性等痛...