1
头图

K8S Internals Series: Issue 1

After the container orchestration battle took shape in the domination of Kubernetes, K8S has become a new generation of operating systems in the cloud-native era. K8S makes everything simple, but it gradually becomes more and more complex. [K8S Internals Series Columns] Focusing on many aspects of the K8S ecosystem, the Boyun Container Cloud R&D team will regularly share hot topics such as scheduling, security, network, performance, storage, and application scenarios. I hope that while enjoying the efficiency and convenience brought by K8S, you can also appreciate the charm of its kernel operating mechanism like a cook.

1. Introduction to Pod Security Policy

Because the goal of Pod Security Admission is to replace Pod Security Policy, it is necessary to introduce Pod Security Policy before introducing it. Pod Security Policy defines a set of conditions that Pods must follow when running and the default values of related fields. Pods must meet these conditions To be successfully created, the Pod Security Policy object Spec contains the following fields, which are the aspects that Pod Security Policy can control:

control angle Field Name
run privileged container privileged
Using the host namespace hostPID,hostIPC
Use the host's network and port hostNetwork, hostPorts
Controlling the use of volume types volumes
Using the host file system allowedHostPaths
Allow use of specific FlexVolume drivers allowedFlexVolumes
Allocate the FSGroup account that owns the Pod volume fsGroup
Read-only access to the root filesystem readOnlyRootFilesystem
Set the container's user and group IDs runAsUser, runAsGroup, supplementalGroups
Restrict root account privilege escalation allowPrivilegeEscalation, defaultAllowPrivilegeEscalation
Linux Capabilities defaultAddCapabilities, requiredDropCapabilities, allowedCapabilities
Set the container's SELinux context seLinux
Specifies the proc types that the container can mount allowedProcMountTypes
Specifies the AppArmor template used by the container annotations
Specifies the seccomp template used by the container annotations
Specifies the sysctl template used by the container forbiddenSysctls,allowedUnsafeSysctls

Among them, AppArmor and seccomp need to be set by adding annotations to the PodSecurityPolicy object:

 seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default'
seccomp.security.alpha.kubernetes.io/defaultProfileNames: 'docker/default'
apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' 
apparmor.security.beta.kubernetes.io/defaultProfileNames: 'runtime/default' 

Pod Security Policy is a cluster-level resource. Let's take a look at its usage process:

PSP使用流程.png
PSP usage process

Since it is necessary to create ClusterRole/Role and ClusterRoleBinding/RoleBinding binding service accounts to use PSPs, it is not easy to see which PSPs are used, and it is even more difficult to see which security rules restrict the creation of Pods.

2. Why does Pod Security Admission appear?

By using PodSecurityPolicy, you should also find its problems, such as no dry-run and audit mode, inconvenient to open and close, etc., and it is not so clear to use. As a result of various defects, PodSecurityPolicy was marked as deprecated in Kubernetes v1.21 and will be removed in v1.25, and a new feature, Pod Security Admission, was added in kubernets v1.22.

3. Introduction to Pod Security Admission

Pod security admission is an admission controller built into kubernetes. This feature gate is enabled by default in kubernetes v1.23. In v1.22, it needs to be enabled by the kube-apiserver parameter --feature-gates="...,PodSecurity=true" . In kuberntes versions lower than v1.22, you can also install the Pod Security Admission Webhook by yourself.

Pod security admission restricts the creation of pods in a cluster by enforcing built-in Pod Security Standards.

3.1 Pod Security Standards

In order to cover a wide range of security application scenarios, Pod Security Standards progressively define three different Pod security standard policies:

Profile describe
Privileged Unrestricted policies that provide the widest possible range of permissions. This policy allows known elevation of privilege.
Baseline The least restrictive policy, disallowing known policy promotions. Allow default (minimum specified) Pod configuration.
Restricted A very restrictive policy that follows current best practices for securing pods.

See Pod Security Standards for details.

3.2 Pod Security Standards Implementation Method

After the pod security admission feature gate is enabled in the kubernetes cluster, Pod Security Standards can be implemented by setting labels to the namespace. There are three setting modes to choose from:

Mode Description
enforce Pods that violate security standards policies will be rejected.
audit Violation of security standards policy triggers the addition of audit comments to events recorded in the audit log, but other actions are allowed.
warn Violation of security standards policy will trigger a user-facing warning, but other actions are allowed.

Label setting template explanation:

 # 设定模式及安全标准策略等级
# MODE必须是 `enforce`, `audit`或`warn`其中之一。
# LEVEL必须是`privileged`, `baseline`或 `restricted`其中之一
pod-security.kubernetes.io/<MODE>: <LEVEL>

# 此选项是非必填的,用来锁定使用哪个版本的的安全标准
# MODE必须是 `enforce`, `audit`或`warn`其中之一。
# VERSION必须是一个有效的kubernetes minor version(例如v1.23),或者 `latest`
pod-security.kubernetes.io/<MODE>-version: <VERSION>

A namesapce can be set to any mode or different modes to set different security standard policies.

The default configuration for pod security admission can be set via the admission controller configuration file:

 apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1beta1
    kind: PodSecurityConfiguration
    # Defaults applied when a mode label is not set.
    #
    # Level label values must be one of:
    # - "privileged" (default)
    # - "baseline"
    # - "restricted"
    #
    # Version label values must be one of:
    # - "latest" (default) 
    # - specific version like "v1.23"
    defaults:
      enforce: "privileged"
      enforce-version: "latest"
      audit: "privileged"
      audit-version: "latest"
      warn: "privileged"
      warn-version: "latest"
    exemptions:
      # Array of authenticated usernames to exempt.
      usernames: []
      # Array of runtime class names to exempt.
      runtimeClassNames: []
      # Array of namespaces to exempt.
      namespaces: []

Pod security admission can exempt pods from security standard checks from the three dimensions of username, runtimeClassName, and namespace.

3.3 Pod Security Standards Implementation Demonstration

  • Environment: kubernetes v1.23

Containers at runtime are exposed to many attack risks, such as container escape and resource exhaustion attacks from containers.

3.3.1 Baseline strategy

The Baseline policy goal is to apply to common containerized applications and prohibit known privilege escalation. In the official introduction, this policy is aimed at application operators and non-critical application developers. The policy includes:

Must prohibit sharing host namespaces, prohibit container privileges, restrict Linux capabilities, prohibit hostPath volumes, restrict host ports, configure AppArmor, SElinux, Seccomp, Sysctls, etc.

The following demonstrates setting the Baseline policy.

Risks of violating Baseline policy:

  • Privileged containers can see the host device
  • After mounting procfs, you can see the host process, breaking process isolation
  • Can break network isolation
  • After mounting the runtime socket, you can communicate with the runtime without restrictions

And so on the above risks can lead to container escape.

  1. Create a namespace named my-baseline-namespace, and set both the enforce and warn modes to correspond to the Baseline-level Pod security standard policy:
 apiVersion: v1
kind: Namespace
metadata:
  name: my-baseline-namespace
  labels:
    pod-security.kubernetes.io/enforce: baseline  
    pod-security.kubernetes.io/enforce-version: v1.23
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: v1.23
  1. create pod

    • Create a pod that violates the baseline policy
     apiVersion: v1
    kind: Pod
    metadata:
      name: hostnamespaces2
      namespace: my-baseline-namespace
    spec:
      containers:
      - image: bitnami/prometheus:2.33.5
        name: prometheus
        securityContext:
          allowPrivilegeEscalation: true
          privileged: true
          capabilities:
            drop:
            - ALL
      hostPID: true
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
    • Execute the apply command, it shows that hostPID=true, securityContext.privileged=true cannot be set, Pod creation is rejected, privileged container is running, and hostPID is enabled, the container process is not isolated from the host process, which may easily cause the Pod container to escape:
     [root@localhost podSecurityStandard]# kubectl apply -f fail-hostnamespaces2.yaml
    Error from server (Forbidden): error when creating "fail-hostnamespaces2.yaml": pods "hostnamespaces2" is forbidden: violates PodSecurity "baseline:v1.23": host namespaces (hostPID=true), privileged (container "prometheus" must not set securityContext.privileged=true)
    • Create a pod that does not violate the baseline policy, set the Pod's hostPID=false, securityContext.privileged=false
     apiVersion: v1
    kind: Pod
    metadata:
      name: hostnamespaces2
      namespace: my-baseline-namespace
    spec:
      containers:
      - image: bitnami/prometheus:2.33.5
        name: prometheus
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          capabilities:
            drop:
            - ALL
      hostPID: false
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
    • Execute the apply command and the pod is allowed to be created:
     [root@localhost podSecurityStandard]# kubectl apply -f pass-hostnamespaces2.yaml
    pod/hostnamespaces2 created
3.3.2 Restricted Policy

The goal of the Restricted policy is to implement the current best practices for protecting Pods. In the official introduction, this policy is mainly aimed at operation and maintenance personnel and application developers whose security is very important, as well as less trusted users. This policy contains all the content of the baseline policy, with the additions: Restrict non-core volume types that can be defined by PersistentVolumes, prohibit privilege escalation (via SetUID or SetGID file mode), must require containers to run as non-root users, Containers cannot be runAsUser is set to 0, the container group must deprecate ALL capabilities and only allow NET_BIND_SERVICE capabilities to be added.

The restricted policy further restricts access to root privileges within the container, linux kernel functions. For example, a man-in-the-middle attack against the kubernetes network requires the CAP_NET_RAW permission of the Linux system to send ARP packets.

  1. Create a namespace named my-restricted-namespace, and set both the enforce and warn modes to correspond to the Restricted-level Pod security standard policy:

     apiVersion: v1
    kind: Namespace
    metadata:
    name: my-restricted-namespace
    labels:
     pod-security.kubernetes.io/enforce: restricted 
     pod-security.kubernetes.io/enforce-version: v1.23
     pod-security.kubernetes.io/warn: restricted
     pod-security.kubernetes.io/warn-version: v1.23
  2. create pod

    • Create a pod that violates the Restricted policy
     apiVersion: v1
    kind: Pod
    metadata:
      name: runasnonroot0
      namespace: my-restricted-namespace
    spec:
      containers:
      - image: bitnami/prometheus:2.33.5
        name: prometheus
        securityContext:
          allowPrivilegeEscalation: false
      securityContext:
        seccompProfile:
          type: RuntimeDefault
    • Execute the apply command, it shows that securityContext.runAsNonRoot=true, securityContext.capabilities.drop=["ALL"] must be set, Pod creation is rejected, and the container has too much permission when the container runs as root user, combined with no Drop linux kernel capability, there is kubernetes Risks of network man-in-the-middle attacks:
     [root@localhost podSecurityStandard]# kubectl apply -f fail-runasnonroot0.yaml
    Error from server (Forbidden): error when creating "fail-runasnonroot0.yaml": pods "runasnonroot0" is forbidden: violates PodSecurity "restricted:v1.23": unrestricted capabilities (container "prometheus" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "prometheus" must set securityContext.runAsNonRoot=true)
    • Create a pod that does not violate the Restricted policy, set the Pod's securityContext.runAsNonRoot=true, and drop all linux capabilities.
     apiVersion: v1
    kind: Pod
    metadata:
      name: runasnonroot0
      namespace: my-restricted-namespace
    spec:
      containers:
      - image: bitnami/prometheus:2.33.5
        name: prometheus
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
    • Execute the apply command and the pod is allowed to be created:
     [root@localhost podSecurityStandard]# kubectl apply -f pass-runasnonroot0.yaml
    pod/runasnonroot0 created

3.4 Current limitations of pod security admission

If PodSecurityPolicy is already configured in your cluster, consider migrating them to pod security admission will require some work.

The first thing to consider is whether the current pod security admission is suitable for your cluster. Currently it is designed to meet the most common security needs out of the box. Compared to the PSP, it has the following differences:

  • Pod security admission only checks the security standards of the pod, does not support modification of the pod, and cannot set the default security configuration for the pod.
  • Pod security admission only supports three officially defined security standard strategies, and does not support flexible custom security standard strategies. This makes it impossible to completely migrate PSP rules to pod security admission, requiring specific security rule considerations.
  • Unlike PSP, pod security admission can be bound to specific users, and only supports exempting specific users or RuntimeClass and namespace.

4. pod security admission source code analysis

The kubernetes admission controller is a plug-in that is decoupled from the API server logic at the code level. Objects are created, updated, or deleted before etcd is persisted to intercept requests and execute specific logic. The classic flow of a request to the API server is shown in the following figure:

pod-security.png
Api Request processing flow chart

4.1 Logic flow chart of source code body

podsecurityAdmission代码流程图.png
podsecurityAdmission code flow chart

The main logic flow of pod security admission is shown in the figure. The admission controller first parses the intercepted request, and performs different logical processing according to the parsed resource type:

  • Namespace : If the parsed resource is a Namespace, the admission controller first parses the information such as the level, mode, and locked Pod security standard policy version according to the labels of the namespace. Check if the Pod security standard policy information is not included, then directly allow the request to pass. If the Pod security standard policy information is included, it is judged whether to create a new namespace or update the old namespace. If it is create, judge whether the configuration is correct, if it is update, then Evaluate whether the pods in the namespace comply with the newly set security standards policy.
  • Pod: If the parsed resource is a Pod, the admission controller first obtains the Pod security standard policy information set by the namespace where the Pod is located. If the namespace does not have a Pod security standard policy set, the request is allowed to pass, otherwise the Pod is evaluated. Compliance with security standards policies.
  • others: The admission controller first obtains the Pod security policy information set by the namespace where the resource is located. If the namespace does not have a Pod security policy set, the request is allowed to pass, otherwise the resource is further parsed to determine whether the resource is such as PodTemplate, ReplicationController , ReplicaSet, Deployment, DaemonSet, StatefulSet, Job, CronJob and other resources that contain PodSpec. After parsing the PodSpec, evaluate whether the resource complies with the Pod security policy.

4.2 Initialize Pod security admission

Like most go programs, Pod security admission uses github.com/spf13/cobra to create a start command that calls runServer on startup to initialize and start the webhook service. The options include DefaultClientQPSLimit, DefaultClientQPSBurst, DefaultPort, DefaultInsecurePort and other default configurations.

 // NewSchedulerCommand creates a *cobra.Command object with default parameters and registryOptions
func NewServerCommand() *cobra.Command {
    opts := options.NewOptions()

    cmdName := "podsecurity-webhook"
    if executable, err := os.Executable(); err == nil {
        cmdName = filepath.Base(executable)
    }
    cmd := &cobra.Command{
        Use: cmdName,
        Long: `The PodSecurity webhook is a standalone webhook server implementing the Pod
Security Standards.`,
        RunE: func(cmd *cobra.Command, _ []string) error {
            verflag.PrintAndExitIfRequested()
            // 初始化并且启动webhook服务
            return runServer(cmd.Context(), opts)
        },
        Args: cobra.NoArgs,
    }
    opts.AddFlags(cmd.Flags())
    verflag.AddFlags(cmd.Flags())

    return cmd
}

The runserver function loads the configuration of the admission controller, initializes the server, and finally starts the server.

 func runServer(ctx context.Context, opts *options.Options) error {
    // 加载配置内容
    config, err := LoadConfig(opts)
    if err != nil {
        return err
    }
    // 根据配置内容初始化server
    server, err := Setup(config)
    if err != nil {
        return err
    }
    
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()
    go func() {
        stopCh := apiserver.SetupSignalHandler()
        <-stopCh
        cancel()
    }()
    // 启动server
    return server.Start(ctx)
}

The following is an interception of the main code snippets of the Setup function. The Setup function creates an Admission object containing:

  • PodSecurityConfig: The admission controller configuration content, including the default Pod security standard policy level and setting mode and locking the corresponding kubernetes version, as well as exempt Usernames, RuntimeClasses and Namespaces.
  • Evaluator: The created evaluator, which defines a specific method for checking the security standard policy.
  • Metrics: Used to collect Prometheus metrics.
  • PodSpecExtractor: Use to parse the PodSpec in the request object.
  • PodLister: Used to get the Pods in the specified namespace.
  • NamespaceGetter: The user gets the namespace where the resource intercepted in the request is located.
 // Setup creates an Admission object to handle the admission logic.
func Setup(c *Config) (*Server, error) {
    ...
    s.delegate = &admission.Admission{
        Configuration:    c.PodSecurityConfig,
        Evaluator:        evaluator,
        Metrics:          metrics,
        PodSpecExtractor: admission.DefaultPodSpecExtractor{},
        PodLister:        admission.PodListerFromClient(client),
        NamespaceGetter:  admission.NamespaceGetterFromListerAndClient(namespaceLister, client),
    }
   ...
    return s, nil
}

After the admission controller service is started, the HandleValidate method is registered to process the admission verification logic, and the Validate method is called in this method to verify the specific Pod security standard policy.

 //处理webhook拦截到的请求
func (s *Server) HandleValidate(w http.ResponseWriter, r *http.Request) {
    defer utilruntime.HandleCrash(func(_ interface{}) {
        // Assume the crash happened before the response was written.
        http.Error(w, "internal server error", http.StatusInternalServerError)
    })
     ...
    // 进行具体的检验操作
    response := s.delegate.Validate(ctx, attributes)
    response.UID = review.Request.UID // Response UID must match request UID
    review.Response = response
    writeResponse(w, review)
}

4.3 Admission Inspection Processing Logic

The Validate method calls different verification methods to perform specific verification operations according to different resource types contained in the acquisition request. The following three processing directions will eventually call the EvaluatePod method to evaluate the security standard policy of the Pod.

 // Validate admits an API request.
// The objects in admission attributes are expected to be external v1 objects that we care about.
// The returned response may be shared and must not be mutated.
func (a *Admission) Validate(ctx context.Context, attrs api.Attributes) *admissionv1.AdmissionResponse {
    var response *admissionv1.AdmissionResponse
    switch attrs.GetResource().GroupResource() {
    case namespacesResource:
        response = a.ValidateNamespace(ctx, attrs)
    case podsResource:
        response = a.ValidatePod(ctx, attrs)
    default:
        response = a.ValidatePodController(ctx, attrs)
    }
    return response
}

In the EvaluatePod method, the namespace sets the security standard policy and version to judge, so as to select different inspection methods to check the security of the Pod.

 func (r *checkRegistry) EvaluatePod(lv api.LevelVersion, podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) []CheckResult {
    // 如果设定的Pod安全标准策略等级是Privileged(宽松的策略)直接返回
    if lv.Level == api.LevelPrivileged {
        return nil
    }
    // 如果注册的检查策略最大版本号低于namespace设定策略版本号,则使用注册的检查策略的最大版本号
    if r.maxVersion.Older(lv.Version) {
        lv.Version = r.maxVersion
    }

    var checks []CheckPodFn
    // 如果设定的Pod安全标准策略等级是Baseline
    if lv.Level == api.LevelBaseline {
        checks = r.baselineChecks[lv.Version]
    } else {
        // includes non-overridden baseline checks
        // 其他走严格的Pod安全标准策略检查
        checks = r.restrictedChecks[lv.Version]
    }

    var results []CheckResult
    // 遍历检查方法,返回检查结果
    for _, check := range checks {
        results = append(results, check(podMetadata, podSpec))
    }
    return results
}

Let's take a specific inspection method to see how to check the pod security standard. Check whether the container in the Pod has closed allowPrivilegeEscalation. AllowPrivilegeEscalation sets whether the child process in the container can elevate the privilege. Usually, when setting the non-root user ( MustRunAsNonRoot) is set.

 func allowPrivilegeEscalation_1_8(podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) CheckResult {
    var badContainers []string
    visitContainers(podSpec, func(container *corev1.Container) {
        // 检查pod中容器安全上下文是否配置,AllowPrivilegeEscalation是否配置,及AllowPrivilegeEscalation是否设置为false.
        if container.SecurityContext == nil || container.SecurityContext.AllowPrivilegeEscalation == nil || *container.SecurityContext.AllowPrivilegeEscalation {
            badContainers = append(badContainers, container.Name)
        }
    })

    if len(badContainers) > 0 {
        // 存在违反Pod安全标准策略的内容,则返回具体结果信息
        return CheckResult{
            Allowed:         false,
            ForbiddenReason: "allowPrivilegeEscalation != false",
            ForbiddenDetail: fmt.Sprintf(
                "%s %s must set securityContext.allowPrivilegeEscalation=false",
                pluralize("container", "containers", len(badContainers)),
                joinQuote(badContainers),
            ),
        }
    }
    return CheckResult{Allowed: true}
}

Summarize

In the kubernetes v1.23 version, Pod Security Admission has been upgraded to the beta version. Although the current function is not powerful, this feature can be expected in the future.


博云
104 声望16 粉丝

博云技术社区定期分享容器、微服务、DevOps等云原生技术干货和落地实践。