kubernetes - K8S Pod New Security Policy Pod Security Admission Introduction | K8S Internals Series 1 - 个人文章

K8S Internals Series: Issue 1

After the container orchestration battle took shape in the domination of Kubernetes, K8S has become a new generation of operating systems in the cloud-native era. K8S makes everything simple, but it gradually becomes more and more complex. [K8S Internals Series Columns] Focusing on many aspects of the K8S ecosystem, the Boyun Container Cloud R&D team will regularly share hot topics such as scheduling, security, network, performance, storage, and application scenarios. I hope that while enjoying the efficiency and convenience brought by K8S, you can also appreciate the charm of its kernel operating mechanism like a cook.

1. Introduction to Pod Security Policy

Because the goal of Pod Security Admission is to replace Pod Security Policy, it is necessary to introduce Pod Security Policy before introducing it. Pod Security Policy defines a set of conditions that Pods must follow when running and the default values of related fields. Pods must meet these conditions To be successfully created, the Pod Security Policy object Spec contains the following fields, which are the aspects that Pod Security Policy can control:

control angle	Field Name
run privileged container	privileged
Using the host namespace	hostPID,hostIPC
Use the host's network and port	hostNetwork, hostPorts
Controlling the use of volume types	volumes
Using the host file system	allowedHostPaths
Allow use of specific FlexVolume drivers	allowedFlexVolumes
Allocate the FSGroup account that owns the Pod volume	fsGroup
Read-only access to the root filesystem	readOnlyRootFilesystem
Set the container's user and group IDs	runAsUser, runAsGroup, supplementalGroups
Restrict root account privilege escalation	allowPrivilegeEscalation, defaultAllowPrivilegeEscalation
Linux Capabilities	defaultAddCapabilities, requiredDropCapabilities, allowedCapabilities
Set the container's SELinux context	seLinux
Specifies the proc types that the container can mount	allowedProcMountTypes
Specifies the AppArmor template used by the container	annotations
Specifies the seccomp template used by the container	annotations
Specifies the sysctl template used by the container	forbiddenSysctls,allowedUnsafeSysctls

Among them, AppArmor and seccomp need to be set by adding annotations to the PodSecurityPolicy object:

 seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default'
seccomp.security.alpha.kubernetes.io/defaultProfileNames: 'docker/default'
apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' 
apparmor.security.beta.kubernetes.io/defaultProfileNames: 'runtime/default'

Pod Security Policy is a cluster-level resource. Let's take a look at its usage process:

PSP使用流程.png
PSP usage process

Since it is necessary to create ClusterRole/Role and ClusterRoleBinding/RoleBinding binding service accounts to use PSPs, it is not easy to see which PSPs are used, and it is even more difficult to see which security rules restrict the creation of Pods.

2. Why does Pod Security Admission appear?

By using PodSecurityPolicy, you should also find its problems, such as no dry-run and audit mode, inconvenient to open and close, etc., and it is not so clear to use. As a result of various defects, PodSecurityPolicy was marked as deprecated in Kubernetes v1.21 and will be removed in v1.25, and a new feature, Pod Security Admission, was added in kubernets v1.22.

3. Introduction to Pod Security Admission

Pod security admission is an admission controller built into kubernetes. This feature gate is enabled by default in kubernetes v1.23. In v1.22, it needs to be enabled by the kube-apiserver parameter --feature-gates="...,PodSecurity=true" . In kuberntes versions lower than v1.22, you can also install the Pod Security Admission Webhook by yourself.

Pod security admission restricts the creation of pods in a cluster by enforcing built-in Pod Security Standards.

3.1 Pod Security Standards

In order to cover a wide range of security application scenarios, Pod Security Standards progressively define three different Pod security standard policies:

Profile	describe
Privileged	Unrestricted policies that provide the widest possible range of permissions. This policy allows known elevation of privilege.
Baseline	The least restrictive policy, disallowing known policy promotions. Allow default (minimum specified) Pod configuration.
Restricted	A very restrictive policy that follows current best practices for securing pods.

See Pod Security Standards for details.

3.2 Pod Security Standards Implementation Method

After the pod security admission feature gate is enabled in the kubernetes cluster, Pod Security Standards can be implemented by setting labels to the namespace. There are three setting modes to choose from:

Mode	Description
enforce	Pods that violate security standards policies will be rejected.
audit	Violation of security standards policy triggers the addition of audit comments to events recorded in the audit log, but other actions are allowed.
warn	Violation of security standards policy will trigger a user-facing warning, but other actions are allowed.

Label setting template explanation:

 # 设定模式及安全标准策略等级
# MODE必须是 `enforce`, `audit`或`warn`其中之一。
# LEVEL必须是`privileged`, `baseline`或 `restricted`其中之一
pod-security.kubernetes.io/<MODE>: <LEVEL>

# 此选项是非必填的，用来锁定使用哪个版本的的安全标准
# MODE必须是 `enforce`, `audit`或`warn`其中之一。
# VERSION必须是一个有效的kubernetes minor version(例如v1.23)，或者 `latest`
pod-security.kubernetes.io/<MODE>-version: <VERSION>

A namesapce can be set to any mode or different modes to set different security standard policies.

The default configuration for pod security admission can be set via the admission controller configuration file:

 apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1beta1
    kind: PodSecurityConfiguration
    # Defaults applied when a mode label is not set.
    #
    # Level label values must be one of:
    # - "privileged" (default)
    # - "baseline"
    # - "restricted"
    #
    # Version label values must be one of:
    # - "latest" (default) 
    # - specific version like "v1.23"
    defaults:
      enforce: "privileged"
      enforce-version: "latest"
      audit: "privileged"
      audit-version: "latest"
      warn: "privileged"
      warn-version: "latest"
    exemptions:
      # Array of authenticated usernames to exempt.
      usernames: []
      # Array of runtime class names to exempt.
      runtimeClassNames: []
      # Array of namespaces to exempt.
      namespaces: []

Pod security admission can exempt pods from security standard checks from the three dimensions of username, runtimeClassName, and namespace.

3.3 Pod Security Standards Implementation Demonstration

Environment: kubernetes v1.23

Containers at runtime are exposed to many attack risks, such as container escape and resource exhaustion attacks from containers.

3.3.1 Baseline strategy

The Baseline policy goal is to apply to common containerized applications and prohibit known privilege escalation. In the official introduction, this policy is aimed at application operators and non-critical application developers. The policy includes:

Must prohibit sharing host namespaces, prohibit container privileges, restrict Linux capabilities, prohibit hostPath volumes, restrict host ports, configure AppArmor, SElinux, Seccomp, Sysctls, etc.

The following demonstrates setting the Baseline policy.

Risks of violating Baseline policy:

Privileged containers can see the host device
After mounting procfs, you can see the host process, breaking process isolation
Can break network isolation
After mounting the runtime socket, you can communicate with the runtime without restrictions

And so on the above risks can lead to container escape.

Create a namespace named my-baseline-namespace, and set both the enforce and warn modes to correspond to the Baseline-level Pod security standard policy:

 apiVersion: v1
kind: Namespace
metadata:
  name: my-baseline-namespace
  labels:
    pod-security.kubernetes.io/enforce: baseline  
    pod-security.kubernetes.io/enforce-version: v1.23
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: v1.23

create pod

Create a pod that violates the baseline policy

 apiVersion: v1
kind: Pod
metadata:
  name: hostnamespaces2
  namespace: my-baseline-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
      capabilities:
        drop:
        - ALL
  hostPID: true
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

Execute the apply command, it shows that hostPID=true, securityContext.privileged=true cannot be set, Pod creation is rejected, privileged container is running, and hostPID is enabled, the container process is not isolated from the host process, which may easily cause the Pod container to escape:

 [root@localhost podSecurityStandard]# kubectl apply -f fail-hostnamespaces2.yaml
Error from server (Forbidden): error when creating "fail-hostnamespaces2.yaml": pods "hostnamespaces2" is forbidden: violates PodSecurity "baseline:v1.23": host namespaces (hostPID=true), privileged (container "prometheus" must not set securityContext.privileged=true)

Create a pod that does not violate the baseline policy, set the Pod's hostPID=false, securityContext.privileged=false

 apiVersion: v1
kind: Pod
metadata:
  name: hostnamespaces2
  namespace: my-baseline-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      capabilities:
        drop:
        - ALL
  hostPID: false
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

Execute the apply command and the pod is allowed to be created:

 [root@localhost podSecurityStandard]# kubectl apply -f pass-hostnamespaces2.yaml
pod/hostnamespaces2 created

3.3.2 Restricted Policy

The goal of the Restricted policy is to implement the current best practices for protecting Pods. In the official introduction, this policy is mainly aimed at operation and maintenance personnel and application developers whose security is very important, as well as less trusted users. This policy contains all the content of the baseline policy, with the additions: Restrict non-core volume types that can be defined by PersistentVolumes, prohibit privilege escalation (via SetUID or SetGID file mode), must require containers to run as non-root users, Containers cannot be runAsUser is set to 0, the container group must deprecate ALL capabilities and only allow NET_BIND_SERVICE capabilities to be added.

The restricted policy further restricts access to root privileges within the container, linux kernel functions. For example, a man-in-the-middle attack against the kubernetes network requires the CAP_NET_RAW permission of the Linux system to send ARP packets.

Create a namespace named my-restricted-namespace, and set both the enforce and warn modes to correspond to the Restricted-level Pod security standard policy:

 apiVersion: v1
kind: Namespace
metadata:
name: my-restricted-namespace
labels:
 pod-security.kubernetes.io/enforce: restricted 
 pod-security.kubernetes.io/enforce-version: v1.23
 pod-security.kubernetes.io/warn: restricted
 pod-security.kubernetes.io/warn-version: v1.23

create pod

Create a pod that violates the Restricted policy

 apiVersion: v1
kind: Pod
metadata:
  name: runasnonroot0
  namespace: my-restricted-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
  securityContext:
    seccompProfile:
      type: RuntimeDefault

Execute the apply command, it shows that securityContext.runAsNonRoot=true, securityContext.capabilities.drop=["ALL"] must be set, Pod creation is rejected, and the container has too much permission when the container runs as root user, combined with no Drop linux kernel capability, there is kubernetes Risks of network man-in-the-middle attacks:

 [root@localhost podSecurityStandard]# kubectl apply -f fail-runasnonroot0.yaml
Error from server (Forbidden): error when creating "fail-runasnonroot0.yaml": pods "runasnonroot0" is forbidden: violates PodSecurity "restricted:v1.23": unrestricted capabilities (container "prometheus" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "prometheus" must set securityContext.runAsNonRoot=true)

Create a pod that does not violate the Restricted policy, set the Pod's securityContext.runAsNonRoot=true, and drop all linux capabilities.

 apiVersion: v1
kind: Pod
metadata:
  name: runasnonroot0
  namespace: my-restricted-namespace
spec:
  containers:
  - image: bitnami/prometheus:2.33.5
    name: prometheus
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

Execute the apply command and the pod is allowed to be created:

 [root@localhost podSecurityStandard]# kubectl apply -f pass-runasnonroot0.yaml
pod/runasnonroot0 created

3.4 Current limitations of pod security admission

If PodSecurityPolicy is already configured in your cluster, consider migrating them to pod security admission will require some work.

The first thing to consider is whether the current pod security admission is suitable for your cluster. Currently it is designed to meet the most common security needs out of the box. Compared to the PSP, it has the following differences:

Pod security admission only checks the security standards of the pod, does not support modification of the pod, and cannot set the default security configuration for the pod.
Pod security admission only supports three officially defined security standard strategies, and does not support flexible custom security standard strategies. This makes it impossible to completely migrate PSP rules to pod security admission, requiring specific security rule considerations.
Unlike PSP, pod security admission can be bound to specific users, and only supports exempting specific users or RuntimeClass and namespace.

4. pod security admission source code analysis

The kubernetes admission controller is a plug-in that is decoupled from the API server logic at the code level. Objects are created, updated, or deleted before etcd is persisted to intercept requests and execute specific logic. The classic flow of a request to the API server is shown in the following figure:

Api Request processing flow chart

4.1 Logic flow chart of source code body

podsecurityAdmission代码流程图.png
podsecurityAdmission code flow chart

The main logic flow of pod security admission is shown in the figure. The admission controller first parses the intercepted request, and performs different logical processing according to the parsed resource type:

Namespace : If the parsed resource is a Namespace, the admission controller first parses the information such as the level, mode, and locked Pod security standard policy version according to the labels of the namespace. Check if the Pod security standard policy information is not included, then directly allow the request to pass. If the Pod security standard policy information is included, it is judged whether to create a new namespace or update the old namespace. If it is create, judge whether the configuration is correct, if it is update, then Evaluate whether the pods in the namespace comply with the newly set security standards policy.
Pod: If the parsed resource is a Pod, the admission controller first obtains the Pod security standard policy information set by the namespace where the Pod is located. If the namespace does not have a Pod security standard policy set, the request is allowed to pass, otherwise the Pod is evaluated. Compliance with security standards policies.
others: The admission controller first obtains the Pod security policy information set by the namespace where the resource is located. If the namespace does not have a Pod security policy set, the request is allowed to pass, otherwise the resource is further parsed to determine whether the resource is such as PodTemplate, ReplicationController , ReplicaSet, Deployment, DaemonSet, StatefulSet, Job, CronJob and other resources that contain PodSpec. After parsing the PodSpec, evaluate whether the resource complies with the Pod security policy.

4.2 Initialize Pod security admission

Like most go programs, Pod security admission uses github.com/spf13/cobra to create a start command that calls runServer on startup to initialize and start the webhook service. The options include DefaultClientQPSLimit, DefaultClientQPSBurst, DefaultPort, DefaultInsecurePort and other default configurations.

 // NewSchedulerCommand creates a *cobra.Command object with default parameters and registryOptions
func NewServerCommand() *cobra.Command {
    opts := options.NewOptions()

    cmdName := "podsecurity-webhook"
    if executable, err := os.Executable(); err == nil {
        cmdName = filepath.Base(executable)
    }
    cmd := &cobra.Command{
        Use: cmdName,
        Long: `The PodSecurity webhook is a standalone webhook server implementing the Pod
Security Standards.`,
        RunE: func(cmd *cobra.Command, _ []string) error {
            verflag.PrintAndExitIfRequested()
            // 初始化并且启动webhook服务
            return runServer(cmd.Context(), opts)
        },
        Args: cobra.NoArgs,
    }
    opts.AddFlags(cmd.Flags())
    verflag.AddFlags(cmd.Flags())

    return cmd
}

The runserver function loads the configuration of the admission controller, initializes the server, and finally starts the server.

 func runServer(ctx context.Context, opts *options.Options) error {
    // 加载配置内容
    config, err := LoadConfig(opts)
    if err != nil {
        return err
    }
    // 根据配置内容初始化server
    server, err := Setup(config)
    if err != nil {
        return err
    }
    
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()
    go func() {
        stopCh := apiserver.SetupSignalHandler()
        <-stopCh
        cancel()
    }()
    // 启动server
    return server.Start(ctx)
}

The following is an interception of the main code snippets of the Setup function. The Setup function creates an Admission object containing:

PodSecurityConfig: The admission controller configuration content, including the default Pod security standard policy level and setting mode and locking the corresponding kubernetes version, as well as exempt Usernames, RuntimeClasses and Namespaces.
Evaluator: The created evaluator, which defines a specific method for checking the security standard policy.
Metrics: Used to collect Prometheus metrics.
PodSpecExtractor: Use to parse the PodSpec in the request object.
PodLister: Used to get the Pods in the specified namespace.
NamespaceGetter: The user gets the namespace where the resource intercepted in the request is located.

 // Setup creates an Admission object to handle the admission logic.
func Setup(c *Config) (*Server, error) {
    ...
    s.delegate = &admission.Admission{
        Configuration:    c.PodSecurityConfig,
        Evaluator:        evaluator,
        Metrics:          metrics,
        PodSpecExtractor: admission.DefaultPodSpecExtractor{},
        PodLister:        admission.PodListerFromClient(client),
        NamespaceGetter:  admission.NamespaceGetterFromListerAndClient(namespaceLister, client),
    }
   ...
    return s, nil
}

After the admission controller service is started, the HandleValidate method is registered to process the admission verification logic, and the Validate method is called in this method to verify the specific Pod security standard policy.

 //处理webhook拦截到的请求
func (s *Server) HandleValidate(w http.ResponseWriter, r *http.Request) {
    defer utilruntime.HandleCrash(func(_ interface{}) {
        // Assume the crash happened before the response was written.
        http.Error(w, "internal server error", http.StatusInternalServerError)
    })
     ...
    // 进行具体的检验操作
    response := s.delegate.Validate(ctx, attributes)
    response.UID = review.Request.UID // Response UID must match request UID
    review.Response = response
    writeResponse(w, review)
}

4.3 Admission Inspection Processing Logic

The Validate method calls different verification methods to perform specific verification operations according to different resource types contained in the acquisition request. The following three processing directions will eventually call the EvaluatePod method to evaluate the security standard policy of the Pod.

 // Validate admits an API request.
// The objects in admission attributes are expected to be external v1 objects that we care about.
// The returned response may be shared and must not be mutated.
func (a *Admission) Validate(ctx context.Context, attrs api.Attributes) *admissionv1.AdmissionResponse {
    var response *admissionv1.AdmissionResponse
    switch attrs.GetResource().GroupResource() {
    case namespacesResource:
        response = a.ValidateNamespace(ctx, attrs)
    case podsResource:
        response = a.ValidatePod(ctx, attrs)
    default:
        response = a.ValidatePodController(ctx, attrs)
    }
    return response
}

In the EvaluatePod method, the namespace sets the security standard policy and version to judge, so as to select different inspection methods to check the security of the Pod.

 func (r *checkRegistry) EvaluatePod(lv api.LevelVersion, podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) []CheckResult {
    // 如果设定的Pod安全标准策略等级是Privileged（宽松的策略）直接返回
    if lv.Level == api.LevelPrivileged {
        return nil
    }
    // 如果注册的检查策略最大版本号低于namespace设定策略版本号，则使用注册的检查策略的最大版本号
    if r.maxVersion.Older(lv.Version) {
        lv.Version = r.maxVersion
    }

    var checks []CheckPodFn
    // 如果设定的Pod安全标准策略等级是Baseline
    if lv.Level == api.LevelBaseline {
        checks = r.baselineChecks[lv.Version]
    } else {
        // includes non-overridden baseline checks
        // 其他走严格的Pod安全标准策略检查
        checks = r.restrictedChecks[lv.Version]
    }

    var results []CheckResult
    // 遍历检查方法，返回检查结果
    for _, check := range checks {
        results = append(results, check(podMetadata, podSpec))
    }
    return results
}

Let's take a specific inspection method to see how to check the pod security standard. Check whether the container in the Pod has closed allowPrivilegeEscalation. AllowPrivilegeEscalation sets whether the child process in the container can elevate the privilege. Usually, when setting the non-root user ( MustRunAsNonRoot) is set.

 func allowPrivilegeEscalation_1_8(podMetadata *metav1.ObjectMeta, podSpec *corev1.PodSpec) CheckResult {
    var badContainers []string
    visitContainers(podSpec, func(container *corev1.Container) {
        // 检查pod中容器安全上下文是否配置，AllowPrivilegeEscalation是否配置，及AllowPrivilegeEscalation是否设置为false.
        if container.SecurityContext == nil || container.SecurityContext.AllowPrivilegeEscalation == nil || *container.SecurityContext.AllowPrivilegeEscalation {
            badContainers = append(badContainers, container.Name)
        }
    })

    if len(badContainers) > 0 {
        // 存在违反Pod安全标准策略的内容，则返回具体结果信息
        return CheckResult{
            Allowed:         false,
            ForbiddenReason: "allowPrivilegeEscalation != false",
            ForbiddenDetail: fmt.Sprintf(
                "%s %s must set securityContext.allowPrivilegeEscalation=false",
                pluralize("container", "containers", len(badContainers)),
                joinQuote(badContainers),
            ),
        }
    }
    return CheckResult{Allowed: true}
}

Summarize

In the kubernetes v1.23 version, Pod Security Admission has been upgraded to the beta version. Although the current function is not powerful, this feature can be expected in the future.

K8S Pod New Security Policy Pod Security Admission Introduction | K8S Internals Series 1

K8S Internals Series: Issue 1

1. Introduction to Pod Security Policy

2. Why does Pod Security Admission appear?

3. Introduction to Pod Security Admission

3.1 Pod Security Standards

3.2 Pod Security Standards Implementation Method

3.3 Pod Security Standards Implementation Demonstration

3.3.1 Baseline strategy

3.3.2 Restricted Policy

3.4 Current limitations of pod security admission

4. pod security admission source code analysis

4.1 Logic flow chart of source code body

4.2 Initialize Pod security admission

4.3 Admission Inspection Processing Logic

Summarize

博云

引用和评论

博云 AIOS 通过国家工信安全中心测试，产品完整性与功能性获权威认证

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

PostgreSQL@K8s 性能优化记

只需三步，就可以在KubeBlocks上集成和使用NebulaGraph集群啦！

在 ApeCloud （云猿生数据）实习是怎样的体验？跟行业大佬练技术修为的一年小记