CreateContainerError - Openshift Error

Sep 17, 2024 | Andrew Wilson

openshift
createcontainererror
kubernetes
selinux

On receiving reports from one of our Managed Kubernetes Hosting clients that their scheduled cron workloads were running with a delay of up to 20 minutes, we discovered that the cron pods in question were failing citing a status of Context Deadline Exceeded (CreateContainerError).

The problem

Drilling down into the pod’s event log, we found that the failing pods would repeatedly attempt to start for up to 20 minutes before successfully starting. Once started the task was able to complete as normal without any additional delays.

Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       23m                 default-scheduler  Successfully assigned sample-namespace/sample-cron-28285183-tzmt2 to ip-10-0-169-183.ap-southeast-2.compute.internal
  Normal   AddedInterface  23m                 multus             Add eth0 [10.131.0.68/23] from ovn-kubernetes
  Warning  Failed          11m                 kubelet            Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: the requested container k8s_sample-cron_sample-cron-28285183-tzmt2_sample-namespace_51cd9f1d-7a54-4210-90aa-2e4111755480_2 is now ready and will be provided to the kubelet on next retry: error reserving ctr name k8s_sample-cron_sample-cron-28285183-tzmt2_sample-namespace_51cd9f1d-7a54-4210-90aa-2e4111755480_2 for id 59ef5eec8345e9bbb573fd6300d8d8c89e9f614664125fecf339d55c59147887: name is reserved
  Normal   Pulled          7m3s (x9 over 23m)  kubelet            Container image "image-registry.openshift-image-registry.svc:5000/openshift/cron-container" already present on machine
  Warning  Failed          7m3s (x7 over 21m)  kubelet            Error: context deadline exceeded
  Normal   Created         6m                  kubelet            Created container sample-cron
  Normal   Started         6m                  kubelet            Started container sample-cron

Initial attempts to replicate this issue in an isolated environment were unfruitful, and pods were able to start up in a matter of seconds and complete each iteration.

The fact that the pod creation process was reaching a timeout at the container volume configuration stage, and that the issue was apparent only for a single client, led us to further investigate the attached persistent storage volume. While there was plenty of free disk space and the volume was using less than 50GB, enumerating the contents of the volume revealed millions of small files present on the storage.

This was posing an issue due to the way the container runtime (in this case CRI-O) handles file permissions when mounting a persistent volume to a pod - as the pod starts up, CRI-O mounts the volume to the pod and then relabels the SElinux context on every single file to ensure that the pod has sufficient access. While this process is relatively speedy the completion time is dependent on how many files require relabelling.

The solution

Once the problem had been identified we were able to find a configurable parameter in CRI-O which handles SElinux relabelling in a simplified manner, the parameter being: TrySkipVolumeSELinuxLabel

When set, this parameter will check only the files / directories on the topmost level to ensure that the SElinux labels are set correctly. If an inconsistency is found during this check the entire volume will be relabelled, otherwise the pod will start under the assumption that the files on subsequent levels are labelled correctly resulting in a considerably faster operation.

Unlocking the ability to set this parameter involves first creating the following custom MachineConfig:

---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-selinux-fix  # this is (an arbitrary) name we’re choosing for this MachineConfig
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - path: /etc/crio/crio.conf.d/99-selinux-fix.conf  # this filename is arbitrary but must be in this directory
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS5ydW50aW1lcy5zZWxpbnV4LWZpeF0KcnVudGltZV9wYXRoID0gIi91c3IvYmluL3J1bmMiCnJ1bnRpbWVfcm9vdCA9ICIvcnVuL3J1bmMiCnJ1bnRpbWVfdHlwZSA9ICJvY2kiCmFsbG93ZWRfYW5ub3RhdGlvbnMgPSBbCiAgICAiaW8ua3ViZXJuZXRlcy5jcmktby5UcnlTa2lwVm9sdW1lU0VMaW51eExhYmVsIiwKXQo=

When the value of MachineConfig.files.data is decoded, the configuration simply states that the TrySkipVolumeSELinuxLabel is allowed for use:

[crio.runtime.runtimes.selinux-fix-runtime]   # here we’re creating the ‘selinux-fix-runtime’ runtime

runtime_path = "/usr/bin/runc"
runtime_root = "/run/runc"
runtime_type = "oci"
allowed_annotations = [
    "io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel",   #this references the built in functionality in CRI-O that we’re looking for

]

The above creates a new runtime in addition to the CRI-O default, which does not accept annotations. The machineconfiguration.openshift.io/role label states that the MachneConfig will be applied to all worker nodes

Once the above is applied, the following RuntimeClass needs to be created. A pod can then specify this RuntimeClass in order to use the selinux-fix-runtime that we’ve configured on the node.

apiVersion: node.k8s.io/v1
handler: selinux-fix-runtime  # this references the runtime created above
kind: RuntimeClass
metadata:
  annotations:
  name: selinux-fix-runtime-class

Now that the cluster has been prepared, pods can be created to run with the new RuntimeClass. Here is a sample Pod manifest referencing the RuntimeClass defined above for reference:

apiVersion: v1
kind: Pod
metadata:
  name: cron-test
  annotations:
    io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: 'true'  # this annotation tells CRI-O to skip child directory selinux relabelling
spec:
  containers:
  - name: cron-test
    image: alpine
    command: ['sh', '-c', 'echo "Hello world" && sleep infinity']
  runtimeClassName: selinux-fix-runtime-class  # this ensures the pod runs in a runtime that allows the above annotation

The take away

Once the solution was tested and proven to fix the acute issue for the affected client, we found that applying the annotation to all workloads resulted in not only faster pod start times, but also a noticeable decrease in the load average on all worker nodes due to the decrease in disk IO operations.

CreateContainerError - Openshift Error

The problem

The solution

The take away

You may also like: