CreateContainerError - Openshift Error

CreateContainerError - Openshift Error

Sep 17, 2024 | Andrew Wilson

On receiving reports from one of our Managed Kubernetes Hosting clients that their scheduled cron workloads were running with a delay of up to 20 minutes, we discovered that the cron pods in question were failing citing a status of Context Deadline Exceeded (CreateContainerError).

The problem

Drilling down into the pod’s event log, we found that the failing pods would repeatedly attempt to start for up to 20 minutes before successfully starting. Once started the task was able to complete as normal without any additional delays.

1 2 3 4 5 6 7 8 9 10 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 23m default-scheduler Successfully assigned sample-namespace/sample-cron-28285183-tzmt2 to ip-10-0-169-183.ap-southeast-2.compute.internal Normal AddedInterface 23m multus Add eth0 [10.131.0.68/23] from ovn-kubernetes Warning Failed 11m kubelet Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: the requested container k8s_sample-cron_sample-cron-28285183-tzmt2_sample-namespace_51cd9f1d-7a54-4210-90aa-2e4111755480_2 is now ready and will be provided to the kubelet on next retry: error reserving ctr name k8s_sample-cron_sample-cron-28285183-tzmt2_sample-namespace_51cd9f1d-7a54-4210-90aa-2e4111755480_2 for id 59ef5eec8345e9bbb573fd6300d8d8c89e9f614664125fecf339d55c59147887: name is reserved Normal Pulled 7m3s (x9 over 23m) kubelet Container image "image-registry.openshift-image-registry.svc:5000/openshift/cron-container" already present on machine Warning Failed 7m3s (x7 over 21m) kubelet Error: context deadline exceeded Normal Created 6m kubelet Created container sample-cron Normal Started 6m kubelet Started container sample-cron

Initial attempts to replicate this issue in an isolated environment were unfruitful, and pods were able to start up in a matter of seconds and complete each iteration.

The fact that the pod creation process was reaching a timeout at the container volume configuration stage, and that the issue was apparent only for a single client, led us to further investigate the attached persistent storage volume. While there was plenty of free disk space and the volume was using less than 50GB, enumerating the contents of the volume revealed millions of small files present on the storage.

This was posing an issue due to the way the container runtime (in this case CRI-O) handles file permissions when mounting a persistent volume to a pod - as the pod starts up, CRI-O mounts the volume to the pod and then relabels the SElinux context on every single file to ensure that the pod has sufficient access. While this process is relatively speedy the completion time is dependent on how many files require relabelling.

The solution

Once the problem had been identified we were able to find a configurable parameter in CRI-O which handles SElinux relabelling in a simplified manner, the parameter being: TrySkipVolumeSELinuxLabel

When set, this parameter will check only the files / directories on the topmost level to ensure that the SElinux labels are set correctly. If an inconsistency is found during this check the entire volume will be relabelled, otherwise the pod will start under the assumption that the files on subsequent levels are labelled correctly resulting in a considerably faster operation.

Unlocking the ability to set this parameter involves first creating the following custom MachineConfig:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-selinux-fix # this is (an arbitrary) name we’re choosing for this MachineConfig spec: config: ignition: version: 3.2.0 storage: files: - path: /etc/crio/crio.conf.d/99-selinux-fix.conf # this filename is arbitrary but must be in this directory overwrite: true contents: source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS5ydW50aW1lcy5zZWxpbnV4LWZpeF0KcnVudGltZV9wYXRoID0gIi91c3IvYmluL3J1bmMiCnJ1bnRpbWVfcm9vdCA9ICIvcnVuL3J1bmMiCnJ1bnRpbWVfdHlwZSA9ICJvY2kiCmFsbG93ZWRfYW5ub3RhdGlvbnMgPSBbCiAgICAiaW8ua3ViZXJuZXRlcy5jcmktby5UcnlTa2lwVm9sdW1lU0VMaW51eExhYmVsIiwKXQo=

When the value of MachineConfig.files.data is decoded, the configuration simply states that the TrySkipVolumeSELinuxLabel is allowed for use:

1 2 3 4 5 6 7 8 9 [crio.runtime.runtimes.selinux-fix-runtime] # here we’re creating the ‘selinux-fix-runtime’ runtime runtime_path = "/usr/bin/runc" runtime_root = "/run/runc" runtime_type = "oci" allowed_annotations = [ "io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel", #this references the built in functionality in CRI-O that we’re looking for ]

The above creates a new runtime in addition to the CRI-O default, which does not accept annotations. The machineconfiguration.openshift.io/role label states that the MachneConfig will be applied to all worker nodes

Once the above is applied, the following RuntimeClass needs to be created. A pod can then specify this RuntimeClass in order to use the selinux-fix-runtime that we’ve configured on the node.

1 2 3 4 5 6 apiVersion: node.k8s.io/v1 handler: selinux-fix-runtime # this references the runtime created above kind: RuntimeClass metadata: annotations: name: selinux-fix-runtime-class

Now that the cluster has been prepared, pods can be created to run with the new RuntimeClass. Here is a sample Pod manifest referencing the RuntimeClass defined above for reference:

1 2 3 4 5 6 7 8 9 10 11 12 apiVersion: v1 kind: Pod metadata: name: cron-test annotations: io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: 'true' # this annotation tells CRI-O to skip child directory selinux relabelling spec: containers: - name: cron-test image: alpine command: ['sh', '-c', 'echo "Hello world" && sleep infinity'] runtimeClassName: selinux-fix-runtime-class # this ensures the pod runs in a runtime that allows the above annotation

The take away

Once the solution was tested and proven to fix the acute issue for the affected client, we found that applying the annotation to all workloads resulted in not only faster pod start times, but also a noticeable decrease in the load average on all worker nodes due to the decrease in disk IO operations.

You may also like: