25. 08. 2023 Lorenzo Candeago DevOps

Debug and Workarounds for a Stuck Update on OpenShift 4.13.6

Today we wanted to update our OpenShift cluster, and after a while we came up against the following error:

Not good…

Let’s start by checking the clusterversion to investigate if we can find any errors:

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.6    True        True          91m     Unable to apply 4.13.9: an unknown error has occurred: MultipleErrors

By analyzing the logs further we found the following error:

oc get clusterversion -o yaml

[...]
    - lastTransitionTime: "2023-08-25T13:55:14Z"
      message: |-
        Multiple errors are preventing progress:
        * Cluster operator kube-apiserver is updating versions
        * deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-74cc7479b7" has timed out progressing.)
      reason: MultipleErrors
      status: "True"
      type: Failing

it seems like the openshift-etcd-operator is not able to deploy the necessary pods. Let’s investigate it further and see which pods are running in the openshift-etcd-operator:

oc get pods -n openshift-etcd-operator
NAME                             READY   STATUS                 RESTARTS   AGE
etcd-operator-74cc7479b7-h8t64   0/1     CreateContainerError   0          97m

OK, what’s going on with the pod?

 oc describe pod etcd-operator-74cc7479b7-h8t64  -n openshift-etcd-operator
[...]
Warning  Failed          96m (x3 over 97m)      kubelet            (combined from similar events): Error: container create failed: time="2023-08-25T12:25:04Z" 
level=error msg="runc create failed: unable to start container process:
unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524"

According to the RedHat release notes, it seems that the issue

is due to a CoreOS limitation in the number of seccomp profiles that can be created on the worker node.

and is related to the following bug: https://issues.redhat.com/browse/OCPBUGS-2637.

Bad news. It seems like a kernel bug related to the architecture amd64.

After some further searching we’ve been able to pin down the issue to the specific OpenShift version 4.13.6 https://issues.redhat.com/browse/OCPBUGS-16655.

The bug still isn’t solved, but at least there’s a suggested workaround that has to be run on all worker nodes:

sudo sysctl status net.core.bpf_jit_limit=364241152

After that, the update seems to be working again!

Lorenzo Candeago

Lorenzo Candeago

DevOps Engineer at Würth Phoenix

Author

Lorenzo Candeago

DevOps Engineer at Würth Phoenix

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive