Kubernetes cluster is damaged: failedsync and SandboxChanged - cluster, Damage, FailedSync, Kubernetes, Sandboxchanged

I have a Kubernetes 1.7.5 cluster, which has somehow entered a semi-fragmented state. Part of arranging a new deployment on this cluster failed: 1/2 pod started normally, but the first The two pods are not started. The event is:

default 2017-09-28 03:57:02 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Pulled kubelet, k8s-agentpool1-18117938-2 Successfully pulled image "myregistry.azurecr.io/mybiz/hello"
default 2017-09- 28 03:57:02 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Created kubelet, k8s-agentpool1-18117938-2 Created container
default 2017-09-28 03:57:03 -0400 EDT 2017-09-28 03:57:03 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Started kubelet, k8s-agentpool1- 18117938-2 Started container
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03:57:01 -0400 EDT 2 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 2 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentpool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
default 2017-09-28 03:57:24 -0400 EDT 2017-09-28 03:57 :01 -0400 EDT 3 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09-28 03:57:25 -0400 EDT 2017-09-28 03 :57:02 -0400 EDT 3 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentpool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
[...]

The last two log messages keep repeating.

The dashboard display of the failed pane:

Dashboard of failed pod

The final dashboard display error:

Error: fai led to start container "hello": Error response from daemon: {"message":"cannot join network of a non running container: 7e95918c6b546714ae20f12349efcc6b4b5b9c1e84b5505cf907807efd57525c"}

This cluster uses the CNI Azure network plugin to run on Azure. After enabling –runtime-config=batch/v2alpha1=true in order to use the CronJob feature, everything works fine. Now, even after removing that API level and restarting the main server, the problem still exists.

The kubelet log on the node shows that the IP address cannot be assigned:

E0928 20:54:01.733682 1750 pod_workers.go:182] Error syncing pod 65127a94-a425-11e7-8d64-000d3af4357e ("hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)"), skipping: failed to "CreatePodSandbox" for "hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"hello-4059723819-xx16n_default\" network: Failed to allocate address: Failed to delegate: Failed to allocate address: No ava ilable addresses"

This is an Azure CNI error, and it’s not always correct from terminated The IP address is reclaimed in the pod. Please refer to this issue: https://github.com/Azure/azure-container-networking/issues/76.

This happens after enabling the CronJob function The reason is that cronjob containers are (usually) short-lived, and an IP is allocated each time they run. If these IPs are not recycled and reused by the underlying network system – in this case CNI – they will be exhausted very quickly.

I have a Kubernetes 1.7.5 cluster, which has somehow entered a semi-fragmented state. Part of arranging a new deployment on this cluster failed: 1/2 pod is normal Started, but the second pod did not start. The event is:

default 2017-09-28 03:57:02 -0400 EDT 2017-09-28 03 :57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Pulled kubelet, k8s-agentpool1-18117938-2 Successfully pulled image "myregistry.azurecr.io/mybiz/hello"
default 2017-09-28 03:57:02 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Created kubelet, k8s-agentpool1-18117938-2 Created container
default 2017-09-28 03:57:03 -0400 EDT 2017-09-28 03:57:03 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Started kubelet, k8s-agentpool1-18117938-2 Started container
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03: 57:01 -0400 EDT 2 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 2 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentpool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
default 2017-09-28 03 :57:24 -0400 EDT 2017-09-28 03:57:01 -0400 EDT 3 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09- 28 03:57:25 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 3 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentp ool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
[...]

The last two log messages keep repeating.

Failure The dashboard of the pane shows:

Dashboard of failed pod

The final dashboard shows an error:

Error: failed to start container "hello": Error response from daemon: {"message":"cannot join network of a non running container: 7e95918c6b546714ae20f12349efcc6b4b5b9c1e84b5505cf907807efd57525c"}

This cluster uses the CNI Azure network plugin to run on Azure. After enabling –runtime-config=batch/v2alpha1=true to use the CronJob feature, everything works fine Now, even after deleting the API level and restarting the main server, the problem still exists.

The kubelet log on the node shows that the IP address cannot be assigned:

< pre>E0928 20:54:01.733682 1750 pod_workers.go:182] Error syncing pod 65127a94-a425-11e7-8d64-000d3af4357e (“hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)”), skipping: failed to “CreatePodSandbox” for “hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)” with CreatePodSandboxError: “CreatePod Sandbox for pod \”hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)\” failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \”hello-4059723819-xx16n_default\” network : Failed to allocate address: Failed to delegate: Failed to allocate address: No available addresses”

This is an Azure CNI error, not always correct Reclaim the IP address from the terminated pod. Please refer to this issue: https://github.com/Azure/azure-container-networking/issues/76.

After enabling the CronJob function This happens because the cronjob container is (usually) short-lived, and an IP is allocated each time it runs. If these IPs are not recycled and reused by the underlying network system – in this case, CNI – they are fast Will run out.

Leave a Comment Cancel reply