Debugging Pods Stuck In Init/ContainerCreating State

Here at Ascenda Loyalty, we are using AWS managed kubernetes service (EKS) to run out applications.

Some background info Link to heading

EKS, aka Elastic Kubernetes Service is a managed kubernetes service offered by AWS. They help to manage the control plane of the kubernetes cluster and the worker nodes (for EKS Fargate).

Security group for pods are used for our application pods and some internal services. This allows us to manage the network security out of the kubernetes layer and between the AWS resources (ie. RDS, ElastiCache). Due to a limitation in the amount of pods that can use security group (only supported by most nitro based Amazon EC2 instance family and limited number of pods allowed to have SG), we are unable to use it for all of our pods.

What happened? Link to heading

Some pods are randomly staying in Init/ContainerCreating state for more than 10 minutes after the pods replicas were increased.

Alt text Figure 1. kubectl logs -o wide -A output

So… What gives?

The pods are unable to obtain any private IPv4 address from the CNI. This causes it to stay in Init/ContainerCreating state until an IP is assigned. We can rule out scheduling issue as the pods managed to get scheduled on the nodes.

The first thing that comes to mind is to check the available private IPv4 address in the subnet if it is fully exhausted.

Alt text Figure 2. Screenshot of the available IPv4 address in the subnets

This is not the case as shown in figure 2.

The next thing that comes to mind is that the branch network interface (pod eni) is at its limit in the affected worker nodes.

  • kubectl get pods -A -o wide
    Check which pods is affected and the node it is scheduled on. Alt text
  • kubectl describe -n <namespace> pods <pod-name>
    Check the status of the pod. If it’s due to pod eni hitting the limitation, it will show up in the status. Alt text
  • kubectl describe nodes <node-name>
    Check the allocated resources for ie. for m5a.xlarge instance, the max pod eni is 18 per instance. Alt text

It doesn’t seem that we’ve maxed our usage for branch-eni as well. ๐Ÿค”

Let’s dig a little further elsewhere since this is related to the pod not getting any IP address. One thing that came to mind is the AWS CNI that we use. The version used at that time was version 1.7.10. There might be a bug in the versino that we’ve deployed that cause these random failure.

A quick google search brought us here. Most of the solution points to upgrading the AWS CNI to version โ‰ฅ v1.7.7 (which we’re already on). There were also other comments stating that certain environment variables were needed to be set to use security group for pods (which we did correctly). AWS CNI has newer releases at that time with the latest version being v1.9.0 and with no options left, we upgraded to the latest CNI version.

Everything seems fine for a few hours until the same issue pops up to haunt us.


Fast forward Link to heading

After opening up AWS support ticket and going back and forth with the AWS engineer, we found that it was indeed due to the max pod eni. Our usage of the security group for pods were ultimately causing this error failed to assign an IP address to container.

Although there are shortfall in using security group for pods in EKS (lesser number of pods per nodes), we’re still using it to maintain the high level of security between different AWS resource such as RDS and Elstic MemCache.

Why didn’t we notice that we ran out of pod eni in the first place? Link to heading

For each application, we deploy a kubernetes job that runs a db migration step before deploying a set of webapp and worker pods. These consumes pod eni as they are using security group per pod.

When we first check if we’re hitting the limit of pod eni, we execute these commands:

  • kubectl get pods -A -o wide
  • kubectl describe -n <namespace> pods <pod name>
  • kubectl describe nodes <node name>

Upon further inspection of the output from kubectl describe nodes <node name>, there’s a discrepancy between the reported allocated resource for and the number of pods that uses the pod eni. We can verify this with the following command: kubectl get pods -o wide -A | grep <node name> and count the number of pods that uses security group for pods.

Alt text Figure 3. List of pods scheduled in affected node

Counting all the pods which uses security group in Figure 3, we are way past the pod eni limit of 18. Compare the total number of pods and allocated pod eni and there’s a discrepancy in the reported number shown here when running kubectl describe node <node name>. The discrepancy can be between 1 - 6 pods.

What’s causing these discrepancy? Link to heading

It’s the db migration jobs. These uses kubernetes job and the security group for pods. On completion, the pod eni allocated does not get detached and this does not get reflected properly in the output of kubectl describe node <node name>. That command only reports running pods and does not include completed pods.

What now? Link to heading

These are some of the possible solutions:

  1. ๐Ÿค” Specify the .spec.ttlSecondsAfterFinished in the job manifest. Not possible at the moment for us. This feature is currently in alpha stage on Kubernetes v1.19. EKS does not enable features pre-beta.
  2. โœ… Set the CI/CD system to delete the kubernetes job after it successfully completed. This is the suitable solution for us. We can remove the successful job since it doesn’t serve any purpose keeping it around and consuming 1 pod eni per pod.
  3. ๐Ÿ‘Ž๐Ÿฝ Run the db migration job as part of the webapp initcontainer. We would be freeing up one pod eni per application since it’ll be running in the same pod. However, this requires a bit of work on our CI/CD, helm charts and we would have a bit of certainty on the impact to the system. Not forgetting how nonsensical this idea is, db migration basically runs when the pod scales up or restarts.

Update Link to heading

27 September 2021 We have since updated the CI/CD pipeline to remove the kubernetes job on completion and it’s been running for a week without any incident especially with the increased number of pods in the cluster! ๐ŸŽ‰