Troubleshooting- Useful Kubernetes Commands

While the use of kubectl and its subcommands are too broad for this guide, there are several commands which are very useful and are recommended for use in debugging.

Distressed Pods

This is the Platform9 nomenclature on what pods to look for when users are debugging issues in a large cluster. The primary command used in our experience is to locate the Pods and specifically finds if any of the pods are not in a "Running" state. This is the simplest and the most effective command to execute to get an overview glimpse of issues affecting the pods. The output of this command can delay depending on the number of pods running.

Bash
    
kubectl --kubeconfig ./k get pods -A | grep -v -i running | grep -v -i completed | grep -v -i terminating kubectl --kubeconfig ./k get pods -A | grep -v -i running | grep -v -i completed | grep -v -i terminatingNAMESPACE                       NAME                                                  READY   STATUS              RESTARTS   AGEargus-n2-310                kplane-usermgr-5c774f85cc-gbgcd                       1/2     ImagePullBackOff    0          7h47mcert-manager                    cert-manager-cainjector-6886449cf8-szgts              0/1     CrashLoopBackOff    69         5h32mdu-argus-ab-368          clarity-6f5fb449b8-7x29c                              4/5     ImagePullBackOff    0          7m
Copy

In the example above regarding argus-n2-310, we can see there are issues with imagepull and the cert-manager having problems. Further debugging would be required, and that can be done in the following few ways:

Events

This is one of the most underappreciated tools available to Kubernetes operators. The kubectl get events command provides users a unique glimpse into what is going on within a kubernetes cluster, or even inside a given pod. The following examples illustrate this rather well.

Bash
    
​x
    
kubectl --kubeconfig ./k get events -n cert-manager​LAST SEEN   TYPE      REASON      OBJECT                                         MESSAGE3m33s       Warning   BackOff     pod/cert-manager-cainjector-6886449cf8-szgts   Back-off restarting failed container​104s        Warning   Unhealthy   pod/cert-manager-webhook-c677f4f7-c8brx        Readiness probe failed: HTTP probe failed with statuscode: 500
Copy

In the example above, it is clear that the pod/cert-manager-webhook-xxx is having issues with the readiness probe, which should be debugged further. You can also run the get events on the whole cluster using the -A flag. piped out to the more command.

Bash
    
kubectl --kubeconfig ./k get events -A | moreNAMESPACE                       LAST SEEN   TYPE      REASON                                                                                                   OBJECT                                                    MESSAGEargus-mithil-310                19m         Normal    BackOff                                                                                                  pod/kplane-usermgr-5c774f85cc-gbgcd                       Back-off pulling image "514845858982.dkr.ecr.us-west-1.amazonaws.com/kplane-usermgr:5.4.0-1335"argus-mithil-310                4m36s       Warning   Failed                                                                                                   pod/kplane-usermgr-5c774f85cc-gbgcd                       Error: ImagePullBackOffcert-manager                    6m3s        Warning   BackOff                                                                                                  pod/cert-manager-cainjector-6886449cf8-szgts              Back-off restarting failed container
Copy

Pod Logs

We also recommend using a log aggregation system for you pods. But if this is unworkable, you can' review the Pod logs, which will provide you valuable insights.

Bash
    
kubectl --kubeconfig ./k -n cert-manager logs cert-manager-webhook-c677f4f7-c8brx | moreE1019 05:10:12.492522       1 dynamic_source.go:88] cert-manager/webhook "msg"="Failed to generate initial serving certificate, retrying..." "error"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input"  "interval"=1000000000E1019 05:10:13.493607       1 dynamic_source.go:88] cert-manager/webhook "msg"="Failed to generate initial serving certificate, retrying..." "error"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input"  "interval"=1000000000
Copy

Miscellaneous

The other tool we find very useful is the check all nodes and their associated pods.

Bash
    
kubectl --kubeconfig ./k get nodes -o wideNAME                                       STATUS                        ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIMEip-10-0-1-116.us-west-2.compute.internal   Ready                         worker   2d12h   v1.20.5   10.0.1.116    34.222.12.206    Ubuntu 18.04.3 LTS   4.15.0-1054-aws   docker://19.3.11ip-10-0-1-120.us-west-2.compute.internal   Ready                         worker   17h     v1.20.5   10.0.1.120    35.86.121.102    Ubuntu 18.04.3 LTS   4.15.0-1054-aws   docker://19.3.11
Copy

Or, find which pod belongs to what node.

Bash
    
kubectl --kubeconfig ./k get pods -o wide -n cert-managerNAME                                       READY   STATUS             RESTARTS   AGE     IP             NODE                                       NOMINATED NODE   READINESS GATEScert-manager-84b96f99c7-v6pst              1/1     Running            0          6h      10.20.40.91    ip-10-0-2-94.us-west-2.compute.internal    <none>           <none>cert-manager-cainjector-6886449cf8-szgts   0/1     CrashLoopBackOff   73         5h50m   10.20.90.34    ip-10-0-1-120.us-west-2.compute.internal   <none>           <none>cert-manager-webhook-c677f4f7-c8brx        0/1     Running            0          7h25m   10.20.90.239   ip-10-0-1-120.us-west-2.compute.internal   <none>           <none>
Copy

Last updated on

Was this page helpful?