Troubleshooting- Useful Kubernetes Commands
While the use of kubectl and its subcommands are too broad for this guide, there are several commands which are very useful and are recommended for use in debugging.
Distressed Pods
This is the Platform9 nomenclature on what pods to look for when users are debugging issues in a large cluster. The primary command used in our experience is to locate the Pods and specifically finds if any of the pods are not in a "Running" state. This is the simplest and the most effective command to execute to get an overview glimpse of issues affecting the pods. The output of this command can delay depending on the number of pods running.
kubectl --kubeconfig ./k get pods -A | grep -v -i running | grep -v -i completed | grep -v -i terminating
kubectl --kubeconfig ./k get pods -A | grep -v -i running | grep -v -i completed | grep -v -i terminating
NAMESPACE NAME READY STATUS RESTARTS AGE
argus-n2-310 kplane-usermgr-5c774f85cc-gbgcd 1/2 ImagePullBackOff 0 7h47m
cert-manager cert-manager-cainjector-6886449cf8-szgts 0/1 CrashLoopBackOff 69 5h32m
du-argus-ab-368 clarity-6f5fb449b8-7x29c 4/5 ImagePullBackOff 0 7m
In the example above regarding argus-n2-310, we can see there are issues with imagepull and the cert-manager having problems. Further debugging would be required, and that can be done in the following few ways:
Events
This is one of the most underappreciated tools available to Kubernetes operators. The kubectl get events command provides users a unique glimpse into what is going on within a kubernetes cluster, or even inside a given pod. The following examples illustrate this rather well.
kubectl --kubeconfig ./k get events -n cert-manager
LAST SEEN TYPE REASON OBJECT MESSAGE
3m33s Warning BackOff pod/cert-manager-cainjector-6886449cf8-szgts Back-off restarting failed container
104s Warning Unhealthy pod/cert-manager-webhook-c677f4f7-c8brx Readiness probe failed: HTTP probe failed with statuscode: 500
In the example above, it is clear that the pod/cert-manager-webhook-xxx is having issues with the readiness probe, which should be debugged further. You can also run the get events on the whole cluster using the -A flag. piped out to the more command.
kubectl --kubeconfig ./k get events -A | more
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
argus-mithil-310 19m Normal BackOff pod/kplane-usermgr-5c774f85cc-gbgcd Back-off pulling image "514845858982.dkr.ecr.us-west-1.amazonaws.com/kplane-usermgr:5.4.0-1335"
argus-mithil-310 4m36s Warning Failed pod/kplane-usermgr-5c774f85cc-gbgcd Error: ImagePullBackOff
cert-manager 6m3s Warning BackOff pod/cert-manager-cainjector-6886449cf8-szgts Back-off restarting failed container
Pod Logs
We also recommend using a log aggregation system for you pods. But if this is unworkable, you can' review the Pod logs, which will provide you valuable insights.
kubectl --kubeconfig ./k -n cert-manager logs cert-manager-webhook-c677f4f7-c8brx | more
E1019 05:10:12.492522 1 dynamic_source.go:88] cert-manager/webhook "msg"="Failed to generate initial serving certificate, retrying..." "error"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input" "interval"=1000000000
E1019 05:10:13.493607 1 dynamic_source.go:88] cert-manager/webhook "msg"="Failed to generate initial serving certificate, retrying..." "error"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input" "interval"=1000000000
Miscellaneous
The other tool we find very useful is the check all nodes and their associated pods.
kubectl --kubeconfig ./k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-1-116.us-west-2.compute.internal Ready worker 2d12h v1.20.5 10.0.1.116 34.222.12.206 Ubuntu 18.04.3 LTS 4.15.0-1054-aws docker://19.3.11
ip-10-0-1-120.us-west-2.compute.internal Ready worker 17h v1.20.5 10.0.1.120 35.86.121.102 Ubuntu 18.04.3 LTS 4.15.0-1054-aws docker://19.3.11
Or, find which pod belongs to what node.
kubectl --kubeconfig ./k get pods -o wide -n cert-manager
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cert-manager-84b96f99c7-v6pst 1/1 Running 0 6h 10.20.40.91 ip-10-0-2-94.us-west-2.compute.internal <none> <none>
cert-manager-cainjector-6886449cf8-szgts 0/1 CrashLoopBackOff 73 5h50m 10.20.90.34 ip-10-0-1-120.us-west-2.compute.internal <none> <none>
cert-manager-webhook-c677f4f7-c8brx 0/1 Running 0 7h25m 10.20.90.239 ip-10-0-1-120.us-west-2.compute.internal <none> <none>