Troubleshooting- Useful Kubernetes Commands
While the use of kubectl and its subcommands are too broad for this guide, there are several commands which are very useful and are recommended for use in debugging.
Distressed Pods
This is the Platform9 nomenclature on what pods to look for when users are debugging issues in a large cluster. The primary command used in our experience is to locate the Pods and specifically finds if any of the pods are not in a "Running" state. This is the simplest and the most effective command to execute to get an overview glimpse of issues affecting the pods. The output of this command can delay depending on the number of pods running.
kubectl --kubeconfig ./k get pods -A | grep -v -i running | grep -v -i completed | grep -v -i terminating kubectl --kubeconfig ./k get pods -A | grep -v -i running | grep -v -i completed | grep -v -i terminatingNAMESPACE NAME READY STATUS RESTARTS AGEargus-n2-310 kplane-usermgr-5c774f85cc-gbgcd 1/2 ImagePullBackOff 0 7h47mcert-manager cert-manager-cainjector-6886449cf8-szgts 0/1 CrashLoopBackOff 69 5h32mdu-argus-ab-368 clarity-6f5fb449b8-7x29c 4/5 ImagePullBackOff 0 7mIn the example above regarding argus-n2-310, we can see there are issues with imagepull and the cert-manager having problems. Further debugging would be required, and that can be done in the following few ways:
Events
This is one of the most underappreciated tools available to Kubernetes operators. The kubectl get events command provides users a unique glimpse into what is going on within a kubernetes cluster, or even inside a given pod. The following examples illustrate this rather well.
kubectl --kubeconfig ./k get events -n cert-managerLAST SEEN TYPE REASON OBJECT MESSAGE3m33s Warning BackOff pod/cert-manager-cainjector-6886449cf8-szgts Back-off restarting failed container104s Warning Unhealthy pod/cert-manager-webhook-c677f4f7-c8brx Readiness probe failed: HTTP probe failed with statuscode: 500In the example above, it is clear that the pod/cert-manager-webhook-xxx is having issues with the readiness probe, which should be debugged further. You can also run the get events on the whole cluster using the -A flag. piped out to the more command.
kubectl --kubeconfig ./k get events -A | moreNAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGEargus-mithil-310 19m Normal BackOff pod/kplane-usermgr-5c774f85cc-gbgcd Back-off pulling image "514845858982.dkr.ecr.us-west-1.amazonaws.com/kplane-usermgr:5.4.0-1335"argus-mithil-310 4m36s Warning Failed pod/kplane-usermgr-5c774f85cc-gbgcd Error: ImagePullBackOffcert-manager 6m3s Warning BackOff pod/cert-manager-cainjector-6886449cf8-szgts Back-off restarting failed containerPod Logs
We also recommend using a log aggregation system for you pods. But if this is unworkable, you can' review the Pod logs, which will provide you valuable insights.
kubectl --kubeconfig ./k -n cert-manager logs cert-manager-webhook-c677f4f7-c8brx | moreE1019 05:10:12.492522 1 dynamic_source.go:88] cert-manager/webhook "msg"="Failed to generate initial serving certificate, retrying..." "error"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input" "interval"=1000000000E1019 05:10:13.493607 1 dynamic_source.go:88] cert-manager/webhook "msg"="Failed to generate initial serving certificate, retrying..." "error"="failed verifying CA keypair: tls: failed to find any PEM data in certificate input" "interval"=1000000000Miscellaneous
The other tool we find very useful is the check all nodes and their associated pods.
kubectl --kubeconfig ./k get nodes -o wideNAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIMEip-10-0-1-116.us-west-2.compute.internal Ready worker 2d12h v1.20.5 10.0.1.116 34.222.12.206 Ubuntu 18.04.3 LTS 4.15.0-1054-aws docker://19.3.11ip-10-0-1-120.us-west-2.compute.internal Ready worker 17h v1.20.5 10.0.1.120 35.86.121.102 Ubuntu 18.04.3 LTS 4.15.0-1054-aws docker://19.3.11Or, find which pod belongs to what node.
kubectl --kubeconfig ./k get pods -o wide -n cert-managerNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATEScert-manager-84b96f99c7-v6pst 1/1 Running 0 6h 10.20.40.91 ip-10-0-2-94.us-west-2.compute.internal <none> <none>cert-manager-cainjector-6886449cf8-szgts 0/1 CrashLoopBackOff 73 5h50m 10.20.90.34 ip-10-0-1-120.us-west-2.compute.internal <none> <none>cert-manager-webhook-c677f4f7-c8brx 0/1 Running 0 7h25m 10.20.90.239 ip-10-0-1-120.us-west-2.compute.internal <none> <none>