Determining Root Cause/s for Pod Termination Issues
Problem
- It is observed that the pods on a specific node start terminating and are scheduled on other nodes.
- How to determine the causes of these evictions?
Environment
- Platform9 Edge Cloud - v5.3.0 and Higher.
- Platform9 Managed Kubernetes - v5.6 and Higher.
- Self Managed Cloud Platform9- v5.9 and Higher.
Diagnostic Steps
- Listed are most frequently observed issues/causes with explanation and sample logtraces for pods termination issues. Observe and identify the kubelet logs on the affected node if any below errors are seen and then take actions accordingly.
- SyncLoop DELETE: Indicates the kubelet received a request to terminate the pod from the API server.
Jan 17 11:05:12 node-01 kubelet[12345]: I0117 11:05:12.123456 12345 kubelet.go:1906] "SyncLoop DELETE" source="api" pod="default/example-pod"
- Killing Pod/Container: The kubelet starts terminating the pod and sends signals to stop running containers.
Jan 17 11:05:13 node-01 kubelet[12345]: I0117 11:05:13.234567 12345 kuberuntime_manager.go:868] "Killing pod" podName="example-pod" podNamespace="default" podUID="e7b6d3f9-d0c3-4c1b-94ff-823e82a93157"
- Cleaning Up Volumes: The kubelet unmounts and removes volumes associated with the pod.
Jan 17 11:05:18 node-01 kubelet[12345]: I0117 11:05:18.789012 12345 kubelet_volumes.go:165] "Cleaning up pod volumes" podUID="e7b6d3f9-d0c3-4c1b-94ff-823e82a93157"
- Evicted Pod: If the termination is due to resource pressure or eviction, logs indicate the reason.
Jan 17 11:05:19 node-01 kubelet[12345]: I0117 11:05:19.890123 12345 eviction_manager.go:211] "Pod has been evicted" podName="example-pod" podNamespace="default"
- Teardown Network: The CNI plugin tears down the pod's network configuration.
Jan 17 11:05:17 node-01 kubelet[12345]: I0117 11:05:17.678901 12345 cni.go:333] "Teardown network for pod" podName="example-pod" podNamespace="default"
- Readiness/Liveness Probe Failures: Pods failing due to Liveness or Readiness probe failures:
I1202 10:51:45.445830 12357 prober.go:117] Liveness probe for "application--d2r96_kube-system(36f0e9b8-6d67-4875-8abd-e4e7175f45a0):app-node" failed (failure):
Additional Information
- It is recommended that before rebooting the node (if done at all) to resolve the issue, all the necessary logs are captured. Specifically a tarball of the directory
/var/log/pf9
. - Share the
/tmp/cluster-dump.tar.gz
file generated using below commands:
# kubectl cluster-info dump -n <affected_namespace> -o yaml --output-directory=/tmp/cluster/cluster-dump
# tar -czvf /tmp/cluster-dump.tar.gz /tmp/cluster/cluster-dump
Was this page helpful?