Determining Root Cause/s for Pod Termination Issues

Problem

  • It is observed that the pods on a specific node start terminating and are scheduled on other nodes.
  • How to determine the causes of these evictions?

Environment

  • Platform9 Edge Cloud - v5.3.0 and Higher.
  • Platform9 Managed Kubernetes - v5.6 and Higher.
  • Self Managed Cloud Platform9- v5.9 and Higher.

Diagnostic Steps

  • Listed are most frequently observed issues/causes with explanation and sample logtraces for pods termination issues. Observe and identify the kubelet logs on the affected node if any below errors are seen and then take actions accordingly.
  • SyncLoop DELETE: Indicates the kubelet received a request to terminate the pod from the API server.
Kubelet log
Copy
  • Killing Pod/Container: The kubelet starts terminating the pod and sends signals to stop running containers.
Kubelet log
Copy
  • Cleaning Up Volumes: The kubelet unmounts and removes volumes associated with the pod.
Kubelet log
Copy
  • Evicted Pod: If the termination is due to resource pressure or eviction, logs indicate the reason.
Kubelet log
Copy
  • Teardown Network: The CNI plugin tears down the pod's network configuration.
Kubelet log
Copy
  • Readiness/Liveness Probe Failures: Pods failing due to Liveness or Readiness probe failures:
Kubelet log
Copy

Additional Information

  • It is recommended that before rebooting the node (if done at all) to resolve the issue, all the necessary logs are captured. Specifically a tarball of the directory /var/log/pf9 .
  • Share the /tmp/cluster-dump.tar.gz file generated using below commands:
Master node
Copy
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard