Troubleshooting a Cluster

How to Troubleshoot a Cluster

In the normal course of cluster administration, difficulties will occur. To accurately identify these issues, we must understand the root cause and its effect on the system. Typically, the monitoring system tied to the cluster will produce error messages in one of the logs associated with the object. Most administrators will have some form of automation set up which notifies them of the issue. They can then review the logs to determine where the failure occurred, at which point they act to quickly to try to resolve the issue to get the system(s) involved back online. Typical problems revolve around two major areas:

Cluster level issues: ensuring nodes are registered, ruling out API, scheduler, or etcd issues
Application level issues: verifying pod states and debugging services

Major cases can involve individual containers, in multiple pods, in a component of the control plane, or in an aggregation of all these objects. The troubleshooting process in a Kubernetes cluster is generally a three-step procedure.

Identifying the issue(s) and where they occur (master, nodes, pods, services, replication controllers, and networking)
Correcting the root cause and the possible associated subsequent cascading failures
Prevention implies taking preemptive steps to keep issues from (re)occurring

Troubleshooting also includes the dynamic ongoing maintenance and mitigation of faults. This means taking actionable measures in advance to prevent further problems before they arise in any Kubernetes components or objects. Platform9 also provides services like Advanced Remote Support to aid clients in resolving problems with their clusters.

Identification Tools

There are a wide variety of software tools, scripts, and internal commands that can be used to identify and resolve most issues. The areas affected will reflect the tools chosen to address the issue. Since most cluster are Linux-based, the following commands are but a few which can provide a wealth of information related to the issue.

kubectl: Running the kubectl command on the master yields a significant amount of information about the state of the cluster, its elements, and available services. Kubectl allows us to list information and perform other tasks on the pods and other objects
journalctl_: _The journalctl command checks Kubernetes systemd services. It runs on both the master and nodes to examine Kubernetes system failures All daemon logging in Kubernetes uses the systemd journal
systemctl: Many systemd services are configured with Kubernetes to assist with communications between the master and nodes. Those services must also be active and enabled
du_ctl: A tool used to gather support information about the host from the Management Server (see (Link Removed))

Components

hostagent: Used to configure and manage the Platform9's host side software. Also used to support other host side software.
sidekick: This tool acts more as a backup mechanism for the hostagent. Moreover, it is used to support host side software.
pf9-kube: Works in coordination with nodelet for starting, stopping, and status checks of the kube stack
etcdctl or curl: This service can run on the master or on other system. The etcdctl command line tool resides on certain systems to seek in depth system information. The curl command is also used to query the etcd service.

Node Config

Node installs typically fail if you do not have the correct configuration or dependencies. There are a few things you need to take care of:

OS Version support: Please refer to the supported version matrix for a list of supported operating systems for your version.
Swap : Please turn off the swap.
Other networking kernel parameter like enable forwarding. See IPV6 specific configuration needs here IPv6 Support

Furthermore, see the Troubleshooting Node Issues article for more details on the Node installation status issues.

Nodelet

Please follow the guide Nodelet for details on how to get the Nodelet details and the different phases. These phases will help you debug the cluster creation.

Log File Location

Each PMK node stores log files for the various PMK components at /var/log/pf9.

The /var/log/pf9/kube/kube.log file stores information about installation of the Kubernetes role on this node and the output of periodic status checks performed on the node. Consult this file on the node for more information if you are running into issues with attaching the node to the cluster or if the node is reported as ‘Unhealthy’ in the PMK UI.

Config File Location

Platform9 components on the host have config files located in /etc/pf9 directory.

The directory contains various configuration files as well as certificates used for various communication purposes. While you should treat them as internal to Platform9, there is one directory in particular which you will find useful. The directory /etc/pf9/kubeconfigs/admin.yaml can be used to access the kubernetes cluster when the Platform9 Management Plane is inaccessible.

Miscellaneous

Being able to replicate an issue in a production environment can be difficult at best. With this in mind, one of the next steps is to try to duplicate the issue locally to not affect the working cluster directly.

Additionally, because Kubernetes is resilient in nature, automatic restarts can sometimes hinder the troubleshooting process, making it difficult to locate the root cause of an issue. This is where having generous logging comes into play. Utilizing a system like Prometheus or Grafana to note and retain any metrics related information associated to the problems could mean the difference between success and failure.

Troubleshooting Matrix

The image below outlines the general steps to take when troubleshooting a Kubernetes cluster.

Remember, Platform9 has multiple avenues of recourse when trying to resolve an issue. You can search our documentation, ask a question in our forums, inquire in multiple Slack channels, review our blog posts or open a support ticket with us.

Last updated on

Was this page helpful?