Catapult has 56 built in rules, each rule falls into one of the categories below:
Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.
Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.
Rule Name | Data Collection
Point | Monitored Component | Action Required |
---|
EtcdBackupJobFailed | Cluster | etcd | Review etcd backup task. Possibly lack of disk space. |
EtcdDown | Cluster | etcd | Verify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container. |
EtcdInsufficientMembers | Cluster | etcd | Etcd quorum lacks an odd number of masters. |
EtcdNoLeader | Cluster | etcd | Etcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus. |
EtcdHighNumberOfLeaderChanges | Cluster | etcd | If leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes. |
EtcdHighNumberOfFailedGrpcRequests | Cluster | etcd | Investigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs. |
EtcdHighNumberOfFailedProposals | Cluster | etcd | Inspect network issues causing failed proposals in the etcd quorum. Check etcd container logs. |
EtcdHighFsyncDurations | Cluster | etcd | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk. |
EtcdHighCommitDurations | Cluster | etcd | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk. |
Rule Name | Data Collection
Point | Monitored Component | Action Required |
---|
KubeAPIServerDown | Cluster | Kube API server | Verify API Server is responding on the node |
KubernetesApiServerErrors | Cluster | Kube API server | Check for API Server overload.
Review logs. |
KubernetesApiClientErrors | Cluster | Kube API server | Check for API Server overload.
Survey logs to audit client requests. |
KubeSchedulerDown | Cluster | Kube Scheduler | Audit logs for k8s master pod restarts. |
KubeControllerManagerDown | Cluster | kube controller | Audit logs for k8s master pod restarts. |
KubeProxyDown | Cluster | kube proxy | Verify kube proxy container is running on the node. |
KubeProxyRuleSyncLatency | Cluster | kube proxy | Explore kube proxy overload via logs. |
| | | |
Rule Name | Data Collection
Point | Monitored Component | Action Required |
---|
| | | |
| | | |
KubeNodeNotReady | Cluster | kube state metrics | Review issues with kubelet on the node. |
KubernetesMemoryPressure | Cluster | kube state metrics | Check available memory on the node. |
KubernetesDiskPressure | Cluster | kube state metrics | Check available space on the node. |
KubernetesJobFailed | Cluster | kube state metrics | Check logs for failed jobs. |
KubernetesContainerTerminated | Cluster | kube state metrics | K8s container was terminated, possible OOM killer. |
KubePodCrashLooping | Cluster | kube state metrics | Review logs on the crashed pod. |
KubePodNotReady | Cluster | kube state metrics | Pod may be in pending state, or awaiting a resource. |
KubeDeploymentReplicasMismatch | Cluster | kube state metrics | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
KubeStatefulSetReplicasMismatch | Cluster | kube state metrics | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
KubeDaemonSetRolloutStuck | Cluster | kube state metrics | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
KubernetesPersistentvolumeclaimPending | Cluster | kube state metrics | PVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc. |
KubernetesPersistentvolumeError | Cluster | kube state metrics | PV is in an error state. Check storage class or CSI drivers, and logs. |
Rule Name | Data Collection Point | Monitored Component | Action Required |
---|
HostHighCpuLoad | Node | Node exporter | High CPU usage on node. Check processes or responsible container. |
HostOutOfMemory | Node | Node exporter | High memory usage on node. Check processes or responsible container. |
HostMemoryUnderMemoryPressure | Node | Node exporter | High memory usage on node. Check processes or responsible container. |
NodeFilesystemAlmostOutOfSpace | Node | Node exporter | Node out of disk space. |
NodeFilesystemAlmostOutOfFiles | Node | Node exporter | Node out of inodes. |
HostUnusualNetworkThroughputIn | Node | Node exporter | Host is experiencing unusual inbound network throughput. |
HostUnusualNetworkThroughputOut | Node | Node exporter | Host is experiencing unusual outbound network throughput |
NodeNetworkReceiveErrs | Node | Node exporter | Host is experiencing network receive errors. |
NodeNetworkTransmitErrs | Node | Node exporter | Host is experiencing network transmit errors. |
HostUnusualDiskWriteRate | Node | Node exporter | Host is experiencing unusual disk write rate. |
HostUnusualDiskReadRate | Node | Node exporter | Host is experiencing unusual disk write rate. |
Rule Name | Data Collection Point | Monitored Component | Action Required |
---|
ClusterStatusNotOK | PMK | SaaS Mgmt Plane | Cluster status is not ok in PMK Mgmt Plane DB. |
NodeNotReady | PMK | SaaS Mgmt Plane | K8s node has entered a not ready state as noted by a HostAgent extension. |
K8sApiNotResponding | PMK | SaaS Mgmt Plane | PMK Mgmt Plane cannot reach k8s apiserver. |
WorkerNodeNotResponding | PMK | SaaS Mgmt Plane | PMK Mgmt Plane cannot reach worker node. |
Rule Name | Data Collection Point | Monitored Component | Action Required |
---|
Host Availability | PMK | SaaS Mgmt Plane | Node Heartbeat is failing. This could be caused by a node outage or a failed service.
Review the node to ensure that all Platform9 services are running. |
Hosts disconnected | PMK | SaaS Mgmt Plane | Partial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running. |
host-down | PMK | SaaS Mgmt Plane | The node is completely disconnected from Platform9. Ensure the node is running and all services are operating. |
Rule Name | Data Collection Point | Monitored Component | Action Required |
---|
AddonNotHealthy | PMK | Nodelet | ClusterAddon’s .status.health shows addon is not healthy. |
AddonNotConverging | PMK | Nodelet | ClusterAddon’s .status.phase is not in Installed state. |
AddonInstallError | PMK | Nodelet | ClusterAddon’s .status.phase is in an InstallError state. |
AddonUninstallError | PMK | Nodelet | ClusterAddon’s .status.phase is in an UninstallError state. |
| | | |
Rule Name | Data Collection Point | Monitored Component | Action Required |
---|
PromHTTPRequestErrors | Node | calico-felix | Prometheus unable to pull data from calico. Underlying network or calico-node pod issues. |
CalicoDatapaneFailuresHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
CalicoIpsetErrorsHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
CalicoIptableSaveErrorsHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
CalicoIptableRestoreErrorsHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
TyphaPingLatency | Node | calico-typha | Check network connectivity to the calico-typha pods. |
TyphaClientWriteLatency | Node | calico-typha | Verify connectivity between calico-typha and kubernetes API server/etcd |