Catapult has 56 built in rules, each rule falls into one of the categories below:
Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.
Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.
| Rule Name | Data Collection
Point | Monitored Component | Action Required |
|---|
| EtcdBackupJobFailed | Cluster | etcd | Review etcd backup task. Possibly lack of disk space. |
| EtcdDown | Cluster | etcd | Verify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container. |
| EtcdInsufficientMembers | Cluster | etcd | Etcd quorum lacks an odd number of masters. |
| EtcdNoLeader | Cluster | etcd | Etcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus. |
| EtcdHighNumberOfLeaderChanges | Cluster | etcd | If leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes. |
| EtcdHighNumberOfFailedGrpcRequests | Cluster | etcd | Investigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs. |
| EtcdHighNumberOfFailedProposals | Cluster | etcd | Inspect network issues causing failed proposals in the etcd quorum. Check etcd container logs. |
| EtcdHighFsyncDurations | Cluster | etcd | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk. |
| EtcdHighCommitDurations | Cluster | etcd | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk. |
| Rule Name | Data Collection
Point | Monitored Component | Action Required |
|---|
| KubeAPIServerDown | Cluster | Kube API server | Verify API Server is responding on the node |
| KubernetesApiServerErrors | Cluster | Kube API server | Check for API Server overload.
Review logs. |
| KubernetesApiClientErrors | Cluster | Kube API server | Check for API Server overload.
Survey logs to audit client requests. |
| KubeSchedulerDown | Cluster | Kube Scheduler | Audit logs for k8s master pod restarts. |
| KubeControllerManagerDown | Cluster | kube controller | Audit logs for k8s master pod restarts. |
| KubeProxyDown | Cluster | kube proxy | Verify kube proxy container is running on the node. |
| KubeProxyRuleSyncLatency | Cluster | kube proxy | Explore kube proxy overload via logs. |
| | | |
| Rule Name | Data Collection
Point | Monitored Component | Action Required |
|---|
| | | |
| | | |
| KubeNodeNotReady | Cluster | kube state metrics | Review issues with kubelet on the node. |
| KubernetesMemoryPressure | Cluster | kube state metrics | Check available memory on the node. |
| KubernetesDiskPressure | Cluster | kube state metrics | Check available space on the node. |
| KubernetesJobFailed | Cluster | kube state metrics | Check logs for failed jobs. |
| KubernetesContainerTerminated | Cluster | kube state metrics | K8s container was terminated, possible OOM killer. |
| KubePodCrashLooping | Cluster | kube state metrics | Review logs on the crashed pod. |
| KubePodNotReady | Cluster | kube state metrics | Pod may be in pending state, or awaiting a resource. |
| KubeDeploymentReplicasMismatch | Cluster | kube state metrics | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubeStatefulSetReplicasMismatch | Cluster | kube state metrics | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubeDaemonSetRolloutStuck | Cluster | kube state metrics | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubernetesPersistentvolumeclaimPending | Cluster | kube state metrics | PVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc. |
| KubernetesPersistentvolumeError | Cluster | kube state metrics | PV is in an error state. Check storage class or CSI drivers, and logs. |
| Rule Name | Data Collection Point | Monitored Component | Action Required |
|---|
| HostHighCpuLoad | Node | Node exporter | High CPU usage on node. Check processes or responsible container. |
| HostOutOfMemory | Node | Node exporter | High memory usage on node. Check processes or responsible container. |
| HostMemoryUnderMemoryPressure | Node | Node exporter | High memory usage on node. Check processes or responsible container. |
| NodeFilesystemAlmostOutOfSpace | Node | Node exporter | Node out of disk space. |
| NodeFilesystemAlmostOutOfFiles | Node | Node exporter | Node out of inodes. |
| HostUnusualNetworkThroughputIn | Node | Node exporter | Host is experiencing unusual inbound network throughput. |
| HostUnusualNetworkThroughputOut | Node | Node exporter | Host is experiencing unusual outbound network throughput |
| NodeNetworkReceiveErrs | Node | Node exporter | Host is experiencing network receive errors. |
| NodeNetworkTransmitErrs | Node | Node exporter | Host is experiencing network transmit errors. |
| HostUnusualDiskWriteRate | Node | Node exporter | Host is experiencing unusual disk write rate. |
| HostUnusualDiskReadRate | Node | Node exporter | Host is experiencing unusual disk write rate. |
| Rule Name | Data Collection Point | Monitored Component | Action Required |
|---|
| ClusterStatusNotOK | PMK | SaaS Mgmt Plane | Cluster status is not ok in PMK Mgmt Plane DB. |
| NodeNotReady | PMK | SaaS Mgmt Plane | K8s node has entered a not ready state as noted by a HostAgent extension. |
| K8sApiNotResponding | PMK | SaaS Mgmt Plane | PMK Mgmt Plane cannot reach k8s apiserver. |
| WorkerNodeNotResponding | PMK | SaaS Mgmt Plane | PMK Mgmt Plane cannot reach worker node. |
| Rule Name | Data Collection Point | Monitored Component | Action Required |
|---|
| Host Availability | PMK | SaaS Mgmt Plane | Node Heartbeat is failing. This could be caused by a node outage or a failed service.
Review the node to ensure that all Platform9 services are running. |
| Hosts disconnected | PMK | SaaS Mgmt Plane | Partial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running. |
| host-down | PMK | SaaS Mgmt Plane | The node is completely disconnected from Platform9. Ensure the node is running and all services are operating. |
| Rule Name | Data Collection Point | Monitored Component | Action Required |
|---|
| AddonNotHealthy | PMK | Nodelet | ClusterAddon’s .status.health shows addon is not healthy. |
| AddonNotConverging | PMK | Nodelet | ClusterAddon’s .status.phase is not in Installed state. |
| AddonInstallError | PMK | Nodelet | ClusterAddon’s .status.phase is in an InstallError state. |
| AddonUninstallError | PMK | Nodelet | ClusterAddon’s .status.phase is in an UninstallError state. |
| | | |
| Rule Name | Data Collection Point | Monitored Component | Action Required |
|---|
| PromHTTPRequestErrors | Node | calico-felix | Prometheus unable to pull data from calico. Underlying network or calico-node pod issues. |
| CalicoDatapaneFailuresHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
| CalicoIpsetErrorsHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
| CalicoIptableSaveErrorsHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
| CalicoIptableRestoreErrorsHigh | Node | calico-felix | Calico-node pod is congested. Reduce load or restart restart calico-node pod. |
| TyphaPingLatency | Node | calico-typha | Check network connectivity to the calico-typha pods. |
| TyphaClientWriteLatency | Node | calico-typha | Verify connectivity between calico-typha and kubernetes API server/etcd |