Catapult Rules & Alarms

Catapult has 56 built in rules, each rule falls into one of the categories below:

Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.

Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.

Rule Name	Data Collection Point	Monitored Component	Action Required
EtcdBackupJobFailed	Cluster	etcd	Review etcd backup task. Possibly lack of disk space.
EtcdDown	Cluster	etcd	Verify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container.
EtcdInsufficientMembers	Cluster	etcd	Etcd quorum lacks an odd number of masters.
EtcdNoLeader	Cluster	etcd	Etcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus.
EtcdHighNumberOfLeaderChanges	Cluster	etcd	If leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes.
EtcdHighNumberOfFailedGrpcRequests	Cluster	etcd	Investigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs.
EtcdHighNumberOfFailedProposals	Cluster	etcd	Inspect network issues causing failed proposals in the etcd quorum. Check etcd container logs.
EtcdHighFsyncDurations	Cluster	etcd	Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.
EtcdHighCommitDurations	Cluster	etcd	Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.

Rule Name	Data Collection Point	Monitored Component	Action Required
KubeAPIServerDown	Cluster	Kube API server	Verify API Server is responding on the node
KubernetesApiServerErrors	Cluster	Kube API server	Check for API Server overload. Review logs.
KubernetesApiClientErrors	Cluster	Kube API server	Check for API Server overload. Survey logs to audit client requests.
KubeSchedulerDown	Cluster	Kube Scheduler	Audit logs for k8s master pod restarts.
KubeControllerManagerDown	Cluster	kube controller	Audit logs for k8s master pod restarts.
KubeProxyDown	Cluster	kube proxy	Verify kube proxy container is running on the node.
KubeProxyRuleSyncLatency	Cluster	kube proxy	Explore kube proxy overload via logs.

Rule Name	Data Collection Point	Monitored Component	Action Required


KubeNodeNotReady	Cluster	kube state metrics	Review issues with kubelet on the node.
KubernetesMemoryPressure	Cluster	kube state metrics	Check available memory on the node.
KubernetesDiskPressure	Cluster	kube state metrics	Check available space on the node.
KubernetesJobFailed	Cluster	kube state metrics	Check logs for failed jobs.
KubernetesContainerTerminated	Cluster	kube state metrics	K8s container was terminated, possible OOM killer.
KubePodCrashLooping	Cluster	kube state metrics	Review logs on the crashed pod.
KubePodNotReady	Cluster	kube state metrics	Pod may be in pending state, or awaiting a resource.
KubeDeploymentReplicasMismatch	Cluster	kube state metrics	K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubeStatefulSetReplicasMismatch	Cluster	kube state metrics	K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubeDaemonSetRolloutStuck	Cluster	kube state metrics	K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubernetesPersistentvolumeclaimPending	Cluster	kube state metrics	PVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc.
KubernetesPersistentvolumeError	Cluster	kube state metrics	PV is in an error state. Check storage class or CSI drivers, and logs.

Rule Name	Data Collection Point	Monitored Component	Action Required
HostHighCpuLoad	Node	Node exporter	High CPU usage on node. Check processes or responsible container.
HostOutOfMemory	Node	Node exporter	High memory usage on node. Check processes or responsible container.
HostMemoryUnderMemoryPressure	Node	Node exporter	High memory usage on node. Check processes or responsible container.
NodeFilesystemAlmostOutOfSpace	Node	Node exporter	Node out of disk space.
NodeFilesystemAlmostOutOfFiles	Node	Node exporter	Node out of inodes.
HostUnusualNetworkThroughputIn	Node	Node exporter	Host is experiencing unusual inbound network throughput.
HostUnusualNetworkThroughputOut	Node	Node exporter	Host is experiencing unusual outbound network throughput
NodeNetworkReceiveErrs	Node	Node exporter	Host is experiencing network receive errors.
NodeNetworkTransmitErrs	Node	Node exporter	Host is experiencing network transmit errors.
HostUnusualDiskWriteRate	Node	Node exporter	Host is experiencing unusual disk write rate.
HostUnusualDiskReadRate	Node	Node exporter	Host is experiencing unusual disk write rate.

Rule Name	Data Collection Point	Monitored Component	Action Required
ClusterStatusNotOK	PMK	SaaS Mgmt Plane	Cluster status is not ok in PMK Mgmt Plane DB.
NodeNotReady	PMK	SaaS Mgmt Plane	K8s node has entered a not ready state as noted by a HostAgent extension.
K8sApiNotResponding	PMK	SaaS Mgmt Plane	PMK Mgmt Plane cannot reach k8s apiserver.
WorkerNodeNotResponding	PMK	SaaS Mgmt Plane	PMK Mgmt Plane cannot reach worker node.

Rule Name	Data Collection Point	Monitored Component	Action Required
Host Availability	PMK	SaaS Mgmt Plane	Node Heartbeat is failing. This could be caused by a node outage or a failed service. Review the node to ensure that all Platform9 services are running.
Hosts disconnected	PMK	SaaS Mgmt Plane	Partial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running.
host-down	PMK	SaaS Mgmt Plane	The node is completely disconnected from Platform9. Ensure the node is running and all services are operating.

Rule Name	Data Collection Point	Monitored Component	Action Required
AddonNotHealthy	PMK	Nodelet	ClusterAddon’s .status.health shows addon is not healthy.
AddonNotConverging	PMK	Nodelet	ClusterAddon’s .status.phase is not in Installed state.
AddonInstallError	PMK	Nodelet	ClusterAddon’s .status.phase is in an `InstallError` state.
AddonUninstallError	PMK	Nodelet	ClusterAddon’s .status.phase is in an `UninstallError` state.

Rule Name	Data Collection Point	Monitored Component	Action Required
PromHTTPRequestErrors	Node	calico-felix	Prometheus unable to pull data from calico. Underlying network or calico-node pod issues.
CalicoDatapaneFailuresHigh	Node	calico-felix	Calico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIpsetErrorsHigh	Node	calico-felix	Calico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIptableSaveErrorsHigh	Node	calico-felix	Calico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIptableRestoreErrorsHigh	Node	calico-felix	Calico-node pod is congested. Reduce load or restart restart calico-node pod.
TyphaPingLatency	Node	calico-typha	Check network connectivity to the calico-typha pods.
TyphaClientWriteLatency	Node	calico-typha	Verify connectivity between calico-typha and kubernetes API server/etcd

Last updated by Chris Jones on Apr 28, 2022

Was this page helpful?