Catapult Rules & Alarms

Catapult has 56 built in rules, each rule falls into one of the categories below:

Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.

etcd

Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.

Rule NameData Collection PointMonitored ComponentAction Required
EtcdBackupJobFailedClusteretcdReview etcd backup task. Possibly lack of disk space.
EtcdDownClusteretcdVerify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container.
EtcdInsufficientMembersClusteretcdEtcd quorum lacks an odd number of masters.
EtcdNoLeaderClusteretcdEtcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus.
EtcdHighNumberOfLeaderChangesClusteretcdIf leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes.
EtcdHighNumberOfFailedGrpcRequestsClusteretcdInvestigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs.
EtcdHighNumberOfFailedProposalsClusteretcdInspect network issues causing failed proposals in the etcd quorum. Check etcd container logs.
EtcdHighFsyncDurationsClusteretcdReview disk or storage issues which may cause problems when an etcd member tries to commit data to disk.
EtcdHighCommitDurationsClusteretcdReview disk or storage issues which may cause problems when an etcd member tries to commit data to disk.

Kubernetes API Monitoring

Rule NameData Collection PointMonitored ComponentAction Required
KubeAPIServerDownClusterKube API serverVerify API Server is responding on the node
KubernetesApiServerErrorsClusterKube API serverCheck for API Server overload. Review logs.
KubernetesApiClientErrorsClusterKube API serverCheck for API Server overload. Survey logs to audit client requests.
KubeSchedulerDownClusterKube SchedulerAudit logs for k8s master pod restarts.
KubeControllerManagerDownClusterkube controllerAudit logs for k8s master pod restarts.
KubeProxyDownClusterkube proxyVerify kube proxy container is running on the node.
KubeProxyRuleSyncLatencyClusterkube proxyExplore kube proxy overload via logs.

Kube State

Rule NameData Collection PointMonitored ComponentAction Required
KubeNodeNotReadyClusterkube state metricsReview issues with kubelet on the node.
KubernetesMemoryPressureClusterkube state metricsCheck available memory on the node.
KubernetesDiskPressureClusterkube state metricsCheck available space on the node.
KubernetesJobFailedClusterkube state metricsCheck logs for failed jobs.
KubernetesContainerTerminatedClusterkube state metricsK8s container was terminated, possible OOM killer.
KubePodCrashLoopingClusterkube state metricsReview logs on the crashed pod.
KubePodNotReadyClusterkube state metricsPod may be in pending state, or awaiting a resource.
KubeDeploymentReplicasMismatchClusterkube state metricsK8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubeStatefulSetReplicasMismatchClusterkube state metricsK8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubeDaemonSetRolloutStuckClusterkube state metricsK8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubernetesPersistentvolumeclaimPendingClusterkube state metricsPVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc.
KubernetesPersistentvolumeErrorClusterkube state metricsPV is in an error state. Check storage class or CSI drivers, and logs.

Node OS

Rule NameData Collection PointMonitored ComponentAction Required
HostHighCpuLoadNodeNode exporterHigh CPU usage on node. Check processes or responsible container.
HostOutOfMemoryNodeNode exporterHigh memory usage on node. Check processes or responsible container.
HostMemoryUnderMemoryPressureNodeNode exporterHigh memory usage on node. Check processes or responsible container.
NodeFilesystemAlmostOutOfSpaceNodeNode exporterNode out of disk space.
NodeFilesystemAlmostOutOfFilesNodeNode exporterNode out of inodes.
HostUnusualNetworkThroughputInNodeNode exporterHost is experiencing unusual inbound network throughput.
HostUnusualNetworkThroughputOutNodeNode exporterHost is experiencing unusual outbound network throughput
NodeNetworkReceiveErrsNodeNode exporterHost is experiencing network receive errors.
NodeNetworkTransmitErrsNodeNode exporterHost is experiencing network transmit errors.
HostUnusualDiskWriteRateNodeNode exporterHost is experiencing unusual disk write rate.
HostUnusualDiskReadRateNodeNode exporterHost is experiencing unusual disk write rate.

Environment Status

Rule NameData Collection PointMonitored ComponentAction Required
ClusterStatusNotOKPMKSaaS Mgmt PlaneCluster status is not ok in PMK Mgmt Plane DB.
NodeNotReadyPMKSaaS Mgmt PlaneK8s node has entered a not ready state as noted by a HostAgent extension.
K8sApiNotRespondingPMKSaaS Mgmt PlanePMK Mgmt Plane cannot reach k8s apiserver.
WorkerNodeNotRespondingPMKSaaS Mgmt PlanePMK Mgmt Plane cannot reach worker node.

Node Connectivity

Rule NameData Collection PointMonitored ComponentAction Required
Host AvailabilityPMKSaaS Mgmt PlaneNode Heartbeat is failing. This could be caused by a node outage or a failed service. Review the node to ensure that all Platform9 services are running.
Hosts disconnectedPMKSaaS Mgmt PlanePartial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running.
host-downPMKSaaS Mgmt PlaneThe node is completely disconnected from Platform9. Ensure the node is running and all services are operating.

Managed Add-ons

Rule NameData Collection PointMonitored ComponentAction Required
AddonNotHealthyPMKNodeletClusterAddon’s .status.health shows addon is not healthy.
AddonNotConvergingPMKNodeletClusterAddon’s .status.phase is not in Installed state.
AddonInstallErrorPMKNodeletClusterAddon’s .status.phase is in an InstallError state.
AddonUninstallErrorPMKNodeletClusterAddon’s .status.phase is in an UninstallError state.

Calico

Rule NameData Collection PointMonitored ComponentAction Required
PromHTTPRequestErrorsNodecalico-felixPrometheus unable to pull data from calico. Underlying network or calico-node pod issues.
CalicoDatapaneFailuresHighNodecalico-felixCalico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIpsetErrorsHighNodecalico-felixCalico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIptableSaveErrorsHighNodecalico-felixCalico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIptableRestoreErrorsHighNodecalico-felixCalico-node pod is congested. Reduce load or restart restart calico-node pod.
TyphaPingLatencyNodecalico-typhaCheck network connectivity to the calico-typha pods.
TyphaClientWriteLatencyNodecalico-typhaVerify connectivity between calico-typha and kubernetes API server/etcd
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard
  Last updated by Chris Jones