Calico Monitoring
Calico Rules
Below is the YAML-based rule set that Catapult uses, including alert names, expression, timeframe, labels, and annotations which contain the description and summaries.
x
bash-5.0# cat calico.ymlgroupsnamecalico.rules rules#----------------------------------- Calico Node -------------------------------------# Felix Prometheus statistics: https://projectcalico.docs.tigera.io/reference/felix/prometheusalertPromHTTPRequestErrors expr(sum(rate(promhttp_metric_handler_requests_totaljob="calico-node"code=~"(4|5).."5m offset 5m )) by (instance, job, cluster, host) / sum(rate(promhttp_metric_handler_requests_totaljob="calico-node"5m offset 5m )) by (instance, job, cluster, host)) * 100 > 1 for10m labels severitywarning typecalico-node annotations description"Cluster {{ $labels.cluster }}: HTTP requests errors on host {{ $labels.host }}." summaryCalico HTTP requests errors on cluster $labels.cluster alertCalicoDatapaneFailuresHigh exprincrease(felix_int_dataplane_failures1h offset 5m) > 5 for1h labels severitywarning typecalico-node annotations description'Felix cluster {{ $labels.cluster }} has seen {{ $value }} dataplane failures within the last hour' summary'A high number of dataplane failures within Felix are happening'alertCalicoIpsetErrorsHigh exprincrease(felix_ipset_errors1h offset 5m) > 5 for1h labels severitywarning typecalico-node annotations description'Felix cluster {{ $labels.cluster }} has seen {{ $value }} ipset errors within the last hour' summary'A high number of ipset errors within Felix are happening'alertCalicoIptableSaveErrorsHigh exprincrease(felix_iptables_save_errors1h offset 5m) > 5 for1h labels severitywarning typecalico-node annotations description'Felix cluster {{ $labels.cluster }} has seen {{ $value }} iptable save errors within the last hour' summary'A high number of iptable save errors within Felix are happening'alertCalicoIptableRestoreErrorsHigh exprincrease(felix_iptables_restore_errors1h offset 5m) > 5 for1h labels severitywarning typecalico-node annotations description'Felix cluster {{ $labels.cluster }} has seen {{ $value }} iptable restore errors within the last hour' summary'A high number of iptable restore errors within Felix are happening'#----------------------------------- Calico Kube Controllers -----------------------# kube-controllers Prometheus statistics: https://projectcalico.docs.tigera.io/reference/kube-controllers/prometheus# curl http://<pod_ip>:9094/metrics#----------------------------------- Calico Typha ----------------------------------# Typha Prometheus statistics: https://projectcalico.docs.tigera.io/reference/typha/prometheusalertTyphaPingLatency exprrate(typha_ping_latency_sum1m offset 5m ) / rate(typha_ping_latency_count1m offset 5m ) > 0.1 and rate(typha_ping_latency_count1m offset 5m ) > 0 for2m labels severitywarning typecalico-typha annotations summaryTypha Round-trip ping latency to client (cluster $labels.cluster ) description"Typha latency is growing (ping operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"alertTyphaClientWriteLatency exprrate(typha_client_write_latency_secs_sum1m offset 5m) / rate(typha_client_write_latency_secs_count1m offset 5m) > 0.1 and rate(typha_client_write_latency_secs_count1m offset 5m ) > 0 for2m labels severitywarning typecalico-typha annotations summaryTypha unusual write latency (instance $labels.cluster ) description"Typha client latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"bash-5.0#Was this page helpful?