etcd Monitoring
Rules for etcd
The Prometheus agent is configured to report numerous etcd metrics. Below is the YAML-based rule set that Catapult uses, including alert names, expression, timeframe, labels with severity and type, and annotations which contain the description and summaries.
x
bash-5.0# cat etcd.ymlgroupsnameetcd rulesalertEtcdBackupJobFailed exprkube_job_status_failedjob_name=~"etcd-backup.*" offset 5m > 0 for0m labels severityhigh typeetcd annotations summaryEtcd Backup Job failed for cluster $labels.cluster description"Cluster {{ $labels.cluster }}: Etcd Backup Job {{$labels.namespace}}/{{$labels.job_name}} failed to complete reason: {{$labels.reason}}"alertEtcdDown exprupjob="etcd" offset 5m == 0 for10m labels severitycritical typeetcd annotations descriptionEtcd container down on cluster $labels.cluster summary"Cluster {{ $labels.cluster }}: Etcd container down on host {{ $labels.host }}"alertEtcdInsufficientMembers exprcount(etcd_server_id) by (cluster) % 2 == 0 for1m labels severitycritical typeetcd annotations summaryEtcd insufficient members on cluster $labels.cluster description"Cluster {{ $labels.cluster }}: Etcd cluster should have an odd number of members\n VALUE = {{ $value }}"alertEtcdNoLeader expretcd_server_has_leader == 0 for1m labels severitycritical typeetcd annotations summaryEtcd no Leader on cluster $labels.cluster description"Cluster {{ $labels.cluster }}: Etcd cluster have no leader\n VALUE = {{ $value }}"alertEtcdHighNumberOfLeaderChanges exprincrease(etcd_server_leader_changes_seen_total10m offset 5m) > 2 for0m labels severityhigh typeetcd annotations summaryEtcd high number of leader changes on cluster $labels.cluster description"Cluster {{ $labels.cluster }}: Etcd leader changed more than 2 times during 10 minutes\n VALUE = {{ $value }}"alertEtcdHighNumberOfFailedGrpcRequests exprsum(rate(grpc_server_handled_totalgrpc_code!="OK"1m offset 5m )) BY (grpc_service, grpc_method, cluster, host) / sum(rate(grpc_server_handled_total1m offset 5m )) BY (grpc_service, grpc_method, cluster, host) > 0.01 for2m labels severitywarning typeetcd annotations summaryEtcd high number of failed GRPC requests on cluster $labels.cluster description"Cluster {{ $labels.cluster }}: More than 1% GRPC request failure detected on Etcd host {{ $labels.host }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"alertEtcdHighNumberOfFailedProposals exprincrease(etcd_server_proposals_failed_total1h) > 5 for2m labels severitywarning typeetcd annotations summaryEtcd high number of failed proposals on cluster $labels.cluster description"Cluster {{ $labels.cluster }}: Etcd server got more than 5 failed proposals past hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"alertEtcdHighFsyncDurations exprhistogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket1m offset 5m )) > 0.5 for2m labels severitywarning typeetcd annotations summaryEtcd high fsync durations on cluster $labels.cluster description"Cluster {{ $labels.cluster }}: Etcd WAL fsync duration increasing, 99th percentile is over 0.5s on Etcd host {{ $labels.host }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"alertEtcdHighCommitDurations exprhistogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket1m offset 5m )) > 0.25 for2m labels severitywarning typeetcd annotations summaryEtcd high commit durations on cluster $labels.cluster description"Cluster {{ $labels.cluster }}: Etcd commit duration increasing, 99th percentile is over 0.25s on Etcd host {{ $labels.host }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"bash-5.0#Was this page helpful?