Node Connectivity Monitoring
Node Connectivity Monitoring
Rules for the Hostagent
The Prometheus agent is configured to report numerous node-based metrics. Below is the YAML-based rule set that Catapult uses, including alert names, expression, timeframe, annotations which contain the summary and description of the notification as well as the SaaS management plane and hostname.
x
bash-5.0# cat resmgr.ymlgroupsnameHost Availability Alerts rules # du_sidekick_last_checkin_seconds will be the time in seconds since a host checked in via sidekick # or -1 if the host is not reported by sidekick at all. The latter case may occur if # sidekick has restarted and a host hasn't connected since.expr-2 ( sum by (host_id,host_name) ( time() - sidekick_host_last_heartbeat_timejob="sidekickserver" / 1000 ) ) OR ignoring(host_name) ( sum by (host_id,host_name) ( (resmgr_host_upjob="resmgr" == 0) - 1 ) ) recorddu_sidekick_last_checkin_secondsnameHosts disconnected rules # resmgr says it's been down for at least 10m, sidekick says still reporting # NOTE: the `for` delay _must_ exceeed the cutoff (currently 600 seconds) + the scrape # period (1m) or else both this and host-down will fire off simultaneouslyalerthost-disconnected expr-2 sum by (du, host_id, host_name) (du_sidekick_last_checkin_seconds < 600) AND ON(host_id) resmgr_host_upjob="resmgr" == 0 for15m annotations summaryhost-disconnected description"{{ $labels.host_name }} disconnected from control plane {{ $labels.du }} for more than 10 minutes" du"{{ $labels.du }}" host_name"{{ $labels.host_name }}" # resmgr says it's been down for at least 10m, sidekick "agrees"alerthost-down expr-2 sum by (du, host_id, host_name) (du_sidekick_last_checkin_seconds >= 600 OR du_sidekick_last_checkin_seconds == -1) AND ON(host_id) resmgr_host_upjob="resmgr" == 0 for10m annotations summaryhost-down description"{{ $labels.host_name }} down {{ $labels.du }} for more than 10 minutes" du"{{ $labels.du }}" host_name"{{ $labels.host_name }}"bash-5.0#Was this page helpful?