Catapult, Kubernetes Monitoring Re-imagined Remotely
Catapult – Remote Monitoring
Simplifying Kubernetes management requires more than installation and upgrades. To truly simplify the complexities that are inherent to Kubernetes a platform provider needs to reduce the burden across a variety of factors and increase the productivity of the operations teams tasked with operating the Kubernetes environment. For example; every request for access to a cluster from a developer takes time to fulfill and ultimately should be self-service and integrated with SSO. Every failed app deployment requires investigation, tweaking of manifests, and more testing, delaying the launch of new services. Troubleshooting should be easy, simplified by the platform and open to all users. Outages at 2am are overwhelming, have the potential to continue for hours, could impact revenue and erode a hard earned reputation, if found early, and remediated by automation the outage could be avoided entirely.
As part of our most recent release, Platform9 5.5, we made significant changes to our remote monitoring capabilities, significantly expanding the Kubernetes metrics we collect and creating new alerts, to help reduce, or in some cases completely mitigate the 2am wake up call.
Every user of Platform9, Free, Growth and Enterprise, has access to Catapult; purpose built remote monitoring that runs 24/7, surveying every cluster for critical failures. Catapult is built on 100% open source Kubernetes native tooling and and monitors the critical functions in each cluster, including:
- etcd
- API Server
- Node Monitoring
- Pods, Deployments and Services
- Calico networking
- Platform9 Managed Add-ons Health
- Platform9 Connectivity
We have written 56 new Rules that inspect each metric in real-time and when the failure condition is met, send an alert to Platform9 support and to your operations team. To ensure security and safeguard users data Catapult runs within each user’s dedicated SaaS Management Plane, nothing is shared between instances ever, and as it runs in the SaaS Management Plane, Catapult is backed by our 99.9% SLA.
Catapult is built on Prometheus & AlertManager, open source, and unchanged. The architecture has been designed to be resilient and distributed, with multiple data collection points ensuring coverage even if nodes within a cluster go offline. The key feature that enables Catapult to scale is Prometheus Agent Mode, introduced late in 2021 Agent Mode enables Prometheus to stream data in real time back to a central Prometheus using the remote write mechanism. Each node that is part of a Platform9 built Kubernetes cluster has Prometheus installed and running in Agent Mode. The instance on each node is configured to scrape data from the OS, Kubernetes and etcd and stream it to the SaaS Management Plane, where a central SaaS instance of Prometheus is running. As the data streams are written to the Prometheus TSDM rules evaluate the data and if required trigger notifications via alertManager. In addition to the data collected by each Prometheus Agent Mode instance, Platform9 has configured the Prometheus that’s running within the SaaS Management Plane to scrape our own metric endpoints.
Within each user’s SaaS Management Plane are a number of Kubernetes management services, for example, Add-on Manager. Add-on Manager deploys our managed add-ons and maintains data on their health, lifecycle and phase. Prometheus is configured to scrape this data and data from other Platform9 services. The data from each Kubernetes management service is collected in addition to the cluster side metrics and combined to create a holistic view of each cluster. The data covers every node, OS resources (Disk, RAM, CPU, Network), Kubernetes application metrics (CrashLoopBackoff), etcd (leader changes), managed add-ons (failed installation) and Platform9 Services (HostAgent Down).
24/7 remote monitoring as a service, that at 2am notifies our support and your operations team so issues can be resolved fast, before they impact your customers.
What about the in-cluster monitoring that Platform9 provides?
Prometheus is made available as part of the set of Platform9 managed Add-ons, this is referred to as ‘in-cluster’ monitoring. The in-cluster monitoring provides users with a managed monitoring package that include Prometheus, alertManager and Grafana and can be installed into any cluster, imported from EKS, AKS, GKE or built using Platform9 on-premises, at the edge or in AWS and Azure. In-cluster monitoring is a data-rich view into cluster performance, pod metrics and node performance that is optional and complimentary to Catapult. Catapult remotely captures data, streams it to the SaaS Management Plane, where it is used to send events on critical issues impacting the cluster, this is how our support and self-healing actions know there are issues. The in-cluster monitoring is built to help developers and operations team quickly view detailed metrics across the cluster and your applications. Data can be viewed directly from Platform9, or from Prometheus and Grafana running in the cluster. To integrate the in-cluster monitoring into an enterprise monitoring strategy we recommend users configure the remote write proxy, and set up specific receivers to send alarms directly to the teams that can take action.
Conclusion
Simply put, Catapult helps Platform9 ensure you don’t hit issues at 2am, the in-cluster monitoring helps your team build, deploy and scale cloud native applications.
To learn more check out or documentation on Catapult
- FinOps: Applying Earned Value Management to maximize ROI - June 18, 2024
- Top 6 FinOps KPIs for EKS - June 17, 2024
- The argument for AWS Spot Instances - May 8, 2024