24/7 OpenStack Monitoring: Architecture and Tools

While the OpenStack Ceilometer project is very popular for telemetry, it’s not really a OpenStack monitoring solution. It is a data collection service designed for gathering usage data on objects managed by OpenStack (instances, volumes, networks, et al), but it does not provide an out of the box solution to monitor OpenStack errors or anomalies. Since one of our core value to customers is that we monitor OpenStack for them, we had to think about how to achieve proactive, 24/7 monitoring for OpenStack.

In this post, we’ll look at how Platform9 approaches monitoring in our OpenStack managed service and the various tools we use to implement it. Proactive monitoring is a key ingredient of the secret sauce behind our 100% customer satisfaction rating, and helps us deliver industry-leading SLA for OpenStack. As you think about monitoring your own OpenStack environment and a monitoring strategy, this information will prove helpful.

What is Monitoring and Why Does it Matter?

Monitoring is a systematic process of collecting, analyzing and using data to track the health of complex systems, such as a private cloud. In very simple terms, monitoring is alerting on what matters.

Monitoring a cloud requires reliable data collection on the physical and virtual resources that comprise the cloud and the actions of its users and services. The collected data is typically analyzed in real time to trigger actions based on predefined criteria. Usually, the data is also persisted to analyze trends in the medium to long term.

At a minimum, monitoring must inform of failure events so that the system can be brought back to a healthy, functioning state. Such reactive monitoring would however still lead to system downtimes and user dissatisfaction. Proactive monitoring checks for error conditions, as well as anomalies in system or user behavior, while also keeping an eye on system performance and event logs.

The data gathered from monitoring is used for troubleshooting errors and issues and preemptively catching them before they affect users. In addition, the insights from monitoring can assist in capacity planning for cloud resources and performing audits for compliance or other reasons.

Current State of OpenStack Monitoring

In a typical OpenStack configuration, at a minimum, the various controller services need to be monitored — Nova, Glance, Cinder, Keystone, Neutron, and others. In addition, all compute and storage nodes must also be monitored. The telemetry requirements of a complete OpenStack environment increases with the range of OpenStack services being used. The Ceilometer project has traction in the OpenStack community, but it is designed for usage metering of objects that OpenStack manages (such as instances, volumes, networks); and is not designed to serve as an infrastructure to monitor the health of OpenStack’s internal components. The various attempts for retrofitting Ceilometer for overall OpenStack monitoring have been clunky, since it wasn’t designed for this purpose.

Most popular monitoring solutions are custom built using tools like Nagios, Zabbix, Graphite, ELK Stack (ElasticSearch, LogStash and Kibana).

Why 24/7 OpenStack Monitoring Matters to Platform9

Platform9 aims to make private clouds as easy as possible for any customer, at any scale. Our solution delivers OpenStack as a managed service. This requires us to manage the OpenStack cloud controllers as separate OpenStack environments and take responsibility for deploying, monitoring, troubleshooting and upgrading those, on behalf of our customers. We manage the customer’s infrastructure resources, which, along with all their data, remain completely inside the customer’s data center.

managed approach requires 24/7 openstack monitoring

So, in order to truly deliver a highly available, production ready service, it was imperative that we had 24/7 OpenStack monitoring to know when there is a problem anywhere in our customer’s cloud infrastructure.

What Does Platform9 Monitor?

To put it succinctly: everything that could impact our SLA.

Platform9 actively monitors all components that could impact the OpenStack SLA we guarantee for our customers. The same monitoring infrastructure also monitors bare metal servers and cloud instances in our development infrastructure, to help our devops teams be more productive. The goal is to make sure that both our customers and our developers are getting the resources they need.

To ensure this, we instrument all OpenStack services, underlying compute, storage and network infrastructure and OpenStack data-plane components. This instrumentation allows to gather the required information  to generate actionable events, which are then  routed through our alerting infrastructure. This allows our support team to proactively reach out to the customer and inform them when there is a problem that we’ve noticed or when the customer needs to resolve an issue with their hardware.

Alerts are routed to specific engineers based on the type of notification and possibly other parameters, such as its source. For instance, if an OpenStack controller node were to be unexpectedly unavailable, our devops team is notified. When a customer’s host experiences an unplanned outage, our customer support team needs to be notified so they can communicate with the customer. A developer might ask for some domain specific alerts to be routed to her during internal development. This helps prevent alert blindness, otherwise critical errors tend to sneak by amidst the noise.

The major requirements for our OpenStack monitoring approach are:

  • Status Monitoring– lost connections or unresponsive messaging queue
  • Health monitoring insufficient capacity or performance
  • Reporting max number of instances run by a user,  avg. duration of compute instances
  • Anomalies a customer operation affects  >n instances (or volumes or snapshots or networks) within a short time interval
  • Log analysis store errors from all logs in a central location for analysis
  • Supportability  support teams need read-only access to the current state of a specific customer

The types of data that must be ingested include:

  • Simple time-series data – controller load average, RabbitMQ health indicator
  • Transactional data – Nova API measurements, instance details across all users
  • First class object state transitions – a transition of a cinder volume into an error state

Architecting OpenStack Monitoring using 7 Key Tools

To meet the requirements we established for comprehensive OpenStack monitoring, the Platform9 team uses a portfolio of tools, and the diagram below shows their interrelationship and interdependencies.

openstack monitoring architecture & tools

  • VictorOps  An alerting service that is used to page people when there is a serious problem with cloud services. The alert messages can be tailored depending on preferences and sent as push notifications to SMS or telephone. Platform9 talks to VictorOps through webhooks. We can also send messages directly to a Slack channel.
  • SumoLogicA data analytics service that provides insight across all the collective set of log files generated by the various OpenStack services and other subsystems. It allows us to use a GUI to search and witness live tails of all of a system’s logfiles. SumoLogic provides the ability to create saved searches for future analysis and route interesting messages into Slack channels.
  • ServerDensity  Server monitor that checks our servers externally to alert us about operational events. It pings Platform9 servers from different geographic locations and sends messages through VictorOps. We are using ServerDensity for connectivity checks, alerts, charts and tracking load times of our website. Some of the metrics collected include load average, disk usage, network traffic and memory usage.
  • Signal Fx A data visualization service that lets us explore data trends through a powerful graphic UI. Data is ingested by using REST APIs against a time series database. The team may zoom in on a single customer to see usage and check if they had any problems.
  • Slack Slack is the standard messaging app across all teams. Alerts and interesting messages from other tools can end up in Slack channels. Additionally, we route emails from third-party systems into Slack channels.
  • ZenDesk A helpdesk service that we use to route support tickets both from customers, and certain types of internal alerting, to our support teams. The insights have helped us strengthen customer relationships and become a trusted partner.
  • Custom monitoring Services In addition to these external services, we created two custom services:
    • A monitoring agent (internal codename “Reachability”) that runs internally on our servers to ping services and check the health of OpenStack and Platform9 APIs. It tracks load average, total processes, free memory and disk space etc.
    • A contextual data service (internal codename “Whistle”) is designed to gain context around events and errors. For example, it connects a request id across different services so that related stack traces can be easily retrieved for troubleshooting.

Recommendations for Setting Up Your Own Monitoring

We’ve spent a lot of time thinking about and implementing OpenStack monitoring, and here are a few things to consider as you weigh your options:

  • A one-size-fits-all approach does not work for monitoring solutions. Constantly tuning the systems and alerts is essential to keep them useful and relevant.
  • When using services on any cloud, design for instance failures and unresponsive APIs. Requests should be retried and should fail gracefully.
  • Monitor servers and REST APIs from inside the network and also from the public web. In this way local network errors will not trip up the operations team.
  • SaaS solutions can save you time and effort vs homegrown or diy solutions. These are typically modern solutions, support REST APIs and webhooks. They are easy to setup and connect with the other monitoring tools you’ll end up using.
  • Existing products may not offer some specific features that are relevant to your domain or for your unique situation. This domain knowledge is necessary for a unified monitoring dashboard and possibly needs to be custom-built.

Experience 24/7 OpenStack Monitoring

If you’re looking for an easier solution for OpenStack Monitoring, consider a managed OpenStack service like Platform9. Not only does Platform9 save you the effort and complexity of fully instrumenting and monitoring your OpenStack cloud, but we also provide the industry’s best support experience for private clouds. If you’d like to learn more, please check out our product demos.

Share Your Best Practices

I’d love to hear your thoughts on other openstack monitoring best practices that have worked well for you. Please share your feedback via comments below, or tweet us @platform9Sys.


You may also enjoy

Kubernetes FinOps: Basics of cluster utilization

By Joe Thompson

How Platform9 Uses OpenStack Ironic to Manage Bare Metal as a Service

By Platform9

The browser you are using is outdated. For the best experience please download or update your browser to one of the following: