Anyone who has ever attempted to upgrade a cloud platform, be it from Microsoft, OpenStack, or VMware, will readily attest that doing so in a live production environment is not for the faint of heart. No Cloud Admin wants to suffer through a failed upgrade that causes a production cloud to be degraded, unavailable, or at risk of losing data. But too many admins have spent sleepless night restoring or even having to rebuild a platform where the upgrade went completely wrong.
That’s why new Platform9 customers are often genuinely and happily surprised by our painless, zero-touch OpenStack upgrade, as detailed in previous posts here and here. It’s certainly one of the value propositions we deliver as a Cloud Management-as-a-Service provider that gets people’s attention. So how can we at Platform9 deliver something that seems still so difficult for so many providers and software vendors? Time to look under the hood of Platform9 upgrades and to think cloud-native.
Why Do Painless Upgrades Matter?
Before delving into how we do an OpenStack upgrade at Platform9, let’s talk briefly about why customers do upgrades at all. Historically, users have upgraded their environments for a number of reasons including the following:
- Platform Stability – Over time, bugs and other problems may surface that requires patches to the code. The OpenStack project regularly introduce patches and hot fixes to the code base but most customers will choose to apply those patches only in specific intervals and often only after through testing in-house and/or by a partner/vendor.
- New Capabilities – Often users will upgrade to gain new features and capabilities that they believe are necessary or desirable. For example, OpenStack adds new projects and new features to existing projects with each bi-annual release. Many of these are based on customer feedback on what they need in a production cloud platform.
- Ongoing Support – Most users choose to go with a vendor or service provider to support their environment instead of relying on self-support. This requires users to be at a certain minimum level of code for them to receive support since a vendor or provider cannot support an infinite regression of code releases. It also means that the minimum level of code changes with each new release since support typically only extends back a certain number of releases.
Whatever the motive, all users can agree that they want upgrades to be simple and non-disruptive. The rule my team and I followed back in my sysadmin days, when it came to upgrades, was “Do no harm.” I still have nightmares about a particular storage firmware upgrade that took a production system down for a weekend and 2 business days while we repaired the system and restored all data.
Horror stories like these have motivated Platform9 to put time and effort into designing a system that makes painless upgrades a possibility for all our customers.
It Starts with the Right Design Patterns
One of the things that pioneers in the cloud computing space, such as Amazon and Netflix, have taught us is that you can build systems with high availability and service uptime, even if the underlying infrastructure is fragile, by following the right cloud-native design patterns. It’s the key to Platform9 being able to make upgrades painless with little to no workload disruption. To understand how we built some these patterns into the Platform9 cloud management service, it may be helpful to review some of these patterns:
- Automation – To operate any system at scale requires the automation of as much operational procedures as possible in order to reduce the risk of operator error. For us to be able to provide the same high level of service to each of our customers requires Platform9 to avoid manual processes, especially repetitive ones. Our biggest risk for failure occurs when a human is required to manually complete those repetitive tasks, especially tasks with multiple steps.
- Service Orientation – System management takes on new meaning when you focus on overall service uptime instead of component uptime. At scale, the availability of a service is much more important than the availability of any individual server. At Netflix for example, engineers focus on maintaining the video streaming service and not on an individual Virtual Machine (VM) hosted on AWS. Similarly for Platform9, it’s about making the cloud management services and APIs available, not individual VMs that make up our service. Taking this approach means designing an architecture with no single points of failures that would cause service failures during a failed upgrade.
- Immutable Infrastructure – I first heard the term “Immutable Infrastructure” from Chad Fowler, who coined the phrase in this blog post. The concept behind Immutable Infrastructure is that if you’ve taken the steps required to automate the building of your infrastructure, then you can treat your server/VM instances as disposable units. Among the change in thinking that results from this mindset is that instead of layering on upgrade after code upgrades on a server/VM instance, build a new instance with the upgraded code and dispose of the old instance. It makes the upgrade to new software much more predictable and error -free since you are always starring with a new instance that has good known configuration state and not one that has had multiple changes applied to it.
There are other important design patterns involved in building cloud-native applications, many of which are assumed by the ones I discussed above. For this blog post, we’ll focus on the ones I outlined.
OpenStack Upgrade with Platform9: A Deep Dive
In a previous blog post, I reviewed the architecture behind our cloud management-as-a-Service platform. There, I revealed that our service runs in a three tier architecture, with the Core Services (CS) and Deployment Unit (DU) tiers currently running in Amazon Web Services (AWS) and the On-Premises tier running in customer data centers. Also recall that a unique Deployment Unit, with OpenStack services, is created for each Platform9 customer.
One of the benefits of deploying Platform9 as a SaaS offering and a primary reason for doing so on a cloud infrastructure, such as AWS, is the flexibility it gives us to architect the CS and DU tiers as cloud-native applications. We can then apply those cloud-native design patterns to the way we manage and upgrade our platform. In particularly, it allows us to deliver painless upgrades to our OpenStack based cloud management service, residing in the Deployment Unit tier.
- Automation – We’ve automated the configuration management of our platform using a toolbox of scripts and utilities written by Platform9 called Snape and a well known open source automation tool called Ansible. Snape is responsible for a number of tasks, including:
- Creating AMI-based instances that are part of a customer DU
- Applying networking configuration to instances in a customer DU
- Creating Amazon Relational Database Service (RDS) instances for each customer DU
- Applying security management, including SSH key and SSL certificate management, for each customer DU
- Deploying monitoring, logging, and analytics tools to each customer DU
Snape does its work before and after Ansible runs and is also the mechanism for kicking off an Ansible run at the appropriate time. Ansible is used for a number of automation tasks including deploying OpenStack services in each customer DU. The combination of Snape and Ansible gives us ways to perform predicable upgrades to our platform.
- Service Orientation – Key to how we deploy Platform9 is that we run as many things as possible as stateless services (moving more and more to microservices). For example, that means components such as our alerting service and logging service don’t maintain persistent data locally but in shared repositories and databases. Each customer DU has its own database which is an RDS instance replicated between Availability Zones. All OpenStack services run in AMI-based instances that do not maintain local state. That means we can instantiate new instances when required for workload spikes or in this case, as part of the “upgrade’ process. This approach aligns with our service oriented approach where specific instances staying up are not important, only service uptime matters.
- Immutable Infrastructure – Automating our processes and focusing on services enables us to treat our customer DUs as Immutable Infrastructures. From an upgrade perspective, this is important because it allows us to use an approach called Blue/Green Deployments. A logical companion to Immutable Infrastructure and popularized by many including Fowler and Martin, the Blue/Green Deployment concept proposes the following:
- Stand up two production-grade deployments. The Blue one is live and the Green one is a standby replica (The two deployments can be identical including running the same code level or if applying the Immutable Infrastructure concept, the Green deployment will be built running the new code level)
- If the two deployments are identical, upgrade the Green replica to the target code level and test. Once testing has passed, label the Green deployment as ready to go live.
- Move services and tenants to the Green deployment running the target code level. That can be done by making configuration changes or if possible, by rerouting requests through a load balancer.
- Once everything is running on the Green deployment, you can choose to destroy the Blue deployment at your convenience. If a problem occurs, you have the option of rolling back to the Blue deployment.
With this approach, we don’t upgrade our customers in the traditional sense of applying code changes to an existing production environment. Instead, we instantiate a new customer DU, including a new release of OpenStack, and cut everyone over to the new DU. This approach is what enabled us to perform painless upgrades from the Havana to the Juno release of OpenStack for all our customers.
It is worth noting that I have not talked about the on-premises tier. Here we are still using the traditional upgrade approach since we cannot assume that our customers are using Immutable Infrastructure or Blue/Green Deployment principles. Also, we follow a shared responsibility approach at this tier in which we upgrade the host agent or gateway but the customer has responsibility for upgrading their hypervisors and other parts of their infrastructure as needed.
Putting It All Together
Here’s what the high-level “upgrade” workflow looks like for Platform9:
- Use Snape to create new AMI-based instances for the standby Green DU
- Use Snape to apply networking configuration to the standby Green DU
- Use Snape to apply security management to the standby Green DU
- Use Ansible to install OpenStack services running target code release
- Use Ansible to apply Platform9 customer account information to standby Green DU
- Use Snape to install management software as needed for the standby Green DU
- Use Snape to run automated tests on standby Green DU
- Confirm that both Blue and Green deployments are running properly
- Cut customers over to the standby Green DU by rerouting requests from the Blue DU, using the Amazon Elastic Load Balancing (ELB) service
- Delete Blue DU including all instances
By automating these steps, focusing on being service oriented, and applying other cloud-native design patterns such as Immutable Infrastructure, Platform9 is able to take the pain out of upgrades while providing our customers with a stable cloud management platform. A stable platform such as this helps deliver new capabilities allowing those customers to continually innovate and to succeed.