Running Platform9 on Platform9
Can we have our cake and eat it too? After months of development from conception to full production use, that is a question we’ve been asking ourselves of our own private instance of Platform9 Managed OpenStack. We affectionately call it “dogfood” and use it to run a major part of our daily business: continuous integration testing.
Let me back up a bit. As you may have heard, Platform9 is in the business of enabling enterprise customers to aggregate private infrastructure resources into flexible pools which can be carved out and utilized by a variety of state-of-the-art open source software, which Platform9 runs, manages and provides as a service. Managed OpenStack, our first offering based on this model, gives a customer a full OpenStack-powered private cloud that exploits internal servers, storage and networks, while hiding the complexities of installing, configuring, scaling, backing up, and upgrading the software. How do we do this? By hosting the cloud’s metadata and control software in a collection of elements called Deployment Unit (DU). For security, each customer is assigned their own, isolated DU, which Platform9 manages.
In its simplest form, a DU is made up of a database and a server running OpenStack controller services, such as Glance and the control elements of Nova (API, scheduler, conductor, network). As a customer’s needs grow, the DU can scale to multiple servers, possibly spread geographically, providing a boost to both performance and availability. For simplicity, this article focuses on the 1-server DU form. We architected DUs to run on almost any compute infrastructure such as public clouds and our own data center. Like most young software companies, we are starting out by fulfilling our compute needs with Amazon Web Services (AWS).
+-----------------------------------------------------------+ | Deployment Unit (DU) | | +----------------+ +-----------------------+ +----------+ | | | PF9 controller | | OpenStack controllers | | Database | | | +----------------+ +-----------------------+ +----------+ | +-----------------------------------------------------------+ / / / Internet - - - - -/- - - - - - - - - - - - - - - - - - - - - - - - - - / Customer Intranet / +-----------------------------------------+ +--------+ | +----------------+ +------------------+ | | | | | PF9 Host agent | | OpenStack agents | | | | | +----------------+ +------------------+ |....| | | Host 1 | | Host N | +-----------------------------------------+ +--------+
Fig. 1
Like many contemporary cloud software designs, OpenStack uses a controller/slave architecture to manage and monitor infrastructure elements. For example, to manage compute hosts, OpenStack Nova requires a slave agent called nova-compute on each managed host. In the first Managed OpenStack release, a host can be any server that runs one of the supported Linux distributions (currently Ubuntu 12 and CentOS 6). To add compute capacity to his/her private cloud, a customer first deploys the zero-configuration Platform9 host agent to a set of hosts. Each host agent registers itself with customer’s DU using a secure outbound connection. Once admitted through a step called authorization, the agent queries the DU for the correct software and configuration of its host, then automatically downloads, installs and configures that software. The type of software appropriate for a particular host is determined by the host’s assigned role(s). For example, a typical compute host receives a nova-compute agent. On the other hand, an image library host will receive a Platform-9 specific software package to turn it into an image caching server. It is possible for a host to have multiple roles. In summary, Fig. 1 illustrates what a typical customer-specific Managed Openstack deployment looks like.
Continuous Integration Testing
Ensuring high quality of our product is a top priority for Platform9. In order to test Managed OpenStack(MO), our continuous build & integration server (Teamcity) schedules a combination of unit/functional tests and integration tests. We generally maintain one git repository per functional component. Creating the binaries required for a Managed OpenStack release involves building and testing code from a total of about 12 repositories, some of them from Platform9 versions of the OpenStack components, and others containing pure Platform9 source code. The standard OpenStack code base includes extensive functional tests that are run upon every check-in into any OpenStack repo.
Integration tests are more complex and take longer to complete, since they test the product as a whole, and therefore involve deploying and connecting components in their final form. Since the topology of a Managed OpenStack deployment is different than a typical OpenStack one, we developed many integration tests from scratch in order to test Platform9 specific scenarios, such as user authentication, host authorization, image library-served instance provisioning, network features, and NFS support. As of this writing, 17 large integration tests need to complete successfully for a Managed OpenStack build to be considered successful. Each test creates its own deployment composed of one DU and between 1 and 5 hosts, with an average of about 3 hosts, and runs up to a dozen smaller sub-tests. By default, a test DU is composed of a single compute instance. A typical MO build therefore employs 17 DU instances and 3*17 ~= 50 host instances. During peak periods, the continuous integration server can trigger MO builds up to 10 times per day. This means that we can instantiate and tear down upwards of 170 DU instances and 500 host instances.
A Test Infrastructure Built on Virtualization
The previously discussed test load estimate is a snapshot of where we are today; of course, it was much lighter in the early days. Like most young companies, our size and needs have grown significantly over the first year, and so have mistakes and lessons learned from them. To adapt to our testing needs, the infrastructure on which we run test instances and the way we automate them have evolved through the phases summarized in the following table. Of particular interest is how we’ve leveraged virtualization and stacked the software layers.
Phase 1
When we started, we naturally tried to make test environments as close to production deployments as possible. This meant running DUs on AWS, and using real servers as compute hosts residing in Platform9’s modest datacenter, in order to recreate the same setup a typical customer would experience. We faced two issues with this approach. First, since the hosts were real Linux computers running CentOS or Ubuntu, we developed scripts that try to restore them to a pristine state before each integration test. As anyone familiar with the concept of “bit rot” knows, this is easier said than done, and despite our efforts, over time our server’s state drifted out of control, resulting in inconsistent test results and difficult to diagnose bugs. The larger problem was scale: we simply couldn’t afford to purchase enough hardware to match our growing test loads.
+----------------------------------------------------------+ 2 | Nova instance created during integration test | +----------------------------------------------------------+ 1 | PF9 compute/imagelib Linux KVM host on real server | +————————————————————————————————--------------------------+
Phase 2
To address these issues, we modified the test infrastructure to run PF9 hosts as virtual machines on a hypervisor (initially ESX and later native Linux with KVM), as shown in Fig 3. This introduced an additional layer resulting in nested virtualization, which means that a virtual hypervisor (layer 2) runs as a VM on a real hypervisor (layer 1). Both ESX and KVM support nested virtualization with various degrees of reliability (we found ESX to be rock solid but with lower performance, whereas KVM is faster but with some bugs & gotchas, something worthy of an article of its own).
+----------------------------------------------------------+ 3 | Nova instance created during integration test | +----------------------------------------------------------+ 2 | PF9 compute/imagelib Linux KVM host created by each test | +————————————————————————————————--------------------------+ 1 | ESX or Linux KVM on real server | +—————————————————————————————————-------------------------+
We developed a set of library functions responsible for creating and tearing down the VMs holding the Platform9 Managed OpenStack hosts (layer 2). It was around this time that we completed the development of our first end-to-end integration tests, and also a first version of our Python-based test framework. Our Teamcity integration server tied everything together by triggering builds and tests in response to code change events. This marked the first time in our history that we had end-to-end automated integration between code, builds and tests, and significantly boosted our confidence in the abililty to deliver Managed OpenStack with our quality expectations.
Phase 3: The Birth of “Dogfood”
As our components matured and started to gel into a cohesive whole that began to look to like an enterprise product, we had more and more debates about the adequate feature set, the right user experience, and what expectations to set for reliability, security, performance and scalability. After endless whiteboarding, wireframing, and prototyping, we concluded that many questions could only be answered by turning ourselves into our own users, in other words, run our business (or a major part of it) on Platform9 Managed OpenStack. In the carrot-and-stick metaphor, this was our stick.
Luckily, our continuous build, integration & test operations supplied the carrot. Our sprawling set of bash scripts and Python code for managing the virtual PF9 hosts (layer 2) on the bare metal hypervisors (layer 1) grew in complexity and became a maintenance nightmare. We had no clean abstractions for dealing with template images, placement logic across hosts, and the different types of hypervisors (ESX, CentOS Linux, Ubuntu Linux). It was clear that Openstack was better at solving these problems, with its consistent APIs and CLIs, resource abstractions, and automatic placement algorithms.
The “dogfood” project was thus born, and has become one of the company’s biggest engineering investments to date. Dogfood is simply a full instance of Platform9 Managed Openstack, with its DU running in “the cloud” (currently on AWS), exposed as a SaaS with a public DNS address, and whose compute, storage and network resources all reside in Platform9’s private datacenter.
In the first implementation, we initially virtualized the Platform9 hypervisor hosts, i.e. made them VMs running on the company’s base hypervisors. This means inserting another virtualization layer to the stack, resulting in the diagram in Fig. 4. Since we couldn’t find a reliable way to nest an additional layer with full hardware virtualization support, we had to run the top-most virtual machine using QEMU emulation, which is extremely slow. Though we knew that this would hurt performance, we did this for several reasons. First, virtual (as opposed to real) dogfood hypervisor hosts are easy to create, manage, and teardown, thereby simplifying operations such as scale up the number of hosts if needed. Second, this approach allowed us to use existing underutilized hardware, and gave us flexibility in planning the addition of more physical servers. Third, at that time, the majority of the tests that spawn instances only needed to check to see if the instances reached the active state, at which point the test could terminate early, therefore the speed at which the guest inside of the instance ran didn’t matter much.
+----------------------------------------------------------+ 4 | Nova instance created during integration test | +----------------------------------------------------------+ 3 | PF9 compute/imagelib Linux KVM host created by each test | +----------------------------------------------------------+ 2 | Virtual Linux KVM on VM managed by PF9 dogfood | +————————————————————————————————--------------------------+ 1 | ESX or Linux KVM on real server | +—————————————————————————————————-------------------------+
As planned, we modified our automation code to take advantage the Openstack Nova API and CLIs. Bash scripts typically use the CLIs (e.g. the nova utility), whereas Python code makes REST calls. The life cycle of template images and compute instances could now be managed in a consistent way. For example, we use “pruning” scripts to delete old templates and instances from prior builds / test runs after a grace period. For various reasons, it is sometimes necessary to protect specific templates or instances from deletion. To implement this feature, we took advantage of Nova and Glance’s object metadata support to tag both templates and instances in a uniform way, using a metadata key named “dont_delete”.
Phase 4: Dogfood Generation 2
The first implementation of dogfood worked well enough when we only had a handful of integration tests, and the company was small. This changed over the following months as many new developers joined the team, more features were added, and a growing emphasis on product quality resulted in an explosion in the number of builds and test runs. Furthermore, the performance of QEMU emulation of the top layer was becoming unacceptable for some newly developed tests that do more interesting work with the guest OS running inside of instances; for example, most of our networking tests verify connectivity between guests and other guests or external systems, thereby requiring the OS to finish booting a full operating system. Combined, all of those issues combined to make random operations time out, causing tests to fail. Our engineering productivity ground to a halt as a result.
We solved the performance problem by purchasing dedicated hardware to serve as dogfood hosts, resulting in the elimination of a virtualization layer, as illustrated in Fig 5. We figured that the loss in flexibility from having to install and configure Linux/KVM on bare metal (layer 1) was a reasonable tradeoff. We compensated by automating as much of the process as possible, for example by using PXE for initial OS installation, and the Ansible configuration management tool for bringing each node to the desired software configuration.
+----------------------------------------------------------+ 3 | Nova instance created during integration test | +----------------------------------------------------------+ 2 | PF9 compute/imagelib Linux KVM host created by each test | +----------------------------------------------------------+ 1 | Linux KVM on real server managed by PF9 dogfood | +————————————————————————————————--------------------------+
Phase 5: Dogfood Generation 3
Up to this point, recall that every integration test’s deployment unit (DU) has always been spawned in the public cloud (AWS). As mentioned earlier, this was done to stay as close to a true production deployment as possible. However, with the number of daily test DUs approaching, and then crossing 100, our AWS bills quickly grew beyond our comfort level, a consequence of not just the cost of running instances, but also auxiliary fees such as EBS storage, network traffic, and RDS database.
We re-evaluated what each integration test was trying to achieve, and concluded that the majority of our tests retained their benefits even when the DUs were to run on private infrastructure. We therefore started to convert our tests to have the option of creating the DUs on dogfood as opposed to AWS. To be safe, we kept both AWS and dogfood variants of our tests. We gradually shifted the load to the configuration we have today, wherein the dogfood variant forms the bulk of our daily continuous integration and is triggered after almost every code check-in. The AWS variant runs periodically at a slower rate. Our AWS costs have since declined significantly and stabilized. However, running dogfood and supplying it with adequate hardware resources also costs time and money, so an accurate cost comparison would require a detailed analysis, a great topic which we’re saving for another blog post.
Frequent Upgrades
Dogfood receives frequent upgrades and updates. Our policy is to fully deploy all code changes and feature additions to it weeks or months before we roll them to customer DUs. We currently perform those updates manually. We are planning to move to an automatic nightly upgrade.
Personal Use
In addition to being used for automated integration testing, our private cloud is available to every employee for personal use. On our Mac and Linux machines, it is a simple matter of installing the nova command line utilities and setting up a standard set of OpenStack environment variables:
[leb:/devel/pf9-main/testing]$ env|grep OS_ OS_USERNAME=XXXXXXX OS_PASSWORD=XXXXXXX OS_AUTH_URL=https://XXXX.platform9.net/keystone/v2.0 OS_TENANT_NAME=service
Most developers spawn temporary instances to troubleshoot a bug or work on a feature that takes a few days to implement. Others create longer lasting instances for longer term projects, or simply to host a personal development environment. We extensively use OpenStack’s tagging feature to annotate instances and images with information about their owner and purpose. We also frequently take snapshots of existing instances, thereby creating images from a running instance, in case we want to revert to that state after running an experiment. In terms of the UI, I’d say that half of all users are comfortable with the CLIs, while the other half tend to prefer the Platform9 Web user interface.
Lessons Learned
An important prediction made when creating dogfood was that by running a major part of our business on Platform9 Managed Openstack, any discovered product defects would have an immediate impact on the company. This would give us the incentive to troubleshoot, root-cause and fix those defects as quickly as possible. This turned out to be absolutely true. Using the product in our daily lives and relying it for mission-critical tasks revealed issues that were not exposed by our tests, for various reasons.
One example is the sheer scale of operations. Whereas our first generation of integration tests use less than 5 hosts and spawn between 1 and 10 compute instances, our daily build & test load involves running hundreds – and soon thousands – of instances. We learned that even when all compute hosts are idle, multiple subsystems in each nova-compute agent send periodic status updates to the Nova controller, such as the number of KVM virtual machines and their state. Each of those status messages increases the overall load on the controller and, at high enough loads, causes it to slow down when handling public API requests. This produces a cascading effect, as other build tasks start failing due to API call timeouts.
Another scale-related issue that manifested itself in various ways is the boot storm problem. As described earlier, an overall product integration build currently triggers 17 different integration tests. Those tests start in parallel and initiate a sequence of steps that ultimately result in the spawning of about 50 compute instances. Our integration build server is configured to allow up to 3 builds to run concurrently, which means that in the peak case more than a hundred instances can be spawned at around the same time. Despite using several reasonably powerful compute servers, we found that several software subsystems didn’t handle the spike as well as we expected. When confronted with a batch of requests, the nova-scheduler service (which performs placement decisions) occasionally appears to spawn too many instances on the same host, thereby overwhelming it. It seems as if its load balancing algorithm did not adapt quickly enough to the changing load on the host. We do not yet know whether it is a configuration or design issue, and are still investigating it. Also, when a host receives too many spawn requests at the same time, we found that the high CPU and I/O load from processing instance disks exposed a timing bug in the the ‘libguestfs’ subsystem, which is used for configuring the network settings of guest operating systems. We’ve since fixed that bug.
Dogfood also allowed us to catch defects that surface with lightly loaded, but long running systems. The Nova set of services use numerous background tasks that wake up periodically to perform maintenance activities. Each activity has its own timer period setting. Some activities depend on the state left after activities, and there is a wide spread in timer period settings, ranging from minutes to hours. This can sometimes lead to hard-to-find bugs that manifest themselves after several days, and only under certain conditions.
Outside of finding defects and performance issues, the daily use of our own product also allowed us to identify product gaps that, if filled, would be extremely valuable not only to us, but also to our customers. For instance, using the web UI, an employee accdidentally selected a large group of instances and deleted them. The loss of several important instances led us to debate various ways we can prevent such accidents, and every viable solution proposed was entered as a different ticket in our defect tracking system. Like all tickets, we prioritized those feature additions, allowing us to decide what product version in which to release them, if at all. On this specific issue, a UI enhancement was implemented to make the user more aware of a large destructive operation, with a final confirmation required; this enhancement is already in the product. A more advanced enhancement involves the addition of a write-protect flag implemented at the product’s API level, and is being planned for future releases.
Future Directions
We are planning to expand our use of dogfood in significant ways in the near future. First, we will expand our use of the product beyond test workloads. Most of our build and integration servers currently run on bare hypervisors (ESX and Linux KVM) and are manually managed; we intend to move those mission critical systems onto hypervisors managed by our private cloud. Other candidates for migration include infrastructure systems (e.g DHCP, DNS, and NFS), and employees’ personal development environments.
Secondly, a major feature being planned for a future release is the support for managing Linux containers, such as Docker containers. We are therefore exploring innovative ways to exploit containers in our business. An example is an on-going effort that involves taking advantage of the light overhead of Docker for large scale testing of thousands of compute nodes. Another use of this technology is to solve some state inconsistencies we’re experiencing with our build servers and to simplify their life cycle management. Containers should allow us to ensure each server always starts with a clean, consistent known state, and starts quickly. While initial container management will be done manually with custom scripts, we will eventually adapt them to perform their work via Platform9 Managed Openstack APIs.
Final Words
Running our business on Platform9 Managed OpenStack has undoubtedly made our product better. By exposing it to the daily rigors of our continuous build and test system, we exposed, then fixed many important issues that would have been missed under synthetic test scenarios. We also learned the strengths and weaknesses of the product through real world usage scenarios, and responded by closing important functionality gaps and refining the user experience. That was the “cake” part of the initiative, and it was by design.
Has the effort also made us more agile and efficient? The initial investment cost was undoubtedly high in terms of time, new hardware, and human capital. But now that the system is stable and humming along for a while, we can say with confidence that Platform9 Managed Openstack has significantly improved our dev/test operations and simplified large portions of our automation code base. And to top it off, we believe the effort is beginning to pay off in terms of cost savings on public cloud expenses as well. So yes, we believe that as long as we treat this project as a long term investment, it becomes a gift that keeps on giving, and both we and our customers ultimately benefit.
- Navigating the future of enterprise IT: The rise of developer-friendly private clouds - December 17, 2024
- Beyond Kubernetes Operations: Discover Platform9’s Always-On Assurance™ - November 29, 2023
- KubeCon 2023 Through Platform9’s Lens: Key Takeaways and Innovative Demos - November 14, 2023