How do you run Kubernetes at scale in production and simultaneously unleash the potential of your engineering teams?
The question may seem to be either duplicative or conflating a solution for one problem with the solution for a second. In truth, it’s neither.
Unleashing the potential of development teams at scale
The solutions for running at scale in production and unlocking the potential of engineering teams are very much related and dependent on each other. Scale in production could be a few things: a small number of very large clusters (1000 nodes), a large number of smaller clusters (50+), or a set of clusters running across multiple cloud regions and cloud providers. Each has its own set of unique problems.
Organizations running at massive scale with a few large clusters hit the ‘blast radius’ problem; if 100% of user facing applications operate on a single Kubernetes cluster, and that cluster fails, then all services are offline.
Organizations running a large number of small clusters hit the ‘pet’ problem; each cluster becomes so important that any change is a monumental lift.
Organizations running across regions and clouds hit the ‘consistency’ problem; as configurations, or manifests, are tuned the reason, logic, and benefits are forgotten, and troubleshooting becomes a nightmare.
Unfortunately, solving for blast radius leads to pets, and solving pets leads to consistency issues.
Furthermore, solving each independently creates entirely new sets of operational challenges, such as:
- A team that was once maintaining a singular monolithic Kubernetes cluster is now operating many more, which instantly puts pressure on any manual process.
- Each cluster needs to be identical, and consistency is a must, otherwise application deployments fail.
- Recovery from a cloud region going offline needs to be fast, and what works in AWS may not work on-prem or in Azure.
Adding further torment to operations teams is a paradigm shift. Clusters shouldn’t be long-lived and troubleshooting clusters shouldn’t be a drawn-out futile exercise. Clusters should be disposable.
Engineering needs: The other side of the coin
These issues only represent half of the problem space. The needs of engineering organizations still need to be met.
Engineering teams are moving on to Shadow-IT 2.0: they’re moving to Kubernetes and they’re choosing when, where, and how. Containerized applications and microservices are the pathways to the promised land of high availability, better scaling, greater resiliency, and happy customers. Cloud-native is the goal and engineering teams are finding their own ways to execute without limitations.
Knowing this, operations teams are trying to deliver but are often too slow, enforce performance degrading processes, and limit user access that further impedes velocity. Shadow IT 2.0 stems from the shift to cloud native, but the cause is a quest for simplicity, speed, and in some cases, to avoid working with an over-encumbered operations team. Either way, engineering teams are charging forward no matter what centralized operations teams are saying, or importantly, implementing.
Approaches to delivering cloud-native to engineering teams
There are a few common approaches DevOps and PlatformOps teams take to deliver cloud-native environments to their engineering teams.
The first is Docker Desktop, Mini-Kube, and Kind. Second comes dedicated long-lived clusters. Third are Namespace bound shared clusters with quota limits and complex role-based access controls. Fourth comes virtual clusters using virtual kubelet. Fifth are homegrown platforms that sit atop one or more Kubernetes clusters that abstract from raw K8s. Finally, the sixth approach is the keys to the kingdom – build your own and we will figure it out in production.
Each approach introduces overhead and limitations of its own. Each ultimately reduces productivity and increases operations overhead, a lose-lose scenario.
- Local development limits integration tests, and often pushes the most grueling work into shared environments where engineers have limited access.
- Dedicated clusters improve speed for engineers, but configuration drift and lost manifests result in production failures, not to mention more clusters to upgrade, patch, and support.
- Namespace-based native logical separation and Kubernetes access introduce resource contention, large clusters with massive blast radiuses, quota management complications, more and more complex RBAC, and networking conflicts for operations teams to manage. At the same time, it delivers little to improve productivity for engineers as they often don’t have the required access to move quickly or get stuck waiting for helpdesk tickets to be resolved.
- Virtual kubelet builds on top of the namespace approach, essentially virtualizing clusters so that engineers have ‘their own’ cluster. In reality, the infrastructure doesn’t exist. This is a functional solution appearing to create a win-win. Operations teams come to realize that the virtual clusters still need to be managed, configurations can be lost, and manifests updated while shared resources increase cluster sizes and outages become monumental. Engineers have liberty and often wide-sweeping access and can move quickly; however, integrations can often mean needing to deploy more than just their service.
- PaaS, the final frontier. As organizations holistically look at the change and challenges of cloud-native, a PaaS can appear to be the solution. Remove the complexity, abstract such that engineers don’t need to ‘know’ about Kubernetes, remove the need for access, and go. This approach is fundamentally flawed, requires a massive investment in work hours, and ultimately requires unending upkeep to handle the rate of change in Kubernetes. This is captured in a story Adobe recently shared on the New Stack about their own Development Platform, where, in their own words, their “(teams’) requirements were more and more varied and didn’t quite fit the mold of what our abstraction could provide”.
- The keys, the ultimate in speed, and the ultimate in complexity. Let each team build as they like and handle everything in an integration, staging or production environment. For initial stages this may function well, however as the number of clusters increases, so too does the number of configurations and application manifests to manage. Not to mention, as workloads and services grow in number, their related needs for integrated testing keep pace.
The balance between developer productivity and operational overhead
To avoid missed business objectives, outages in production, and more shadow IT, a balance must exist between developer productivity and operational overhead. That balance – a mix of each of these solutions – delivers an environment where teams can self-service real clusters that are built as complete environments, where changes are not lost, where policies are consistently applied, and where resources are not shared.
Operations teams must:
- Avoid operating large clusters, and instead run multiple small ones
- Not be dependent on a single cloud region, and instead operate across multiple
- Be able to recover fast in any location
- Treat clusters as disposable units
- Maintain security and constantly improve
- Enable engineers to work with Kubernetes without burden
- Share the responsibility for application configuration
To do this a new approach to Infrastructure-as-Code and GitOps is required.
This blog is the first in a multi-part series on the challenges enterprises face in balancing operational and business needs with developer productivity and comparing the tools available to help with these challenges.
- An introduction to Flux – Part 2: Capabilities and ecosystem - September 21, 2023
- An introduction to Flux – Part 1: History and features - September 14, 2023
- Build infrastructure for apps - September 7, 2023