Build infrastructure for apps
This is the 4th blog in Chris Jones’s series on scaling Kubernetes for applications, GitOps, and increasing developer productivity. You can read part 3 here, or can go back and read parts 1 and 2 here and here.
For the last decade, operations teams have delivered Infrastructure-as-Code (IoC) on public clouds and virtualized infrastructure, using tools from a marketplace that has stagnated and consolidated. But the landscape is changing. Organizations that are transforming their customer-facing services to meet demand, scale for new markets, and exceed customer expectations require a new approach. Infrastructure now needs to deliver a diverse set of services beyond traditional virtual machines, networking, and storage, a service that encompasses messaging services, databases, observability, containers, and more. This change in service delivery, coupled with heightened security requirements and developers demanding velocity, has created a new market.
The need to deliver more than just compute started with the shift to containers. Platforms such as Docker Swarm and Mesos were born to simplify container orchestration for application teams; however, they provided little help for infrastructure teams. The continued evolution of containerized applications into cloud-native computing has blurred the line between infrastructure and application. This has created an opportunity for Infrastructure-as-Code to simultaneously deliver compute and additional services such as Kafka for messaging, Redis and Elastic for storage, and Prometheus for observability.
Infrastructure-as-Code in the cloud native world moves beyond virtual machines, load balancers, block storage, and DNS. Infrastructure-as-Code includes core, critical applications that deliver essential services for development, and production. Finally, infrastructure can be delivered as code and simultaneously include applications. The problem now is – how can teams that have only needed to deliver infrastructure pivot to support infrastructure and applications?
The solution is a combination of two existing disciplines: DevOps that invested in application delivery that automates build, test, and deployment, and operations that have been creating scripts that leverage APIs to lay down the foundational infrastructure. Merging both approaches creates unified workflows that build infrastructure and can manage the lifecycle of applications all based on code. To do this in 2013, the tools would have included VMware + Puppet and Jenkins + Bamboo. To achieve this today its Kubernetes + ? and ?
One approach would be to use Terraform for cluster lifecycle, and the Terraform Provider for Kubernetes to deploy the core applications. Terraform is a valid solution, however, Flux and ArgoCD are two ‘built for Kubernetes’ tools that provide more native controls. Either tool can be used to deliver a new Infrastructure-as-Code experience – one that handles infrastructure as well as applications. Given this, there is a case to be made to review existing practices. Ask the question: which tools should I adopt today to enable success in the future?
Continuous Delivery for Core Infrastructure & Applications
The first step in delivering entire complete clusters at scale is to solve for several questions: which applications must be deployed in each cluster across your environments? How will these applications be versioned? How will they be packaged? How will each application manifest (Helm, YAML, Kustomize) be controlled? And lastly, how will they be deployed?
Teams using Terraform to build a cluster may immediately lean toward the providers for Kubernetes and Helm. Teams not leveraging Terraform may look to the market for a solution that can deliver Kubernetes infrastructure plus the essential applications and policies that are required for a complete cluster.
To help everyone on this journey, the next section comprises a comparative view of Terraform, ArgoCD and Flux. These tools have been selected because each can be used to automate cluster lifecycles and applications, something that makes them unique solutions when compared to Jenkins, Jenkins X, Harness, and GitHub Actions. Critically, they’re available as open source and support declarative models of object management.
We are evaluating each tool across a number of factors, including:
- Ability to manage infrastructure & applications.
- How the tool manages, and stores declared state.
- How drift is detected and resolved.
- How changes are rolled out.
- Architecture for multi-cluster, multi-region deployments.
- How each tool operates in a team/multi-user environment.
Assessment Overview
The matrix below details each measurement dimension, the assessment, and the scoring system. Below the matrix is a detailed introduction to each dimension, and further explanation on scoring.
Dimension | Assessment | Score (0-10) |
Infrastructure & App Management
|
Can the tool manage Kubernetes infrastructure as well as application deployments simultaneously.
Community plug-ins or extensions are included in this assessment. |
0 – No Support for infrastructure and Application deployment
5 – Limited, either infrastructure or Applications are not based on declared state stored in Git. 10 – Both are supported and 100% GitOps compliant |
Declared State Management | Where is the declared state stored, is it shared by design?
Is the state storage designed for code and supports versioning? Can the declared state be integrated into a CI/CD workflow? Is a multi-user environment an issue?
|
0 – The declared state is not stored in a GitOps compliant repository
5 – Declared state is stored in a GitOps compliant repository but updates/changes are not version controlled. Multi-user access may cause blocking/locking issues. 10 – Declared state is 100% GitOps compliant, version controlled, and multi-users is not an issue. |
Drift Detection, Resolution, & Diff
|
Does the platform support notifications as a standard feature?
Does the platform support displaying declared state and running state as a code-based diff? How flexible is the resolution of drift? Can individual elements have different compliance/resolution settings? |
0 – No notifications are possible. Drift is overwritten without any mechanism for intervention. No diff supports
5 – Notifications may be configured; drift reconciliation may be intervened. Diff is complex, not native. 10 – Notifications are native. Reconciliation intervention is native and individual elements may have unique reconciliation settings. Diff is built in, and code/line based. |
Updates & Change Management | Are changes approved by a native workflow or is an external system required?
Are changes GitOps compliant and is Push supported?
|
0 – No change management exists. Changes only exist once a command has been executed.
5 – Limited change management exists. Changes need to be approved before implementation. 10 – Changes must be approved, history is tracked natively over time and are initiated via a push mechanism. |
Multi-cluster, Multi-region Deployment | Is the architecture built for global scale?
Is the deployment centralized, distributed or both? Can users’ access be based on an enterprise identity? Can users’ access be based on granular RBAC configurations?
|
0 – Each cluster and region requires its own deployment.
5 – Deployment is centralized and user RBAC is strict. No collisions can occur. 10 – Deployments can be centralized or decentralized, strict user RBAC is supported, and users cannot lock, block, or cause collisions. |
Multi-User Environment | Can multiple users edit a file simultaneously?
Can users self-service to deploy environments?
|
0 – Only a single user can augment the configuration files at a time.
5 – Multiple users can simultaneously interact with configuration files without issue. 10 – Multiple users can self-service the creation of environments. |
Note: Tools plus community add-ons/plugins.
This assessment may include the core tool and additional add-ons/plug-ins that have been built by the community. If an addon/plug-in is included, it will be documented as part of the assessment.
Assessment Criteria Explained
Infrastructure & App Management
Deploying only infrastructure is not enough to support Kubernetes. By design, a cluster supplies the means to run cloud native applications, but many of the essential elements are intentionally missing. CoreDNS, Metrics Server, and Load Balancers are three such essential elements. Many organizations may also view observability, logging, storage, and message bus services as essential and require a means to deliver these to Kubernetes clusters as part of the core infrastructure. In some scenarios, such as public cloud, providers may include optional ‘add-ons’ as part of the cluster configuration. Even though add-on is installed as part of the cluster, the add-on must then be configured with a desired state, upgraded, and maintained throughout the cluster’s lifecycle.
To score a 10, the tool needs to enable the simultaneous and uninterrupted deployment of a new cluster and the deployment of essential applications. This deployment must not require any human/manual intervention and must be achieved through minimal enhancements beyond the core product.
Declared State Management
Once a state is defined in Kubernetes, the control loop ensures it’s maintained. This ensures the cluster operates as configured but provides no assistance with understanding if the running state is the correct declared state. The practice of codifying your infrastructure and application manifests and then storing them in a Git repository is the foundation for GitOps, and provides an answer to what the declared state is.
Leveraging a Git repository introduces version control, distributed management, code reviews, workflows, and more. These capabilities make it the perfect location to store a declared state. It also removes the need to debate if the cluster is running the correct code, as the definitive source of truth is not the cluster.
To score a 10, the tool needs to leverage Git and seamlessly incorporate all features of Git, without creating any additional overhead or requiring additional plugins/add-ons.
Drift Detection, Resolution & Diff
Correctly reacting to drift, or a divergence from the declared state, is critical. React the wrong way and the consequences could be far-reaching. It may appear to be self-evident that changes should be immediately reverted/reconciled, however, flexibility is key to adoption.
To score a 10, the tool needs to assist users in quickly and clearly identifying which component is not compliant. The tool needs to be able to notify users of drift using external systems such as Slack, support reconciling of individual components, and provide out-of-the-box code level diff capabilities.
Updates & Change Management
The entire reason for leveraging code is to be able to systematically make changes, quantify the impact of a change, and move changes forward from lower environments, such as development, onto upper environments, such as production.
To score a 10, the tool needs to track a history of changes, support workflows with approvals, and execute changes via a push methodology. These three factors help reduce risk and remove human error.
Multi-cluster, Multi-region Deployment
Scale – the enemy of any good system. Scale can manifest in many ways; for the purpose of this analysis, we are looking at scale in terms of more than 5 clusters, more than 3 regions, and a team of at least 5 users. Such an environment creates multiple factors that need to be maintained and complexity through its distributed nature.
To score a 10, the tool needs to simplify management, not multiply endpoints, and, importantly, not limit users’ access. Ideally, the tool will support a minimum of 3 role types (Read Only, Edit, and Admin). The more granular the multiple user controls, the higher likelihood the tool will score a 10.
Multi-User Environment
The traditional users of Infrastructure-as-Code are Operations and DevOps teams. These teams strive to configure environments that change infrequently and follow strict workflows. However, there is an inevitable requirement that multiple users will be interacting with the core files that control the environment, and this can cause issues. Furthermore, cloud-native developers can benefit greatly from self-service access to infrastructure, introducing a whole new dimension to ‘multi-user’ requirements for Infrastructure-as-Code.
To score 10, the tool needs to allow multi-user concurrent updates to the core files that drive the infrastructure as well as multi-user access for self-service deployments.
Looking ahead
In the next few blog posts, we’ll dive into several tools and evaluate them using these criteria. In the meantime, check out the earlier posts in this series. Or, if you’re looking for a solution that can help you better manage and orchestrate your Kubernetes infrastructure, schedule a meeting with me. Even if you’re not in the market for a platform, I’d love to discuss how Platform9 might be able to help you in the future.
- FinOps: Applying Earned Value Management to maximize ROI - June 18, 2024
- Top 6 FinOps KPIs for EKS - June 17, 2024
- The argument for AWS Spot Instances - May 8, 2024