Kubernetes in Production: Operating etcd with etcdadm

etcd is a critical part of the Kubernetes stack. etcd stores the state of the Kubernetes cluster, including node and workload information. Having multiple etcd members (a cluster) is a common strategy for ensuring the availability of the Kubernetes control plane, including the Kubernetes API. A highly available Kubernetes control plane is critical in production, especially when applications depend on ingress controllers, service meshes, functions-as-a-service frameworks, API gateways, and other services that integrate tightly with the Kubernetes API.

There are three high-level requirements to operating an etcd cluster in production:

Each etcd member must be bootstrapped: The etcd binary has to be on the host and the runtime parameters must be defined.
The list of members must be kept up to date: This list must come from some external source of truth such an infrastructure API or the administrator themself.
Periodic backups of the etcd state must be configured.

Earlier this week, I had the pleasure of presenting in a CNCF-hosted webinar featuring etcdadm – a project recently adopted by the Kubernetes Cluster Lifecycle SIG.

etcdadm is a command-line tool with a kubeadm-like interface that makes it easy to meet the first requirement by simplifying the bootstrap process. The project’s goal is to also make it easy to meet the second and third requirements by programmatically querying a variety of APIs for cluster membership, and by integrating existing services to store backups. Etcdadm will enable the automated operation of etcd clusters with a user experience analogous to CoreOS’s etcd operator, but without the prerequisite of a functional Kubernetes cluster.

I explained the etcdadm design in depth, and shared how our experience running etcd clusters in production informed that design. I discussed important etcd runtime parameters and caveats of dynamic cluster reconfiguration, and then demoed deploying a highly available etcd cluster using etcdadm, recovering the cluster from both partial and complete failures.

Watch the CNCF webinar replay to learn about etcdadm:

Note, as I demonstrated creating a new cluster seeded with a snapshot, I ran into an issue joining the third member to the cluster. At that point, a two-member cluster was healthy. Shortly after the webinar, I was able to join the third member successfully. Since the webinar, I have not been able to reproduce the issue, but I will keep trying until I find the root cause.

You can download the slides here.
etcdadm is also part of Klusterkit – an open-source toolkit to simplify deployment and operations of production-grade, highly-available, multi-master Kubernetes clusters in on-prem, air-gapped environments. Learn more about Klusterkit here.

Get etdcadm on Github

Author

Platform9

Platform9 is a leader in simplifying enterprise private clouds. Our flagship product, Private Cloud Director, turns existing infrastructure into a full-featured private cloud. Enterprise IT teams can manage VMs and containers with familiar GUI tools and automated APIs in a private, secure environment.

View all posts