Kubernetes for Machine Learning

Machine Learning (ML) is rapidly becoming essential to businesses and institutions across the globe.  Each organization must meet the challenge of provisioning a computational infrastructure that can support a resource-intensive machine-learning pipeline. While moving computation to the cloud has become the standard response to the challenge of scalability, machine learning teams have specific needs that must be considered.

Fortunately, the popular containerization orchestrator – Kubernetes – addresses these requirements so that teams can apply the flexibility of cloud-native development and infrastructure to their machine-learning applications.


The phases of the machine learning process vary in the amount, type, and timing of the resources they require.  Data preparation requires consistent, intensive computation while model inference is less resource-heavy. Deep learning is characterized by bursts of activity that will bring the process to a halt if sufficient resources are not available. The machine learning workflow works best when each step in the process can be scaled-up when needed, and scaled back down when done.

The ability to scale resources on-demand is one of the primary benefits of cloud computing. Kubernetes makes this process simple to orchestrate: users can allocate additional resources as needed simply by adding more physical or virtual servers to their clusters.  Additionally, Kubernetes offers the ability to track which resources are provisioned and in use, such as the type and number of CPUs or GPUs present, or the amount of RAM available. Thus the existing allocation is taken into account when scheduling jobs to nodes, ensuring that resources are allocated efficiently.

While simply being able to scale resources is a great advantage over needing to physically allocate servers, this scaling must be automated in order to ensure efficient utilization.  This is especially true in machine-learning workloads that require unpredictable bursts of additional computation power.

Kubernetes includes the Horizontal Pod Autoscaler, which scales the number of pod replicas; and the Vertical Pod Autoscaler, which allocates more memory or CPUs to existing pods.  If you need another layer of automation on top of this, you can also autoscale clusters so that new clusters are created to manage the extra pods generated by these autoscalers.

While the inherent scalability of Kubernetes ensures that ML teams can manage their resources efficiently, being able to automate that scaling and support granular scaling ensures that they actually will, even when they can’t predict their needs.

GPU Support

If possible, GPUs (graphic processing units) should be used for deep-learning training, due to their significant speed when compared to CPUs. However, managing an end-to-end GPU stack requires layers of drivers and dependencies wherein small discrepancies can lead to big headaches.  By using containers, it is easy to reproduce the environment necessary to support computation on GPUs.

Currently, Kubernetes includes experimental support for managing AMD and NVIDIA GPUs, and NVIDIA offers “Kubernetes on Nvidia GPUs” software that automates the management of GPU-accelerated application containers.  These tools enable ML teams to leverage the speed of GPUs within a containerized workflow.

Data Management

Machine Learning requires large sets of data for training and testing.  Although early feature engineering may be done manually, the transformation of this data should be automated as soon as possible in order to ensure reproducibility.  The results of these transformations may also need to be saved temporarily, resulting in a sudden doubling of the team’s storage needs.

A flexible data repository, such as a cloud-based data lake, is a more efficient way to meet these needs than static, on-premises hardware. This solution can also support the high-throughput access required by training and inference workloads, without requiring additional data replication.

Kubernetes provides a single access point for diverse data sources and manages volume lifecycle, enabling teams to provision exactly the cloud-based storage they require, while reducing complexity.  Kubernetes also works with a host of third-party tools, like Apache Hadoop or Spark, which allow complex data transformations to be automated and executed across clusters.


The machine learning process is made up of a diverse set of workloads, which are often managed by separate teams. However, separating these workloads into their own dedicated hardware environments creates unnecessary siloing and inefficient use of resources. It is both simpler and more efficient to build the process on shared environments that can support the needs of multiple concurrent workloads.

Kubernetes offers the ‘namespaces’ feature, which enables a single cluster to be partitioned into multiple virtual clusters. Each namespace can be configured with its own resource quotas and access control policies. This allows a single cluster to more easily support different steps of the machine learning workflow.


Data pipelines abstraction

Orchestrating a machine learning pipeline is a tricky business, and managing the details of how the input from one step gets passed to the next can be overwhelming. Many machine learning teams rely on libraries like TensorFlow or Keras to abstract the process so they can focus on the overall logic. Fortunately, these libraries are compatible with Kubernetes and there is ample documentation available for how to incorporate them into a containerized workflow.

Infrastructure abstraction

As much as possible, the details of the infrastructure should be abstracted as well. The task of managing the resources that run machine learning models should not be more complex than the task of developing them. Kubernetes provides a layer of abstraction for managing containers that provides a number of easy-to-access interfaces from which workloads can be manipulated. Thus, while managing containers is certainly a skill that takes time to learn, with Kubernetes, users are at least shielded from the complexity of the underlying tech stack.

For even further abstraction, there are platforms available which simplify the process of getting set up with Kubernetes for ML use cases. One popular framework is Kubeflow, which offers a user interface from which teams can configure and monitor their containerized machine learning pipeline. Seldon is an additional tool, which extends Kubeflow and is specifically built for Kubernetes deployments. For teams considering more complex Kubernetes ML workflows, these services are particularly useful. Managed services, such as Platform9 Managed Kubernetes, farther help by enabling organizations to unify their ML processes for hybrid cloud models – that use both on-prem data stores and cloud services – by simplifying the management and Day2 operations of the underlying Kubernetes infrastructure on different environments.  By enlisting the help of these third-party tools, data scientists can reap the benefits of Kubernetes while still focusing on their ML models.


A containerized cloud-based machine learning workflow orchestrated by Kubernetes meets many of the challenges posed by the computational requirements of machine learning.  It provides scalable access to CPUs and GPUs that automatically ramps up to when computation needs spike. It provides multiple teams with access to data storage that grows to meet their needs and supports the tools they rely on for manipulating those datasets.

A layer of abstraction gives data scientists access to these services without worrying about the details of the infrastructure underneath it. As more groups look to leverage machine learning to make sense of their data, Kubernetes makes it easier for them to access the resources they need.


You may also enjoy

The argument for AWS Spot Instances

By Chris Jones

Java app performance over the decades

By Chris Jones

The browser you are using is outdated. For the best experience please download or update your browser to one of the following: