Why and How to Run Machine Learning Workloads on Kubernetes • Platform9

If you’re a data scientist, you probably spend a fair amount of time thinking about how to deploy your machine learning models efficiently. You look for ways to scale models, distribute models across clusters of servers, and optimize model performance using techniques like GPU offloading.

These all happen to be tasks that Kubernetes handles very well. Kubernetes may not have been designed specifically as a machine learning deployment platform; indeed, Kubernetes will happily orchestrate any type of workload you throw at it. Yet, Kubernetes and machine learning are becoming fast friends as more and more data scientists look to K8s to run their models.

This article explains why Kubernetes is such a great fit for machine learning and how to get started deploying machine learning models on Kubernetes with the help of Kubeflow.

The Machine Learning Tools that Data Scientists Need

The typical workflow of a data scientist can be broken down into several basic steps:

Write a machine learning algorithm. The algorithm will analyze data that you feed into it in order to create a machine learning model.
Find a data set (or create one, if it doesn’t already exist) that you can feed into the algorithm – or, in other words, “train” it.
Prepare the data so that it’s ready to use for training. Data preparation may involve tasks like restructuring the data and fixing data quality issues.
Train the model based on the data set you’ve prepared. Training allows the algorithm to find patterns or other insights within the data set that the model can use to make predictions, generate recommendations, or otherwise create real-world value for your organization.
Test the model to ensure that it accurately delivers the insights that it was intended to provide.
After the model has been trained and tested, deploy it for production use, where it can generate insights for real users.

This is a simple overview of the machine learning process. Your approach may be somewhat different, but in general, these steps capture the essentials of machine learning.

The Role of Jupyter Notebooks

Data scientists often use Jupyter Notebooks to help manage the various steps outlined above. By inserting the code, data, and documentation that they need to train and deploy machine learning models in Jupyter Notebooks, data scientists can store all of the resources they require in a single location. They can also interactively modify Jupyter Notebooks as they proceed, which makes it easy to update the algorithm if, for example, testing reveals that the model is not performing as required.

On top of this, Jupyter Notebooks offer the benefits of being easy to share because the backend server that hosts them typically runs remotely rather than being tied to a user’s local computer. This distributed architecture also means that data scientists can easily leverage the computing power of remote servers when training models with Jupyter Notebooks rather than being limited by the resources of their local machines.

We’ll explain what Jupyter Notebooks have to do with Kubernetes in a bit.

Why Kubernetes Is the Right Solution for Machine Learning

First, though, let’s look at why data scientists are increasingly turning to Kubernetes to help streamline model training and deployment.

To be clear, you certainly don’t need Kubernetes to train a machine learning model (or to use Jupyter Notebooks, for that matter). You can run an algorithm, prepare a data set, and train a model on virtually any machine, or you can use certain cloud services (like AWS SageMaker or Azure Learning Studio) that are designed specifically for model training.

However, Kubernetes offers some key advantages as a platform for training and deploying machine learning models. To understand what those advantages are, let’s first discuss some of the major challenges that data scientists face when running models:

Automated deployment: Unless you set up an automated release pipeline to run your models, you’ll have to deploy them manually. By default, there is no easily reusable code or commands that you can re-invoke for each new deployment.
Scaling: Model training often requires massive computing power. But scaling up training by spreading out the load across multiple servers can be difficult because there is no easy way to take a model that is running on one machine and scale it to another, unless the machines are organized into a cluster.
Infrastructure failure: If you do distribute your models across multiple servers, you run the risk that training will fail in the event that one of those servers crashes and you lack a means of automatically “failing over” to a different server.
Shared infrastructure: If a data science team invests time and resources in setting up a cluster, it’s likely that the team wants to run multiple models on it at the same time. However, depending on how the cluster is managed, there may not be an easy way to isolate workloads from each other.
GPU offloading: If you set up a cluster using virtual machines – an approach that offers more flexibility than running your models directly on bare-metal servers – it becomes harder to offload computation to GPUs (which is a common technique for speeding training, because GPUs typically offer significantly better performance than standard CPUs for this use case).

Kubernetes offers a solution to each of these challenges:

Reusable deployment resources: When you deploy any kind of workload on Kubernetes, you define the deployment in a YAML or JSON file, or by creating a Helm chart. Either way, you can easily tweak and reuse the deployment files when you need to redeploy. In this sense, Kubernetes provides an automated deployment solution by default, without requiring users to perform extra work or set up additional tools (although you could always create a CI/CD pipeline that feeds into Kubernetes if you want even more deployment automation).
Automated scaling: Kubernetes automatically scales workloads up and distributes them across a cluster of servers to ensure they have adequate resources to run.
Infrastructure management: Kubernetes automatically redistributes workloads in the event that one server (or node, to use Kubernetes parlance) in the cluster begins to fail or stops responding. That means your training won’t fail and have to be restarted every time a server goes down.
Multi-tenancy: Multi-tenancy, which allows users to isolate workloads from each other, is a native feature of Kubernetes. Data scientists can create a single cluster and share it among multiple workloads or multiple teams.
GPU access: If you deploy models inside containers and properly provision your nodes, your models can directly access GPUs. And even if you run models in virtual machines, add-on frameworks like KubeVirt, which makes it possible to orchestrate VMs with Kubernetes, support GPU passthrough. Thus, however you choose to run your models on Kubernetes, they can leverage GPU offloading.

In each of these ways, Kubernetes smooths over some of the most significant challenges that data scientists face in running models at scale.

Making things even sweeter is the fact that you can easily deploy JupyterHub on Kubernetes using Helm charts. That means that, in addition to running your models, Kubernetes can host all of your Jupyter Notebooks as their associated resources. Rather than having to rely on a third party to host the Notebooks or manually set up a single server, you can scalably host the Notebooks in the same Kubernetes cluster that you use to train and deploy your models.

Getting Started with Machine Learning on Kubernetes Using KubeFlow

If the benefits that Kubernetes offers for machine learning deployment sound cool, here’s something that makes it even cooler: Kubeflow, an open source project that provides even more automation for training and running models on Kubernetes.

In a nutshell, the purpose of Kubeflow is to provide an easy-to-deploy, easy-to-use toolchain that lets data scientists integrate the various resources they need to run models on Kubernetes – such as Jupyter Notebooks, Kubernetes deployment files, and machine learning libraries like PyTorch and TensorFlow.

Using Kubeflow is simple and straightforward. You can quickly install it on any mainstream Kubernetes distribution using a prebuilt package (or, for distributions without official support, you can install from a Kubernetes manifest).

Once installed, Kubeflow offers Web UIs for managing a variety of training jobs, as well as preconfigured resources for deploying them. For example, check out this tutorial, which explains how to run a TensorFlow training job using Kubeflow TFJob, a tool provided by Kubeflow. As the tutorial shows, you can easily run the job by creating a YAML file that points TFJob to the container that hosts your training code, along with any necessary credential data required to run the workload. You can then deploy it all with a few simple commands.

In addition to making it easy to train models with libraries like TensorFlow without having to install and set up the libraries manually, Kubeflow also automates distributed training. That means you can take full advantage of the parallel computing features of your Kubernetes cluster. In addition, the Kubeflow tooling can easily integrate with GPUs that you configure to be accessible to your Kubernetes cluster (although a downside of Kubeflow is that, currently, it doesn’t set up GPU access for you; you have to handle that on your own). This simplifies GPU offloading.

Another common machine learning task that Kubeflow radically simplifies is working with Jupyter Notebooks. Kubeflow includes built-in Notebook services, which you can access within the UI in order to set up Notebooks and share them with your team or teams.

Tips for Using Kubeflow in Production

While Kubeflow makes it simple to use Kubernetes as a place to run machine learning models, you’ll want to be aware of Kubeflow’s limitations before deploying it into production.

One limitation is that, although you can collect data from containers that are deployed with Kubeflow just as you would for any other type of container in Kubernetes, Kubeflow provides little in the way of native monitoring tooling. The takeaway here is that, in order to deploy Kubeflow-based workloads in production, you’ll need to educate yourself in the basics of Kubernetes monitoring and take steps to monitor your clusters. Otherwise, you may deploy your workloads only to find that Kubernetes performance issues cause training jobs to fail or run very slowly.

Along similar lines, Kubeflow doesn’t automatically verify that your cluster actually has sufficient resources to run a job before you deploy it. If you attempt to deploy a job without enough resources, the containers will hang in a “pending” state. This can be a risk if you attempt to automate deployments, leading to situations where you think a job was deployed, but your cluster sits there idly until you fix the resource issue.

Keep in mind as well that, while the Kubeflow Web UI simplifies some deployment and administration tasks, you can’t do everything you need to deploy workloads of any complexity by pointing-and-clicking within the UI. So, expect to have to learn the ins and outs of YAML and the Kubernetes CLI in order to run jobs in production.

Finally, a persistent issue for the Kubeflow project has been out-of-date documentation. As of the summer of 2021, many topics that are likely to be important for any data scientist using Kubeflow in production, such as logging and monitoring, are addressed only in archival versions of the documentation that may no longer be accurate. This isn’t the end of the world; you can likely find the help you need for working with Kubeflow within online user communities. Still, the spotty documentation can be a challenge if you are someone who likes to look up the information you need yourself rather than asking others.

Of course, it’s worth noting that Kubeflow is only one solution for deploying machine learning workloads on Kubernetes. Even if Kubeflow goes bust – an eventuality that some members of the data science community have predicted – you can take advantage of Kubernetes as a machine learning solution even without the automations and integrations that Kubeflow provides.

Conclusion

For data scientists seeking a scalable, reliable, automated means of running machine learning models with the help of Jupyter Notebooks, Kubernetes is an excellent solution. Although tools designed specifically for machine learning on Kubernetes, like Kubeflow, are not yet as mature as they ideally would be, spending a little time learning about Kubernetes and the features it offers for machine learning use cases is well worth it.

Author

Platform9

Platform9 is a leader in simplifying enterprise private clouds. Our flagship product, Private Cloud Director, turns existing infrastructure into a full-featured private cloud. Enterprise IT teams can manage VMs and containers with familiar GUI tools and automated APIs in a private, secure environment.

View all posts