An argument for Kubernetes and Machine Learning

There’s a big problem brewing in the world of data engineering, high-performance computing (HPC), and machine learning. Since the release of ChatGPT, the once quiet but critical function of the world’s data teams has seen an increase in demands, new scrutiny, and questions appearing around ROI. ChatGPT spurred awareness, along with ‘features’ that almost anyone can understand, but the simultaneous macroeconomic change has meant this ‘simple’ form of AI is the laymen’s benchmark, for better or worse. The question now stands; how do data teams respond?  

AI or machine learning models that deliver a competitive advantage require accuracy, constant refinement, monitoring, and the ability to release to production frequently. To deliver this data, teams are faced with an onslaught of new tools and new infrastructure. But here’s the thing: a lot of data science teams aren’t ready for this change. You simply cannot just ask ChatGPT and magically change.  

The choices are overwhelming, and in a field that is comparatively young, organizations need to take a measured approach and ensure their resources are positioned to drive the highest ROI. The question really is which approach will work, and where should a data team focus to see the best outcomes? A recent study by PWC presents an informed opinion, outlining the approach that the most successful organizations have taken:  

What sets these companies apart, the data indicates, is that instead of focusing first on one goal, then moving to the next, they’re advancing with AI in three areas at once: business transformation, enhanced decision-making and modernized systems and processes. Of the 1,000 respondents in our survey, 364 “AI leaders” are taking this holistic approach and reaping the rewards. 

If new demands, questions on ROI, and tough decisions were not enough of a problem, data teams are also facing the fact that there simply isn’t enough physical infrastructure available to execute model training, and there’s a serious shortage of skilled professionals.  

It’s a real struggle for data teams to get things done. 

So, what can leaders do to help increase the odds of success? Leaders need to focus on what drives ROI, such as the ability to easily access the right AI/ML tools, and remove the burden of necessary but complicated and time-consuming aspects, such as managing the infrastructure the workloads run on. In other words, solve the “modernized systems and processes” piece without adding a mountain of work.  

Data teams need a system that’s ready to go, without a lot of hassle. That way, data scientists can focus on what they do best: training and fine-tuning models. And MLOps teams can concentrate on important tasks like monitoring models, deploying into production, and making sure the user experience for their data science team is smooth and self-serve. 

What options exist today to help? Much of the industry is allowing their teams to self-select tooling, often landing on tools that run on an individual’s workstation, a client tool, and may be able to remotely run jobs. For every tool choice, there is a corresponding infrastructure choice. Below is a high-level summary to help illustrate the dichotomy of the problem.  

Today’s Options – Cloud Services 

Public Cloud Services (e.g., SageMaker): 100% public cloud-based turn-key solutions limit users to a single cloud and limit capabilities to the vendor’s ability to iterate. For organizations looking to gain a competitive advantage, this style of service may not be best placed. Costs are dependent on the cloud’s pricing, and lock-in is a real risk. 

Today’s Options – Data Science Client Tools  

Client Tooling and Public Cloud IaaS: Client tools such as Jupyter, with services such as Batch or letting users ‘ssh’ onto single machines, provide limited economies of scale. Resources can be easily wasted, and jobs can be stuck in single-threaded queues, or else costs skyrocket as more and more compute is deployed. This approach is where many teams find themselves – leveraging client-based tooling that’s been in place for years and has little friction or learning curve for data science users. 

One way to increase ROI would be to leverage infrastructure outside of the public cloud and continue to use client tooling.  

VMs / Physical Servers: Purchased hardware that is specialized for AI/ML workloads can be significantly cheaper to purchase than lease from public cloud providers. This does, however, require an ability to forecast demand and, similar to public cloud, this can impact velocity without software to allow for seamless scheduling.  

Users can simplify the operational burden of managing hardware by using:    

Managed Data centers: A refined approach to operating infrastructure that can provide a balance between “renting” and “owning”/ OpEx vs. CapEx. 

Co-los: More hands-on than a managed data center but reduces costs by shifting the operational burden to internal staff.  

On-Premise: 100% owned and operated, can provide the ultimate in flexibility and privacy. But unless you are already operating a data center, it would be unlikely for this to be an option.   

Another approach is to look at tooling that enables more of a shared compute environment, such as MetaFlow or Nextflow 

A shift in the software industry 

With the current trend to develop new software using container technology, much of the market is adopting Kubernetes as their underlying infrastructure to host and run generic applications and services. Compounding this is the fact that nearly all open-source solutions are container-first, and AI/ML workloads are not immune to this trend.  

For data science teams, and their AI/ML workloads specifically, it’s a perfect storm; model development and training have been trending to Docker for many years, and the tooling to support Docker applications has all landed on Kubernetes.  

The combination of existing containerized ML models and new tooling being developed to run natively on Kubernetes means that data science teams are facing headwinds to attempt to operate using anything other than Kubernetes.  

Options for Kubernetes and AI/ML 

The simple approach would be to leverage client tools that interact with Kubernetes and reduce the total change that could impact a data scientist. For example, both MetaFlow and Nextflow enable this. 

Client/Server Tooling on Public Cloud K8s: An evolution of IaaS, Kubernetes allows for 100% elasticity and natively supports short and long-running workloads as well as scheduling capabilities. Kubernetes is popular in the application world as it can help drive a higher ROI for compute spend but drastically increases complexity of operations. For data science teams, Kubernetes is the target platform for many of the new generation of tooling. As much of the workloads for ML are containerized, the evolution seems clear. Public cloud provides a fast-start button but doesn’t remove the need to support upgrades, troubleshoot, or manage all the essential Kubernetes add-ons.  

Kubernetes in Managed Data Centers, Co-lo, or on-prem: Outside of public cloud, Kubernetes requires one of the following: a managed data center, co-lo, or on-premises infrastructure. Users then must choose how to deploy and manage Kubernetes. This requires a significant investment and is prone to outages, especially during upgrades. It’s also important to point out, unlike public cloud, there is no free, fast-start option. 

As burdensome as Kubernetes may appear, it’s the closest thing to a perfect platform for AI/ML workloads. Why?  

  1. Most training jobs are containerized already. 
  2. 90% of the open-source tools, client or server-based, are containerized.  
  3. Model serving is best achieved on serverless or container platforms. 
  4. Elastic demands on compute require a platform that can deliver elasticity.  

As pointed out by PWC, success requires a focus across three dimensions; business transformation, enhanced decision-making, and modernized systems and processes – the final one being a critical part that should not be ignored.  

With all the tradeoffs, is moving to Kubernetes a realistic goal?  

A Real World Example 

There are large teams that still leverage spreadsheets and manually schedule time slots to train their models. These teams are negatively impacting their ROI through limited parallelization, limited utilization of infrastructure, and delayed execution that ultimately degrades their ability to release to production. This approach holds true no matter where the compute is consumed – public cloud, managed data centers, co-locations, or on-premises. The compounding factor is a failure to mature the practice and adopt new tooling. 

The market is showing early signs of adopting Kubernetes as the underlying infrastructure for training, due in part to its ability to dynamically scale and run across heterogeneous compute types, and also for inference, again for its ability to scale based on load and leverage generic x86 and specialized infrastructure. 

At the Open Source Summit, Ikea presented their 100% open-source ML training and inference platform. Their data science team took their time over the last 18 months, evaluated options, and ultimately landed on a platform that allows training on-premises and inference in the cloud. All provided on a single compute stack.       

Further, Merantix, who focuses on AI, recently shared their approach and, importantly, their why! In their presentation, Fabio Grätz and Thomas Wollmann dive into the details of the infrastructure they use for deep learning and provide an overview of the Merantix devtools that help them execute to their best practices as code and infrastructure as code. 

The Challenge for Data Science Teams 

It’s immediately apparent that adopting a similar platform to that of Ikea or Merantix is no small task. Not only must a team learn Kubernetes, they must also select and adopt a number of disparate open-source solutions. Once selected, installed, and running, the team must then upgrade, support, troubleshoot, and maintain governance and security policies. 

Data science teams need to choose carefully what they want to own, which tools they want to operate, and where their time is best spent. This includes whether or not public cloud infrastructure is the best choice or if purchasing GPU hardware may drive a higher ROI.  

If you’re looking for infrastructure flexibility without the operational burden, then Platform9’s Kubernetes platform is a great foundation to start with. If you want turn-key cloud-native, then a platform such as Union may be a good fit. It leverages Flyte, which is open-source and runs on Kubernetes. This means you get the best of both worlds and can also run on-prem or in the cloud.  

If you can’t invest in a managed data center or co-lo, and public cloud constraints are impacting your timelines, then offerings from Equinix Metal and Lamba Labs can be combined with Platform9 and Union to get you a scalable platform ready in under a day.    

There are a lot of moving parts to AI/ML, and that’s without considering the underlying infrastructure that supports data science teams and their ability to bring value to an organization. It’s possible – and in many cases, preferable – to get specialized assistance in the operational aspects of AI/ML, so that your teams can stay focused on insights and outcomes. If you’re interested in learning more about how Platform9 can help, you can schedule a free consultation with us. 

You may also enjoy

Mastering the operational model challenge for distributed AI/ML infrastructure

By Kamesh Pemmaraju

Kubernetes FinOps: Resource management challenges

By Joe Thompson

The browser you are using is outdated. For the best experience please download or update your browser to one of the following:

Learn the FinOps best practices to maximize your cloud usage & budget:Register Now
+