SpotQuake: Using AWS Spot Instances to run Platform9 Freedom
Spot Instances
AWS makes no mistake in the type of workloads that should use EC2 Spot Instances; ‘Fault-Tolerant’.
This is how Amazon describes Spot Instances:
Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices. You can use Spot Instances for various stateless, fault-tolerant, or flexible applications such as big data, containerized workloads, CI/CD, web servers, high-performance computing (HPC), and test & development workloads. Because Spot Instances are tightly integrated with AWS services such as Auto Scaling, EMR, ECS, CloudFormation, Data Pipeline and AWS Batch, you can choose how to launch and maintain your applications running on Spot Instances.
We use Spot Instances, a lot of them, to run the Platform9 clusters that provide the infrastructure for our Freedom plan. As the name suggests, Freedom Plan is 100% free, forever, 2 clusters, 8 nodes, you’re allowed to build new clusters and import existing clusters across VMs, Physical Servers, AWS, Azure and Google Cloud. Once you have a cluster, new or existing, you can leverage Platform9 to deploy workloads, troubleshoot Pods/Deployments/Services …. dig into the Containers to see what’s going on, execute upgrades, manage user RBAC within Clusters and more. That’s enough on the platform, I want to discuss Spot, as the book says “See Spot Run, See Spot Jump, See Spot Play”, sometimes though, Spot Goes Away, and this can create a bit of a bumpy experience. A SpotQuake!
There are two factors that may cause Spot Instances to become unavailable:
In both scenarios if you have running EC2 Instances they will be removed. When AWS detects this, they send a notification that starts the clock on a 2-minute window, within this 2-minute window your workloads need to be evacuated and re-scheduled onto a new instance to avoid interruption to your service. When this happens en masse, we call it a SpotQuake.
Interruption Notifications are well documented, first you can find information on your Spot Instance by using the describe-spot-instance-requests API
document. Check out here.
The data from the describe API lets you understand the status of your instance. To avoid outages it is critical to configure the AWS Simple Notification Service to send events notifications when a Spot Instance is about to be terminated, that way you can use the 2-minute notification to kick-off automated actions.
Learn more about Amazon Simple Notification Service.
How does a SpotQuake impact the Platform9 Freedom Plan environment?
The Platform9 Freedom Plan operates across a number of dedicated Kubernetes Clusters, built using Platform9, in AWS. All of these clusters run on Spot Instances. The Control Plane nodes are set to consume On-Demand or Reserved instances, and the Worker nodes run on Spot. Each Platform9 customer irrespective of their Plan; Freedom, Growth or Enterprise, run in a dedicated Namespace with dedicated services, these services run the SaaS Management Plane, build & operate your clusters and run our remote monitoring. Growth and Enterprise are backed by our 99.9% uptime SLA, the clusters that run Growth and Enterprise do not use Spot Instances, EVER.
When a SpotQuake occurs we kickoff automation to move the services for each customer instance, sometimes when the SpotQuake magnitude is large enough the results are catastrophic. Since March 2020, when we launched Freedom Plan, we have seen occasions where there are losses of 50 Nodes or more across each cluster, and within the 2-minute window we were not able to deploy new instances and tell Kubernetes to reschedule the workloads. Ultimately this resulted in a degradation of service.
Platform9 in no way impacts the availability of your clusters. When your instance is impacted by a failed service, or a SpotQuake, you will notice different outcomes in our platform depending on what is impacted.
There is one important factor to know when this happens, your cluster will remain 100% operational. Platform9 in no way impacts the availability of your clusters. When your instance is impacted by a failed service, or a SpotQuake, you will notice different outcomes in our platform depending on what is impacted. If Keystone is down you will not be able to log into your Instance. If ResManager is impacted, your nodes status’ will be marked as unavailable. If the Kubernetes Proxy is impacted, requests that talk directly to the API Server will fail. When any of these issues occur, your cluster is still running, your workloads will continue to run.
How do I know if a SpotQuake is happening?
You can check the status of Platform9 Freedom Plan on our status page site.
Who can I contact for help?
If you are running a Freedom instance then reach out on slack. Our community slack is staffed by our support team, engineers and product management.
Whenever anything is impacting Freedom Plan we post status updates to the Slack General channel.
You can also check the health of your dedicated instance at any time. Log in > then click the Notification tray (it’s next to your Avatar) > within the Notification Tray menu click on “Management Plane Health.” This will open a view into each service that is running. If any of these services are failed, reach out on Slack and the team will investigate.
Conclusion
AWS Spot Instances are a great way to help manage and reduce cloud spend, however their volatility can be a bit of a pain to manage. If you want to chat about using Spot, then reach out on Slack, we are here to help.
If you want to get started with Platform9 on AWS, check out this blog post:
How to Set Up a Shared Kubernetes Cluster on AWS using Platform9 Managed Kubernetes
- FinOps: Applying Earned Value Management to maximize ROI - June 18, 2024
- Top 6 FinOps KPIs for EKS - June 17, 2024
- The argument for AWS Spot Instances - May 8, 2024