There are many tools in the cloud-native and microservices tool chest. Kubernetes is the go-to for container management, giving organizations superpowers for running container applications at scale. However, running an enterprise-grade, production-level Kubernetes deployment is more than running just Kubernetes by itself.
Kubernetes can simplify the management of your containerized applications and services across different cloud services. It can be a double-edged sword, though, as it also adds complexity to your system by introducing a lot of new layers and abstractions, which translates to more components and services that need to be monitored. This makes Kubernetes monitoring and overall observability even more critical.
Because containers are ephemeral and transient, monitoring, security, and data protection are fundamentally different from their counterparts in virtualized or bare metal applications. Optimizing the tooling that supports a Kubernetes deployment is not a trivial task.
In many cases, this means that tooling aimed at virtualized environments doesn’t translate well into containerized platforms. Replacing these tools may be better than retrofitting legacy tooling.
Types of Observability
Let’s go back to basics first. Looking at the observability space for container-based microservices landscapes, we can distinguish three separate types of tooling:
- Monitoring (or metrics): collecting operational telemetry about applications, application services, middleware, databases, operating systems, and virtual or physical machines
- Logging: collecting error messages, debug or stack traces, and more detailed messages
- Tracing: collecting user transactions and performance data across a single or distributed system
In a DevOps or Site Reliability Engineering (SRE) world, these three disciplines collectively make up observability.
Each discipline provides valuable insights in all layers of the layer cake that make up the increasingly complex application and infrastructure landscape of containers. DevOps engineers and SREs use the insights from these tools to improve resilience and performance, as well as triage errors, fix bugs, and improve availability and reliability.
Finally, they use these tools to gauge how users are interacting with the system. The tools help figure out which functionality visitors use or don’t use, and where performance bottlenecks lie.
As application landscapes expand due to digital transformation, the number of microservices and individual containers explodes, making it harder to see the inner workings of systems. So it shouldn’t be a surprise that executing a good observability strategy is one of the deciding factors of a successful Kubernetes deployment.
Layers of Monitoring
A good place to start with monitoring is by collecting metrics and operational telemetry of the Kubernetes constructs like clusters and pods, as well as collecting metrics on resource usage like CPU, memory, networking, and storage. Starting with the bottom two layers for monitoring is relatively easy and a good way of becoming comfortable with observability tooling.
Infrastructure monitoring and logging are key capabilities because it’s important to know the activities of your physical infrastructure. A substantial amount of your application’s performance and resilience comes from correctly functioning servers and networking.
As the application landscape expands, a well-executed infrastructure monitoring and logging strategy also builds a shared understanding of application performance across teams, preventing miscommunication between application development, cloud platform, and other teams.
Visibility into infrastructure and the shared understanding it builds is crucial, but of course doesn’t give the entire picture. For that, you need to move up the stack, and start with application performance monitoring (APM).
For many organizations, the application monitoring journey starts with monitoring (or metrics collection) and logging containerized workloads. For Kubernetes-based environments, there are natural combinations to start with, like the open source Fluentd and Prometheus, which make it easier to run monitoring and logging.
Making Observability Work for Your Business
This journey up the stack is an opportunity to align monitoring, logging, and tracing to business objectives, mining more insights from the increased visibility. It allows teams to gain visibility into more than just technical metrics, generating business-oriented metrics, too.
By measuring business-oriented metrics (such as the dollar value of the shopping basket, the number of abandoned baskets, and metrics on popular or even disused features), product owners can align development priorities to what their users really want, optimize performance in areas where it actually matters, and fix technical debt to accommodate further growth. Naturally, these insights fuel business growth and revenue.
When tooling is aligned to the business and customer experience, the tools can be used by more than just IT teams, allowing business teams to gain insights into their applications and its users.
Think Mid-Term to Long-Term
The tools you choose for observability should serve your needs for several years. This requires you to think about how your business is changing and how that will change your observability requirements in the long-term.
The cost of migrating to a new, more capable APM platform can be significant, but won’t immediately give you additional functionality. This additional functionality requires additional engineering and implementation before these capabilities are fully unlocked.
And let’s not forget that moving to another APM platform requires you to retrain staff and needs time to regain confidence in the metrics and insights, all of which reduce the value the APM platform brings in the short term. That’s why it makes sense to choose your tooling wisely from the start, keeping the long-term goals in mind.
In other words, while you won’t need the most complex or feature-rich solution now, look at what features you’ll need to support evolving requirements in the future. Invest in your team and people and start with the APM capabilities you need now.
You don’t need to enable, implement, and incorporate every feature the tooling provides from the get-go. It’s OK to start simple, build up confidence along the way, continuously evolve your knowledge of the tool, and expand its use in-sync with changing requirements.
- Getting to know Nate Conger: A candid conversation - June 12, 2023
- Platform9 at the Edge Computing Expo North America 2023 - May 8, 2023
- Argo CD vs Tekton vs Jenkins X: Finding the Right GitOps Tooling - March 1, 2023