Everyone in tech is busy discussing Kubernetes, containers, and microservices as if the basics of DevOps and continuous delivery are all figured out. In practice, the classic blame game between dev and ops is still alive and kicking. We might be nicer to each other now with a common sense of shared ownership, but those niceties often hide the obstacles that are making it hard for engineers to move forward faster.
In this post, we will cover 5 of the biggest problems major engineering organizations are still facing, and how to approach solving them.
Think of this as your cheatsheet when having to communicate the problems caused by a lack of observability.
— OverOps (@overopshq) July 25, 2018
TL;DR: Here’s the Gist of the Point We’re Making
- There’s not enough high-level data to inform application owners about application behavior
- There’s not enough granular data to inform developers about application behavior
- There’s a disconnect between how high-level data relates to granular machine data
- Data is isolated into silos across adjacent teams and management
- Unreachable data hides unknown errors under the radar of both dev and ops (but not the end user)
(Pssst… keep scrolling through, there’s a pretty cool chart and more info that goes deeper into this)
The Lay of the DevOps Land
Let’s say we’re looking at a random engineering organization within an enterprise. Here’s an abstraction of how it operates its business units and their supporting applications:
We have the company, 2 separate business units, critical-app-1 for the first business unit, critical-app-2 that has shared capabilities for 2 business units, and app-3 which is operated by business-unit-2.
Each of these applications are developed with an agile methodology, has different versions of it running, and deployments are made anywhere between weekly, daily, or even multiple times per hour. Each has multiple server instances, and those instances might have multiple microservices, distributed or not, containerized or not. The lay of the land gets quite complex.
While running, these applications generate billions of events. For example, exceptions, log messages, and slowdowns:
However, and this is something that many of us often forget, these events are abstractions of what’s really going on under the surface. The same way the application is an abstraction of all the individual services that make it run.
Machine data is the flow of information that runs through the infrastructure and code. Let’s look at this from the perspective of the application-related data and environment-related data:
There’s a high pressure machine data firehose that’s generated by applications. Events are just a fraction of that flow, a drop in the bucket of what’s really going on under the hood. Accessing, investigating, and uncovering insights from machine data is the technical requirement for any strategy around observability. Tying it all back to the code and its high level implications is the ultimate.
Can You Find the 5 Problems Hiding in This Chart?
Now, on top of this landscape, we have silos, unknown errors, and a huge machine data blindspot. Management is isolated, development groups are not talking nor sharing information with each other, unknown errors are happening without anyone’s knowledge, missing granularity, it’s a mess.
There are probably more than 5 problems, if you find more, please let us know in the comments section below!
Let’s try to untangle and isolate the essence of the most problematic areas:
1. There’s Not Enough High-Level Data to Inform Application Owners About Application Behavior
Kicking this off from a bird’s-eye view. How do you know whether a new deployment will introduce a new error? How about its impact on the overall application? Is it failing all the time, or just some of the time? And what is its relation to the business unit it supports? Technical, yet high-level, data allows making code-aware decisions and tying them to outcomes that product metrics alone don’t cover.
For example: if we think about this from the perspective of the software development lifecycle, most engineering teams have a hard time defining effective quality gates between dev, test, staging, and production. A pretty common and honest answer to “how do you know if the new release is successful?” is “a few hours/days pass and no one is angry at us”. Less than ideal to say the least.
2. There’s Not Enough Granular Data to Inform Developers About Application Behavior
Let’s change perspective and zoom in. When looking into a specific error, how do you approach finding a solution? How do you recreate it? And when you do, how do you isolate the code-aware insights that will point to a solution? How much time is spent on that versus pushing the roadmap forward? Root cause for ops is just the first step in the journey to the root cause for dev. The data that does get captured is a shallow “low resolution” representation of the cause for trouble.
The DevOps root cause, the real root cause, is a whole different story.
For example: let’s say ops identify that the root cause of a new issue is tied to a specific deployment on one microservice. Devs, on the other side of the release cycle, supposedly have a root cause to work with but need to rely on logs to understand what happened. More often than not, logs don’t contain all of the information that’s needed to find a solution.
3. There’s a Disconnect Between How High-Level Data Relates to Granular Machine Data
The DevOps mindset is supposed to help both dev and ops act as one team. However, how can you be one team when data is fragmented? How can you connect the high-level outcome to the granular pieces of data that caused it? How can you escape the abstraction to dive into the details? Are you even solving for the right problem? You don’t only miss high-level data, and granular machine data, you also miss out on the links between them.
For example: a single piece of code in a new deployment fails 30% of the times it is called. 50% of the time for one reason, 30% for a second reason, and 20% for a third reason. When it fails for the first or second reason, it’s detected, but for the remaining 20%, it’s not detected. How do you separate those failures? How do you find what’s causing it to fail only in those 20% of calls? How do you tie this to the changes in the new deployment?
4. Data Is Isolated into Silos Across Adjacent Teams and Management
It’s not all about the merits of machine data alone. The human factor shapes data barriers no less than the tech we use to capture it. Sometimes one team’s data or workflow holds the answer to another team’s problem. In those cases, how can you let information flow freely? How can you make sure everyone has access to all the relevant data they need? And how can you verify that it all complies with security and privacy guidelines? These data barriers are either productivity blockers when they’re too strict, or security incidents in-the-making when they’re too loose.
Today, it seems most engineering organizations are on either one of those extremes.
For example: we don’t have to go as far as teams working on different applications as data can even be siloed within a single team. One person’s set of go-to Splunk or regex queries are not the same as the next person’s. Different people often look at different data since logs and metrics are so inconsistent.
5. Unreachable Data Hides Unknown Errors Under the Radar of Both Dev and Ops (But Not the End User)
The first 4 problems we described are true for all the events we already know of. The known unknown data. Now, let’s look at the unknown unknowns. Are you aware of the errors that never hit your logs? The bad user experiences that are not reported? How often are you labeling user complaints as “could not reproduce”? They’re definitely happening, they’re not unknown to end users. It’s more likely that they’re well-known. The existing data that engineering teams depend on is not only incomplete in terms of data resolution, but also in terms of identification.
For example: if an exception is thrown, caught, but not logged, there’s no way of knowing that it happened. It might seem like a bad coding practice that would be filtered out by an attentive code review stage, but it’s more common than most organizations would like to admit. Some people call them swallowed exceptions, hidden errors, or just a silent killer of applications.
What’s the Essence of All These Problems?
The catch-22 of data is in the core of all these issues. Capturing noisy unstructured data on one end, and a data blindspot on the other, both obscure the understanding of what’s happening. Engineers on both the dev and ops end of the release cycle are stuck between a rock and hard place when it comes to choosing the abstraction level of the data they’ll capture.
Most, if not all, engineering organizations are on either end of this scale. Either too much noisy, unstructured data or too many blindspots.
What Are Companies Doing to Address This? Machine Data as Code
At OverOps, we approached this problem by looking into the engine rather than capturing the exhaust. Instead of relying on log events to capture data, we use a native micro-agent to capture 2 critical components from applications:
- The code graph (static) – an abstract representation of all possible code execution paths that gets updated with every deployment and code change
- Snapshots (dynamic) – a slice of memory that we capture every time an event of interest occurs, whether it was caught, uncaught, or “swallowed”
Then, we treat this machine data as code, generate a unique digital fingerprint for every event and process the information by cross referencing the code graph with the snapshots. This allows us to selectively capture, process, and display code-aware data.
This data includes:
- The call stack and source code
- The value of every variable within the stack
- The last 250 log statements (including DEBUG and INFO statements, even in production)
- And event analytics (including the frequency, failure rate, first or last seen, new or resurfaced for 100% of errors and exceptions)
Most importantly, tying all this new data all the way from granular to high level, with the specific microservice and deployment that introduced it.
To learn more about how we do this and the value we provide, check out our website, watch our recent webinar recording about Continuous Reliability, and stay tuned for more news by subscribing to the blog for additional updates!
Dev, Ops, DevOps, and the SRE movement are all working to increase their velocity to keep innovating while maintaining the reliability of the product. In order to do that, we need to be aware of the limitations of the data and work towards making it contextual and code-aware from top to bottom.
After all, it’s a shared responsibility, and we’re all one team.
Do you relate to the problems covered in this post? How are you working towards solving them today? We’re curious to learn more about how you are tackling this within your own organization, please let us know in the comments section below!