It's time for problem detection!
By Tony Meehan, Co-Founder & CTO
A familiar software story
Kelsey Hightower launched the first problem detection workshop at KubeCon by connecting it to the origins of Kubernetes. Like then, new technology revolutions spark when a motivated group confronts the chaos and complexity with a fundamentally different approach.
Software complexity is at an all time high. And it's outpacing engineering capacity to wrangle it. Abstractions and dependencies are increasing. AI adoption is accelerating. But engineering teams and budgets are shrinking. Software and the world it’s eating is primed for an explosion of unmanaged problems.
Reliability suffers as the gulf between software complexity and attrition widens. As engineers we see more mysterious failures and performance issues in code we didn’t write. And as users we feel the impact in more aspects of our daily lives.
In the early 2000s a story like this unfolded in cybersecurity.
It was a similar explosion of software problems. With a front row seat as a vulnerability researcher hunting bugs at the National Security Agency (NSA), I lived through the evolution of a radically new approach to confront the chaos.
At first, defenders combed through long dashboards and a mountain of data in isolation to react to security problems. It was exhausting and lonely. But with the rise of threat intelligence, Common Vulnerability Enumerations (CVEs), threat research teams, and products operationalizing this knowledge through detection rules, a global community emerged — a network of organizations hunting for security problems together. Defenders were no longer alone.
Everyone benefited from the emergence of a new architecture. Security products began proactively executing rules at the edge where the problems happen. This allowed more low-level data to be analyzed to detect problems in real time. It was a game changer. By the time Lyndon and I left the NSA to build one of the first endpoint detection and response products on this new architecture at Endgame, the cybersecurity revolution was in full swing.
It’s a stark and sobering contrast to the story of software reliability.
Today, when users report an incident, that grenade usually lands on an individual surrounded by sprawling rearview dashboards on another mountain of copied data. Investigation noise is amplified by screaming alerts, merely highlighting symptoms that hint at a deeper, hidden problem. Sound exhausting and lonely?
It is! The unlucky ones endlessly pivot between data sources searching for elusive breadcrumbs, researching esoteric metrics, and plotting charts in a lonely hunt to connect the dots. With third-party code constituting 80% of software applications, too often the investigation concludes with a GitHub issue—a somber repository of shared frustration where countless others have wrestled with the same problem.
It’s a familiar story—one that Lyndon and I both experienced building products at Elastic, Mandiant, and Endgame. Fortunately this doesn't have to be the end of the story. It's just the beginning. We can do better as a community!
This is why Lyndon and I started Prequel. We're obsessed with software bugs. And we're taking cybersecurity's hard-won insights and transforming them into software reliability superpowers with problem detection. Our mission is to empower the global reliability community to confront software complexity together.
We are launching Prequel as the next step of this important mission. Prequel is the only framework for building and running problem detectors at scale. And our customers on this mission with us love problem detection:
“Prequel independently flagged issues that were the underlying cause of complex bug reports. We're able to catch problems earlier and save time on troubleshooting.” — Andy Martin, Director, Infrastructure, Schrödinger, a leading provider of software solutions for the life and material sciences industries.
What is problem detection?
Traditional monitoring and alerting is like the ‘check engine’ light in your car. At best, it tells you something is wrong, but not exactly what, why, or how to fix it.
In contrast, problem detection is:
- precise and specific, not relying on traditional alerting mechanisms, such as thresholds or anomalies which are low-fidelity and prone to false positives
- based on hard-earned codified knowledge from a community of engineers, getting smarter over time
- about identifying, mitigating, and stopping problems before they escalate into incidents and downtime
This means:
- interrogating systems at the lowest levels to uncover silent and unreported failures not discernible from errors, logs, or even synthetic monitoring
- cutting through “random” signals to tell you exactly what is wrong at a level that is actionable to mitigate it
How does Prequel work?
It starts with intelligence. We created the industry’s first Reliability Research Team, working with users and the community to curate actionable reliability intelligence buried in customer environments, GitHub threads, Discord channels, and mailing lists. This enables us to build detectors for known issues, best practices, anti-patterns, and misconfigurations.
We created the Common Reliability Enumerations (CREs) standard—a way for engineers and products to understand and share knowledge of software problems. It unlocks collaboration by making reliability intelligence shareable and actionable. Earlier this year we kicked off detect.sh, the open problem detection and resolution community, to rally engineers, amplify awareness, and push problem detection forward.
Like cybersecurity, problem detection requires a new architecture. Instead of bringing data to dashboards, detectors need to be brought to the data at the edge where problems happen. This gives teams real-time, up-to-date, and actionable insights without the penalty and unpredictable cost of storing sensitive data somewhere else.
This is how problem detection works. And with Prequel deployed in customer environments for the last year, words cannot describe how rewarding it is to help teams improve production-readiness, prevent incidents, and unblock releases. I'm embarrassed to admit I often jump out of my office chair screaming and clapping in excitement with each customer breakthrough made possible by problem detection.
We’ve watched individual engineers—now backed by community for the first time—use Prequel to run problem detectors across thousands of services, identifying problems and applying mitigations. All before their first coffee (or tea) break.
We’ve seen the results of customers quickly understanding how a given problem cascades through their systems. Prequel’s service graph continuously builds a map of neighboring services to find the chain of distributed cause and impact. This enumerates related symptoms and delivers more holistic insights within a single view of the underlying problem.
We’ve also seen the power of a fast and simple install process and drop-in capabilities. Problem detection is powerful in part because it's easy to use. That's why we leverage eBPF to make installation easy and avoid the need for manual instrumentation. This also allows reliability intelligence rules to analyze asynchronous data from protocol messages, metrics, logs, stack traces, Kubernetes, container runtimes, and Linux to detect anti-patterns, misconfigurations, and known failures.
For too long the industry was told that issues are unique snowflakes. And yet we’ve detected hundreds of problems by applying the same, rapidly expanding detection library across different organizations.
Here are a few examples:
- At one customer we detected a developer-introduced N+1 database performance query anti-pattern in a Java service that was impacting customer retention. The same detection uncovered a similar problem at another customer in a Golang service.
- By detecting configuration errors in technologies like RabbitMQ, we prevented inter-node traffic from reaching capacity and triggering failures in neighboring services.
- And at another customer, we detected a known problem in Kafka that stopped the creation of new topics.
Join us!
We started Prequel because we’ve spent our entire careers obsessed with finding and stopping software failure. We’ve seen how a new approach started a cybersecurity revolution. We are on a mission to do the same thing in reliability with problem detection. It’s time for a radically new approach.
Join us and the problem detection community! Let’s write the next chapter in software reliability and fight software complexity together.