Google SRE Pioneer Niall Murphy Discusses Problem Detection

Software systems are expected to run consistently without major disruptions, but no matter how well they're designed, issues crop up—often at the worst possible times. This is where problem management comes in: the process of identifying issues (and intervening) before they escalate into critical incidents.

A recent discussion between Niall Murphy, CEO of Stanza Systems (and co-author of Site Reliability Engineering: How Google Runs Production Systems) and Tony Meehan, co-founder and CTO of Prequel, sheds light on the strategies and tools that help teams tackle problem detection and management effectively.

They explore organizational, technical, and cultural barriers that stifle problem management and offer actionable insights for teams eager to get ahead of failures.

Let’s break down their conversation and dive into building better approaches to problem detection and reliability management.

The Difference Between Problems, Issues, and Incidents

To address reliability effectively, it’s crucial to understand the relationship between problems, issues, and incidents:

‍Problem: An underlying condition or flaw in a system. For example, a memory leak in an application.‍
Issue: A noticeable effect caused by the problem, like a low memory warning.‍
Incident: The direct, real-time disruption users experience. Example: servers crash due to memory exhaustion.

Think of it like a pipeline. The problem is the source, the issue is the leakage, and the incident is the flood hitting users. Understanding the distinctions helps teams track the cause rather than only patching the symptoms.

A rolling server crash caused by a memory leak might seem like random bad luck.But with the right analysis, teams can discover that a third-party library isn’t releasing memory correctly and patch it.

Why Problem Detection Matters

Problem detection is the first step to avoiding system downtime, service disruptions, and customer dissatisfaction. In many organizations, the challenge lies in identifying cracks in the system before they lead to big failures. Problems are often silent until they manifest as issues, and if unresolved, they evolve into full-blown incidents.

A system failure doesn't just cost productivity. It impacts revenue, customer trust, and team morale. Proactively detecting problems is about protecting your business from unnecessary risks.

Overcoming Barriers to Effective Problem Management

Unfortunately, problem detection isn’t as simple as it seems. Teams face roadblocks—some are technical, others organizational. Addressing these can build a stronger foundation for reliability.

Cultural Barriers

In many teams, reliability often takes a backseat to feature delivery. Leadership may reward quick rollouts at the expense of system health, sending the message that problem detection isn’t a priority.

This attitude leads to firefighting. Teams spend most of their time reacting to incidents instead of preventing them. Shifting this mindset requires leadership buy-in—reinforcing the value of addressing reliability as a competitive edge.

Technical Barriers

Modern engineering teams have access to more data than ever—logs, metrics, traces—but making sense of it is harder than it seems. Overloaded dashboards and unfiltered alerts create noise, while engineers scramble to find useful insights buried in the chaos.

Without proper tools for identifying trends, teams rely on intuition or guesswork, wasting time and leaving issues unresolved.

Organizational Barriers

Resource constraints also play a major role. As teams shrink or budgets tighten, proactive work, like problem detection, often gets deprioritized. Engineering efforts gravitate toward short-term deliverables that add features or fix immediate bugs, leaving deeper systemic problems untouched.

With fewer people focused on reliability, incidents increase, adding more stress to already-overworked teams.

The Power of Proactive Detection

Being reactive is exhausting. Playing catch-up with incidents gives teams little time to improve systems or prevent future issues. Proactive detection flips this dynamic.

A proactive team uses its time to:

Identify weak spots before they fail.
Address recurring problems to reduce incident frequency.
Give customers a more seamless experience by providing error-free sessions

Building Smarter Problem Detection Systems

Technology can help, but only if it’s applied effectively. Better tools for managing reliability don’t just add more data—they streamline analysis and focus attention on the right areas.

Tooling That Makes a Difference

What should a good tool for problem detection include? At a minimum:

Precision: Prioritizing meaningful problems without generating a lot of noise ‍
Multi-Data Source: Leveraging logs, metrics, and low level data‍
Comprehensive: Spanning problems caused by developers, open source software, and misconfigurations

Poor tooling often overwhelms teams instead of helping them. Thoughtful systems that present actionable insights, rather than endless charts or graphs, are game changers.

Conclusion

Teams that prioritize detecting issues early see fewer disruptions, enjoy happier customers, and avoid the burnout caused by reactive firefighting. By addressing cultural, technical, and organizational barriers, and adopting smarter tools and strategies, you can transform how your team manages reliability.

Whether it’s a community-driven approach from a product like Prequel or organizational learnings from Stanza Systems, now’s the time to focus on creating systems that don’t just work—they endure.

Investing in problem detection today means fewer unexpected failures tomorrow. Take a look at how Prequel enables problem detection for engineering teams.

problem detection

software reliability

ideas