Goodbye AIOps: Automating SREs - the next $100B opportunity

Jaya Gupta

Partner @ Foundation Capital

Published Aug 13, 2024

We've seen countless buzzwords come and go. "AIOps" is the latest in a long line of catchy but ultimately misguided terms that fail to capture the true potential of AI in the world of IT ops and observability.

Here's why we believe we're on the cusp of a fundamental shift in how organizations monitor, debug, and optimize their increasingly complex software systems.

The Problem with "AIOps"

The term "AIOps" implies simply layering AI on top of existing operations processes. This vastly undersells the potential of what could happen in this space - currently, the existing solutions get us to automating a few alerts or providing slightly smarter dashboards. The entire paradigm of how we approach observability and root cause analysis is poised for disruption.

The Current State of Observability

The observability market has been fragmented for years, with leading vendors like Datadog, New Relic, and Splunk rarely capturing more than 20% market share. Why? Because fundamentally, observability has been treated as a big data problem rather than an intelligence problem.

Modern distributed systems generate an astronomical amount of telemetry data – often petabytes per day. This data comes in heterogeneous formats: unstructured logs, structured metrics, and complex distributed traces. Each of these data types traditionally requires its own specialized storage and query engine, leading to a proliferation of tools and data silos.

However, the core challenge isn't collecting or storing massive amounts of telemetry data. It's making sense of that data quickly enough to drive real business value.

The challenges:

1. Data Volume and Velocity: The sheer scale of data generation in modern systems is staggering. Real-time ingestion and indexing at this scale remain computationally expensive, pushing the limits of even advanced platforms like Elasticsearch or InfluxDB.

2. Heterogeneous Data Formats: Logs are typically unstructured text, metrics are time-series data, and traces form directed acyclic graphs. Each requires specialized tools: Elasticsearch for logs, Prometheus for metrics, and Jaeger for traces, for instance.

3. Lack of Unified Data Model: There's no standardized way to correlate events across logs, metrics, and traces. While initiatives like OpenTelemetry aim to address this, adoption is still in early stages.

4. Query Complexity: Each observability tool has its own query language. Elasticsearch uses Lucene, Prometheus has PromQL, and many tracing tools use SQL-like languages. Mastering these diverse query languages is a significant barrier for many teams.

5. High-Cardinality Problem: Modern microservices architectures lead to an explosion in the number of unique label combinations. Traditional time-series databases like InfluxDB or Prometheus struggle with high-cardinality data, often leading to performance issues or increased costs.

6. Alert Correlation: A single root cause often triggers cascading alerts across multiple systems. Correlating these alerts programmatically is an NP-hard problem, making automated root cause analysis extremely challenging.

The average enterprise juggles 7-10 different observability tools, each with its own query language and data model. This makes it incredibly difficult to get a holistic view of system health. Engineers often spend up to 30% of their time just triaging alerts, many of which are false positives or symptoms rather than root causes. Even worse, the Mean Time to Resolution (MTTR) for critical incidents still averages 4-5 hours and can often be days in exceptional cases.

Companies like Splunk, Elastic, and Grafana Labs have made strides in unifying some of these data types, but a truly integrated solution remains elusive. Other entrants like Honeycomb and Lightstep (now part of ServiceNow) have focused on high-cardinality data and distributed tracing, but the challenge of unifying all observability data persists.

Recommended by LinkedIn

2023 Product Roundup: AI, Data Mesh, and a New Age of…

Atlan 9 months ago

Why Companies Deploying RAG-Powered AI on Kubernetes…

Ashish Patel 🇮🇳 2 months ago

CxO, Storage, AI, BI, HR, DR, Careers, DDN, SAP…

John J. McLaughlin 8 months ago

Why Previous "Smart" Observability Attempts Have Fallen Short

Earlier forays into applying AI to observability, including efforts by established players like Dynatrace and AppDynamics, have often disappointed. The reasons are multifaceted and deeply technical.

Supervised learning approaches struggle with the lack of labeled training data for rare failure modes. Feature engineering across heterogeneous data sources proves to be a Herculean task, often failing to capture the complex interactions in distributed systems. Black-box models, while sometimes accurate, fail to provide the explanations necessary to gain the trust of DevOps teams.

Perhaps most challenging is the issue of concept drift. In the world of continuous deployment, system behavior is constantly evolving. Traditional machine learning models require frequent retraining to maintain accuracy, a luxury rarely afforded in fast-paced production environments.

You knew what was coming here: LLMs

LLMs offer a unified approach to data understanding. Their ability to process and correlate heterogeneous data types – logs, metrics, and traces – in their raw formats breaks down some of the silos that have plagued observability. The transformer architecture underlying LLMs excels at capturing long-range dependencies, crucial for understanding system-wide patterns. LLMs also bring the power of zero-shot and few-shot meaning they can adapt to new failure modes without extensive retraining, addressing the perennial issue of concept drift in rapidly evolving systems.

One of the most exciting things is the introduction of natural language interfaces to observability. Imagine being able to ask, "Show me all HTTP 500 errors in the payment service correlated with high CPU usage in the last hour," and getting an instant, accurate response. This democratizes access to powerful debugging capabilities, no longer requiring expertise in multiple query languages.

LLMs can also provide context-aware analysis by ingesting not just telemetry data, but also system documentation, code repositories, and historical incident reports. This allows for reasoning that incorporates deep domain knowledge, going far beyond simple pattern matching.

We believe LLMs offer a path to truly automated root cause analysis. By understanding the complex causal relationships in distributed systems, they can rapidly correlate events across the entire stack to pinpoint root cause, potentially reducing Mean Time to Resolution (MTTR) by an order of magnitude.

Beware the technical challenges

While the potential of LLMs in observability is immense, significant technical hurdles persist. Real-time processing, crucial in observability contexts, remains challenging due to current LLM inference latencies and costs. Moreover, the confidential nature of telemetry data, often containing PII in large companies, raises legitimate data privacy and security concerns.

LLMs currently struggle with both tabular and time series data, common formats in observability. Although we anticipate that innovations in newer architectures, multimodality, and multi-agent systems will mitigate some of these challenges over time, near-term solutions will require creative workarounds from builders.

Furthermore, while LLMs excel at identifying correlations, true root cause analysis often demands causal reasoning. A more promising direction lies in integrating LLMs with causal graphical models, bridging the gap between correlation and causation in complex systems.

Conclusion

The term "AIOps" will soon feel as dated as "Big Data." We're moving beyond simply applying narrow AI to existing ops processes. The future is LLM-powered, intelligent, unified observability that fundamentally transforms how organizations build, run, and optimize their software systems.

The economic impact of this shift will be profound. By dramatically reducing MTTR, preventing outages, and freeing up engineering time, these technologies will be a force multiplier for software-driven innovation across industries. While Gartner predictions for AIOps are modest ($3.1B by 2025), we believe that automating SREs is worth 50X that, i.e. this is a $100B+ opportunity.

For startups entering this space, success will require a rare combination of deep expertise in ML, LLMs and distributed systems, along with a keen understanding of the practical challenges faced by DevOps teams. The ability to ingest and process heterogeneous data at scale, provide explainable insights, and deliver immediate value will be crucial.

While there are a handful of start-ups going after this opportunity, we believe that the playing field is wide open, and there will be multiple decacorns built in this category. If you are building in this space, email agarg@foundationcap.com and jgupta@foundationcap.com.

Pushkala Pattabhiraman

Platform As a Product, Security Engineering, Kubernetes, Infrastructure, Observability, Compliance, Portal & API Experience

4mo

Insightful and on target. Signaling that there is a problem ain't the target state.. Resolving the issue with workload awareness and cloud awareness is the target state.

2 Reactions

Prasad Thammineni

Serial Entrepreneur | AI Engineering, Product, Growth Executive | eCommerce, B2C, B2B, Aggregation platforms, Marketplaces | Wharton, BITS Pilani

4mo

You identify some great opportunities and challenges, Jaya. Since these issues are so specific to each domain and customer, and latency and privacy is an issue, are you seeing an ensemble of fine-tune small language models used to solve for this?

Satish Nagpal

4mo

Interesting perspective. Is there any startup up currently offering this solution?

1 Reaction

Saikat Sen

AI/ML leadership at Meta, ex-CEO, ex-Microsoft

4mo

My team did this at Microsoft for close to a decade. Feel free to ping me.

1 Reaction

See more comments

To view or add a comment, sign in

See all

Goodbye AIOps: Automating SREs - the next $100B opportunity

Jaya Gupta

Partner @ Foundation Capital

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

Reference Architecture for RAG applications

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

GenAI Weekly — Edition 6

Oracle AI Agent Architecture for enterprise-grade Generative AI solutions

What’s AWS CEO’s AI Vision? Discover GameFi, Top Data Management Tools Of 2024 And More!

Gleecus Gazette - December 2024

Why I believe AI LLM and reason models will consume Data Lakes

Harnessing AI for Log analysis using AI functions in Databricks

MLOps for Data Scientists

Data Driven: 360 Degrees of Digital Analytics With Kubernetes

Explore topics

Recommended by LinkedIn

Service as Software Part 3: How Systems of Agents will collapse the enterprise stack

Dec 3, 2024

A System of Agents brings Service-as-Software to life

Oct 31, 2024

From Systems of Intelligence to Systems of Agents: The New Moats in Enterprise Software

Oct 15, 2024

Overhauling logistics with AI: a $79 billion opportunity

Oct 14, 2024

Shock-proofing supply chain with AI: a $62 billion opportunity

Sep 24, 2024

The Observability Crisis

Aug 20, 2024

Beyond LLMs: Building magic

Jun 27, 2024

6'5. Blue Eyes. Trust Fund. Finance.

Jun 25, 2024

10 takeaways from interviewing AI researchers

Jun 24, 2024

agents on the web

Jun 18, 2024

Insights from the community

Others also viewed

Reference Architecture for RAG applications

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

GenAI Weekly — Edition 6

Oracle AI Agent Architecture for enterprise-grade Generative AI solutions

What’s AWS CEO’s AI Vision? Discover GameFi, Top Data Management Tools Of 2024 And More!

Gleecus Gazette - December 2024

Why I believe AI LLM and reason models will consume Data Lakes

Harnessing AI for Log analysis using AI functions in Databricks

MLOps for Data Scientists

Data Driven: 360 Degrees of Digital Analytics With Kubernetes

Explore topics