Design gracious failures: what happens when agents break

Design gracious failures: what happens when agents break

The demand in AI is surging. We are seeing real LLM workflows being pushed into production more and more often.

AI SDR-s sending messages to your inbox and booking calls, AI customer support assistants or even code debuggers, loan origination assistants or accounting agents are becoming more common and by the end of 2025 will take over more than half of our common interactions across many domains.

Despite all the excitement and the economics that are on the table, we need to also consider potential pitfalls in designing these experiences. In areas where cost of mistake is high , regulated domains or complex workflows where errors can accumulate, accomodating for agentic failure and intentionally designing fallbacks is key.

I call this the gracious failure mode. The better designed it is on your end, the further you can push automation and synthetic intelligence. These go hand in hand. In fact a well thought-through "failure mode" can out-compete more expensive and slower intelligence that is too reliant on run time compute.

Definition

What is gracious failure ?

Graceful failure refers to a system's ability to handle errors or unexpected conditions without causing a complete breakdown or producing harmful outcomes. Intentionally designed failure more boosts AI adoption and can guarantee, safety, reliability and compliance.

Co-pilots vs action agents

If the AI is a copilot where humans are the final decision makers and action takers, failures are easier to capture and account for, and often they do not land with end customers. When an agent takes action, the situation changes.

Let's consider a simple form filling agent that is booking travel or ordering food for a board meeting.

In an edge scenario where one of the people in the room has an allergy and this was not captured by the agent or was missed as a note in the order form, significant harm can be caused. To accommodate for this scenario you can either employ traditional human in the loop approvals that are slow and costly, or add additional loops to double check for the allergies question and add a note in every single order.

Implementing gracious failure strategies

In order to implement graceful failing you need both technical steps and process.

The first and the most critical step is error handling. Most of the errors occur on the edges, on exceptions and in scenarios where an agent has no relevant data or patterns.

Error handling

Exception Handling: It is key to create try-catch blocks to manage exceptions without crashing the system and test them way in advance, but also constantly while in production.

Validation Checks: Incorporating input validation becomes more and more important as well in order to detect and reject invalid or malicious data.

Monitoring and Alerts: Finally real-time monitoring and alerts are evergreen tools to handle fallbacks of all software systems, including LLM enabled ones.

Fallback management

Designing fallback systems that are activated once the primary model or tasks fails to deliver is key. Here you can have either humans or backup models. Allowing systems and agents to reduce functionality instead of shutting down is another path that enhances user trust and improves retention

You also need to invest in user overrides allowing end users to correct the systems and be in full control.

Clear messaging

Finally, if error messages are well written and transparent they enhance everyones experience particularly when the core expectations or agentic functions fail .

Epilogue

Terence Tao in one of his recent interviews mentions that the way humans get better at a domain, even mathematics, is by making many mistakes and gaining an understanding where they fail. AI systems do not have such data and need more of it to improve.

Designing for graceful failure allows to collect key datapoints on end to end experiences with AI agents, as well as on cases where they make errors or fall short. Capturing this data will get you both product excellence, but also plausible lead in deployments and adoption.

Mo Awadalla, CFP®

Financial Institutions GTM Innovator | Personalized Financial Data Advocate | LLM & GenAI Adoption Advisor | Ethical AI Strategist

3mo

Designing for graceful failure is crucial as we move AI-driven workflows from hypothetical to real-world production. I call it 'failing fast', building reliability and driving towards an impact sooner rather than later.

Like
Reply
Sabina Zakaradze

Senior Quality Assurance Specialist at Ntropy

3mo

Thanks for sharing

Like
Reply
Naré Vardanyan

Chief Executive @Ntropy. Building the data layer for AI workforce and workflows in finance . Spending most of my time on finding the right prompt

3mo
Like
Reply

To view or add a comment, sign in

More articles by Naré Vardanyan

Insights from the community

Others also viewed

Explore topics