OpenAI and CrowdStrike Learnings: Prevention With Phased Rollouts Through Enhanced Observability
In the realm of cloud services and cybersecurity, especially those relied on by innovative teams within the enterprise, system reliability is critical. This year, notable outages most recently at OpenAI and previously this year at CrowdStrike have underscored the need for robust deployment strategies and comprehensive observability frameworks. By dissecting these incidents and examining phased rollouts as a potential solution, companies can better safeguard against similar disruptions and ensure service continuity before an issue even starts. Observability is key to enable these rollout improvements and reduce risk.
Overview of the OpenAI and CrowdStrike Outages
OpenAI, known for its innovative artificial intelligence models, faced an outage that impacted several of its services, including the popular GPT models. This disruption happened 5 days ago and was caused by an oversight in an update rollout, which did not adequately account for certain operational variables specifically at scale, leading to widespread service failures. Though this is difficult to test for as it's very difficult to have a testing environment that entirely mirrors the scale of the load on production, rolling out to all of production in a swift manner can lead to unnecessary risk. In OpenAI's statement, the very first item they mentioned moving forward speaks to this:
Robust phased rollouts: We're continuing our work on improved phased rollouts with better monitoring for all infrastructure changes to ensure that any failure has limited impact and is detected early. All infrastructure-related configuration changes moving forward will follow a robust phased rollout process, with improved continuous monitoring that ensures that both the service workloads and the clusters (including the Kubernetes control plane) are healthy.
CrowdStrike, a cybersecurity giant, also experienced a significant service disruption affecting its Falcon platform. This outage compromised endpoint protection across its global user base, temporarily leaving numerous systems unprotected. The root cause was identified as a configuration error during a standard update process, which inadvertently initiated a cascade of failures. Both incidents not only disrupted services but also exposed the vulnerabilities in traditional update mechanisms. In CrowdStrike's released RCA report, they stated as the last item:
Template Instances should have staged deployment: Staged deployment mitigates impact if a new Template Instance causes failures such as system crashes, false-positive detection volume spikes or performance issues. New Template Instances that have passed canary testing are to be successively promoted to wider deployment rings or rolled back if problems are detected. Each ring is designed to identify and mitigate potential issues before wider deployment. Promoting a Template Instance to the next successive ring is followed by additional bake-in time, where telemetry is gathered to determine the overall impact of the Template Instance on the endpoint.
Minimizing Outages with Phased or Staggered Rollouts
Nearly 15 years ago when I was a developer within Microsoft Azure, even then we had this concept. It was called Update Domains or "UDs" which were a fault-tolerance mechanism used in cloud environments, particularly within Azure, to ensure high availability during updates or maintenance. An application or service is distributed across multiple UDs, which are logical groupings of resources like virtual machines. Updates or maintenance tasks are applied sequentially, one UD at a time, ensuring that only a portion of the infrastructure is affected at any given moment. This staggered approach minimizes downtime and reduces the risk of service disruption. By isolating updates to specific UDs, Microsoft enables continuous service availability, even during scheduled updates or unforeseen infrastructure changes.
Recommended by LinkedIn
These types of phased or staggered rollouts represent a strategic approach to software updates, where new changes are gradually introduced to segments of the user base or infrastructure. This method offers numerous advantages for minimizing the risks associated with direct, full-scale deployments:
For OpenAI and CrowdStrike, implementing phased rollouts could have significantly mitigated the impact of their respective outages. By limiting the initial release to a small fraction of their environments, both entities could have identified and addressed the critical issues without a full-scale system disruption.
The Critical Role of Observability in Phased Rollouts
#Observability plays a pivotal role in the success of phased rollouts. It involves collecting, analyzing, and acting on data from applications and infrastructure to gain a comprehensive understanding of system health and performance. Effective observability strategies enable real-time monitoring and proactive problem-solving during deployment phases, crucial for minimizing potential downtime. Key components include:
For companies like OpenAI and CrowdStrike, enhancing their observability frameworks could provide deeper insights into their deployment processes, enabling them to detect and address issues more swiftly and effectively during phased rollouts. Integrating comprehensive logging, detailed performance metrics, and thorough tracing capabilities would allow for a more controlled and informed update process, significantly reducing the likelihood of widespread service outages.
Conclusion
The outages experienced by OpenAI and CrowdStrike serve as continuing reminders of the complexities and risks inherent in software deployment. By adopting phased or staggered rollouts, companies can not only reduce the impact of potential outages but also enhance their ability to manage and mitigate risks. Coupled with robust observability practices, these strategies form a foundational approach to maintaining service reliability and trust in the ever-evolving landscape of technology services.
VP of Marketing at Edge Delta. Ex-Datadog.
3wGreat article!
OnCall AI is quite nice for immediate visibility and summarization of anomalies during the change process:
Reduce IT Monitoring & Security Costs by 60% or More
3wThanks for sharing Ozan Unlu. Very insightful.