Case Study : The CrowdStrike Outage
Date : 21st July 2024
Author : Vishal Tripathi
The global outage was caused by a faulty software update from cybersecurity firm CrowdStrike. This Texas-based company develops software to help companies detect and block hacks. Their update, intended to protect endpoints, interacted poorly with Windows systems, leading to widespread IT disruptions. Banks airlines, and other businesses were affected, with services disrupted and computers rebooting due to the update issue. Microsoft is taking mitigation action to address the lingering impact of the outage.
CrowdStrike Root Cause Analysis: Case Study
1. Introduction
The global IT outage caused by CrowdStrike was a significant disruption that affected various industries, including banks, airlines, and businesses worldwide.
The Falcon Sensor, a component of CrowdStrike’s security platform, was responsible for the global outage. This sensor runs locally on users’ devices, scanning for malware. A faulty update caused it to malfunction, leading to widespread disruptions in Microsoft Windows systems worldwide.
1.1. Background of Crowdstrike
CrowdStrike: CrowdStrike is a Texas-based cybersecurity vendor that develops software to detect and block hacks. It’s widely used by Fortune 500 companies, including global banks, health-care providers, and energy companies. Their software focuses on endpoint security, applying cyber protections to devices connected to the internet.
2. Case Study
A case study on recent crowdstrike related worldwide outage and it's root cause analysis and ways to avoid it in future. Here’s what happened:
1. The Issue: On Friday, a software update related to CrowdStrike’s Falcon product caused a cascade effect. Falcon is designed to prevent cyber breaches using cloud technology. Unfortunately, this update led to a widespread issue where machines rebooted, causing the infamous “blue screen of death” error on Windows computers.
2. Impact: Industries faced disruptions—banks, health-care services, TV broadcasters, and air travel were all affected. Planes were grounded, services delayed, and businesses grappled with ongoing outages.
3. Rollback: CrowdStrike is currently rolling back the faulty update globally.
2.1. Kernel crash dump
Remember, this wasn’t a cyberattack but an unintended consequence of a flawed software update.
3. Root Cause Analysis
Memory in your computer is laid out as one giant array of numbers. We represent these numbers here as hexadecimal, which is base 16 (hexadecimal) because it's easier to work with... for reasons. The problem area? The computer tried to read memory address 0x9c (aka 156).
Why is this bad? This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS. That is what you see here with this stack dump.
3.1. Addresses and some maths
So why is memory address 0x9c trying to be read from? Well because... programmer error. It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean "there's nothing here", don't try to access it or you'll die.
Programmers in C++ are supposed to check for this when they pass objects around by "checking full null". Usually you'll see something like this:
string* p = get_name();
if (p == NULL)
{
Recommended by LinkedIn
print("Could not get name");
}
The string* part means we have a "pointer" to the start of the string value. If it's null, then there's nothing there, don't try to access it. So let's take a generic object with stuff in it:
struct Obj { int a; int b; };
if we create a pointer to it: Obj* obj = new Obj(); We can get it's start address, let's say its something random like 0x9030=36912 (I'm using small numbers) Then the address of: obj is 0x9030 obj->a is 0x9030 + 0x4 obj->b is 0x9030 + 0x8 Each member is an offset from the start address.
Now let's assume the following: Obj* obj = NULL; Then the address of: obj is 0 obj->a is 0 + 4 obj->b is 0 + 8 So if I do this on a NULL pointer: print(obj->a); The program stack dump like what you'll see above. It will cannot read value 0x000000004
In this stack dump you see that it's trying to read memory value 0x9c. In human numbers, this is the value 156. So what happened is that the programmer forgot to check that the object it's working with isn't valid, it tried to access one of the objects member variables...
NULL + 0x9C = 0x9C = 156. That's an invalid region of memory. And what's bad about this is that this is a special program called a system driver, which has PRIVLIDGED access to the computer. So the operating system is forced to, out of an abundance of caution, crash immediately
This is what is causing the blue screen of death. A computer can recover from a crash in non-privileged code by simply terminating the program, but not a system driver. When your computer crashes, 95% of the time it's because it's a crash in the system drivers.
4. Best Practices
If the programmer had done a check for NULL, or if they used modern tooling that checks these sorts of things, it could have been caught. But somehow it made it into production and then got pushed as a forced update by Crowdstrike... OOPS!
The fix going forward is that Microsoft needs to have better policies to roll back defective drivers and not just raw dog risky updates to customers. Crowdstrike will likely promote their code safety officer to put in code sanitization tools that will catch this automatically.
And Crowdstrike will likely take a hard look at rewriting their system driver from what it currently is, C++ to a more modern language like Rust, which doesn't have this problem.
4.1. Proactive Measures
C++ is hard. Maybe they have a DEI engineer that did this but for mission-critical software like this Crowdstrike should have set up automated testing using address sanitizer and thread sanitizer that runs on every code update.
5. Conclusion
It was a NULL pointer from the memory unsafe C++ language. It’s a reminder of how critical software updates are and the need for thorough testing before deployment.
For people looking for a conspiracy, the replacement language for C++, Rust, is compromised by a cabal of woke tards that are doing strange things. It's possible this could be a plot to move mission-critical code to Rust. It's the only other language Linux is allowing, other than C. But who knows.
6. Technical Updates
[ Updates from CrowdStrike official blog ]
Technical Details on the outage can be found here:
References:
[1] Ryan Browne, “How a software update from cyber firm CrowdStrike caused one of the world’s biggest IT blackouts”
[2] Understanding a Kernel Oops! - Open Source For You (opensourceforu.com)
[3] G. M. Chiong, "The rise of ransomware: Motivations, contributing factors, and defenses," 2023.
[4] N. Daswani, M. Elbayadi, N. Daswani, et al., "Technology defenses to fight the root causes of breach: Part One," in Practical Cybersecurity Lessons for Companies, Springer, 2021.
Technologist (Delivery Manager, Strategist, Architect, Coach with a quality & risk focus) #OpenToConsult
5moCrowdstrike draft PPIR reveals a design change to use pointers but not update the exception handler probably led to the bug. See my post for more
Senior Test Specialist at Ericsson | Solution Architecture | DevOps | CICD | k8s | E2E | Telecom Cloud | Automation
5moInsightful! and one of the best use case to implement robust CI/CD (DevSecOps) practices.
Alarming concern company of this magnitude has implemented critical core updates without testing and even more distressing that the recipients have no DR recovery or even diligence for canary release deployment or sandbox environment prior to implementation or BI software cloudtrail /cloudtrail or API diligence to mitigate and resource the rogue code
Senior Process Specialist at Sakon
5moVery informative
Digital Media Technology Student at K-State Polytechnic
5moI've been insanely curious about the actual faulty code. I'm merely a Digital Media student and I barely have a handle on JavaScript, but I was able to piece together the limited knowledge I have and understand that the C++ code was written to call from the wrong hexadecimal memory location. This wasn't properly vetted by the time the update went into production, and now I truly see what damage can be done with one or two small lines of faulty code. I'm grateful my professors are hammering into me the importance of proofreading my programs. I'm also an intern with a hospital, my legs and feet are incredibly sore from the computers on different floors we've had to fix or salvage since Friday. I haven't done half as much as my superiors, I can't imagine how exhausted they are. Nurses and staff have had to get creative with medical charting and scrounge for any working computers in their respective departments, let alone the damage this has caused other Microsoft/CrowdStrike users. All this for just a couple small errors in a tiny update.