Recently, the CrowdStrike outage has been making headlines everywhere. If you’re curious about what happened and why it matters, you’re in the right place. This post breaks down the incident, its implications, and lessons we can all learn and understand.
The Incident
On July 19, 2024, at approximately noon IST, numerous organizations faced a significant issue: a sudden appearance of the Blue Screen of Death (BSOD), resulting in extensive disruptions. The cause of this widespread problem was traced back to an update released by the cybersecurity firm “crowdstrike,” which affected critical systems across multiple sectors such as hospitals, airlines, and emergency services. This disruption led to flight delays, system outages in hospitals, and unavailability of emergency services, impacting the lives of many individuals.
What is CrowdStrike
CrowdStrike, a cybersecurity firm, safeguards endpoint systems against cyber threats through a lightweight agent installed on Operating systems. Their primary offering, EDR (Endpoint Detection and Response), protects more than 29,000 organizations by detecting and mitigating malware and ransomware. Due to CrowdStrike’s deep integration with Windows systems, the impact of this incident was extensive, affecting the entire Windows ecosystem significantly.
Here is a detail breakdown and how to fix CrowdStrike Windows Outage issue:
First, let’s understand why we saw the infamous Blue Screen of Death (BSOD) on screens around the world today.
𝐁𝐒𝐎𝐃 – 𝐁𝐥𝐮𝐞 𝐒𝐜𝐫𝐞𝐞𝐧 𝐨𝐟 𝐃𝐞𝐚𝐭𝐡
A BSOD signifies a critical error, often involving kernel-level operations, which have privileged access to system resources. Windows operates in three modes:
- 𝗨̲𝘀̲𝗲̲𝗿̲ ̲𝗺̲𝗼̲𝗱̲𝗲̲ – Where most programs run with limited access to system resources. The browser application you are reading this on is most likely running from the “user” space. In that context, the operating system can report the process crashed and as the user you can start it again. Hopefully it works and the issue doesn’t continue, but at least you can use your computer still.
- 𝗞̲𝗲̲𝗿̲𝗻̲𝗲̲𝗹̲ ̲𝗺̲𝗼̲𝗱̲𝗲̲ – Where critical software runs, with direct hardware access. This mode includes system drivers and EDR software. When you hear “system drivers”, they likely are running here. Kernel Mode should be treated like Spiderman powers (“With great power, comes great responsibility.”) Also, you saw how many major companies are impacted around the world which runs in Kernel mode.
- 𝗦̲ ̲𝗺̲𝗼̲𝗱̲𝗲̲ – This is a Windows Exclusive and we don’t really need to talk about it much in this context. But know that it exists and it’s more locked down and for Microsoft Store Apps.
A kernel mode crash usually causes a BSOD. EDR software like CrowdStrike’s needs kernel access to monitor system events and provide the ability to stop actual malware from taking action on systems. When CrowdStrike’s kernel driver encountered an error, it led to a BSOD you saw in the news headlines because it’s a required boot-start driver.
But how did that happen?
CrowdStrike pushed what they called a “Channel File Update” to all customer systems. This file (C-00000291-00000000-00000032.sys) bypasses customer configured Sensor Update Policies and is a background update to the core components to all installed agents. Usually these are updated without issue and no action is ever needed by the user.
However, as you can see in the stack trace below, it introduced a Null Pointer error once the kernel driver (CSagent.sys) tried to load using this file. I won’t go into what pointers are and C++ memory management but understand that when you write low-level languages you have the ability to do things that are unsafe and cause crashes if you don’t write checks into your code. CrowdStrike is written in C++, failing to check for NULL pointers can cause crashes. This error led to a memory access violation, forcing Windows to crash the entire system.
This is a Stack Trace from a crash today. You can see the error “Access Violation” indicating that the error is a result of a problem accessing some memory. You can also see that the read address is 0x9c and there is a move (mov) operation which is Assembly for “copy data from here to there”. Unfortunately, the memory location 0x9c is not accessible due to it being an invalid region of memory. Windows will always crash if something tries to access this address location.
Resolution Crowdstrike BSOD bug
To recover your machine:
- Access Windows Recovery Environment
-
- Tap the F8 key repeatedly until you see the Recovery screen
- Navigate : Troubleshoot -> Troubleshoot menu -> Advanced options – >In the Advanced options menu, click on Command Prompt
-
- In the Command Prompt window, type the necessary commands to remove the faulty file, suggested in the below code ⬇️
To delete: del C:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys To disable: @echo off setlocal REM Define the driver file pattern set "driver_pattern=C-00000291*.sys" REM Define the target directory set "target_dir=C:\Windows\System32\drivers\CrowdStrike" REM Change to the target directory cd /d "%target_dir%" || ( echo Failed to change directory to %target_dir% goto :error ) REM Find the driver file for %%f in (%driver_pattern%) do ( set "driver_file=%%f" goto :found ) echo No driver file matching %driver_pattern% found. goto :error :found REM Extract the base name of the driver file (assuming the driver name without extension matches the service name) set "driver_name=%driver_file:~0,-4%" REM Disable the driver sc config %driver_name% start= disabled || ( echo Failed to disable the driver %driver_name% goto :error ) echo Successfully disabled the driver %driver_name% REM Reboot the system shutdown /r /t 0 goto :eof :error echo An error occurred. Exiting without reboot. endlocal pause
Other resolution steps include:
- Uninstall or Disable CrowdStrike
- Safe Mode Uninstall: In Safe Mode, navigate to Control Panel > Programs > Programs and Features, find CrowdStrike Falcon, and uninstall it.
- Disable Sensor: If uninstalling is not feasible, disable the CrowdStrike Falcon sensor from the CrowdStrike management console.
- Update Drivers and System Files
- Driver Update: Ensure all system drivers are up to date. This can often resolve conflicts causing BSODs.
- Windows Update: Run Windows Update to install the latest patches and updates from Microsoft.
- Reinstall CrowdStrike
- Updated Version: Download the latest version of CrowdStrike Falcon from the official website or management console.
- Installation: Reinstall the updated version on the affected devices.
- Test and Monitor
- Initial Testing: After reinstallation, perform a series of tests to ensure the system operates without issues.
- Continuous Monitoring: Use CrowdStrike’s monitoring tools to keep an eye on system performance and quickly identify any future issues.
Preventive Measures for the Future
- Regular Backups : Data Backups: Ensure regular backups of critical data to mitigate data loss risks during system crashes.
- Staged Rollouts : Implement phased rollouts for updates to critical software like CrowdStrike, allowing time to identify and address issues before widespread deployment.
- System Compatibility Checks : Regularly perform compatibility tests for new software updates with existing system configurations to prevent conflicts.
Implications and Lessons
This incident underscores the critical nature of kernel-level software and the importance of thorough testing and robust error handling. It also highlights the need for effective crisis management strategies. Organizations must balance the need for quick updates with the importance of safety and reliability.
Moving Forward
The CrowdStrike outage serves as a wake-up call for the IT community to re-evaluate their internal processes. Ensuring rigorous testing, gradual rollout of updates, and robust error handling are crucial. Additionally, organizations should strengthen their resilience and adaptability to handle such crises effectively.
Final Thoughts
While CrowdStrike has been a trusted name in cybersecurity, this incident is a reminder of the complexities and risks involved in the field. It is an opportunity for the entire industry to learn, improve, and reinforce their commitment to making the digital world a safer place. This event calls for a collective effort to enhance technical practices, resilience, and a continuous learning mindset in the face of challenges.