CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

July 24, 2024:

CrowdStrike's Falcon security software brought down as many as 8.5 million Windows PCs over the weekend.
Enlarge / CrowdStrike’s Falcon security software brought down as many as 8.5 million Windows PCs over the weekend.

CrowdStrike

Security firm CrowdStrike has posted a preliminary post-incident report about the botched update to its Falcon security software that caused as many as 8.5 million Windows PCs to crash over the weekend, delaying flights, disrupting emergency response systems, and generally wreaking havoc.

The detailed post explains exactly what happened: At just after midnight Eastern time, CrowdStrike deployed “a content configuration update” to allow its software to “gather telemetry on possible novel threat techniques.” CrowdStrike says that these Rapid Response Content updates are tested before being deployed, and one of the steps involves checking updates using something called the Content Validator. In this case, “a bug in the Content Validator” failed to detect “problematic content data” in the update responsible for the crashing systems.

CrowdStrike says it is making changes to its testing and deployment processes to prevent something like this from happening again. The company is specifically including “additional validation checks to the Content Validator” and adding more layers of testing to its process.

The biggest change will probably be “a staggered deployment strategy for Rapid Response Content” going forward. In a staggered deployment system, updates are initially released to a small group of PCs, and then availability is slowly expanded once it becomes clear that the update isn’t causing major problems. Microsoft uses a phased rollout for Windows security and feature updates after a couple of major hiccups during the Windows 10 era. To this end, CrowdStrike will “improve monitoring for both sensor and system performance” to help “guide a phased rollout.”

CrowdStrike says it will also give its customers more control over when Rapid Response Content updates are deployed so that updates that take down millions of systems aren’t deployed at (say) midnight when fewer people are around to notice or fix things. Customers will also be able to subscribe to release notes about these updates.

Recovery of affected systems is ongoing. Rebooting systems multiple times (as many as 15, according to Microsoft) can give them enough time to grab a new, non-broken update file before they crash, resolving the issue. Microsoft has also created tools that can boot systems via USB or a network so that the bad update file can be deleted, allowing systems to restart normally.

In addition to this preliminary incident report, CrowdStrike says it will release “the full Root Cause Analysis” once it has finished investigating the issue.

Source link