Cloudflare suffered its worst outage since 2019. The incident was caused by a bug in the company's Bot Management system.
Matthew Prince, Cloudflare co-founder and CEO, has published a detailed article on the company's blog to explain what went wrong.
The outage, which occurred on 18 November 2025 at 11:20 UTC, had a global impact as several websites went offline. Price clarified that Cloudflare was not hacked by a DDoS attack, as had been initially suspected. That's important to clarify, because people may have been worried whether it was taken down by malware. Once the issue had been identified, Cloudflare replaced a backup of the feature file to sort out the issue. The company says that its Core traffic was normal by 14:30, about three hours after the issue began. Cloudflare managed to restore all systems by 17:06.
Cloudflare described the issue as follows "A change in our underlying ClickHouse query behavior (explained below) that generates this file caused it to have a large number of duplicate “feature” rows. This changed the size of the previously fixed-size feature configuration file, causing the bots module to trigger an error."
Here's a simpler explanation. Cloudflare had made a change to one of its database systems' permissions, and this had caused the database to output multiple entries into a “feature file” used the company's Bot Management system. The feature file in question is used to keep the Bot Management system up to date to handle threats. This system has many modules, including a machine learning model that is used to generate bot scores. Every network request has a bot score, and Cloudflare's customers, i.e. websites, services, use the bot scores to determine which bots can access their site, or block them accordingly.
The "feature file" I mentioned earlier is used as a configuration file for the model, to check whether requests were automated. It is used by Cloudflare's servers to route traffic across its network. But, due to the bug this file had doubled in size, causing the software to fail, and this propagated to the entire network. Websites that had deployed rules to block bots saw a large number of false positives, and hence, unreachable.
Cloudflare says it will harden its system against failures like this by making these changes:
- Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
- Enabling more global kill switches for features
- Eliminating the ability for core dumps or other error reports to overwhelm system resources
- Reviewing failure modes for error conditions across all core proxy modules
You may have noticed that yesterday Ghacks was down for a while too, and we couldn't access the admin website either, it kept throwing up internal server errors. While it wasn't as disruptive as the last year's CrowdStrike outage, Cloudflare took down a large chunk of the internet down. It didn't just affect websites, several online services were unusable including ChatGPT, X, Spotify, Downdetector. Cloudflare’s status page, which lets you check if the service is experiencing any disruptions, was unreachable.
The company says that the status page runs off of its network, and when it went down, it suspected that this was an attack. However, it was able to single out the issue as a bug, and fix it. It's kind of scary to think about how many websites depend on Cloudflare's network. In case you are wondering, about 20% of the internet depends on Cloudflare. I have to admit, I thought that number would be way higher.
Amazon Web Services (AWS) suffered an outage on October 20, and Microsoft Azure experienced an outage on October 29.
Thank you for being a Ghacks reader. The post Cloudflare says the outage on Tuesday was due to a bug in its bot detection system appeared first on gHacks Technology News.
0 Commentaires