Between November 2025 and February 2026, Cloudflare experienced three major service disruptions. All three shared a disturbing pattern: they weren't caused by hardware failures, attacks, or human error. They were caused by automated systems doing exactly what they were programmed to do.
This is the automation paradox. We build self-healing systems to eliminate human error, but when automation fails, it fails at machine speed and machine scale.
IMAGE PLACEHOLDER 1: Timeline Chart
Timeline showing three Cloudflare outages: Nov 18, Dec 5, Feb 20 with duration and impact metrics
The Three Incidents
November 18, 2025: Bot Management File Generation
Automated system generated a malformed Bot Management configuration file. File was syntactically valid but contained logic causing processing failures. Because generation and deployment were fully automated, it propagated globally before detection.
December 5, 2025: React Security Patch (28% of HTTP traffic affected)
Cloudflare rushed to deploy React Server Components vulnerability protection. Engineers disabled a testing tool flagging errors to meet deadlines.
The change: increase buffer size to 1MB. The bug: code tried to allocate more memory than available, panicked instead of gracefully handling error.
From their post-mortem: "If this code had been written in Rust, this error literally couldn't have happened at compile time."
Deeper issue: automated testing was disabled to meet deployment schedule.
February 20, 2026: BYOIP BGP Withdrawal (6 hours 7 minutes)
Automated maintenance on IP management system. Bug interpreted "update configurations" as "withdraw BGP announcements."
Result: 1,100 of 6,500 IP prefixes withdrawn globally. 25% of BYOIP customers unreachable for over an hour.
IMAGE PLACEHOLDER 2: BGP Withdrawal Graph
Graph from Cloudflare post-mortem showing prefix announcements dropping from 6,500 to 5,400 in minutes
The Pattern: Speed, Scale, and Missing Guardrails
Speed of propagation: Manual changes roll out gradually. Automated changes deploy globally in seconds.
Scale of impact: Humans affect what they can touch. Automation affects everything it can touch. One bug, 28% of web traffic failing.
Validation gaps: Humans have intuition. Automation has only what you programmed. No programmed check for "withdrawing 1,000 routes seems like a lot"? It won't check.
Cascading failures: React incident - automated testing caught the problem, solution was to disable testing.
The Numbers
From ThousandEyes Internet Report:
Global network outages:
- November 2025: 421 incidents
- December 2025: 1,170 incidents (178% increase)
US specifically:
- November: 153 incidents
- December: 587 incidents (284% increase)
December surge coincided with multiple automation-related incidents industry-wide.
Why This Matters Beyond Cloudflare
Cloudflare handles 20-25% of HTTP/HTTPS requests globally. When their automation fails, a quarter of the web breaks.
AWS, Google, Azure, Cloudflare combined: 60-70% of internet traffic. We've centralized infrastructure for efficiency, centralized blast radius of failures.
The autonomous agent problem:
From ThousandEyes 2026 risk report: Organizations moving from narrow automation to agents with broader authority. Auto-scalers, remediation tools, AIOps making consequential decisions simultaneously on shared infrastructure.
Each agent operates correctly in isolation. When multiple agents interact, you get "interaction failures" - systems working as designed creating states nobody anticipated.
IMAGE PLACEHOLDER 3: Automation Cascade Diagram
Flowchart showing how one automation bug triggers multiple system interactions, each amplifying the problem
Cloudflare's Response: Code Orange
After November and December outages, Cloudflare declared "Code Orange: Fail Small."
Goal: Prevent failures in one part from cascading globally.
Approach:
- New system "Aegis" for configuration management
- Mandatory 7-day canary deployments for critical systems
- 92% automated testing coverage
- Chaos engineering program
- Detection within 60-90 seconds
Key insight: "We can't prevent all bugs. We need to prevent all bugs from becoming global outages."
Lessons for Infrastructure Engineers
1. Automation Needs Different Testing
Traditional: "Does this work correctly?" Automation: "What happens when this fails incorrectly?"
Test the failure modes. What if buffer allocation fails? What if we're withdrawing 1,000 routes not 10? What if automated rollback also fails?
2. Speed Controls Are Not Optional
Every automation system needs:
- Rate limits (don't change more than X things per minute)
- Magnitude checks (alert if changing >Y% of total)
- Rollback verification (ensure rollback works before deploying forward)
3. Staged Rollouts for Everything
Minimum stages:
- Test environment
- 1% production (monitor 1 hour)
- 10% production (monitor 4 hours)
- 100% production
Each stage: automated metrics checking, manual approval gates for critical systems.
4. Strong Typing Prevents Bug Classes
Rust would have prevented the React bug at compile time. Move critical automation to languages that prevent undefined behavior.
5. Independent Monitoring
System making changes can't be only system checking if changes are correct.
Required:
- BGP monitoring from external vantage points
- Configuration diff tools running independently
- Alert on magnitude of changes, not just failures
- Human review triggers for large-scale changes
IMAGE PLACEHOLDER 4: Code Orange Framework
Diagram showing Cloudflare's fail-small architecture with isolation boundaries and circuit breakers
The Uncomfortable Truth
These outages aren't anomalies. They're predictable outcomes of how we build infrastructure.
We optimize for deployment speed, developer velocity, operational efficiency.
We don't optimize for failure isolation, graceful degradation, human oversight of automation.
December 5 incident: Engineers disabled testing to hit deadlines. Not a Cloudflare problem. An industry problem. We've all been there.
Question isn't "How could Cloudflare let this happen?"
Question is "How many other companies have similar automation bombs waiting?"
What Comes Next
ThousandEyes predicts interaction failures increase in 2026 as autonomous agents proliferate.
Example scenario:
- Auto-scaler sees high load, spins up 1,000 instances
- Cost optimization agent sees spike, kills "unnecessary" instances
- Health check agent sees failures, restarts services
- Traffic router sees restarts, shifts load
- Loop back to step 1
Each agent operates correctly. Together they create oscillation looking like DDoS. But self-inflicted.
Organizations that will avoid this:
- Treat agent coordination as first-class design concern
- Build instrumentation capturing agent interactions
- Implement circuit breakers for automation
- Design for observability across full service delivery chain
Conclusion
Cloudflare's transparency in post-mortems is gold standard. Be grateful they share these failures.
But the failures themselves are a warning. The more we automate, the more catastrophic automation failures become.
Solution is not less automation. Can't put that genie back. Solution is:
- Better-designed automation with guardrails, rate limits, failure isolation
- Stronger typing to catch bugs at compile time
- Staged rollouts always, for everything
- Independent monitoring, automation can't be its own watchdog
- Cultural change: "ship fast" can't mean "disable safety checks"
The automation paradox: Systems meant to prevent human error become the error when they fail. And they fail at scale and speed no human ever could.
We built tools to automate infrastructure at global scale. Now we need discipline to do it safely.
---
Further Reading:
- Cloudflare Nov 18 Post-Mortem
- Cloudflare Dec 5 Post-Mortem
- Cloudflare Feb 20 Post-Mortem
- Code Orange: Fail Small
- ThousandEyes 2026 Outage Risk Report
All technical details from official Cloudflare post-mortems and ThousandEyes Internet Reports.