I don’t have any insight into Cloudflare’s architecture or internal tooling, but by reading their root cause analysis, there are a few notes:
This is I think very usual start of a big widespread outage:
> This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations.
From the description, it seems they are victim of „lets do the change on less important locations first“ syndrome. I am speculating, but it seems that they did some phases, but then all the 19 busiest locations (which has specific feature called MCP) did the change at once (probably the last phase).
Additionally, they have hard to read diff format describing the change where multiple peer reviews missed the issue.
1. took them 3 minutes to declare incident – very good visibility into what is happening
2. took them 18 minutes to have a solution to test on one location
3. took them 1h 15 minutes to fix it everywhere