Cloudfare outage post mortem

homura1650@lemmy.world · 4 months ago

Cloudfare outage post mortem

edgemaster72@lemmy.world · 4 months ago

Noah Snedden@aussie.zone · 4 months ago

Wasn’t it crowdstrike? Close enough though

cepelinas@sopuli.xyz · 4 months ago

The crowd was in the cloud.

edgemaster72@lemmy.world · 4 months ago

Shit, you’re right. Oh well.

Echo Dot@feddit.uk · 4 months ago

So I work in the IT department of a pretty large company. One of the things that we do on a regular basis is staged updates, so we’ll get a small number of computers and we’ll update the software on them to the latest version or whatever. Then we leave it for about a week, and if the world doesn’t end we update the software onto the next group and then the next and then the next until everything is upgraded. We don’t just slap it onto production infrastructure and then go to the pub.

But apparently our standards are slightly higher than that of an international organisation who’s whole purpose is cyber security.

codemankey@programming.dev · 4 months ago

My assumption is that the pattern you describe is possible/doable on certain scales and in certain combinations of technologies. But doing this across a distributed system with as many nodes and as many different nodes as CloudFlare has, and still have a system that can be updated quickly (responding to DDOS attacks for example) is a lot harder.

If you really feel like you have a better solution please contact them and consult for them, the internet would thank you for it.

Echo Dot@feddit.uk · 4 months ago

They know this, it’s not like any of this is a revelation. But the company has been lazy and would rather just test in production because that’s cheaper and most of the time perfectly fine.

IphtashuFitz@lemmy.world · 4 months ago

You would do well to go read up on the 1990 AT&T long distance network collapse. A single line of changed code, rolled out months earlier, ultimately triggered what you might call these days a DDoS attack that took down all 114 long distance telephone switches in their global network. Over 50 million long distance calls were blocked in the 9 hours it took them to identify the cause and roll out a fix.

AT&T prided itself on the thoroughness of their testing & rollout strategy for any code changes. The bug that took them down was both timing-dependent and load-dependent, making it extremely difficult to test for, and required fairly specific real world conditions to trigger. That’s how it went unnoticed for months before it triggered.

dan@upvote.au · 4 months ago

When are people going to realise that routing a huge chunk of the internet through one private company is a bad idea? The entire point of the internet is that it’s a decentralized network of networks.

Echo Dot@feddit.uk · 4 months ago

I hate it but there really isn’t much in the way of an alternative. Which is why they’re dominant, they’re the only game in town

Capricorn_Geriatric@lemmy.world · 4 months ago

How come?

You can route traffic without Cloudflare.

You can use CDNs other than Cloudflare’s.

You can use tunneling from other providers.

There are providers of DDOS protection and CAPTCHA other than Cloudflare.

Sure, Cloudflare is probably closest to asingle, integrated solution for the full web delivery stack. It’s also not prohibitively expensive, depending on who needs what.

So the true explanation, as always, is lazyness.

dan@upvote.au · edit-2 4 months ago

there really isn’t much in the way of an alternative

Bunny.net covers some of the use cases, like DNS and CDN. I think they just rolled out a WAF too.

There’s also the “traditional” providers like AWS, Akamai, etc. and CDN providers like KeyCDN and CDN77.

I guess one of the appeals of Cloudflare is that it’s one provider for everything, rather than having to use a few different providers?

Jason2357@lemmy.ca · 4 months ago

Someone always chimes into these discussions with the experience of being DDOSed and Cloudflare being the only option to prevent it.

Sounds a lot like a protection racket to me.

dan@upvote.au · 4 months ago

Companies like OVH have good DDoS protection too.

MonkderVierte@lemmy.zip · 4 months ago

Meaning, internal error, like the other two prior.

Almost like one big provider with 99.9999% availability is worse than 10 with maybe 99.9%

Jason2357@lemmy.ca · 4 months ago

Except, if you chose the wrong 1 of that 10 and your company is the only one down for a day, you get fire-bombed. If “TEH INTERNETS ARE DOWN” and your website is down for a day, no one even calls you.

jj4211@lemmy.world · 4 months ago

Note that this outage by itself, based on their chart, was kicking out errors over the span of about 8 hours. This one outage would have almost entirely blown their downtown allowance under 99.9% availability criteria.

If one big provider actually provided 99.9999%, that would be 30 seconds of all outages over a typical year. Not even long enough for people to generally be sure there was an ‘outage’ as a user. That wouldn’t be bad at all.

MonkderVierte@lemmy.zip · 4 months ago

deleted by creator

melsaskca@lemmy.ca · 4 months ago

We are going to see a lot more of this type of bullshit now that there are no standards anymore. Fuck everything else and make that money people!

thisbenzingring@lemmy.sdf.org · 4 months ago

really reminds me of the self owned crowdstrike bullshit

JcbAzPx@lemmy.world · 4 months ago

This is just the beginning of the coming vibe code apocalypse.

Nighed@feddit.uk · 4 months ago

Somewhere, that Dev who was told that having clustered databases in nonprod was two expensive and not needed is now updating the deploy scripts

choopeek@lemmy.world · 4 months ago

Sadly, in my case, even after almost destroying a production cluster, they still decided a test cluster is to expensive and they’ll just live with the risk.

ranzispa@mander.xyz · 4 months ago

Before today, ClickHouse users would only see the tables in the default database when querying table metadata from ClickHouse system tables such as system.tables or system.columns.

Since users already have implicit access to underlying tables in r0, we made a change at 11:05 to make this access explicit, so that users can see the metadata of these tables as well.

I’m no expert, but this feels like something you’d need to ponder very carefully before deploying. You’re basically changing the result of all queries to your db. I’m not working in there, but I’m sure in plenty places if the codebase there’s a bunch of query this and pick column 5 from the result.

felbane@lemmy.world · 4 months ago

“Claude said it was fine, ship it.”

Lembot_0005@lemy.lol · 4 months ago

Mortem

Wishful thinking :)

falseWhite@lemmy.world · edit-2 2 months ago

deleted by creator

stepintomydojo@sh.itjust.works · 4 months ago

Zero for the triggering action. A human rolled out a permissions change in a database that led to an unexpected failure in a different system because that other system was missing some safety checks when loading the data (non-zero chance that code was authored in some way by AI).

panda_abyss@lemmy.ca · edit-2 4 months ago

Classic example of how dangerous rust is.

If they had just used Python and ran the whole thing in a try block with bare except this would have never been an issue.

Edit: this was a joke, and not well done. I thought the foolishness would come through.

jimmy90@lemmy.world · 4 months ago

honestly this was a coding cock-up. there’s a code snippet in the article that unwraps on a Result which you don’t do unless you’re fine with that part of the code crashing

i think they are turning linters back to max and rooting through all their rust code as we speak

dan@upvote.au · 4 months ago

This can happen regardless of language.

The actual issue is that they should be canarying changes. Push them to a small percentage of servers, and ensure nothing bad happens before pushing them more broadly. At my workplace, config changes are automatically tested on one server, then an entire rack, then an entire cluster, before fully rolling out. The rollout process watches the core logs for things like elevated HTTP 5xx errors.

SinTan1729@programming.dev · edit-2 4 months ago

I hope you’re joking. If anything, Rust makes error handling easier by returning them as values using the Result monad. As someone else pointed out, they literally used unwrap in their code, which basically means “panic if this ever returns error”. You don’t do this unless it’s impossible to handle the error inside the program, or if panicking is the behavior you want due to e.g. security reasons.

Even as an absolute amateur, whenever I post any Rust to the public, the first thing I do is get rid of unwrap as much as possible, unless I intentionally want the application to crash. Even then, I use expect instead of unwrap to have some logging. This is definitely the work of some underpaid intern.

Also, Python is sloooowwww.

panda_abyss@lemmy.ca · 4 months ago

I was joking, but oof it did not go over well.

SinTan1729@programming.dev · 4 months ago

Ah that makes sense. To be fair tho, there’s a lot of unwarranted hate towards Rust so it can be hard to tell.

panda_abyss@lemmy.ca · 4 months ago

I should bite the bullet and learn it.

I decided to learn zig recently, it feels like crafting artisanal software, which is what I liked C for. But it’s kinda janky in that each point version major features come and go (see io and async).

There’s a place for engineering software which is what rust seems great at. Definitely seems like a tool I could/would use as rust is taking over many of my tool workflows.

Cloudfare outage post mortem

Cloudfare outage post mortem

Cloudflare outage on November 18, 2025