when a null pointer dereference breaks the internet lol

No video

when a null pointer dereference breaks the internet lol

Рет қаралды 126,414

Күн бұрын

but it may not be the devs fault.
If you're a developer, sign up to my free newsletter Dev Notes 👉 www.devnotesda...
If you're a student, checkout my Notion template Studious: notionstudent.com
Don't know why you'd want to follow me on other socials. I don't even post. But here you go.
🐱‍🚀 GitHub: github.com/for...
🐦 Twitter: / forrestpknight
💼 LinkedIn: / forrestpknight
📸 Instagram: / forrestpknight

Пікірлер: 763

@fknight Ай бұрын

UPDATE: New info reveals it was a logic flaw in Channel File 291 that controls named pipe execution, not a null pointer dereference like many of us thought (although the stack trace indicates it was a null pointer issue, so Crowdstrike could be covering). Devs fault 100% (in addition to having systems in place that allow this sort of thing). Updates to Channel Files like these happen multiple times a day.

@kofiz7355 Ай бұрын

This should be pinned

@ingiford175 Ай бұрын

Thanks for the update. Have not used named pipes in a Long time....

@anaveragehuman2937 Ай бұрын

Source please?

@fknight Ай бұрын

@@anaveragehuman2937 - www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/

@user-in3xs9gn2o Ай бұрын

Then delete this video.

@vilian9185 Ай бұрын

Fun fact the null point reference were also on the linux crowdstrike, the linux kernel just handled it like a boss

@oleksandrlytvyn532 Ай бұрын

There seems to be an articles where it is said that some time prior, like maybe month or a few months before CrowdStrike allegedly did same as we see now for Windows to Debian 12 or Rocky Linux. Potentially because smaller blast radius it went not noticed in media. But I myself don't know if it was true or not, so take it with grain of salt

@Songfugel Ай бұрын

well, null pointer dereference is something that should throw an error, not be allowed and silenced without the dev explicitly handling it correctly. Problem here is, how was this ever allowed to be mass delivered to everywhere at once with such a glaring and general case bug that should have showed up at any sort of testing So linux handling it itself quietly, might not be the own you think it is

@vilian9185 Ай бұрын

@@Songfugel it throw an error, it just didn't kill himself like windows did, where the fix were to reboot your machine 15 times and hope that the network go up first than the driver 💀

@Songfugel Ай бұрын

@@vilian9185 But a failure like that should kill the program and not let it continue, and since it was the kernel itself, it should fail to boot past it. Windows had the exactly correct reaction to this very serious error, problem is that it should have never gotten past first patch ot tests

@HappyGick Ай бұрын

@@Songfugel No, you should leave it to the developer to handle a crashed driver. The end user using the driver does not care if it crashed or not, only that the program using it works, *and most importantly,* that the machine works. Fail silently (let the user boot), inform whatever's using the driver that it doesn't work and why, and let the developer handle it. There are times and places for loud failures. This is one of the occasions where it's better to silently fail and inform the developer. A crashed driver almost took down society with it. Edit because seems like it wasn't clear: no, I'm not saying we should dereference the null pointer. Of course not. I'm saying that we should crash the driver only, and let the system move on without it loaded. Or unload it if it crashed at runtime. If another program tries to use it, it will raise an error and will be able to recover why the driver failed. In enterprise environments it's much better to have the system running vulnerable than not running at all. A vulnerability costs millions on one company. A company-wide crash costs billions. A worldwide crash is incalculable.

@capn_shawn Ай бұрын

“You cannot hack into a brick” -Crowdstrike, 2024

@Songfugel Ай бұрын

@@capn_shawn 😂

@jedipadawan7023 Ай бұрын

Torvalds has always held that security bugs are just bugs and should not be granted special status whereby, given the obsession of some, all functionality is lost in the name of security. Crowdstrike just proved Torvalds is correct.

@Songfugel Ай бұрын

@@jedipadawan7023 He is kinda right here, but he is also often wrong and is just a normal person who just managed to get away with mostly plagiarizing (not sure if that is the right word, like Linus I'm a Finn, and not that great with English) Unix into an extended version as Linux

@Acer11818 Ай бұрын

but you can break it

@666pss Ай бұрын

😂😭

@whickervision742 Ай бұрын

But it's still their fault for pushing it out to everything everywhere all at once.

@brentsaner Ай бұрын

And they did so ignoring clients' SLA/update management policies, too! Damages as a *result* of breach of contract? Crowdstrike's *done* for.

@marcus141 Ай бұрын

Well in my previous role, I deployed crowdstrike for a major broadcaster, and one common misconception in all of this, is that crowdstrike can push updates to customer endpoints without their knowledge or consent. It doesn't work like that. Endpoint management is handled centrally by IT admin, and when crowdstrike release a new Falcon sensor version, after reviewing, we can choose if we want to use the latest version or not. You can of course configure crowdstrike to auto update the sensors but that would be ludacris for obvious reasons.

@kellymoses8566 Ай бұрын

@@marcus141 It wasn't a new version, it was just a definition file.

@SahilP2648 Ай бұрын

@@kellymoses8566 maybe I am missing something but if the driver file got updated, wouldn't the affected PCs boot into recovery only when shutdown? So, they could still in theory keep running if not shut down?

@maddada Ай бұрын

100% agree. They should've updated 5% of users and compared failures to before the update. Not send updates to everyone in one go! Especially for such a huge company writing a critical kernel level software.

@bluegizmo1983 Ай бұрын

Crowdstrike is DEFINITELY still at fault. You never ever ever push an update out live to millions of computers without extensive testing and staged rollouts, especially when that update involves code that runs at the kernel level!

@vesk4000 Ай бұрын

Yeah I cannot possibly comprehend how and why this was pushed to everyone so quickly. Also why didn't the clients of crowdstrike say: heyy, do we really have to update everything day 1?

@inderwool Ай бұрын

They're a security company and it becomes a necessity that they rollout security patches to everyone at the same time. Staged rollout means, you leave the rest of the customers viable to being compromised.

@vesk4000 Ай бұрын

@@inderwool I agree if this was some kind of critical security update, but apparently it wasn't.

@ingiford175 Ай бұрын

Especally code that can execute within the Kernel

@batman51 Ай бұрын

I am still surprised that everyone apparantely just loaded the update. Surely in a big organisation at least, you run it through your test network first. And if you really don't have one, you will know better now.

@stevezelaznik5872 Ай бұрын

I still don’t understand how this patch didn’t brick the machines they tested it on, the idea that a company worth $70 billion didn’t catch this in CI or QA is mind blowing

@simoninkin9090 Ай бұрын

They didn’t run it 😅 tested sections of code, but not the integrated product.

@ingiford175 Ай бұрын

@@simoninkin9090 Or how they did not stage the patch on a small fraction of machine per hour and then pull it back when BSOD happens

@xponen Ай бұрын

this company went big because of politics, they are the one who investigated alleged hacking of Democrat email server.

@JeanPierreWhite Ай бұрын

@@ingiford175 Right. Rolling out to the world in one fell swoop is really irresponsible. Even given the best QA in the world, mistakes will get by, that's why you stage deployments.

@JamesTSmirk87 Ай бұрын

@@stevezelaznik5872 that’s just it. They clearly did not go integration testing.

@rickl7604 Ай бұрын

This is precisely why you actually test the package that is being deployed. If you move release files around, you need to ensure that the checksums of those files match.

@rekko_12 Ай бұрын

And you don't deploy anything on friday

@rickl7604 Ай бұрын

@@rekko_12 Amen.

@JamesTSmirk87 Ай бұрын

And you don’t deploy to the whole flipping world in one go.

@grzegorzdomagala9929 Ай бұрын

And md5 checksum all files... They must use some sort of cryptographic signature securing package integrity. It means they don't test "end product" - they probably tested compilation products, then signed the files and sent it to whole world - and somewhere between end test and release one of the files was corrupted. I bet it was something silly - for example not enough disk space :)

@simoninkin9090 Ай бұрын

@@grzegorzdomagala9929exactly my thinking. Just skipped on some critical integration tests - environment mismatch or something of a sort. However I don’t think it got exactly “corrupted”. The only reason the world got into this trouble, was because they have packaged the bug within the artifact.

@mitchbayersdorfer9381 Ай бұрын

Saying the root cause was a "null pointer dereference" is like saying the problem with driving into a telephone pole is that "there was a telephone pole in the way." The root cause was sending an update file that was all null bytes. The fact that the operating system executed that file and reported a null pointer dereference as a result is not the fault of the OS, and is not a root cause.

@JamesTSmirk87 Ай бұрын

Bingo. And I can’t believe the testing server was apparently the only (apparently single) server in the whole world not affected. I get that we don’t want to make assumptions and point fingers willy nilly, but this one is a bridge way too far.

@paulbarclay4114 Ай бұрын

the problem is centralized control that word salad is a tertiary problem

@astronemir Ай бұрын

Well actually because of this shitty OS people miss flights etc. Should never run such critical systems in Windows. Just leave that for your employees PCs

@zebraforceone Ай бұрын

@astronemir so what alterations would you make on an OS level to avoid this?

@Ryan-xq3kl Ай бұрын

The root cause is ACCEPTING null bytes, just check for them, its LITERALLY the programmers fault

@samucabitim Ай бұрын

giving any software unlimited kernel access is just crazy to me

@mallninja9805 Ай бұрын

MSFT: "Should we do something about the kernel, or develop AI screenshot spyware?"

@plaidchuck Ай бұрын

Tired of hearing like Y2K was some panic or something that just magically fixed itself or wasn't a big deal. It wasn't a big deal because people spent years before fixing it

@HeeroAvaren Ай бұрын

Yeah buddy we all watched Office Space.

@kxjx Ай бұрын

@@HeeroAvarenwell I am old enough to have seen it first hand, I don't remember what office space said but I do remember all the overtime 😅

@successmaker9258 Ай бұрын

Welcome to bad reporting by the media, and a general lack of knowledge by the layman of tech

@davidhines7592 Ай бұрын

there is another one coming in 2038 when old unix system's 4 byte time integer overflows. jan 19 2038 ought to be interesting if any of those systems havent been fixed and are doing something critical.

@henryvaneyk3769 Ай бұрын

I spend many months doing tests and fixing code for Y2K. That nothing happened is testament to the fact that we did our jobs well.

@coltenkrauter Ай бұрын

I have no doubt they will do a thorough investigation as this was such a massive impact with millions and billions of dollars of implications.

@astrocoastalprocessor Ай бұрын

🤔 worldwide? probably trillions 🫣 updated 24h later to add: the peanut gallery is correct, the wikipedia entry makes it more clear that some enterprises and markets were unaffected and some were only affected for a short time 🧐 thanks everyone

@TehIdiotOne Ай бұрын

@@astrocoastalprocessor Nah, it was big for sure, but i don't think you get quite how much a trillion is.

@jedipadawan7023 Ай бұрын

I have no doubt Crowdstrike are going to be sued into oblivion. I have been reading the comments from employees reporting how their company's legal departments are being consulted.

@puppy0cam Ай бұрын

@@jedipadawan7023 Just because a legal layperson is trying to find out from a lawyer if there is any legal liability doesn't mean that there actually *is* any legal liability. That doesn't mean people won't try to sue them, and that will be costly fighting them off.

@JamesTSmirk87 Ай бұрын

The question is will anyone outside ClownStrike ever hear what actually happened?

@adwaithbinoy5355 Ай бұрын

Has the name says - crowd strike , every device goes to strike

@suntzu1409 Ай бұрын

DoS like a boss

@AmxCsifier Ай бұрын

the name is quite fitting seeing how many people were left stranded in airports

@JohnSmall314 Ай бұрын

Let me guess. Maybe Crowdstrike recently laid off a stack of experienced developers who knew what they were doing, but were expensive, and kept the not so experienced developers who didn't know what they were doing, but were cheaper. Then on top of that because of the reduced head count, but same workload, then under pressure the developers cut corners to rush product out. I'm not saying that is what happened. But I have seen that happen elsewhere, and I'm sure people can come up with loads of examples from their own experiences.

@dmknght8946 Ай бұрын

Oh funny enough, there's a topic on reddit (18 hours ago) told this: "In 2023, Crowdstrike laid off a couple hundred people, including engineers, devs, and QA testers…under RTO excuse. Aged like milk." But is there any official (or at least trusted) sources?

@smoocher Ай бұрын

Sounds like a lot of companies

@Palexite Ай бұрын

It’s true, but I think it’s vice versa. They’re keeping the “experienced” programmers while throwing away rookies. Atleast that’s the trend we see with google and Microsoft. They want to pay less to employment as a whole, and the only way to do that without tearing the whole team apart is kicking people out.

@ABa-os6wm Ай бұрын

Not at all. They skipped the first test and went directly to the cheap inexperienced suckers.

@davidjulitz7446 Ай бұрын

Not likely. The underlying issue was obviously introduced already a long time ago but never catched. So far only valid "param" files where pushed and parsed by the driver. The error itself is likely easy to fix if you accept it has also to be able to parse non-valid files without crashing.

@Bregylais Ай бұрын

Thank you for your insights. Man, I hope CrowdStrike does a thorough post-mortem for this one. That's the least they're owing the IT professionals at this point.

@mirrikybird Ай бұрын

I hope a third party does an investigation

@bart2019 Ай бұрын

It did not break "the internet". It broke a lot of companies' office computers, but those are not on the internet. In fact, the internet chugged along just fine.

@w1l1 Ай бұрын

most of the backlash from developers -> ego devs who write like 2 lines of crap code a day but are (for whatever reason) extremely vocal

@juandesalgado Ай бұрын

Narcissism is so prevalent in this profession. As with surgeons, violinists and physicists ;)

@Dead_Goat Ай бұрын

ive been bitching about crowdstrike for a long time.

@FritzTheCat_1030 Ай бұрын

@@juandesalgado Speaking as a retired violinist who now works as a software dev, I feel like physicist might be the next career I should look into!

@CptMartelo Ай бұрын

@@FritzTheCat_1030 As a dev that started as a physicist, what are your tips to learn violin?

@juandesalgado Ай бұрын

@@FritzTheCat_1030 lol - I hope you keep playing at home, though!

@originalbadboy32 Ай бұрын

First rule of patch management is you dont install patches as soon as they are available. If I know that then why some of these massive companies don't is beyond me. It seems that IT management has forgetten the fundamentals. Also technically it can be done remotely if it's a virtual machine or remote management is enabled.

@rehmanarshad1848 Ай бұрын

The problem is that these patches are automated OTA (Over the Air) patches. Which was marketed to businesses as there would be less administrative work in installing patches, since these patches come directly from CrowdStrike the trusted vendor. Thus, they wouldn't need to hire as many qualified IT people for cybersecurity tools patch management. It was like a SaaS service handled by the vendor that they didn't need to worry about. Little did anyone realize that there was no proper isolated testing done before pushing this out to production globally.

@rehmanarshad1848 Ай бұрын

The lack of testing and slow gradual rollout + Windows OS architectural design flaws Combined, it created a single point of failure. Didn't help that AzureAD was also down as well so anyone trying to login via Active Directory to remediate the issue and get Bitlocker keys were also screwed. 😅

@ChristianWagner888 Ай бұрын

The update was more like a virus definition data file. The actual scanning engine driver file was not updated. These types of updates are apparently pushed multiple times a day as new “threats” are encountered. It is astonishing to me, that the Falcon driver cannot handle or prevent garbage data being loaded into it. Also it’s the poor architecture of Windows that driver crashes bring down the OS. Additionally possibly a bad architectural decision by CS to embed their software so deeply into Windows that the OS will crash if the Falcon driver misbehaves.

@Lazy2332 Ай бұрын

yeah, for the physical machines, I hope they have vPro set up; if they don't, I bet they're really wishing they had done it sooner. Lol.

@allangibson8494 Ай бұрын

@@Lazy2332Virtual machines proved even more unrecoverable than physical machines - you need a physical keyboard connected to enter safe mode (assuming you actually have the bitlocker keys).

@Zulonix Ай бұрын

As a developer, I always checked if a pointer was null before dereferencing it.

@isomeme Ай бұрын

At the most fundamental level, it is obvious that CrowdStrike never tested the actual deployment package. Things can go wrong at any stage in the build pipeline, so you ALWAYS test the actual deployment package before deploying it. This is kindergarten-level software deployment management. No sane and vaguely competent engineer would voluntarily omit this step. No sane and vaguely competent manager would order engineers to omit this step. Yet the step was definitely omitted. I hope we get an honest explanation of how and why this happened. Of course, then you get into the question of why they didn't do incremental deployments, which are another ultra-basic deployment best oractice. I am beginning to form a mental image of the engineering culture at CrowdStrike, and it's not pretty.

@coltenkrauter Ай бұрын

I have not written operating system code, but generally code is supposed to validate data before operating on it. In my opinion, developers are very likely the cause. Even if there is bad , the developers should write code that can handle that gracefully. Also, this video asserted that this kind of issue could slip by the test servers. That sounds ridiculous to me. The test servers should fully simulate real world scenarios when dealing with this kind of security software. They should run driver updates against multiple versions of windows with simulated realistic data. But, I would be surprised if a single developer was at fault. Because there should be many other developers reviewing all of the code. I would expect an entire developer team to be at fault. It'll be interesting to learn more.

@TehIdiotOne Ай бұрын

I'm just astonished that this got past testing, AND was deployed to everyone at same time. Just screams of flaws in the entire deployment process at crowdstrike.

@famboettinger2041 Ай бұрын

... and also the management for not giving enough resources for testing. Its always features, features, features!

@Ic3q4 Ай бұрын

Bc this clip is not worth it dont waste your brain

@kkgt6591 Ай бұрын

AI usage is going to make such occurrences common in coming decade.

@dondekeeper2943 Ай бұрын

The internet was not broken. Not sure why people kept saying it did

@robertbutsch1802 Ай бұрын

Right. If it was a network-type problem, the IT folks could have just applied a fix across their network from the comfort of their cubicles and then gone home at 5:00 on Friday. Instead some of them had to run around to individual machines and boot them to safe mode while others had to try to remember where the bitlocker keys were last seen.

@oswin4715 Ай бұрын

Ye just a click baiting title but I guess everyone is just surprised off the scale of this. Most people couldn’t do their job and CrowdStrike probably made billions in financial loss to these companies, airlines etc

@IKEARiot Ай бұрын

"tHe SiTuaTiOn tHaT bRoKe tHe ENtiRe InTernEt" Instant downvote.

@FullFrontalNerdity-e3z Ай бұрын

What this shows me is that it's a bloody miracle that any computer works at all.

@nathanwhite704 Ай бұрын

Any *Windows computer. The Linux systems that run crowd strike weren’t affected :).

@ItsCOMMANDer_ Ай бұрын

@@nathanwhite704they were in april ;)

@glitchy_weasel Ай бұрын

A guy on Twitter theorized that maybe it was some sort of incomplete write - like when the filesystem records space for a file, but stops before copying any data leaving a hole of just zeroes. If sometime like that happened in the distribution server or whatever it's called and didn't manifest during testing, well, kaboom!

@TheUnkow Ай бұрын

Would be even "funnier" if it turns out a bug in the file system.

@sirseven3 Ай бұрын

@@TheUnkowI'm still suspect of windows. I work enterprise IT at a big defense contractor. I see drivers fail ALOT in windows and most of my job now is just updating drivers. I see memory management, inaccessible boot device, nvpcl.sys crashes, all related to drivers that get rolled back/corrupted from windows updates. I'm just not good enough yet to find it and expose it.

@TheUnkow Ай бұрын

@@sirseven3 As a developer myself I know sometimes it is the most weirdest bugs that cause the issue ... just a bit of an incorrect offset and any file or code may become totally useless ... sometimes even hazardous. I haven't been using Windows for a while because of being unable to determine what causes some the issues, I know that using an alternative is not always an option ... but debugging closed source is a really challenging process. Just because we get a set of API's or other functionalities from Microsoft to use ... no one guarantees they are bugfree or security/privacy/memory leaks free. Even if they were 100% ok, on the next update (such as in this case), an issue may be introduced and we will have trouble again. Note that Linux and any other software isn't fully clear of these issues as well, for example just recently they had the RegreSSHion bug, which was also a bug introduced in the update which enabled most serious security vulnerabilities. Still I would say the transparency of open source would make such issues easier to overcome and harder to introduce. Easier life with closed source has it's downs not just ups, we must take precations against that, glad to hear some people like yourself are serious about it.

@bob_kazamakis Ай бұрын

Doesn’t macOS fail gracefully when a kext misbehaves? If so, you can still technically blame Windows for not handling that situation well

@samyvilar Ай бұрын

I don’t know about later iterations but from my experience in Big Sur and earlier iterations kexts’ can still cause kernel panics, at least when an invoking an instruction that raised an uncaught/unhandled CPU exception, in my case I was trying to access a non-existing MSR register on my system. The thing is whether it’s kexts/drivers/modules on macOS/windows/linux doesn’t really matter, cause at that point your in ring 0, the code has as much privilege as the kernel, the only safeguards at this level are rudimentary CPU exception handling hence why kernel panics and BSOD always seemed so CRUDE with just a few lines of text, since at this point everything has halted and and the CPU has unconditionally jumped to a single procedure and nothing else seems to be happening …

@k.vn.k Ай бұрын

@@samyvilar CrowdStrike does not have kernel level permissions on new Macs, because Apple has been pushing people to move away from kernel extensions, so CrowdStrike runs as a system extension instead which is run outside of kernel. The system files on Mac are mounted as read-only in a separate partition and you need to manually turn SIP off and reboot in order to be able to even write/modify them. Good API designs encourages your developers to adopt more secure practices. CrowdStrike isn't intentionally malicious here, but lax security design in Windows stemming from good old Win32 days allowed such failure to happen.

@JeanPierreWhite Ай бұрын

I doubt it because MacOS is like Windows, it does in place upgrades to software. Some versions of Linux and ChromeOS employ blue/green or atomic updates that allow for automated rollbacks if a boot failure occurs.

@samyvilar Ай бұрын

@@k.vn.k I was under the impression crowdstrike was windows only, for as long as I can remember Enterprise seemed to shy away from macOS, given Apples exorbitant price on its REQUIRED hardware. macOS Darwin kernel is significantly different from windows and Linux for that matter, crowdstrike may or may not need kernel level privileges, for feature parity across the 2 platforms, but make no mistake anything requiring ring 0 does!

@JamesTSmirk87 Ай бұрын

So, it wouldn’t show up on the testing server, but it would show up on millions of servers all over the rest of the world? I can’t say that makes sense to me.

@BittermanAndy Ай бұрын

Yeah, this is.......... very dubious.

@grokitall Ай бұрын

we have known how to write software so this does not happen since the moon landings. we have even better tools now. the only way for this to happen is for everyone including microsoft to ignore those lessons.

@cyberbiosecurity Ай бұрын

This can not make any sense unless you are delusional. So you're good. The man stated absolute nonsense, like he has no idea what he's talking about.

@grokitall Ай бұрын

@@JamesTSmirk87 of couse if you canary release to the test servers first, then to your own machines, and only then to the rest of the world, it would have been caught.

@MASTERPPA Ай бұрын

Its called deploy to 1% of customers at a time... Maybe starting on a Monday at 6PM..

@JeanPierreWhite Ай бұрын

Bingo.

@bokunochannel84207 Ай бұрын

the thumbnail says "not the devs fault". wrong, totally the devs fault. previous update broke future update, classic.

@soko45 Ай бұрын

Pdf weeb

@pen_lord8520 Ай бұрын

@@soko45You have a right to cry about a drawn picture on the internet.

@gearoidoconnell5729 Ай бұрын

I agree it test, test and more test. The should row test on some PC not all to test if was full work. You think with Airline that part should has most test. Code do not change just OS it run on clear was not test for that OS.

@whickervision742 Ай бұрын

As I understand it. AssClownStrike has a "secure" conduit to copy anything to system32 folder. Windows happily runs any driver file there during startup. (Reworded for the pedantic). Windows is designed to bugcheck (bluescreen) on any driver problem. Always has. Having the ability to send trash to over to a billion computers system32 folder with one command is the real problem.

@robertbutsch1802 Ай бұрын

Yeah, I’ve heard Windows channel files mentioned. Sounds like a similar process to what MS uses to distribute new Defender signature files.

@allangibson8494 Ай бұрын

It wasn’t a driver fault - it was a bad file configuration the driver downloaded automatically.

@sirseven3 Ай бұрын

@@allangibson8494but couldn't it have been a bad pull from the WSUS to the clients? The checksum wasn't verified on the client side, but verified before distribution

@reapimuhs Ай бұрын

@@allangibson8494 that still sounds like the driver's fault for not gracefully handling that bad file configuration

@allangibson8494 Ай бұрын

@@reapimuhs Yes. The file error checking seems sadly deficient. A null file check at least would seem to have been warranted.

@francismcguire6884 Ай бұрын

Thanks for the detailed explanation of why I am spending the first 4 days of my vacation at the airport. Honestly.

@baboon_baboon_baboon Ай бұрын

He hardly said anything

@itsstudytimemydudes4345 Ай бұрын

I am so sorry

@ZY-cr7yg Ай бұрын

If the disk corruption occurred before checksum, how come it’s not caught in the CI pipelines. If the corruption happened after the CI pipelines, why don’t they check the checksum before distributing it

@grokitall Ай бұрын

but that assumes that they have a decent testing and deployment strategy, despite all the evidence to the contrary. to paraphrase terry pratchett, in the book raising steam, you can engineer around stupid, but nothing stops bloody stupid! 😊

@raylopez99 Ай бұрын

Is this man speaking into a cactus in a vacation setting? He's crazy. Subscribed!

@microdesigns2000 Ай бұрын

Pointers from a file, that is nuts. 😂

@bf-696 Ай бұрын

I guess that doing a validation test on a limited scale before pushing to the world just never occurred to anyone at CrowdStrike or MS?

@diogotrindade444 Ай бұрын

All parties need to fix this broken system: - Security companies cannot ever force push without testing. - OS (special MS) need to improve all aspects in this scenario with lots of new well documentated automated testing/check tools for multiple steps in the process. - Essensial companies cannot trust blindly on updates without basic checks, and MS should not be the only OS running if you want to make sure that you online all the time. We need better software build for failure special for essential compatines that cannot stop. If companies do not fix this on all levels it can open a new door for failure.

@Murph9000 Ай бұрын

There is something you can do better for the case of a bug after testing, when you're going to push an update to a massive population of systems. Unless it's an emergency update that needs to be pushed NOW, you do a phased push. Push to 1% of the systems, and wait an hour (or longer); then push to 5%, and wait 5 hours; then push to 10%, and wait 12 hours; finally push it out globally. While you are waiting each time, you monitor support activity closely and/or look for any abnormal telemetry such as high rates of systems reporting errors, going offline, etc. You can also split the application between kernel and user space, so that you have a minimal footprint in kernel space and do the more complicated work in user space. In that model, the kernel code can be hardened and shouldn't change on a regular basis; and the high frequency updates are then to the user space code, which is much less likely to take out the entire system due to bad data.

@JSRTales Ай бұрын

probably they might have laid off the qa who could have caught it

@JeanPierreWhite Ай бұрын

And the deployment team who would deploy to say 1% of their customers first to be double dog sure.

@henryvaneyk3769 Ай бұрын

Dude, it is still the driver developer's fault. What happened to using MD5 or SHA checksums to validate the contents of a critical file? If the driver did the one simple step to do checksum validation, it would have noticed that the contents of the data file is not valid, and could have refrained from loading the file and could then have issued an alert instead of BSODing. It would be a very simple step to also add the checksum and do validation during the CI/CD pipeline and the installation process.

@chronixchaos7081 Ай бұрын

The first use of ‘Gnarly Event’ to describe a world wide catastrophe. Well done.

@manojramesh4598 Ай бұрын

Crowdstrike really had the crowd strike!!!!

@ExpensivePizza Ай бұрын

As a software developer with over 30 years of experience I must say... you couldn't be more wrong. There's no way in the world this wasn't a developers fault. Software developers are responsible for testing the actual thing they're going to ship against the thing they're going to ship it on. If they don't do that, it's on them.

@JeanPierreWhite Ай бұрын

The devs did no doubt create the problem and wrote code that is prone to failure. The devs must take some blame for sure. The dudes in charge of deploying the code around the world are also to blame. Why on earth would you not deploy this to a percentage of your clients first until it is proven to be reliable? Deploying to everyone at the same time is not a devs fault. It is still with Crowdstrike tho, very irresponsible of them. I knew something bad would happen like this after I retired lol.

@mallninja9805 Ай бұрын

@@JeanPierreWhite Surely there's enough failure here to spread some blame around. Devs should check for & handle null pointers. Test suites should find bad channel files. Engineering department should properly fund & staff. Fortune-100 companies should be wary of all deploying the exact same EDR solution. etc etc.

@AndrewTa530 Ай бұрын

Friends don't let friends write C++

@juanmacias5922 Ай бұрын

3:20 don't get me wrong, it could have even been flipped by a solar flare, but saying something happened after CI/CD and testing still sounds like it should have been implemented better haha Edit: WTF I had never heard of the 2038 bug, makes sense tho, I always found Unix time to be limiting.

@sadhappy8860 Ай бұрын

That's just an extra little thing for us all to worry about for 14 years! Good night, lol.

@astrocoastalprocessor Ай бұрын

@@sadhappy8860😱

@eltreum1 Ай бұрын

Enterprise gear uses hardware ECC RAM with a separate parity chip for error checking and correction to prevent that. Even if it flipped the same bit in 2 chips at the same time perfectly the file created would have failed a CRC integrity check and the build should have failed in pipeline. A failing disk or storage controller in a busy data center is not going to pick on 1 file and would be eating enough data to set off alarms. This was the either the perfect storm of multiple human errors or sabotage.

@SimonBlandford Ай бұрын

The idea that security somehow involves installing a remotely controlled agent that can potentially go full "Smith" on critical servers is the problem.

@joseoncrack Ай бұрын

Yes.

@krunkle5136 Ай бұрын

Fr. Security should never require critical code that can CRASH the kernel to be continuously deployed so easily.

@kxjx Ай бұрын

AV software is clearly very risky. The industry for some reason seems obsessed with it. Customers keep asking for it on their servers, I keep saying no we have other ways to handle this hazard. Why oh why are you letting a vendor push kernel changes to all your domain controllers at the same time? Why are you in a position where you feel you need this kind of software on critical servers? Are you letting your administrator browse the Web from your file server?

@successmaker9258 Ай бұрын

AVs are risky, users are riskier. No, riskiest.

@reapimuhs Ай бұрын

@@kxjx then what happens if a bad actor manages to exploit your server in some way to get their malware onto its system and running? without an AV on the server to help catch it then surely it would be capable of running loose for far longer than it would have if an AV was present would it not? what other and better ways do you have to "handle this hazard" without something present on the server to try and identify and deal with it?

@prcvl Ай бұрын

Why should the Null bytes have to do anything with the file? If you deref a nullptr, you crash in cpp

@burtonrodman Ай бұрын

good explanation. one additional note that on Windows at least, a null ptr deref is basically a special case of an access violation… the first page of the process is marked as unreadable and any access attempt (like the 9c in this case) causes an access violation and any access violation in the first page is assumed to be a null ptr deref. i’m really surprised people aren’t talking more about why this went out to millions of computers all at once. why aren’t they doing a phased roll out? i bet they will now 😂

@reapimuhs Ай бұрын

other comments seem to suggest this wasn't an actual update but rather a faulty definition file that was downloaded, the real problem is why they were not validating the integrity of these files and gracefully handling corrupted ones.

@burtonrodman Ай бұрын

@@reapimuhs that makes sense, but regardless of what they call it,imho even config files are a part of the software and require testing and roll out procedures just as if code had been updated.

@SkinnyCow. Ай бұрын

Microsoft Windows is one patch ontop of another patch. There's a reason why linux and Apple software is preferred by developers.

@evacody1249 Ай бұрын

🙄

@evacody1249 Ай бұрын

It's also the reason they have less apps and programs they can run because it would mean a whole rewrite of the majority of apps that have no issues. Also work around auchas wine are not the answer.

@ItsCOMMANDer_ Ай бұрын

Yeah, because unix is a monopoly, you simply cant get good alternatives on windows (jk)

@harisaran1752 Ай бұрын

Always rollout slow, why the hurry, rollout 1% if its okay 10% and so on

@kellymoses8566 Ай бұрын

Even a simple 32 bit CRC would have detected that the file was corrupt. So incompetent.

@mikef8846 Ай бұрын

Headline should be "level 1 techs save the world."

@jmarti1997jm Ай бұрын

Crazy how I can understand how one line of assembly code caused everything to just die

@SaHaRaSquad Ай бұрын

Wouldn't it be funny if Crowdstrike used their own security product (I know, right?) and had bricked their own computers as well?

@MeriaDuck Ай бұрын

When parsing input data, especially from a kernel driver, one needs to be VERY defensive. Validation should happen and the validation stage should not be able to crash on any input, especially empty or all zero files.

@InstaKane Ай бұрын

No I disagree, they should be testing the updates on Windows Mac etc by rolling out the updates to the machines, and preforming restarts on those machines and then running a full falcon scan to check that the application behaves as expected. Also an engineer can if a values is null before de-referencing so I think this is an engineering/testing issue by cs for sure. But hey live and learn I guess their testing processes will be updated to catch these types of bugs going forward.

@alxk3995 Ай бұрын

A security company of that scale should have their testing and update pipeline figured out. Learning basics at that size is just unacceptable.

@diogotrindade444 Ай бұрын

OSs like openSUSE, Fedora Silverblue, macOS, and Chrome OS use automatic rollback mechanisms to revert to a stable state if an update or configuration change causes a system failure, preventing widespread issues.

@baruchben-david4196 Ай бұрын

Even I know yer not spozed to dereference a null pointer... How could it not be the devs?

@me99771 Ай бұрын

But doesn't this mean that whatever's reading this file isn't checking if the file has valid data? So there is a bug in the code and it was just sitting there for who knows how long? Have they not tested how their code handles files filled with zeros or other invalid data?

@tbavister Ай бұрын

Test your final deliverable *as a customer*

@JeanPierreWhite Ай бұрын

Clearly Crowdstrike bypassed all change management that corporation employ. There is no way that all corporations worldwide decided not to do change management last Thursday. Crowdstrike updated customers production systems bypassing all change management. bad bad bad. They will be sued out of existence.

@mrgunman200 Ай бұрын

The scale it happened is 100% their fault and preventable period.

@krunkle5136 Ай бұрын

Fr. The buck must stop somewhere. I get protecting people from a mob but...

@mrgunman200 Ай бұрын

@@krunkle5136 Single dev isn't to blame, the whole company is. None the less no matter what there can always be situations like this, which is shocking they don't having rolling updates to minimize such damage. If for example they only rolled it out to places like McDonald Kiosk's, then slowly to more critically clients, they would of known of the issue before it became a huge cluster fuck

@JeanPierreWhite Ай бұрын

Crowdstrike will be sued out of existence due to this. Microsoft also has some blame, only their OS was affected by a bad Crowdstrike release. Their system is too fragile and has no automated recovery or rollback.

@mrgunman200 Ай бұрын

@@JeanPierreWhite Hopefully that will bring some good to Windows for a change of pace

@grokitall Ай бұрын

@@JeanPierreWhitethe issue is not that they had no recovery mechanism in place, it is that go into safe mode and fix it yourself does not work with locked down machines.

@drygordspellweaver8761 Ай бұрын

A simple Zero Initialization would have prevented this. ZII (zero is implementation)

@tahliamobile Ай бұрын

Even with a basic tech understanding there seem to be way too many obvious errors involved in this incident. ZIL, staggered rollout, local testing... This is not how a professional software security company operates.

@pauljoseph3081 Ай бұрын

Just imagine the amount of Jira tickets and story points within Cloudstrike right now... Non devs folks can leverage on this and micromanage all devs moving forward lol

@josb Ай бұрын

Maybe, but you don't push an update on a kernel driver to all your clients at the same time. Kernel drivers is a serious business you don't want to mess with it

@TheUnkow Ай бұрын

You have testing environments for that. If you don't push it to all at the same time you couls be sued for giving priority to some over other customers (ergo discriminate or downright steal) as some security updates may be essential. A bug of this caliber simply should not have been allowed on live, it was a most basic and serious mistake for a security company.

@JeanPierreWhite Ай бұрын

@@TheUnkow BS. Most large corporations have change management in place to prevent production software from being updated that doesn't go through all the necessary quality steps. Crowdstrike updated clients systems without their knowledge or permission. In addition some customers are OK being a beta customer, those are the ones you target first, then the majority and finally the customers who say they want to be one release behind. Deploying all at the same time is clearly irresponsible and highly dangerous as evidenced by this disaster. Making an update available for download by everyone is fine, but pushiing said release to everyone at the same time is irresponsible and will be the basis for a slew of lawsuits against Crowdstrike.

@TheUnkow Ай бұрын

@@JeanPierreWhite If someone screws up, they can always be sued. If the option for having beta testers are included in the contract, then that is a kind of software model, those rules were agreed upon and that is fine. BUT I would not want a big business when my software provider just decides that my competitors get some security updates before me just because they had some kind of an additional arangement or the software provider just deemed someone should get them first (even if it was round robin). And yes, almost everyone tries to skip steps until a thing like this happens. Skipping saves capital and because the first priority of companies is capital, not the lives of humans in hospitals which depend on their software, things like this will continue happen, but are not right just because they are beind done by many of the corporations.

@awkerper Ай бұрын

It was the Fisher Price dev tools they used!

@HenkvanHoek Ай бұрын

That is why I like PiKVM on my servers. Although I don't use Crowdstrike or even Windows. But it can happen on Linux or Apple as well.

@brentsaner Ай бұрын

You kind of downplay the actual largest concurrent global IT outage in history, dude.

@krunkle5136 Ай бұрын

Just tech people typically downplaying issues and avoiding accountability.

@TheDa6781 Ай бұрын

This guy is on drugs. All they had to do is test it on one PC.

@baboon_baboon_baboon Ай бұрын

He has 0 clue what he’s saying. You can do plenty of null pointer derefernce checks similar to typical null checks

@henningjust Ай бұрын

Respect for mentioning the 2038 problem. Gosh, I hope the code I wrote in 1999 isn’t still running by then😂

@Gengh13 Ай бұрын

Of course I have never forgotten to check the pointer, never happened in the past, totally impossible👀.

@gaiustacitus4242 Ай бұрын

Crowdstrike was anticipating a major worldwide attack by a never before heard of hacks so the development team decided to put Windows into "Super Safe Mode".

@figloalds Ай бұрын

The billion dollar mistake just went nuclear

@stingrae789 Ай бұрын

Still the dev team's fault. Testing wasn't sufficient and environments weren't close to production. Also technically they could have also done a canary rollout which would have meant only a few servers were affected.

@williamforsyth6667 Ай бұрын

There is no in person repair for most of the cases. Servers in data centers usually do not use disks directly but though some storage network technologies. They can access the file systems of the affected machines remotely.

@mariusj.2192 Ай бұрын

You COULD prevent that by signing before testing, so the signature guarantees what you're shipping is what you've tested.

@MALITH666 Ай бұрын

I wont finger point at devs. Id point releasing production changes on a working day. Especially CS being a agent that sits at Kernal level of OS. This means zero testing done.

@RC-1290 Ай бұрын

Checksums are a thing, right?

@JeremySmithBryan Ай бұрын

I am amazed by the fact that it is 2024 and we're still writing software ( especially operating systems and drivers ) in non-memory safe languages and without formal verification. Crazy. That's what we should be using AI for IMHO.

@jamesbutler5570 Ай бұрын

In c++ you have to check if data is valid!

@jrkorman Ай бұрын

IF? if it had only happened to certain computers, with certain versions of OS, then I'd maybe believe that testing might not have caught it. But with this many computers, all at the same time, CrowdStrike's pre-delivery testing on a deployment box should have broken also! So, deployment testing was not done properly! If at all?

@pilauopala843 Ай бұрын

It’s the developers fault. They didn’t check that the struct pointer was NULL before referencing it.

@kamilkaya5367 Ай бұрын

That's why there are two phases in product stages. One for development phase, one for master or production phase. Even if There is an error after all the merge requests coming together and introduce a bug, You would catch it during the development phase and fix it. You wouldn't release it right away. This fault is not justifiable.

@RoterFruchtZwerg Ай бұрын

Well if your CI/CD and testing allow for file corruption afterwards then they are just set-up wrong. The update files should be signed and have checksums and you should perform your tests on the packaged update. Any corruption afterwards would result in the update simply not being applied. The fact that the update rolled out shows they either package and sign after testing (which is bad) or don't test properly at all (which is even worse and probably the case).

@semibiotic Ай бұрын

NULL dereference is always a dev's fault, because it is a lack of simple error handling.

@renat1786 Ай бұрын

Here must be a meme with Bret Hitman Hart pointing at null pointer dereferencing in Wrestlemania game code

@LCTesla Ай бұрын

A pointer's validity (i.e. not being a null reference) can always be checked, so the dev that wrote that does hold some accountability. But what's more important is that code is run in small execution blocks that never take down the whole system when an exception of any kind occurs.

@grokitall Ай бұрын

rubbish. this is kernel level code, and the wrong type of bug will crash the kernel on any operating system. the problem is it should never have been deployed, should have immediately stopped deployment when the machines started crashing, and windows needed a better response to a broken driver than just put your locked down machine into safe mode and fix it yourself.

@MrTrak08 Ай бұрын

It would have been trivial to test the driver for this kind of issue, sadly people are too complacent thinking the worse can never happen

@benyomovod6904 Ай бұрын

Why no canary test, developer 101

@nicnewdigate Ай бұрын

Presumably you don’t understand the difference between the internet and Microsoft windows. And Also absolutely Microsoft’s fault too.

@noelgomile3675 Ай бұрын

This could have been avoided if CrowdStrike used a null safe programming language.

@ItsCOMMANDer_ Ай бұрын

No it couldnt, it seems like it was read from a file so no memory safe lang would have helped

@mysticknight9711 Ай бұрын

Need to disagree - bug introduced after CI CD (ie: perhaps in code signing, or code packaging/unpackaging) violates the dictum that “though shalt test the same bits as delivered to the customer”

@niveZz- Ай бұрын

basically they immediately tested it on all devices in the world instead of at least one after pushing the update lol crowdstrike is quite a fitting name tho

@shashwa7 Ай бұрын

Still it could have been easily avoided if they did incremental/batch/controlled rollouts, government, travel and healthcare systems must always receive any new update after few months of testing of public rollouts.

@johnmcvicker6728 Ай бұрын

Deploy to limited test servers for something like this, run it a few days. This seems like a deploy to the world and not a staggered deployment.

@Scott_Stone Ай бұрын

Should've rolled out that update in chunks. That's the real problem

@wengkitt10 Ай бұрын

I wish everyone knew what really happened and would stop blaming Microsoft and even the government in those affected countries.

@JeanPierreWhite Ай бұрын

It's aliens. It's always aliens.

@grokitall Ай бұрын

but the machines staying down is directly due to microsoft. if they had looked past safe mode and implemented something to detect and recover from bad driver updates, then it would have been a simple case of turning the machine of then on again and letting it recover.

@JeanPierreWhite Ай бұрын

@@grokitall Yep. Microsoft have a lot of 'splaining to do.

@BanterMaestro2-y9z Ай бұрын

_"How was your day, love? Are you okay? You look frazzled."_ _"No biggie. Just took down the Internet, that's all."_ _"Oh I'm sorry! By the way, I burned dinner and the cat puked in your sock drawer."_

@kasimirdenhertog3516 Ай бұрын

Fun fact: Tony Hoare, the guy who invented the null reference in 1965, called it his ‘billion-dollar mistake’ - not far from the truth!

@tharnendil Ай бұрын

It might not be a software dev fault, but releasing such changes without any form of non-staged mechanism (first, release 5%, then 20%, etc) and observing reports (you can report success after booting patched OS) is a sing of lack of the good process and not following industry standards

@JeanPierreWhite Ай бұрын

The devs are at fault for writing flaky code. The deployment manager is at fault for deploying to everyone at once. The scale of the disaster is down to deployment methodology or lack thereof.

@Nahash5150 Ай бұрын

Those of us who aren't code geeks read: Crowdstrike has way too much power.

@JeanPierreWhite Ай бұрын

Truth.

@nathanwhite704 Ай бұрын

@@JeanPierreWhiteeven those of us who are code geeks believe that.

@frederikjacobs552 Ай бұрын

Look, it's 2024. Both at Microsoft and CrowdStrike you need to assume this can happen and that the impact will be huge. Don't tell me nobody ran a "what if" scenario. At best both Microsoft and CrowdStrike could have done way more to allow some sort of fail-safe mode. For example: you detect your driver was not started or stopped correctly 2 times in a row after a content update > let's try something different and load a previous version or no content at all > check with the mothership for new instructions. Which would still be bad, but only "resolve itself after 2 reboots" bad...

@JeanPierreWhite Ай бұрын

Yes; This is why corporations should not use Windows in mission critical systems. Its too fragile with no resiliency or automated rollback built in. Even my lowly chromebook can revert a bad OS update automatically by switching partitions at boot. Microsoft should have provided some level of resilience after all these decades.