CrowdStrike Exposes a Fundamental Problem in Software

  Рет қаралды 71,851

ArjanCodes

ArjanCodes

Күн бұрын

Пікірлер: 580
@ArjanCodes
@ArjanCodes 2 ай бұрын
✅ Get the FREE Software Architecture Checklist, a guide for building robust, scalable software systems: arjan.codes/checklist.
@MysticCoder89
@MysticCoder89 2 ай бұрын
Your kernel is crashed. No malicious code can be executed. Your computer is completely protected now. Thank you for choosing our company!
@ropro9817
@ropro9817 2 ай бұрын
I think the even more fundamental problem here is the security software mono-culture. I know CrowdStrike is big, but honestly, I was surprised when I heard in the news how broadly sweeping the impact was across companies and even across industries. If everyone's using the same software, that provides a ripe attack vector for hackers. 😒
@FrankmoonDusty
@FrankmoonDusty 2 ай бұрын
This a 100%. We need to move away from tech monopolies like crowdstrike, and Microsoft by extension.
@penfold-55
@penfold-55 2 ай бұрын
It's a much bigger issue... Most of the biggest companies are held in a very small space, in northern California, US. Microsoft, Google, Nvidia, AMD, Intel, Facebook, Amazon, and so on. The issue is that Europe and Asia are just so far behind America
@alexivanov4157
@alexivanov4157 2 ай бұрын
Bravo! This is the main point from the issue!
@username7763
@username7763 2 ай бұрын
It isn't entirely a mono culture. All it takes is for one layer in a massive distrubuted system to all be on the same thing. The problem is the crazy complexity of today's IT systems. Everything is a damned service that requires it's own cluster and infrastructure.
@robertbutsch1802
@robertbutsch1802 2 ай бұрын
No enterprise IT folks in their right minds are going to say look, everyone else is using the best threat protection software in the business. So lets use Acme Security Software so we’re not promoting a mono-culture.
@kwas101
@kwas101 2 ай бұрын
It's partly about $$$ and partly about how everything nowadays is expected to happen with speed. Back in the day (30 years ago) I worked for a bank. We maintained a very large enquiries counter system. Before anything got pushed out to branches, it was tested for weeks. We had dozens of test engineers and they would run through every conceivable action. Then and only then a release would happen to a local branch. This would be tested in the wild for a week. Then a small group of branches for two weeks, then a larger group, then finally the main group. The result was that very few (if any) show stoppers made it to production. This meant a slow cadence of releases though. Also this was a large project with extensive management backing, so the cost was not really a factor (within reason). This type of behaviour would never fly today. Everything has to be done on the cheap, with minimal testing, just "get it out there". I call it the "just get it f**king done" attitude - this is very common nowadays, especially among MSPs.
@gzoechi
@gzoechi 2 ай бұрын
It's not necessary to go that slow. With proper CI/CD practice this would work as well at high speed. You still need to put a lot of work in to get proper quality.
@enadegheeghaghe6369
@enadegheeghaghe6369 2 ай бұрын
If you spend weeks testing your Cyber security software, you will get hacked for sure before you deploy it. Hackers are a lot more sophisticated now compared to a decade ago
@manoo2056
@manoo2056 2 ай бұрын
the issue is thar in the short term the "get it the fack done" acctitud "saves money" but in the long term it explodes. I see like entropy rising and then some feedback to bring back equilibrium. Let's hope we survive that feedback !! XD
@michaelwills1926
@michaelwills1926 2 ай бұрын
@@enadegheeghaghe6369hackers rise with the level of tech. Besides you still should sandbox any release even zero day patching
@AMMullan
@AMMullan 2 ай бұрын
So we have the CrowdStrike option ENABLED so CrowdStrike won't release the latest version of their software to use (we stay 1 version behind) - apparently they don't actually even check for this so we got it anyway. Absolutely shoddy development :(
@TheGreenRedYellow
@TheGreenRedYellow 2 ай бұрын
Wouldn't you get it in the next release, so technically you are not immune to this update, unless you manually deploy it.
@gcaussade
@gcaussade 2 ай бұрын
@@AMMullan wow that's interesting information. It's interesting to see what happened. Did they have an emergency release? Maybe they felt there would be a breach if they didn't release something right away? So many questions. I have a hard time believing there are so many incompetent organizations around the world. If these companies were choosing to be one version behind, specifically to avoid something like this, then how did this happen?! That's crazy!
@gcaussade
@gcaussade 2 ай бұрын
@@TheGreenRedYellow what do you mean? You would assume that people would report the BSOD and they would stop the roll out. The problem with being one version behind is that you're not getting the latest protection. But, I could see doing this to avoid this exact situation
@AMMullan
@AMMullan 2 ай бұрын
@@gcaussade yeah they killed that update so anyone not getting the latest update wouldn't have received this at all 😕
@TheGreenRedYellow
@TheGreenRedYellow 2 ай бұрын
@@gcaussade it is really about how many updates did they release. Like what if they have released 2 updates within same day?
@whatcouldgowrong7914
@whatcouldgowrong7914 2 ай бұрын
People seem to be overlooking the glaring fact that they pushed an update that was corrupted or checksum failed which means there was a wide open vulnerability that allowed man in the middle exploits or injecting code with modified files directly into the Kernel….
@vitalyl1327
@vitalyl1327 2 ай бұрын
Keep in mind that Clownstrike is a scam company selling a "cybersecurity" snake oil. Just like all the other antivirus companies. There is zero value in their product. They have no incentive whatsoever to do the right thing, because they're consciously scamming their customers anyway.
@fransstar8731
@fransstar8731 2 ай бұрын
I see a lot of answers/recommendations, but what surprises me why CrowdStrike is working in Linux and not in Windows. Apparently Microsoft needs an extra driver to let CrowdStrike working. I think it has all to do with the different structure between Microsoft and CrowdStrike. I think it is time that Microsoft should change its total stucture like Linux. This is whole thing is to blame to Microsoft. It is clear ethical hacking can only be done with Linux and not Microsoft. Wake up people. Linux has sudo Windows not this was and is the main issue. Awaiting comments. Thanks.
@whatcouldgowrong7914
@whatcouldgowrong7914 2 ай бұрын
@@fransstar8731 They tried to and was blocked by Europe. At the very least Microsoft need to revoke their WHQL and prevent changes after the fact
@ying-ym8ut
@ying-ym8ut 2 ай бұрын
The CEO of CrowdStrike, George Kurtz used to be the Chief Technology Officer of McAfee in 2010, when a security update from the antivirus firm crashed tens of thousands of computers.
@KC-uf1rg
@KC-uf1rg 2 ай бұрын
He upped the ante now 😂😂😂
@Discoverer-of-Teleportation
@Discoverer-of-Teleportation 2 ай бұрын
😂😂😂😂
@yoyolim538
@yoyolim538 2 ай бұрын
We got crowd struck
@Jace-yt2zm
@Jace-yt2zm 2 ай бұрын
Crowdstrike dropped the ball and brought down a big chunk of the world’s commerce and business. While CEO George Kurtz is thoroughly enjoying and consumed by his race-car-driver-lifestyle in events all over the globe!
@James-hb8qu
@James-hb8qu 2 ай бұрын
My career has been leading engineering organizations. This is not a new issue or a unique issue. Bad driver code crashes systems. Because of that, the industry has created well known and effective ways to prevent problems. You've listed them. The issue here is a company with wide spread driver releases that failed to follow those practices. The free market has created a process for handling that and it is called competition and consumer choice.
@joansparky4439
@joansparky4439 2 ай бұрын
markets that prohibit or undermine competition via rules that are being enforced by the market authority (#) do not give the consumer the chance to chose a different supplier #) goal is to give one or a few control over the supply, so it can be kept below demand, which guarantees the the consumer always pays more than it cost - which is what profit is. Or in other words - real free markets would trend towards zero profit for all involved due to competition.
@ChristianSteimel
@ChristianSteimel 2 ай бұрын
Most surprising is that PCs still don't use A/B installs of the OS, where you use one copy and update the other copy, then switch over to the updated copy, and you can switch back if the update failed for some reason. With disk space so cheep, you'd thing every Linux/Mac/Windows PC would use that by now. In Linux at least you can revert to a prior Kernel version.
@incandescentwithrage
@incandescentwithrage 2 ай бұрын
Yeah but the same thing happened with Crowdstrike on Linux previously, causing a kernel panic. If you hook into the kernel, changing kernel isn't going to help. A/B is what happens with OS feature updates on Windows already. Nothing preventing people using backup software on the daily.
@JeanPierreWhite
@JeanPierreWhite 2 ай бұрын
Bingo. Windows is not ready for critical functions. Microsoft have had over 30 years to develop resilient OS. Time to give up on them and go to Linux systems that support immutable OS's and atomic releases.
@askii3
@askii3 2 ай бұрын
SUSE MicroOS is essentially capable of such A/B installs. It does snapshots of atomic transactional updates where it can automatically rollback the update on failure. This is what JeanPierreWhite (above) is referring to with immutable Linux distros with atomic updates.
@gzoechi
@gzoechi 2 ай бұрын
I found that stupid 30y ago. NixOS does that quite well though.
@gzoechi
@gzoechi 2 ай бұрын
​@@JeanPierreWhiteThey never even tried to approach the problem. If it had been 300 years they wouldn't have made any more progress on that front
@on_wheels_80
@on_wheels_80 2 ай бұрын
The Crowdstrike disaster hasn't struck because they needed to move fast, but because they obviously haven't tested this specific update on a single Windows machine. Because if they did, they'd immediately noticed it would crash. And they made a similar mistake already in April. That time it could be somewhat forgiven because it only occurred on two distributions of Linux which hadn't been in their test matrix.
@1DwtEaUn
@1DwtEaUn 2 ай бұрын
yeh, you think they'd have at least one of every supported OS in a test lab and rollout to that first, is 30 minutes before global rollout that big of a delay.
@d3stinYwOw
@d3stinYwOw 2 ай бұрын
Or they have such machine, but their testing might be influenced by local changes, or flaky.
@The_Ballo
@The_Ballo 2 ай бұрын
that, or they did it on purpose
@Travolta12e
@Travolta12e 2 ай бұрын
Wasn't the .sys file just a bunch of zeroes? I wonder if it was either a compilation or distribution problem that somehow corrupted the file, but the original file was working as intended. I mean, no matter how incompetent they are, it's naive to think that they just push new files to production without a minimal testing first.
@stickman1742
@stickman1742 2 ай бұрын
How can you say they didn't because they needed to move fast? One of the most common reasons why software is put out without enough testing is because they are trying to move fast. You may not thing they needed to move fast, but internally they may have felt pressure. These kinds of drivers normally have to go thru certification tests to be put into Windows, but updates can bypass this to get out more quickly. Don't underestimate the ability of companies, including very big ones, to take shortcuts whenever possible. Not too long ago I spent some time working for a huge financial company that has more money than most companies would know what to do with. They are supposed to have a complete system just for testing to protect everyone's financial data, but they didn't really want to put in the money or effort. Could they afford it, of course! They just wanted to skip a few things, would probably make their quarterly report look a little better. That test system was never working so everyone had to run tests using people's actual financial data. They would just hand out people's real financial records saying "You're not supposed to see this but we don't have any test data" to any employee just to get the job done. This is the attitude of the biggest institutions running this country.
@askii3
@askii3 2 ай бұрын
A mechanism rolling back an update after X number of failed boots/etc would help a lot here. My router does this, it keeps a copy of the old firmware it can automatically revert to in case flashing a new firmware image bricks it. SUSE's MicroOS does similar by having a stateless OS and transactional updates that are snapshotted in the BTRFS file system. If it crashes and reboots, it'll automatically rollback to the snapshot before the update while preserving user data.
@gzoechi
@gzoechi 2 ай бұрын
In NixOS you can configure how many system configs you want to keep. Switching back in the boot menu just changes a bunch of symlinks.
@askii3
@askii3 2 ай бұрын
@@gzoechi yeah, it's really cool
@metamadbooks
@metamadbooks 2 ай бұрын
But you can have it both ways: it's called rolling updates. You don't deploy software to a billion endpoints in one go.
@JeanPierreWhite
@JeanPierreWhite 2 ай бұрын
Correct. This was dumb.
@amyhaynes3019
@amyhaynes3019 2 ай бұрын
Right
@Julio-ek1lw
@Julio-ek1lw 2 ай бұрын
I disagreed with your comment, the number of deployments doesn’t remove the dichotomy
@samarbid13
@samarbid13 2 ай бұрын
This is a reminder of how fragile our IT solutions are. Imagine a solar storm occurring and the devastation it would cause! We need a plan B for critical infrastructures to always be in place!
@henson2k
@henson2k 2 ай бұрын
We need operating system that can disable drivers on reboot
@JeanPierreWhite
@JeanPierreWhite 2 ай бұрын
Its a reminder of how fragile Windows is. Notice how it was only Windows computers that borked?
@gzoechi
@gzoechi 2 ай бұрын
How does this increase stock prices in the short time? Yeah, not gonna happen.
@zackang4731
@zackang4731 2 ай бұрын
@@JeanPierreWhite Because it's a software written specifically for Windows? A major bug written in the Safari browser can potentially cause the same problem to ALL Mac users, and no Window computers will be affected, because the browser is no embedded in the system the same way Safari is in the MacOS
@STCatchMeTRACjRo
@STCatchMeTRACjRo 2 ай бұрын
@@JeanPierreWhite they could have released a buggy update for non-Windows os as well.
@lumeronswift
@lumeronswift 2 ай бұрын
Something that needs to be more highlighted from this issue is that companies have in recent years been offloading their IT resources but are still adopting external, overseas-managed (i.e. managed in the US) solutions. Companies should always have an in-house team ready to respond to system failures. Informed, careful companies would only have had a couple of hours of downtime...
@_SR375_
@_SR375_ 2 ай бұрын
I want to add that the fact that CrowdStrike is so widely used makes it a target for bad actors, and perhaps how it operates internally, which seems to be monolithic, is also a problem. We also do not know what government and military systems were affected by this "bug" . Regardless of other bad practices that were at play, CrowdStrike itself may want to consider a lessre and perhaps break up its platforms into shards, such that entire industries are not a impacted by one bad software update or a bad pod
@yogibarista2818
@yogibarista2818 2 ай бұрын
The issue essentially is that there is a kernel-mode driver - no doubt WHQL certified - that is running uncertified p-code from installable 'definition' files, so that a bug there will cause the kernel-mode driver to execute bad code, and bug-check the system. Perhaps the kernel-mode driver needs better checking and self-defence - could the WHQL certification process require this?. The 'fix' is to gain access to safe-mode, boot without the driver, and then remove the installable definition files, so perhaps a system should identify crashing 'boot-required' drivers and sideline them if they crash repeatedly.
@incandescentwithrage
@incandescentwithrage 2 ай бұрын
You mean just like malware would do,?
@JeffBartlett-kj6sq
@JeffBartlett-kj6sq 2 ай бұрын
I heard that it ate a file of all zeros. So, 1) no signature bytes. 2) no header, 3) no header checksum, 4) no whole file checksum. 5) no file encryption nor signing. So a bad actor can figure out the p code and put a definition file in the directory, or do a man in the middle attack and own the machine from ring zero.
@johnhebert9583
@johnhebert9583 2 ай бұрын
Someone else who watched the Dave Plummer video about Crowdstrike. His is the most thorough explanation I've seen.
@wernerlippert5499
@wernerlippert5499 2 ай бұрын
Humans tend to think they can sacrifice quality for speed, which works for some time and then fails miserably. It's a bit like the uncertainty principle, there is a fundamental limit that cannot be cheated.
@stickman1742
@stickman1742 2 ай бұрын
We are pushing towards these kind of bad events pretty quickly. Software updates are being pushed out constantly in an effort to move ahead as fast as possible. It wasn't that long ago that this was not the way. Updates were treated very carefully and put out more slowly. Now it is a race to see who can update the fastest. I see computer and devices suddenly stop working on their own all the time now. Always because of some recent update. This is already an issue, this is just an event so widespread that everyone is hearing about it. The industry is going to have to come up with more robust systems as we cannot depend on computers for everything if they often are just not going to work. This is a relatively new issue with all these updates and the problems will only get far worse with bigger consequences if it continues like this.
@keithnsearle7393
@keithnsearle7393 2 ай бұрын
So, basically Crowdstrike could not even secure itself against itself. Well done Crowdstrike, well done! (Slowly clapping) To Microsoft, get rid of Crowdstrike, no IFS and no BUTTS!
@vister6757
@vister6757 2 ай бұрын
Other antivirus also have access to the kernel due to EU regulator requirements after McAfee and Symantec brought the case against Microsoft when Microsoft placed a code to stop 3rd party software running on its kernel.
@krissn8111
@krissn8111 2 ай бұрын
Was there any test in canary environments? I guess not and how long does it take to test in canary? I cant understand a company like crowdstrike overlooking best practices.
@MadeleineTakam
@MadeleineTakam 2 ай бұрын
I find it utterly incredible that they don’t test the update on a sandboxed system before sending it out.
@Eris123451
@Eris123451 2 ай бұрын
I don't. It's a quality assurance issue and it's turned out for example, that after years of promoting quality systems and quality assurance that at least 2 of the biggest manufacturing companies in Japan had been falsifying their production records and data for decades. If it a choice between scrapping millions of pound of work or passing it on the nod, few if any managers are going to bite the bullet and take that kind of financial hit, That mind set is probably at the roots of the majority of major operational failures in almost any industry.
@omriliad659
@omriliad659 2 ай бұрын
One (partial) solution is to have a backup computer, that stays a version behind or a few days and only comes online if the main server stops responding. It would prevent this problem and could only be exploited in case a hacker could take down the most up to date server. You could even rotate the servers, so you update 2 versions each time. Another solution is to have canary distribution with faster turnaround. Set the most secret systems to have the update first, have the next group within an hour later etc. It means you make your last group vulnerable for a few more hours, but you give them the peach of mind that it was tested for a few more hours and is unlikely to crush that fast. Last solution is to disconnect systems from the internet. No computers with internet connection means no attack surface, and you can still work offline or maybe even with others on the same network. Keep the protection system guarding the gateway, maybe even keep several layers of different software at each layer, but leave the inner network isolated.
@marcelogarcia5539
@marcelogarcia5539 2 ай бұрын
I thought this was one of the lessons from COVID: resilience is important as efficiency.
@charlesnicholas4758
@charlesnicholas4758 2 ай бұрын
Good video but everyone seems to ignore the fundamental problem. How do you compile source code into a file of binary zeros?! At least if it had been a null file the size would have been noticed.
@mitchellsmith4601
@mitchellsmith4601 2 ай бұрын
This was an embarrassing failure for Crowdstrike. All they had to do was test their patch on Windows PCs prior to release, and they would have seen those PCs blue screen. They could have fixed the issue, tested again, and THEN deployed. The more devices you’re responsible for, the greater the duty to test prior to deployment. This was negligence, pure and simple, and there should be a class action suit against Crowdstrike for the damages they caused. Such a suit would destroy Crowdstrike, of course, but that’s as it should be. Our world needs to deter this negligence in the future.
@raristy1
@raristy1 2 ай бұрын
Basic Security + cert teaches EXACTLY that. So my question would be, was ANYONE certified at CrowdStrike???
@RiteGuy
@RiteGuy 2 ай бұрын
All great points, Arjan, and I agree with them. But you left out a biggy - companies want to make as much money as possible so they cut corners everywhere. You did lightly touch on time by saying sometimes you don’t have the time to create a proper fix for a threat. I agree, but there’s another time problem. To companies, time = money, so the time allowed to work on things is cut right away even when there isn’t a looming threat. Remote updating is a godsend for companies. It lets them ship a product that is incomplete and flawed thanks to time and money restraints. Them as the product is completed/fixed, the current installations of software are usually automatically updated without the knowledge of the user. These issues and all the ones you mentioned are breaking software. I’m afraid of AI, not for the reasons most people cite but because software code is garbage in this day and age. Why would AI software be any better?
@NickThunnda
@NickThunnda 2 ай бұрын
In the good old days we had big mainframes running code which took checkpoints and did automatic rollbacks upon failure. They were replaced by lots of networked Microsoft boxes.
@gregorymathy2782
@gregorymathy2782 2 ай бұрын
That unfortunately goes beyond IT infrastructure cost… harmonization of process and procedure… CrowdStrike, Boeing, cars breakdown… all those stuff are driven unfortunately by cost reduction and profit optimization … We are unfortunately only seeing the top of the iceberg and I am pretty sure we are only at the beginning of it … I wonder what will be the next big things …
@sneezyfido
@sneezyfido 2 ай бұрын
Business culture breeding and promoting incompetence is a huge issue in all large companies
@nurulnurul9270
@nurulnurul9270 2 ай бұрын
Ouch. Somehow I found myself agreed with you
@galuszkak
@galuszkak 2 ай бұрын
I think this is interesting case that software design decision to build monolithic kernels 30-40 years ago is showing it’s consequences today (Linux, Windows etc.). Prof. Andrew Tanenbaum was trying to convince software industry that micro kernels are better for reliability and security, while sacrificing some performance. Looking back this is my best guess that by going with monolithic kernels we build whole security industry around it because of security flows that can be there by design.
@pureabsolute4618
@pureabsolute4618 2 ай бұрын
It's also how big "kernel space" is in general. Windows NT has graphics in user space. Of course, that was too slow, so they moved it "back" (windows 98 didn't have a protected kernel).
@CallousCoder
@CallousCoder 2 ай бұрын
The problem with micro kernels is that they are complex. Gnu Herd failed because of it Darwin is the only one now but on x64 (I need to check ARM, I developed assembly on ARM but never from bare metal) only has 2 security rings. We used to 4 but since all major operating systems and most CPUs since VAX had 2 rings of protection, x64 also settled for two. So you don’t have your classical ring 1 for your drivers anymore. So you maybe loosely coupling your drivers but all in all they run in the privileged are - hands MacOSX on Intel did crash with shitty drivers too.
@gzoechi
@gzoechi 2 ай бұрын
NixOS can easily switch back multiple versions of configurations (not just the kernel). That's not a problem where the kernel architecture needs to get involved.
@MartinMaat
@MartinMaat 2 ай бұрын
It has nothing to do with this. The point of a virus scanner is that it should have control over everything by design. Which is not only a major security issue in itself but also a major privacy issue. As people get scared they tend to accept compromises, all the way down to fascism.
@ra2enjoyer708
@ra2enjoyer708 2 ай бұрын
@@MartinMaat You meant liberalism?
@gzoechi
@gzoechi 2 ай бұрын
CrowdStrike has shown that it has become the biggest threat to security
@MartinPHellwig
@MartinPHellwig 2 ай бұрын
When you expect something to work all the time in all circumstances but you can't define what all the time actually is or what the specifics of circumstances mean, you have unrealistic expectations. That is something each individual has to learn those willing to be realistic will have an easier time with less severe consequences learning that.
@ProfessionalBirdWatcher
@ProfessionalBirdWatcher 2 ай бұрын
My rage at everyone downplaying this for CrowdStrike is immeasurable. This is a billion dollar company, with a B, trusted by critical government, public, and private services and they shafted each and everyone. The lack of outrage from our authorities is absolutely disgusting. Speaks a lot to the state of cybersecurity and tech in general
@gregharn1
@gregharn1 2 ай бұрын
It's not a software problem. The decisions CS made was a tradeoff for functionality. The real root problem is policy. If you're a company running an EDR or even certain AV, you MUST build out a redundant infrastructure - specifically to mitigate bad updates. Which really just means if a system can run on 1 machine, you deploy at least 2 AND run different EDR or AV on each system. If 1 crashes (like last week), no big deal. 2 is 1, 1 is none.
@CraftyF0X
@CraftyF0X 2 ай бұрын
I for one always saw the possibility of something like this happening, hence my reservations against automatic forced background software updates, which would sound shady AF in the 90s while today a widely accepted daily occurence. Don't get me wrong it has its advantages but something like this case was always in the ards.
@samable9585
@samable9585 2 ай бұрын
for serious bug or zero day bug -- CrowdStrike should have simply disabled inbound traffic to the host (other than itself) and work on fix and roll it in limited manner. If it succeeds keep rolling it.... Would you fly a plane with this type of method? We ground planes immediately when there is threat -- but we treat security threat in computers in slightly business-as-usual method and take chance. This may change it ... Act first, disable and then push changes
@samchristy6745
@samchristy6745 2 ай бұрын
Most threats do not require an immediate response level, for many a canary release mechanism based on system criticalness level 1) deploy to non-critical systems (grocery stores, small businesses, gas stations, government) 2) wait 36 hours 3) deploy to mid-level critical systems (banks, financial institutions) 4) wait 36 hours 5) deploy to critical systems, (hospitals, pharmacy's, airports) For the defcon 5 threat level scenarios, then perhaps use the shotgun approach.
@sm5574
@sm5574 2 ай бұрын
A lot of the developers who are doing shoddy work don't know that they are. They may be incompetent, or they may not know the codebase as well as they think they do, or the codebase may itself be a ticking timebomb, full of patches and poor decisions that effectively hide a myriad of bugs. The industry is absolutely broken because it is full of people who are completely unaware of best practices and solid patterns, relying instead on their own unstructured learning that has gone unchecked for decades.
@stickman1742
@stickman1742 2 ай бұрын
The only real solution though is that the systems need to be more robust. There are always going to be some software bugs, it would be impossible for everyone to always create software without a single bug. These computers have to have a design that is far more robust so that it won't just refuse to run if there is one bug in even a kernel driver. This is a must if we are going to avoid much bigger problems like this in the future. Problem is, most companies just want to build on the current designs and move quickly as that is how you can make the most money. It will take a disaster to make everyone stop and say I guess we really do need a new design for this. Then all companies will be willing to pay for that newly designed system and the computer companies will make it. Is this even big enough to make that happen? It may cause them to look at it a big, but I kind of doubt it's big enough to push that much change. They'll probably put a band-aid on it.
@sm5574
@sm5574 2 ай бұрын
@@stickman1742, I agree, but I would estimate that developers (even at the senior level) who are capable of writing high-quality code are very much in the minority, and the people in charge of hiring rarely understand what to look for, as they are not, themselves, capable of writing such code. Thus, the vast majority of codebases are and will be more error-prone and difficult to maintain than should be considered acceptable.
@johnmoore8599
@johnmoore8599 2 ай бұрын
You aren't thinking through the problem sufficiently. There are two ways to solve this issue at the kernel level to prevent these kinds of problems with monolithic kernels. 1. Build a subsystem that if the kernel panics, reverts the system to the last known good configuration before the crash. 2. Build a driver subsystem that insulates the kernel from buggy or bad drivers and lets it continue to operate. This was the idea behind nooks written by Michael Swift in 2005. Either architectural change would build resiliency into the current OS kernels humans use. For whatever reason, no one is doing this. Their answers are always develop better drivers and they put these best practices into place and along comes some company like cough, McAfee, or cough, Crowdstrike who bring Windows systems down. People get angry and pissed because of ruined plans or lost money, but ultimately, nothing significantly changes because someone at Microsoft/Apple/Linux Foundation doesn't want to pay to make their OSes more reliable.
@PerisMartin
@PerisMartin 2 ай бұрын
Well, the way you solve this is to keep doing what you are doing. Keep teaching and preaching good practices with your videos. You never know the second and third order consequences of your good work. Keep it up!
@robinlioret7998
@robinlioret7998 2 ай бұрын
Add poor patching management in the companies: never apply patches directly in production without testing it in lower environments before...
@gcaussade
@gcaussade 2 ай бұрын
This is what really amazes me, the fact that so many companies were just rolling this out. But his point is correct. I give more blame to Microsoft and crowd strike. They're the ones that have to work very closely together and do like something new more like real-time testing. It's amazing this hasn't happened prior. The largest breach in US history was the United healthcare Optum breach months ago. That was a result of companies not patching fast enough! And that was remote software not something near the kernel. Still led to a massive disaster and problems with the health care system for over a month! So if anything, CIOs and CISOs felt more compelled to have to roll out security software even faster to make sure that it at least is up to date. What would happen if you were breached because you didn't roll out crowd strike fast enough? That's the dilemma he brings up.
@xBanki
@xBanki 2 ай бұрын
Reading from anecdotal reports online, CrowdStrike likes to push their customers into enabling automatic patch updates. Logically, it makes sense why they would do that, however historical evidence (And literally any administrative handbook) says blindly accepting updates, no matter the reputation of the company and the claimed quality of the updates should not be done to prevent outages like we saw.
@robertbutsch1802
@robertbutsch1802 2 ай бұрын
This was the equivalent of an AV pushing out a new virus signature file. No enterprise is going to pay the cost of CrowdStrike just to be a week behind on threat protection.
@silmarian
@silmarian 2 ай бұрын
They pushed it using the same channel as signature updates, not the usual upgrade path.
@Lofote
@Lofote 2 ай бұрын
That is not really valid for 2024 anymore. That was the case in earlier times, but in 2024 the zeroday-attacks are so common and threadening that security updates are considered time-critical. Meaning the risk of crashing your systems is considered more acceptable than having a successful security hack, where your data may be downloaded to the hacker, which is considered a far bigger desaster. Time is critical in 2024 with security patches :(...
@davidgrisez
@davidgrisez 2 ай бұрын
One main thing that allowed CrowdStrike security software to crash the computer operating system was the fact that this security software must be installed as a device driver operating at the high privilege level of the operating system kernel. Normal program software running at a lower privilege level should not be able to crash the operating system.
@DistortedV12
@DistortedV12 2 ай бұрын
I think one of the problems is this automatic update culture
@robertbutsch1802
@robertbutsch1802 2 ай бұрын
According to CrowdStrike this was not an “update” but a content delivery.
@luciaceba4640
@luciaceba4640 2 ай бұрын
@@robertbutsch1802which, is an update
@CaribouDataScience
@CaribouDataScience 2 ай бұрын
What’s the cliché say about putting all your eggs in one basket?
@bobdowling6932
@bobdowling6932 2 ай бұрын
There is (should be) a standard pre-release test even for time-critical security software: The test should be that the target operating system can at least boot to a point where the updater can allow new versions of the security software to be installed. The test should be run twice: once on an instance that keeps upgrading the software and one on a freshly installed operating system. If just those tests are implemented then, to an extent, you can rush the rest because fixes can be sent out to clean up errors. This test doesn’t need re-writing for each instance of the software. Other tests (does it block the malware, does it not interfere with critical applications, ...) can be run after launch because errors can be cleaned up automatically. There is room for subtlety here: a customer might sign up for the pre-application-testing version or the post-application-testing version. Perhaps they do their own testing. Perhaps they have made a risk-balancing decision. This sounds so obvious. Hindsight is a beautiful thing.
@epiphoney
@epiphoney 2 ай бұрын
Mark Russinovich retweeted using Rust instead of C++ for systems programming, "for no particular reason".
@daviddunkelheit9952
@daviddunkelheit9952 2 ай бұрын
This failure was quicker in onset and damage than Solarwinds. Diversity in systems …NOW! Need to build heterogeneity into the system rather than homogeneity.
@billfrug
@billfrug 2 ай бұрын
So your argument is that there was an imminent security threat that the update addressed? Is there any evidence of that?
@JeanPierreWhite
@JeanPierreWhite 2 ай бұрын
I don't think so. Just a bad update. Fragile endpoints, lack of change management. I retire and a year later the world goes to crap. Geez, that didn't take long ;-)
@theronwolf3296
@theronwolf3296 2 ай бұрын
Nothing I have seen so far even identifies the threat that was SO serious that this rush was essential. That's another part of the problem, security companies just go along and do things.
@AlvaroGilFernandez
@AlvaroGilFernandez 2 ай бұрын
As an IT expert, all my life windows has always been a problem, always presents some kind of problem to begin with. We need a new operating system that can replace windows that can be tough trustworthy.
@GH-oi2jf
@GH-oi2jf 2 ай бұрын
We had one. It was from IBM and was called OS/2. Actually, Microsoft participated in the development of OS/2 1.x. It ran in the early ATMs. OS/2 2.x came out before Windows NT and was an excellent product, but for some reason the shift to Microsoft took place and OS/2 was left behind. Business would have been better off sticking with IBM. I have never worked for IBM (or Microsoft) by the way, but I have worked with OS/2. I have no conflict of interest, just my opinion.
@sergioyichiong7269
@sergioyichiong7269 2 ай бұрын
Non windows oses never had problems? What you re gonna do with the legacy code?
@captainnerd6452
@captainnerd6452 2 ай бұрын
Error checking and error handling design. Don't trust data coming in, and don't trust data being returned from functions. Really don't trust the user.
@mfrunyan
@mfrunyan 2 ай бұрын
Precisely. This is amateur level code running in the kernel space.
@daimajind7231
@daimajind7231 2 ай бұрын
anyone consider how a single company can affect so many systems single handedly at kernel level worldwide. Does that mean a single bad actor at the company has the potential to compromise those same systems globally with a silent malicious payload without anyone knowing or even noticing thanks to the default automatic update to the bleeding edge build version?
@theronwolf3296
@theronwolf3296 2 ай бұрын
Maybe the kernel security layer should be virtualized, so that a corruption of the kernel can be quickly be switched off. Despite the claimed need for such deep access, if companies like Crowdstrike can corrupt the kernel, hackers (including nation state actors) could the same, or worse. At least the Crowdstrike bug just crashed the system, but other bugs can subvert it.
@miraculixxs
@miraculixxs 2 ай бұрын
It was not a security issue. It is a management issue. Perhaps MBAs should not run engineering orginizations.
@TimShear-p3s
@TimShear-p3s 2 ай бұрын
It seems to me to be the old problem of shutting the gate after the horses have left. What's needed is to have a kernel process built with a number of 'gates' where the system will not continue past a gate unless it passes its constraint and then to have a error 'fail safe' that allows the system to execute in safe mode so any changes can be made to restore operation if the gate fails. This would have saved CrowdStrike. Further, the system should be based on an identity management framework where identities/entities and permissions can't be pivoted or navigated out of to other entities. At the kernel level and beyond. All operations should check credentials before executing programs. This is easy to do if there is a relationship between the entity requesting access and the entity controlling access. No relationship, no access.
@StuartLynne
@StuartLynne 2 ай бұрын
There is no Silver Bullet.improvement within a decade in productivity, in reliability, in simplicity.” • - Fred Brooks, 1986.
@Apstergo
@Apstergo 2 ай бұрын
Knowing these questions is important. Actively listening to industry experts and less to corporate experts (They only lead to better return on invest, and now that is AI). This event should be a wakeup call, but I don't think people with think of it like that.
@d3stinYwOw
@d3stinYwOw 2 ай бұрын
CrowdStrike hit linux few months ago, too, but nobody told anything since impact was smaller CrowdStrike also was able to force such upgrades. Plus, we can have both w.r.t tests and velocity, ContinuousDelivery main front person, Dave, told it as well :)
@allenpierce4575
@allenpierce4575 2 ай бұрын
doesn't help that the newer versions of windows doesn't allow you to roll back the update without it trying to reinstalling right after removing it
@k98killer
@k98killer 2 ай бұрын
The driver code itself was not updated, but rather a "channel" file that contained attack detection templates was pushed out with all zeros. The driver contained faulty template verification code that allowed the broken file to be parsed, and this included what should have been a valid pointer offset value. The driver then dereferenced a bad pointer and crashed the system. So really, if they had more thorough testing of their core code, they could have prevented this.
@rydmerlin
@rydmerlin 2 ай бұрын
Why couldn’t Windows be written to quarantine any driver that behaved like crowdstrikes? Wouldn’t that have allowed recovery to be quicker?
@MrShoorf
@MrShoorf 2 ай бұрын
It's like installing 2 antiviruses on the same machine. We might even call it *SecondStrike* 🤣
@Together707
@Together707 2 ай бұрын
As far as I understood it did exactly that. Crowdstrike update tryed to meddle with registries which wasnt supposed to touch, and the solution is blue screen and immidiate reboot to stop the process and revert the changes. Only it got stock in a loop this time.
@kjetilhvalstrand1009
@kjetilhvalstrand1009 2 ай бұрын
Well, companies that’s not Microsoft, have their own way push packages often. It might not be up to windows at all, and as I understand it, it was case here, there had there “driver” that executed other code, that was not signed.
@CallousCoder
@CallousCoder 2 ай бұрын
If this clusterfuck has showed one thing, that is that important companies and instances can’t cope with disasters. There are no manual backup processes in place. And it’s not only computer systems that can fail but also long power outages, internet outages. Long traffic problems that disrupt goods from going where they need to be. We need to not rely as much on government global infrastructure but decentralize systems. Back when I was in the energy business, I was propagating for small pebble reactors for towns or larger neighbourhoods. Instead of massive 1GW nuclear power stations - the bigger the more complex and the more material is needed. Whereas small reactors are simpler to build and even safer. Russia understood this and they are moving in that way. And it also is great against acts of war. Taking out 4 or 5 power plants and you disturbs all of the industrial areas of the Netherlands. Taking out 50-100 smaller ones is a lot more of a hassle. And we should use cash cash and more cash for our daily shopping. And we should actually buy locally from local farms much much more.
@joansparky4439
@joansparky4439 2 ай бұрын
economies of scale drive this, which means this is NOT true: _"the bigger the more complex and the more material is needed"_
@JeanPierreWhite
@JeanPierreWhite 2 ай бұрын
Many companies do have manual processes. However they are very slow and inefficient. If that wasn't the case then we wouldn't deploy computers in the first place.
@CallousCoder
@CallousCoder 2 ай бұрын
@@JeanPierreWhite many did a lot didn’t like hospitals and GP offices that’s just unthinkable! I worked in healthcare software, we documented a backup process as part of our manual. You could print your agenda, you could print user details and treatment and medication plans. And most did that. You don’t need your computer system to diagnose or treat people. Same with issuing boarding cards. The SITA system was still running, print a passenger manifest and issue the boarding cards manually. Some airlines did most didn’t. So it showed how painfully unprepared we are. And this was only a simple computer outage. Let alone something more impactful like power outage.
@CallousCoder
@CallousCoder 2 ай бұрын
@@joansparky4439 it is true in case of nuclear power plants and engineering. Of something is bigger it will always require more resources to build. I can’t build a reactor thinner. And this statement is only true in case of consumer goods where you can make billions. But critical systems its cost isn’t manufacturing the actual system. But all the security and secondary systems. A single engine plane will always be simpler and cheaper than a 4 engine plane. It’s not just bolting an extra engine on the plane, but your mass increases so that first engine should be able to hold the plane up with that added mass. You will need to monitor the two engines and balance them for wear and tear. You’ll need to service the two engines. And this complexity and problem gets worse with 4 engines. What if two power engines stop? Then the other two should take over but also the whole load bearing structure is indiscriminately loaded and that needs to be designed and tested. Critical systems where you start adding to the control systems get quickly more complex. You probably haven’t studied and engineering and especially not done critical systems. Because that’s where the law of general economics don’t apply. Simply because of the snowball effect. Also there’s not enough true mission critical systems to get the benefits of standard economy. How many nuclear power plants are there build every year? How many satellites etc.
@CallousCoder
@CallousCoder 2 ай бұрын
@@joansparky4439 funny story, I got into critical systems for building a very simple device that measured heights of snow/ice on Antarctica. You would knock out a prototype for this in a few hours these days, back in 1993 in about a week (all in assembly and no libraries available). But since this system has to run unattended for 5 years. Suddenly the complexity of the peripherals systems exploded! We needed two batteries, charge circuits that would batteries equally (if they don’t have the same capacity you are basically discharging one over the other). These charging circuits had to be redundant. Which also mean two solar cells, that were cross connected. You need to be ware that 6 months out of the year there’s no charging so each battery itself should be able to hold a 7 month charge. So suddenly the batteries became twice as big. The housing as a result became twice as big. But we also need 3 sets of ultrasound range finders. For redundancy and that added enormous code complexity, to see if primary system was working, by comparing it to the secondary. If there was a discrepancy (which with snow and ice is very normal because that forms in heaps) to take the secondary system after comparing it to the tertiary system. If the tertiary system decided the primary and or secondary system is defective, you don’t want to use those range finders to save crucial energy. As a matter of fact let’s decouple them from the CPU bus. The cost exploded! Not only in resources but mainly in design and development. And you never build enough of these systems to get the economic benefit. There are simply never enough build for that. Basically all those critical systems from planes to tanks to satellites to bespoke research equipment, are manually made by a very select few people. It’s not that you go to China and let a factory build 2200 satellites. First of all that factory that can do that doesn’t exist and needs to be designed and build. And 2200 is a big number of satellites.
@natduinfo
@natduinfo 2 ай бұрын
NSA has better access to kernel than CrowdStrike. Let that sink in. 😂
@joelmamedov404
@joelmamedov404 2 ай бұрын
Technical glitches can happen. The fundamental problem is not technical. It’s managerial. The “business continuity “ planning does not exists anymore. The critical systems and industries must have redundant and durable systems. All the eggs are in the same basket unfortunately.
@dannym817
@dannym817 2 ай бұрын
As a software engineeer myself: - To less time: to test, to build well/refactor, to rebuild legacy code. Deadlines pushing bad/not well tested software into production. - To much stress because to much firing/people leaving and rehiring all the time. And with the problem the knowledge of parts of the software is gone. - A lot of bad managers in the IT world, who make the above happen - Companies see software development/it as a cost instead of a win: For example in some companies i worked sales persons get bonusses when they make enough sales selling the software, while software engineers dont get anything. - There is not a real easy to see how good/bad a software engineer have been working for managers/people who cant read code. And because of this most companies only look at speed. Not how well software is written. This have been happening for a very long time in lots, probably most companies. With legacy code that isnt workable anymore and very hard to maintain. And should have been replaced years ago.
@mrtnsnp
@mrtnsnp 2 ай бұрын
In part you want to avoid single points of failure. So don't run all your systems using the same security software, the same base OS for all parts of your systems. A more diverse collection of systems is less likely to go down all at the same time. Crowdstrike for sure isn't the only provider for these kinds of services, and for sure they won't be the first (or last) to introduce bugs in kernel drivers. There are sufficient opportunities for shit to hit a fan. On the other hand: Apple removed access to the kernel for all third party software. That may be needed for Windows as well, with an API to perform these tasks from user space rather than inside the kernel. And crowdstrike needs to have better processes for developing their code, but they are not unique there.
@whlewis9164
@whlewis9164 2 ай бұрын
A more diverse collection of systems also introduces complexity of support, management, monitoring, licensing, and contracting.
@mrtnsnp
@mrtnsnp 2 ай бұрын
@@whlewis9164 Yes. Instead of one configuration, you have to support two, you split the options for each functional piece in two, and end up with two sets. In my view this likely beats the downtime that was just experienced.
@whlewis9164
@whlewis9164 2 ай бұрын
@@mrtnsnp I very much doubt our corporate management overlords will opt for the best technical approaches. They will like continue to squeeze the budgets, ship support overseas, and consider the bottom line over everything else.
@mrtnsnp
@mrtnsnp 2 ай бұрын
@@whlewis9164 They get what they pay for, that is for sure.
@ra2enjoyer708
@ra2enjoyer708 2 ай бұрын
@@mrtnsnp Two? Try O(n^2), aka every OS with its own barely specced configuration format which depends on a specific version of parser which depends on a specific language it was written for. Also this kind of clusterfuck introduces another attack surface in the form of different parts of the stack interpreting the same value differently, in worst case with a race condition on top.
@eglobalsystems2554
@eglobalsystems2554 2 ай бұрын
That's taught us again. SDETs are important part of our software life cycle!
@richardbloemenkamp8532
@richardbloemenkamp8532 2 ай бұрын
Staged/canary releases should be obligatory unless imminent danger at which point the government should be involved. It is totally ridiculous that millions of PC's install kernel patches that have not even been checked on a starting group of a few thousand computers for at least one or two days. In this case there was no imminent great danger that absolutely required all of the millions of PC's to be updated within a few hours.
@diogotrindade444
@diogotrindade444 2 ай бұрын
All parties need to fix this broken system: - Security companies cannot ever force push without testing. - OS (special MS) need to improve all aspects in this scenario with lots of new well documentated automated testing/check tools for multiple steps in the process. - Essensial companies cannot trust blindly on updates without basic checks, and MS should not be the only OS running if you want to make sure that you online all the time. We need better software build for failure special for essential compatines that cannot stop. If companies do not fix this on all levels it can open a new door for failure.
@bernhardkrickl5197
@bernhardkrickl5197 2 ай бұрын
The promise of Continuous Delivery (as Dave Farley explains so often) is that you can release quickly and safely *because* you have a lots of tests. You work in small steps to achieve that. There might be an imminent threat and we will have to make a big change to our software to deal with it. You are back at square one: How do you know your change actually deals with the threat? Oh, that's right: By testing. If you say you need to skip that phase you don't believe in testing in the first place. If you skip that phase you get *something* to the market quicker. But will it help? Or are you pouring oil into the fire? The practice of continuous delivery with TDD is the best insurance that your software stays flexible and easy to change so you can deal with such problems quickly when they arise suddenly.
@pureabsolute4618
@pureabsolute4618 2 ай бұрын
First, there is *no way* someone else's driver should be pushed to *your* customers. Second, if it is pushed, protect it with the software stack equivalent of a try catch. If that takes too many resources, have it remove the try-catch guardrails via something like a manual group policy push. But people focus too much on non-scaled performance. If a driver takes 10% more of your computing power by being outside of the kernel, that should be a choice you can select as the customer.. and in most cases you should select that. I remember when.. Windows XP? crashed becuase of a bad graphics driver.. I was pissed, since the performance hit caused by being outside the kernal wouldn't affect what we were trying to use "NT" for. Kernel's should have the option (or by default) be as slim as possible.
@Beat0X
@Beat0X 2 ай бұрын
This one feels off.... Opposing speed of development to quality. Hasn't that always been the case finding the balance between stability and throughput ? Isn't that the whole idea behind DevOps? Making sure that rigorous testing and quality isn't an impediment to rapid delivery. Making sure that it is part of your standard flow not a heavy process but sufficient guardrails that you can quickly release and just as fast rollback... I am still curious to see more post mortem info on what went on with crowdstrike but I am not sure testing could have caught this one. Canary releases maybe but anyway still waiting to know more.
@ulrichborchers5632
@ulrichborchers5632 2 ай бұрын
A rant about this is perfectly fine. We need to speak the truth if something clearly goes wrong. To remain silent, not wanting to be "negative" in such a scenario would be wrong, it only strengthens the wrong approach and thinking. The responsibility of a software engineer includes detecting problems early, thus avoiding them in the first place. This is an essential part of implementing a good solution, by avoiding a bad one. CD is exactly about that. It is not rigid at all, don't fall into that trap. To bypass good engineering practices is never a good idea, especially at this scale. CD with all its techniques supports fast incremental progress into production. It can raise software quality dramatically, minimize defects, and it also does prevent desaster, both with respect to releases and to the intrinsic quality of the product. A failed integration test with a widely used OS obviously would have prevented this. The choice here is not whether to allow a quick release into production or not (good CD practices even speed up the release because a high degree of automation is included), BUT the choice whether to apply CD best practices or not is this: Do you want to avoid desaster and be notified as early as possible that you have to fix the software before deploying it ... or do you really want to deploy a bug into production if you could have detected and fixed it before? To apply CD is not about preventing a quick deployment. It would have been about preventing the release of a problem in this scenario. If people experience problems not being able to release quickly when they have to, then they DO have quality problems with their software or with the system architecture or they do not have enough understanding of CD and how to apply it correcly. They then have to improve their engineering skills instead of accusing the necessary and professional techniques which they have not mastered. Incompetent people in charge making decisions under pressure, or for whatever reasons, are the actual thread. It is not "AI", but yes, the thread is also about intransparency and lack of knowledge. "AI" is a marketing term, nothing more. If a software can detect patterns and even learn to improve the detection of security problems, that is perfectly fine, whatever techniques or tools it uses. The internal mechanics of a software are never transparent to the "end user". This is normal, whether "AI" is included or not. But yes, we do have a problem if engineering is done without exact knowledge of what is going on. The ongoing loss of culture, education and priorities is the actual problem, not the use and integration of technology itself. When I notice software "engineers" using the term "AI" internally and following the same belief system, that scares the hell out of me. Technology is not a thread, but inkompetence is, and always has been. There is a clear answer to the question of how to solve the global problem: Education, thinking, acting with a sense of responsibility and the right priorities as human beings. Now this may read strange, but what that means is this: Stop cheating. If you do something, do it well for the sake of what you are doing ... money must never be the top priority, money is not even real. It is a mental construct to make exchanging things more efficient, by comparing the value of things. This requires balance to work. Subtract or add the same on both sides. This is simple math and obvious stuff. If you see money as something to be maximized, then this is abuse of the limited resources we have and of other human beings, of your very own species. Magic does not exist in this world, money cannot be "generated" from nothing. The law of coservation of energy is a fundamental law of physics and it of course applies to everything, including life itself. For every shortcut and for every selfish act, someone else will have to pay the price. Is there a problem with a software which breaks encapsulation, to access the Kernel of an OS? Yes of course. Does it have to do with EU regulations, forcing MS to allow access to a protected layer? It seems. But the root cause of this may be unfair economic practices to exclude others from "competition". Or it is pretending that a successful company does so and thus invading their space, for the sake of making money. It does not matter which side is right there. That is fundamentally the wrong type of thinking. The actual root cause is a wrong definition of "success" which has invaded our culture in a very harmful way.
@semibiotic
@semibiotic 2 ай бұрын
That is the question of proper, profit-related system administration. Companies layoff their system administrators, turning all updates on automatically ? So they got automatic crashes like that. Big companies should have system administrators to manage infrastructure maintenance, grow and updates safely.
@Colaholiker
@Colaholiker 2 ай бұрын
Surprisingly, my employer was not affected. After all, our IT usually doesn't leave any chance to screw up pass without making good use of it. But we are affected by supposed "AI securtiy software". Some years ago, they changed from a common rules based antivirus to something that supposedly uses AI. And it is so terrible. One thing it likes to target are programs that you just compiled (even a simple "Hello World" that you used to test if a new compiler version is working), parts of development software that we have used for ages, and now it even stopped attacking another supposed security software (we think it is just something they use to monitor what we do) that supposedly filters web traffic. Yes, it deletes that software as well. Of course, this is all just stuff that has an effect on individual workstations, not on a global scale, but it is so annoying...
@steves9250
@steves9250 2 ай бұрын
Shows how a product that works 99.99% of the time makes one mistake and it all goes to hell
@durand101
@durand101 2 ай бұрын
The reason the world is so fragile right now is because of a) tech monopolies and b) efficiency over pragmatism. Why does MS have such a large share of the corporate market and why aren't our various regulators challenging that? In nature, monoculture ecosystems are the quickest to be killed by disease. And why does everything have to be automated to the point where there are no humans in the loop? Mostly to be more "efficient" and reduce costs - at the risk of much more expensive black swan events eventually coming to ruin your day.
@askii3
@askii3 2 ай бұрын
I imagine Monte Carlo like testing methods are going to become more common for testing critical software, as demonstrated by the TigerBeetle database team. The bug was from dereferencing a null pointer. The channel Low Level Learning had a video on this. I saw another video on this saying it would've been very difficult to catch this in testing (I don't recall why). Clearly automated testing needs to be improved to have higher odds catching such errors.
@gamechannel1271
@gamechannel1271 2 ай бұрын
I'll just say there is no reason this software would have a need to "quickly deal with a security threat". The software itself is a security threat. It should be removed from all computers, and the company should be disbanded. See videos from people who have analyzed their driver, and how poorly it is validating its virus definition files. The download of a bad definition file caused this crash, NOT a driver update. Because they wanted to bypass the driver update process.
@kjetilhvalstrand1009
@kjetilhvalstrand1009 2 ай бұрын
Absulutly an update can be back door into the kernel, if they do not check what is pushed, it shows how bad this company is.
@kellyaquinastom
@kellyaquinastom 2 ай бұрын
This is called “Experience”
@miyu545
@miyu545 2 ай бұрын
That's what you have when you have no patch management or change management process. Microsoft does not have one. It has the public to do that for them.
@JeanPierreWhite
@JeanPierreWhite 2 ай бұрын
Not true. Microsoft does have change management as do their customers. The problem is that Windows is too fragile. If a problem sneaks by then you have few recovery options.
@johnm838
@johnm838 2 ай бұрын
It's often the case that management decides, even before software has been properly tested, when it will be released. Tests might find flaws that require changes to the code, followed by new tests that potentially find flaws in the revised code and require even more software changes. Management has to wait until software passes all tests before deciding when to upgrade.
@TheWolverine1984
@TheWolverine1984 2 ай бұрын
I thought this all was a long-winded setup for "I wrote a free book about how to be a senior software engineer" It's like "How do we solve all those problems? Well, I don't know, but I wrote a free book about how to be a better software engineer" 😆
@ArjanCodes
@ArjanCodes 2 ай бұрын
Haha, now I only need to write a book 😁.
@TheWolverine1984
@TheWolverine1984 2 ай бұрын
@@ArjanCodes That would be great actually.
@joansparky4439
@joansparky4439 2 ай бұрын
@@ArjanCodes U did not ask the important question - why was cloudstrike relied on by so many? What is with ALTERNATIVES, with COMPETITION, with REDUNDANCY? How did a competitive economic system create a "monopol" (which is intrinsically subject to this kind of failure)? _The fundamental problem for this is sociological in nature and (if one digs deeper) actually caused by how life itself functions, but that is far outside of programming._
@Ramdileo_sys
@Ramdileo_sys 2 ай бұрын
@@ArjanCodes Why my Win10 computers here don't crash or anything?? .. and everybody was working normally........... because I don't let somebody I don't know update my computers just because some as^&%%shole said I have to .......... I update my computer if I need it.. and after I try that update in a not essential machine for at least some days or a week ..... today my Windows is running with the same files it was running yesterday..... and also last weak...... and last month......... and the same software that I install on it last year .......... it boils my piss that this imbeciles are constantly beta-testing their crap in the computers I use for work.... like if this were a 1960 Terminal.. instead of a PERSONAL Computer.. ......... I don't understand who the hell you people tolerate this nonsense over there...... ¿¿was medical centers with this problem over there??...... and probable in nuclear plants affected also ...... yes... because those things are connected to the internet sewer rigth.. because the retardation is overwhelming in this world.. so yeah probably .....
@gruntaxeman3740
@gruntaxeman3740 2 ай бұрын
One root cause of issue is that bullshit security, having a lot of complexity and adding more complexity in form of some security application. In reality when someone want to make reliable system amount of complexity is minimized. That is why critical places all unnecessry "moving parts" are removed and system is locked down tightly. It can be even better if code is formally verified to avoid bugs. Humanity has knowledge how to do this correctly. I even have alone the knowledge how to do it. Instead we see bloated software stacks, dumb IT who thinks that end point security software should be installed on critical, dedicated system. Or dumb insurance company who require it. One issue is also that today 95% of software developers don't even know how computer works. There is lack of deep knowledge and software developers are actually those people who are understand technology better than some lawyer in insurance company.
@salec7592
@salec7592 2 ай бұрын
I believe that concentrating solely on the code quality and metrics is not guaranteeing security. We must always look at the wider context and recognize risks, extract safety invariants and make them into critical requirements that must not be optional, gated by no other condition. We need a model for that, perhaps some sort of description language. Another thought, security model is such that computer system potentially can do anything, but its mission is to do some very specific set of things, and it is loosely kept busy doing its mission, and then we have an add-on software process on that computer whose purpose is to tackle and stop that particular computer system from working anything if certain criteria, which imply that this computer is taken over by some outsider malevolent will, are met. Perhaps dedicated computers should be for starters disabled from being able to do anything except their intended task? Keep "intelligence", reduce scope, reduce number of things that could go wrong.
@MelloBlend
@MelloBlend 2 ай бұрын
I want to know what the actual failure was. I saw someone post a clip of the offending jump routine that was trying to move data to or from register R8. Now these are general purpose registers but what was the offending data or address or executable issue? Was that file we all deleted some other purpose that no one is mentioning because they don't know. Was it something nefarious?
@Pengochan
@Pengochan 2 ай бұрын
Is it really "fundamental", when what was described is a combination of causes, i.e. the complexity of modern software, AI-blackboxes, lack of testing due to emergencies, etc.? For example one argument was, that there may not always be time for testing and canary releases, but was that really the case here, or was it just some automation simply because it was "cheaper"? I think risks of such faulty updates could be substantially mitigated by not applying panic mode to every update on every system. Even when some security breach might need immediate fixing on some systems, other systems might just need to be fixed eventually, because they don't use the affected component. So was that crowdstrike update really necessary immediately on all those systems? Maybe what we really want is proper risk assessment, when the cure might pose a greater risk than the disease, and on which systems. Combined with strategies to reduce the risk exposure, especially of systems used for very specific tasks, where a lot of services/daemons aren't needed and shouldn't even be installed, the risk of another crowdstrike could be greatly reduced. Maybe that risk will never be zero, but that doesn't mean we can't do anything about it.
@calkelpdiver
@calkelpdiver 2 ай бұрын
This is why you need to properly test your installation/deployment process and tools. I'm not blaming the Test group for this one as I'm sure they were pressured and overruled on releasing this patch (by both Microsoft and Crowdstrike). Been there and done that one a few times in my career. Testing of the deployment process and installation/configuration tool/app is always overlooked. Always has been. I've tested installation software for commercial products and found them at times to be pretty crappy in how they check for version differences of files (don't overwrite a similar file and warn the user if you have to, but again a lot of this is run in Silent mode), correct location of files, and validate changes to the config/INI files or registry changes that are done. A lot of installer's are "dirty" in their process, basically just jam things on. But this mentality is endemic to software development, and always has been. Companies have to consider and remember that the Installer software is the first one an end-user encounters. And if it isn't pretty much bullet proof then you're going to have a pissed off customer and your reputation will be heavily impacted by it.
@danjolly9505
@danjolly9505 2 ай бұрын
You just described outsourcing thinking. My single biggest problem with these systems is
@johnmcway6120
@johnmcway6120 2 ай бұрын
its just going to happen sometimes. theres construction accidents, there's medical accidents, theres accidents in every field big or small regardless of their complexity. no manager is going to see this happen and say, hey guys we just spoke with the board and we decided that we can invest twice as much money to ship this feature and decided to move the expected delivery date by 2 months to ensure theres no crunch and devs are well rested. thats not how business works, not in my experience. a good answer is to always have back up systems in place. when driving cars one should always keep a fire extinguisher and spare tire. there are steps that all of us can take, developers, managers, users but we wont. thats why just get used to the idea that this is simply going to keep happening.
@JeanPierreWhite
@JeanPierreWhite 2 ай бұрын
There are more resilient OS's than Windows. Companies need to move away Windows at the desktop for critical systems.
@osark2487
@osark2487 2 ай бұрын
At this point autopsies have poped up on youtube channels everywhere. We most definitely know what happened, how and why.
@kodtech
@kodtech 2 ай бұрын
I remember those old days when you can boot your system and see what files are loading and at what driver is stucking... and at the next reboot it will bypass.
@douglasengle2704
@douglasengle2704 2 ай бұрын
It is always been risky to use an operating system that was made popular because it could run on the preferred video game PC platform with the same security concerns. There is a huge benefit with going with the crowd by using MS Windows, but for critical system the ability to switch to Linux operating system on the same hardware platform will likely take place at companies like Tesla.
@slm6873
@slm6873 2 ай бұрын
It "has" to run in kernel mode ..except on MacOS 11+ that uses system extensions so that it works fine from userspace..
@center-q4k
@center-q4k Ай бұрын
great analysis - mentioned about the dichotomy...
@JordanEdmundsEECS
@JordanEdmundsEECS 2 ай бұрын
Perhaps even more fundamental, software quality is always going to come with a price tag, and gotta please them shareholders.
@clarenceawalker1873
@clarenceawalker1873 2 ай бұрын
That make you think about a self-driving car with a bug in it.🦅
@davew-marketer8264
@davew-marketer8264 2 ай бұрын
Just good questions! I hope lot people will open the eyes more and more @ArjanCodes i just discovered your channel since 2 weeks. But what a love for your good and clear vidoes. Thanks!
@dalenmainerman
@dalenmainerman 2 ай бұрын
Great video as always! Thanks, Arjan! Completely out of topic: It would be very interesting to see your take on the game "The Farmer Was Replaced", where you have to write code to automate farming drone Thanks!
@Calphool222
@Calphool222 2 ай бұрын
In this particular case, there's something more basic that would have caught the problem, and it requires no slow down of the development process when new malware shows up. When they deploy code, they need their software to phone home once the OS boots up, that's it. If when they deploy a new channel file and then reboot their test servers, they were to wait for the "phone home" before moving the code down the line for further testing and eventual deployment, they would have caught this problem. The user-land code would never have phoned home because *the OS was stuck* in boot. This is really just basic smoke testing, and it shows how immature their deployment pipeline must be.
@philipoakley5498
@philipoakley5498 2 ай бұрын
I agree. There is a lack of appreciation in the general 'software' industry of the shaky foundations of logic & perfection that coding is built upon, and how it has permeated the foundations of many other parts of society. Those who are reaching for blame should have a read of the years of study of human error, safety studies, and their ilk to see how these major failures continue to happen. It's just another day in our interconnected world. When the underlying unexpected factor(s) is/are finally identified, they'll likely be tedious and boring from some disused cupboard that had been forgotten about (xkcd/2347).
@jjmalm
@jjmalm 2 ай бұрын
If companies really do want better assurance they should be demanding third party audits of test and release management. But I doubt anyone would pay for it, so honestly this is the perpetual level of service companies will need to expect, and have backups if they don't like the fallout. Here in Canada hospitals reverted to paper record keeping during the outage. This is probably the right fallback for most companies if they don't want to pay extra.
Avoid These BAD Practices in Python OOP
24:42
ArjanCodes
Рет қаралды 62 М.
CrowdStrike IT Outage Explained by a Windows Developer
13:40
Dave's Garage
Рет қаралды 2,1 МЛН
Стойкость Фёдора поразила всех!
00:58
МИНУС БАЛЛ
Рет қаралды 7 МЛН
А что бы ты сделал? @LimbLossBoss
00:17
История одного вокалиста
Рет қаралды 8 МЛН
PirateSoftware Breaks Down CrowdStrike Computer Issue
12:56
itmeJP Shorts
Рет қаралды 192 М.
Hacking Windows TrustedInstaller (GOD MODE)
31:07
John Hammond
Рет қаралды 671 М.
Why Western Designs Fail in Developing Countries
27:36
Design Theory
Рет қаралды 1 МЛН
Microservices are Technical Debt
31:59
NeetCodeIO
Рет қаралды 510 М.
The New Python 3.13 Is FINALLY Here!
20:39
ArjanCodes
Рет қаралды 49 М.
Microsoft Is KILLING Windows | ft. Steve @GamersNexus
19:19
Level1Techs
Рет қаралды 446 М.
The End Of Jr Engineers
30:58
ThePrimeTime
Рет қаралды 450 М.
Why Microsoft Is To Blame For The Crowdstrike Outage (Not The EU)
17:37
Стойкость Фёдора поразила всех!
00:58
МИНУС БАЛЛ
Рет қаралды 7 МЛН