Its A really good explanation. Particularly the first video on driver certification and kernel mode etc.Good info
Thanks for that.Good info
In the 1st video the solving explanation seems impeccable and the feedbacks from YT are all very positive.
The following is the link for anyone who wants to read them
www.youtube.com/watch?v=wAzEJxOo1ts
Among other ones, I found the comment by user "zug-zug" particularly interesting, which also explains the reason for the impressive increase in the number of machines that have quickly crashed worldwide.
"
. . .
This broken update IGNORED our staging policies and went to ALL machine at the same time. CS informed us after our business was brought down that this is by design and some updates bypass policies.
So in the end, CS caused untold millions of dollars in damages not just because they pushed a bad update, but because they pushed an update that ignored their customers' staging policies which would have prevented this type of widespread damage. Unbelievable."
The way to solve the issue by restarting Windows in Safe-Mode and deleting the Falcon-related ".sys" driver files it is it is almost touching and it is typical of Windows that at the end of the day is an extremely fun system imo at least for the "advanced" home-user who knows the (few) precautions to take and it makes you do a lot of things in a practical and relatively simple way.
It's always been a love-hate relationship, but in my opinion while you may not like it, I don't think you can even hate it.
However, I wonder how many days the systems were without protection and how system administrators managed to survive without it...
His video is good, but misses the point that Win should not allow the sort of process CS used to bypass WHQL certification, nor should it allow third party boot start drivers without some additional checks or QA.
However primarily this is a CS screw up, total absence of QA for which the CEO of CS is a repeat offender.
They should be using eBPF anyway, which I think they now (after the last Linux screw up) do for Linux, I think.
Lastly - there is no real evidence that CS stuff works any better than other solutions, beyond fear mongering, or so I'm told.
Lastly - this didn't affect home users, just corporates in the main. Perhaps they will now listen to the IT guys who always plead for them to provision an IP based KVM that can be plugged in to allow this sort of safe boot stuff needed to be done remotely. Usually that's denied as it costs a bit....
However primarily this is a CS screw up, total absence of QA for which the CEO of CS is a repeat offender.
They should be using eBPF anyway, which I think they now (after the last Linux screw up) do for Linux, I think.
Lastly - there is no real evidence that CS stuff works any better than other solutions, beyond fear mongering, or so I'm told.
Lastly - this didn't affect home users, just corporates in the main. Perhaps they will now listen to the IT guys who always plead for them to provision an IP based KVM that can be plugged in to allow this sort of safe boot stuff needed to be done remotely. Usually that's denied as it costs a bit....
FWIW El Reg has a new article explaining what the problems were. It wasn't uploaded code apparently.
It was a configuration file that bad data that caused the crowdstrike kernel driver to access memory that didn't exist.
Oh... the good old "bus error" interrupt.
I would have thought that Microsoft would have that interrupt trapped... actually, I guess it was because the core didn't crash.... but likely they (MS) wouldn't know what to do with it since it was bad code from a user...
So this could possibly be down to a bad pointer? If so it's human error at its finest. I can't help but wonder if that conclusion is an over simplification.
How did the C chicken cross the road? He didn't, he pointed himself across it.
How did the C++ chicken cross the road? He instantiated himself across it.
Pointers are neither bad or good, it's what they point at... but of course, real programmers use handles as pointers to a pointer.... that way you can have some security checking that the underlaying pointer list is safe - do that with a background task. But that requires real thought and good design.
In my opinion this second article is even "worse" than the first one.
Every authoritative and technological person involved who was asked, after a series of conjectures he concludes by saying that the real causes are unknown, even if a lot of more or less technical hypotheses are made.
On the other hand, how can we think that a company that within a few hours has effectively blocked millions of Windows PCs all over the world and which for this reason could be sued for damages in so many countries probably amounting to billions of dollars, tells us the truth?
And in fact CrowdStrike and Microsoft have been passing the ball back and forth for a while now.
Maybe politics (in general, no particular side) will take care of everything soon to "solve" everything... 😛
And there is no guarantee that it won't happen again, on the contrary.
Ideally Microsoft should write an interrupt handler that traces back the process that caused it and then shuts it down. Putting out a message ( and log ) of what it found.
In this case, it would have shut down the CS crap and it would have returned to the main screen ( no BSOD ) and the sysadmins would have had a chance to look it up. Not losing access to the machine at all.
Mind you, MS is not at fault.. I mean, who would have thought that someone would be so sloppy? Can you imagine if Wind River was to release a buggy compiler?
I'll remember all that information for the next time I use MS windows- which will be never.
(Except for tuning my truck engine)
(Except for tuning my truck engine)
The fact that nothing could be done remotely was an issue within the issue since of course no employee is allowed (nor would he have the skills) to access the physical machine to delete a driver in safemode.Not losing access to the machine at all.
And the boot-start key in the Registry which prevented again and again to reach the Desktop to any subsequent (possibly forced) reboot.
It seems that in some cases some machines managed to boot normally after about 15 restarts as for each one Falcon tried to access a previous (assuming it intact and not-broken) configuration file until it managed to finally get the Desktop again.
Not really wholly unexpected from George Kurtz... He did it again...
Friday’s CrowdStrike outage is the second major tech meltdown that founder and CEO George Kurtz has been involved in. He was also the Chief Technology Officer of McAfee in 2010, when a security update from the antivirus firm crashed tens of thousands of computers.
https://www.hindustantimes.com/tren...010-global-tech-disaster-101721471586633.html
https://www.businessinsider.com/crowdstrike-ceo-george-kurtz-tech-outage-microsoft-mcafee-2024-7
Friday’s CrowdStrike outage is the second major tech meltdown that founder and CEO George Kurtz has been involved in. He was also the Chief Technology Officer of McAfee in 2010, when a security update from the antivirus firm crashed tens of thousands of computers.
https://www.hindustantimes.com/tren...010-global-tech-disaster-101721471586633.html
https://www.businessinsider.com/crowdstrike-ceo-george-kurtz-tech-outage-microsoft-mcafee-2024-7
The whole mess gets a little more amusing for users of that product. A 10$ gift card for their troubles. Just do not expect to redeem it.
https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-apology-gift-card-to-say-sorry-for-outage
https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-apology-gift-card-to-say-sorry-for-outage
BTW, this is not new... the callousness and arrogance with which some of these Tech companies execute their business is incredible.
During a one month stay in Western Europe, our Verizon International Roaming Plan had a two day outage. Only affected "American lines"..
We were saved because I had purchased a cheap Android phone and put a local SIM card... so all I had to do was Hot Spot it and we had data and email.... but no texting. I was able to reach the Internet with our American lines.
If I hadn't had a Plan B backup, we would not have been able to access our flat or the car garage since the Euros, in their wisdom-not are all into using the Internet.... no keys. You need Internet access to reach the servers that open your doors.
Verizon offered us a two day rebate... lucky for them we were not locked out.
During a one month stay in Western Europe, our Verizon International Roaming Plan had a two day outage. Only affected "American lines"..
We were saved because I had purchased a cheap Android phone and put a local SIM card... so all I had to do was Hot Spot it and we had data and email.... but no texting. I was able to reach the Internet with our American lines.
If I hadn't had a Plan B backup, we would not have been able to access our flat or the car garage since the Euros, in their wisdom-not are all into using the Internet.... no keys. You need Internet access to reach the servers that open your doors.
Verizon offered us a two day rebate... lucky for them we were not locked out.
This type of thread is frustrating because it is lumping all use cases into one group. The fact that a home user had no problems with their MAC does not enter into it. The fact that some companies are still running bare metal is troublesome but has been a bad idea for a while now. I tend not to talk IT or Sysadmin in these forums because many users are not full stack knowledgeable and believe that there system would never have a problem because they are versed in 1 part of the issue. I also can't stand arguing over phrasing.
And then there is the double talk for example; my PC is easy to up grade so therefore the design department is wrong for using a Mac. OR my favorite is that windows is junk and it dose not work but then Windows is bashed because a bad actor shut down a million PCs or some such large number. No one says "hey I can read that sign in the terminal so that windows machine is doing its job."
As far a use case, the comparisons are absurd. One of my peers had to "reset" approx. 900 pcs / servers. I am not sure exactly what his solution was but my solutions are 1. log into your hypervisor reset the computer to the snapshot the was taken before the update. The data should not exist on the same service as the os. 2. Log into the hypervisor, create a quick script to mount the offending os file systems and delete the corrupted file/s. Have your system set up with fail-over redundancy out side the control of third parties.
My point is that the companies affected by this problem for the most part operate on a different level or style to smaller entities.
Now as far as keeping the problem from happening, I don't know because I am not a kernel coder.
And then there is the double talk for example; my PC is easy to up grade so therefore the design department is wrong for using a Mac. OR my favorite is that windows is junk and it dose not work but then Windows is bashed because a bad actor shut down a million PCs or some such large number. No one says "hey I can read that sign in the terminal so that windows machine is doing its job."
As far a use case, the comparisons are absurd. One of my peers had to "reset" approx. 900 pcs / servers. I am not sure exactly what his solution was but my solutions are 1. log into your hypervisor reset the computer to the snapshot the was taken before the update. The data should not exist on the same service as the os. 2. Log into the hypervisor, create a quick script to mount the offending os file systems and delete the corrupted file/s. Have your system set up with fail-over redundancy out side the control of third parties.
My point is that the companies affected by this problem for the most part operate on a different level or style to smaller entities.
Now as far as keeping the problem from happening, I don't know because I am not a kernel coder.
Last edited:
Here's a "consolation prize" ...
https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-apology-gift-card-to-say-sorry-for-outage/
https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-apology-gift-card-to-say-sorry-for-outage/
No one says "hey I can read that sign in the terminal so that windows machine is doing its job."
There is no OS needed to put something on the screen, even a legacy BIOS can do that.
But the sign also has to display ads in 4k, too. If it didn’t an 8 bit processor with the entire program in non volatile memory would do. Someone would also need to know something about programming, though.
@wg_ski beat me to the punch
That is certainly true but you at least need some compute to manage the timely updates to that information being displayed. This is where the good intention actors step in and create a product to protect the system from malicious bad actors. At some point in some cases the potentially good intention actors become unintentional bad actors by way of incompetence or plain bad actors do to greed.
But again the use case for this conversation is not what one can do with legacy bios it is the case of an OS getting hosed and unable to continue to do its job. A job that it presumably has done well during its deployment.
But taking your statement a step further into a secure and not power wasting setup one could and I think should use a "wyse 60" booted from a virtual server / service. Then there would be no need for local system management past physical realm. (you know the stuff no one does and then dont understand why there machine's fan is on high and is running very slow.
And yes I am using wyse 60 generically to mean a low compute power smartish terminal
That is certainly true but you at least need some compute to manage the timely updates to that information being displayed. This is where the good intention actors step in and create a product to protect the system from malicious bad actors. At some point in some cases the potentially good intention actors become unintentional bad actors by way of incompetence or plain bad actors do to greed.
But again the use case for this conversation is not what one can do with legacy bios it is the case of an OS getting hosed and unable to continue to do its job. A job that it presumably has done well during its deployment.
But taking your statement a step further into a secure and not power wasting setup one could and I think should use a "wyse 60" booted from a virtual server / service. Then there would be no need for local system management past physical realm. (you know the stuff no one does and then dont understand why there machine's fan is on high and is running very slow.
And yes I am using wyse 60 generically to mean a low compute power smartish terminal
"wyse 60" booted from a virtual server / service.
For the last 50 years we've been working hard to implement pervasive, distributed computing and you want to go back to the 60s? Just because IT is too incompetent and insecure and their knee jerk reaction is to take over control over everything...
No thanks.
I don't even want to install "apps" in my devices.
It's html or the highway for me.
But then over in the R&D World we don't let IT within 100 miles of our stuff. They're likely to break it.
For the last 50 years we've been working hard to implement pervasive, distributed computing and you want to go back to the 60s? Just because IT is too incompetent and insecure and their knee jerk reaction is to take over control over everything...
No thanks.
I don't even want to install "apps" in my devices.
It's html or the highway for me.
But then over in the R&D World we don't let IT within 100 miles of our stuff. They're likely to break it.
- Home
- Member Areas
- The Lounge
- CrowdStrike