hey I'm Dave welcome to my shop I'm Dave plumber a retired Microsoft software engineer starting our Windows back in the early 1990s and today I'm going to update you on all the latest fulcon news as well as some want and speculation and even conspiracy theories on the crowd strike Falcon it oage if you watch my last video then you already know the specific technical details of what precisely went wrong so I'll only briefly update them here with some new info once we've done that I'll update you on the latest conspiracy theories as as well as consider what broader lessons can be learned from the whole debacle the recent crowd strike it outage was caused by a faulty sensor configuration update in their fulcon cyber security platform here are the key technical details the update involved a configuration file known as Channel file 291 designed to Target newly observed malicious named pipes used in common command and control Frameworks the update appears to have been malformed it then triggered a logic air in the crowd strike kernel Drive that resulted in system crashes in the infamous blue screen of death on impacted Windows systems approximately 8.5 million devices worldwide were impacted causing significant disruptions across various Industries including Banks Airlines and businesses even 911 service was disrupted in some areas Crow quickly identified the issue and deployed a fix within a few hours they issue detailed technical guidance for affected customers including mitigation steps and tools to identify impacted hosts now I used air quotes around the word fix because in this case the fix only fixes the update and prevents more machines from being brought down for the 8 million or so machines that already took the update it does nothing at all to fix them that's going to be up to the system administrators office managers and nerdy uncles around the world to fix because each and every machine will require that a human manually boot the machine into safe mode from there you have to find the corrupted Channel 291 update file in the crowd strike folder delete it and reboot and so that's where we're at a whole lot of tech standing around with their disc in their hand waiting to Safe boot 8 million blue screen Windows machines doesn't look very good for Microsoft which is ironic because it's primarily a crowd strike issue and not something specific to Windows itself if you don't believe me consider that on April 19th this year Crow strike issued a flawed update that impacted customers running Debbie and Linux the update caused those systems to crash and prevented them from rebooting normally the issue was acknowledged by crowd strike the next day but it took weeks to determine the exact cause and Implement a fix another similar issue occurred a month later on May 13th this time affecting Rocky Linux these servers experience freezes after upgrading to the rocky Linux 9.4 this problem was linked to a Linux sensor operating in user mode combined with Pacific 6.x Kel versions curiously absent from the list though is the Mac like a lot of folks you might just assume that's because it's yet one more piece of software that doesn't even run on the Mac but you'd be wrong crowd strike does provide security solutions for Mac OS through its Falcon Plus platform the Falcon sensor for Mac OS does not install kernel extensions especially with the release of Mac OS big sir and later versions where Apple deprecated the use of K extensions entirely instead crowd strike has rearchitecturing workk provided by Apple known as system extensions and while I generally hold Microsoft blameless in how crowd strikes mistakes manifested on their platform this time around it all comes down to the fact that a kernel driver is involved at all as I explained in the last video a kernel driver has very intimate access to the system's most inner workings as a cost however it brings with it the fact that if anything goes wrong with the kernel driver the system must blue screen to prevent further damage to the user settings files security and so on crowd strike engag is in the risky business of delivering kernel code to the critical path of millions of machines not because they are careless YOLO Cowboys or even in spite of that they do it because it's the only way on Windows to get the low-l system access to do the security Voodoo that they do you see code gets to walk on the wild side in the kernel usually for one of two reasons either for performance reasons or because it needs access to information about or other kernel goings on that it simply cannot do from user mode back in the day when my beard was still dark red as in the days of Windows n31 not even the video driver ran in kernel mode it essentially ran entirely in user mode and when it needed to access the hardware it would be done by a proxy thread in the kernel on behalf of the video driver and the parameters and results will be validated and Marshal back and forth between those threads the problem is that with a gen 4x6 GPU connection that's a metric crap ton of data to Marshall and it'd be a lot faster if the driver just had Direct access to the hardware and so for performance Reasons video drivers got moved into kernel space but the key point is that it was not a necessity it was a performance decision made at the cost of potentially reduced reliability oh over time the decision has been made the other way in favor of stability too the original printer subsystem for Windows used a kernel mode driver model for printers and while I would never dare to question the wisdom of printer designers I'm not sure I want some internet brother writing my kernel code and so with a little wailing and nashing of teeth the printer driver model was moved to user mode to make Windows far more robust when it comes to something like crowd strike the Falon sensor is in kernel mode presumably because it needs to do things that can't be done from user mode and to me that's where Microsoft could be responsible because on the Windows platform to the best of my knowledge some of the crowdstrike security functionality requireed deep integration with the operating system that can only be currently achieved on the colonel side that's not to say that Microsoft hasn't tried there's wdac or the Windows Defender application control API there's also the Windows Defender device guard together they provide mechanisms for controlling application execution and ensuring that only trusted code code runs on a system they also offer various apis for antivirus and endpoint protection solutions to interact with the operating system and I don't know to what extent crowd strike those Active network filtering but the Windows filtering platform or wfp allows applications to interact with the network stack without requiring kernel level code the irony of all this is that at one point Microsoft actually tried to do the right thing behind the scenes sources indicate that Microsoft have been working on a solution that could have potentially prevented such disasters the tech giant had developed an advanced API designed specifically for security applications like crowd strikes this API promised deeper integration with the Windows operating system offering enhanced stability performance and security it was a proactive measure aimed at mitigating the risks associated with low-level system interactions which are often fraught with complexities and potential vulnerabilities however as Microsoft prepared to roll out this game-changing API they encountered an unexpected obstacle regulatory body tasked with ensuring Fair competition in the tech industry scrutinized the new API The Regulators in the European Union argued that providing such a powerful tool exclusively to certain applications could give Microsoft an unfair Advantage potentially stifling competition from smaller security firms that wouldn't have the same access now despite Microsoft's assurances that the API would enhance security for all users The Regulators stood firm they feared that integrating this API could create a dependency on Microsoft's ecosystem effectiv L locking out competitors who couldn't leverage the same level of access to the windows core consequently the API was deemed anti-competitive and its implementation was prohibited so allocating blame to Microsoft for inaction on an API is actually pretty unfair Microsoft is also in a very different position than Apple Apple is somehow afforded the luxury of being able to do things like break an entire driver model in a new update that requires everything to be Rewritten conversely backwards compatibility is so deeply ingrained among Microsoft developers that it simply may not be an option on my Mac I've got a universal audio Apollo tnx Thunderbolt sound device and it requires that you disable all of Apple's driver signing and kernel extension security and for weeks the machine would pink screen and reboot until they eventually got their driver more sorted Microsoft needs to support and Export whatever functionality as an official API so that security providers can build their product without putting the entire operating system at risk not because it's the right thing to do but because the harsh reality is that they've got tens of millions of machines serving ad Mission critical roles like 911 service that do run kernel mode code those organizations deserve a system that doesn't need to run thirdparty kernel code to safely do its job and only Microsoft can fix that but only if The Regulators would let them now I'm certainly not going to throw satcha under the bus for not throwing crowd strike and the EU under at first but I question the communication and messaging that's coming from the top the decision to not publicly note that this isn't a failure in Windows itself has led to to the widespread misconception amongst my friends and relatives that it was a Windows update that went horribly wrong I think it' be instructive to take a quick look at another PR nightmare that also wasn't the company's fault Tylenols crisis back in the 1980s now that might sound like a long time ago but keep in mind I'm almost 56 now damn I'm sorry anyway Johnson and Johnson faced a crisis that would become a defining moment in corporate crisis Management in September 1982 seven people in the Chicago area died after ingesting Tylenol capsules that had been laced with cyanide this event triggered Widespread Panic and could have easily destroyed the trust and credibility of the Tylenol brand entirely to say the least James Burke the CEO of Johnson and Johnson at the time spearheaded a response that would set a new standard for corporate crisis management his approach was characterized by transparency decisiveness and a focus on consumer safety as soon as the tampering was discovered Burke ordered a nationwide recall of Tylenol products totaling around 31 million bottles and costing the company over $100 million this decisive action underscored Johnson and Johnson's commitment to Consumer safety over their short-term Financial losses Burke made it a priority to maintain open lines of communication with the public the media and Regulatory Agencies he ensured that the company was forthright about the risks and the steps being taken to address the situation this transparency helped to build trust with the public during a time of fear and uncertainty in the aftermath of the crisis Johnson and Johnson introduced tamper evident packaging which became an industry standard this move not only addressed immediate safety concerns but also restored consumer confidence in the product the company also launched a major public relations campaign to educate the public about new safety measures and reassure them about the product safety Burg's leadership during the Tylenol crisis was widely praised for its ethical Focus he adhered to the company's Credo which emphasize the importance of the company's responsibility to its consumers employees and Community this ethical Foundation guided all of Johnson and Johnson's decisions during the crisis the Swift and responsible actions taken by Burke and his team not only helped Tylenol to recover from the crisis but also strengthened the Brand's reputation Tylenol regained its market share within a year and the company's handling of the crisis became a case study in business schools around the world James Burke's masterful handling of the Tylenol crisis showcased the power of ethical leadership and set a new Benchmark for crisis management by putting consumer Safety First and maintaining transparent communic ation Burke was able to navigate one of the most challenging crises in its history and emerge stronger of course the Tylenol crisis and the crowd strike outage are very different events but I think both Microsoft and crowd strike would be wise to learn from James Burke's example and maybe it's time for a tamperproof colonel all this would require that the EU reway the greater public good in terms of critical infrastructure over competition in the security API business and speaking of trust what about code signing what went wrong here that a fully signed driver was able to bork 10 million Windows machines remember that Microsoft fully tested and vetted and approved and signed the crowd strike driver in the whql lab and the driver didn't change just the channel update file did the channel files are used as input to the driver and we subsequently learned that the channel 291 update file was made up entirely of zeros and then when the driver ingested that update file it choked and because it was in curdle mode its only choice was to then turn blue and D that also means that all of the trusted platform modules and secure boots in the world wouldn't have saved you the driver was already fully trusted so even if you were running locked down to sign bits only the driver never changed data files like Channel updates aren't signed as far as I know so a digital signature wouldn't have helped and even if they were signed an all zero signed Channel file would still likely have crashed the signed driver so in this case trusted Computing was of little help since there have been very few specific technical details made public so far it's time to get a little further into the weeds with some speculation before moving on to some outright conspiracy theories my speculation begins with my assessment of what went wrong inside the crowd strike driver in the last episode we saw how the driver was access violating and crashing the system but why what caused it the best assessment I can come up with is that the code D referencing a null pointer plus an offset into a data structure that is expecting to find in memory why their base pointer for the structure is no is harder to say but it's almost certainly tied to the fact that the channel update file was all zeros a few folks have written to ask me why such code can't just be placed in a tri accept block so that if it access violates operation can continue and the answer is you can in theory and since the exception will be triggered on the attempt to write to Illegal memory and not merely after the fact memory itself is protected and preserved that means that as long as the code with the exception Handler can return gracefully and the callers Upstream can in turn cope with the air being returned back to them all as well but I didn't want to give you the impression that you can just wrap suspect code in a tri accept block and eat the exceptions there's a bit more to it than that I think the real failure here is on the part of the crowdstrike driver in its lack of properly vetting its input they're not great about teaching it in college but one of the first things you learn as a real developer is never to trust user input and if you're a device driver and your input is a dynamically downloaded Channel update file you can't just implicitly trust it even if the channel files were signed by himself the code needs to sanity check the contents let's say you're writing a little app to read in a bitmap file and displayed on the screen using the graphics card when you read that file into memory and pass it to the draw bitmap API the first thing that the API is going to do is to check the bitmap structure and header and make sure that it's all valid and if you pass that bit map off to direct X to render it with the GPU you can rest assured that the kernel side of the driver is going to carefully inspect the bit map for validity in every possible sense before attempting to draw it and crowd strike man not so much looks like their code just kind of raw dogged it and hoped for the best but it is in life as it is in software you can be lucky sometimes but if you come to rely on luck it will eventually run out and crowd strikes appears to have run out when that channel file full of zeros brought down what must be a fairly fragile section of their code following the crowd strike outage various conspiracy theories have emerged on Twitter and Reddit one popular Theory posits that the outage was a deliberate Cyber attack signaling the onset of World War II with some can get to warnings from the world economic Forum about potential Global cyber threats another theory suggests that the oage was orchestrated by political figures to influence geopolitical events although there is no evidence supporting any of these claims as for me I try to never attribute to malice that which can be sufficiently explained by incompetence it's not as simple as one programmer air either though when I was at Microsoft I only wrote the odd bit of Colonel code but the culture among the colonel guys was pretty hardcore the quality bar was extremely high as was a level of scrutiny that your code would receive from the colel team if you wandered under their Turf and checked something into their Source control even so I'm not going to just condemn the programmer especially based on the limited information that we have on the actual bug but regardless of how egregious the bug is or isn't there should be several procedural and tests and review layers that would prevent this bug or any bug from having the impact that this one had there are a lot more lessons to consider here from whether or not seemingly the entire world's infrastructure should be dependent on a single vendor to whether critical systems like 911 need to be on an N minus1 or an N minus 2 update schedule and what that all means and Heaven help you if you are running bit Locker on the affected machine but all that we'll have to wait for a future episode so if you found today's episode to be any combination of entertaining or informative please remember that I'm mostly in this for the subs and likes and I'd be honored if you'd consider subscribing to my channel and leaving a like on the video if you're already subscribed thank you please consider sending this video to a friend if you think it's covered the subject well and please do check out the free sample of my new book on Amazon the non-visible part of the autism spectrum it's intended for folks that don't have ASD but who suspect they might have a few characteristics that put them somewhere on the Spectrum it's everything I know now about living a successful life on the spectrum that I wish i' had known long ago check it out at the link in the video description in the meantime and between time hope to see you next time right here in Dave's Garage