Takeaways From the CrowdStrike Outage

On July 19, CrowdStrike, an independent cybersecurity company whose primary technology is the Falcon platform, released a software update that began impacting IT systems globally. The outage was not a Microsoft Windows flaw directly, but rather a flaw in CrowdStrike Falcon that triggered the issue. Falcon hooks into the Microsoft Windows OS as a Windows kernel process. There was a logic flaw in Falcon sensor version 7.11 and above, causing it to crash. Due to CrowdStrike Falcon’s tight integration into the Microsoft Windows kernel, it resulted in a Windows system crash and blue screen of death (BSOD).

What's a Kernel?

The kernel is a computer program at the core of a computer’s operating system and generally has complete control over everything in the system. It is the portion of the operating system code that is always resident in memory, has full control over all hardware resources, and facilitates interactions between hardware and software components. Generally, most software operates in a separate user or application space that is separate from the kernel. Some critical security software, like CrowdStrike, interacts at the kernel level. The reason for this design decision is twofold:
  1. This provides the security software with high privileges and gives it the ability to monitor operations in real time across the OS. This enables the capability to block any process or action that is known to be, or appears to be (in the case of some AI powered systems), malicious.
  2. It also allows this process to happen much faster without a middleman negotiating the interactions, maintaining a much better user experience by not slowing down the machine and user applications.

Why Microsoft Allows This Access

According to Microsoft, the reason that CrowdStrike, and other security vendors have this access, is due to a 2009 European Commission ruling, which stipulates that Microsoft must ensure that third-party products can interoperate with Microsoft’s relevant software products using the same interoperability information on an equal footing as other Microsoft products.

Despite this, Microsoft provides several APIs that are meant to provide the same functionality without the need to directly access the kernel. For instance, the Windows Defender Application Control API and the Windows Defender Device Guard both provide mechanisms for controlling application execution, ensuring that only trusted code can be executed. Additionally, the Windows Filtering Platform (WFP) allows applications to interact with the network stack without requiring kernel level code.

Furthermore, Microsoft sources claim to have begun developing an advanced API designed specifically for security applications such as that from CrowdStrike that had promised deeper integration with the Windows operating system, offering greater stability, performance and security. But the EU ruling in 2009 halted such integration efforts as the regulators claimed it could potentially have given Microsoft an unfair advantage.

Takeaways

Planning, planning, planning. I’ve always believed, that when done right, IT is 80% planning and 20% execution (at least it feels that way). The planning reduces the work that you have to do and the stress level of the circumstances under which you are doing that work. MLB pitchers will tell that a pitch with runners on base it a lot more stressful and requires more energy than when the bases are empty (football fans – think 2nd and short vs. 3rd and long, backed up against your own goal line). Having gotten my sports metaphors out of the way, here are some practical things to consider:

Test Updates Before Deploying to Production

For many companies, due to a lack of resources or expertise, it has been their practice to allow automated updates to ensure systems are always up to date. However, the CrowdStrike issue made the underlying risk with that approach apparent. For mission-critical systems, testing updates before deployment or having some form of staging environment before pushing updates to production can help to mitigate some risk.

Consider Multiple Vendors for Security Software

For years it’s been a best practice in larger companies to have multiple security vendors to improve the likelihood of thwarting threats. The best example of this is improving edge security by having multiple firewall vendors; the theory being that a flaw that can be exploited in one system most likely doesn’t exist in another system and provides time for that hole to be closed. Having some portion of your systems protected by one vendor and another set protected by a different vendor, may keep you from experiencing a blanket outage and provide the breathing room you need to get impacted systems back online.

Develop and Document Manual Workarounds

Where viable, manual workarounds can ensure critical business processes will continue even when technology fails. In the event of an outage, this approach can serve as a practical fallback helping to mitigate the effects of system failures, and ensuring businesses can still operate and support their customers.

Perform Disaster Recovery and Business Continuity Planning

Outages can happen for any number of different reasons from cyber attack to power outages. Having comprehensive disaster recovery and business continuity plans in place is critical. Part of that effort should include the identification of non-critical software that could be the source of some problem that you certainly wouldn’t want to replicate in your recovery environment, such as software that interacts with the kernel.

How Adtech Can Help

We support organizations of all types with a complete set of IT consulting and managed services that include:

 

  • IT strategy development
  • IT risk assessment services
  • Aligning your technology strategy with your business strategy
  • IT process development
  • Business continuity and disaster recovery planning and implementation

You can also fill out our contact form and we’ll get back to you within 2 business days.