A single bug in the software unexpectedly crashed the world. On the morning of Friday 19 July, our time, a global IT outage was recorded, affecting banks, airports and airlines, telecommunications, hospitals, shops and the media. It was all caused by a failed update to the cloud-based cybersecurity service Falcon from CrowdStrike.
The outages affected Windows computers. Microsoft Azure and cloud environments were restricted, affecting all virtual servers and stations. As a result, services and applications such as Power BI, Microsoft Fabric, Microsoft Teams, Microsoft 365 admin center, Microsoft Purview, Microsoft Defender, Microsoft Defender for Endpoint, Microsoft Defender Experts, Microsoft Intune, Microsoft OneNote, OneDrive for Business, SharePoint Online, Windows 365 and Viva Engage were unavailable or limited.
What happened
The primary cause was the release of CrowdStrike’s Falcon update. It started blocking Windows devices and causing them to crash. Users experienced the old familiar blue screen of death or computers going into an endless loop of reboots after the update.
CrowdStrike is one of the leading cybersecurity firms. It delivers advanced endpoint protection software designed for enterprise use. And precisely because it operates at the network level, its update could have compromised machines on which it is not directly installed, such as corporate laptops.
On a global scale, airport check-in systems collapsed and flights were delayed, rail transport was affected, financial or stock exchange transactions could not be carried out, reservation systems crashed, telecommunications went down, hospitals had to cancel elective procedures, people could not pay in shops. The cost of the outage was estimated to have reached a billion dollars.
The Czech environment was less affected. For example, some shops, pharmacies and insurance companies experienced difficulties. Hospitals and offices, however, functioned properly with minor exceptions.
The solution took a few days
The outage affected several million computers worldwide. Due to its unprecedented scale, the full remedy took several days. Many computers had to be repaired manually and individually by removing the problem file in safe mode. CrowdStrike promptly released a new update to its software, which should be working properly again once the computer is turned on.
Causes and consequences
The outage was caused by a combination of factors.
- A faulty update: An undetected bug caused Windows systems to crash. And because they are among the most widely used across industries, the crash subsequently led to widespread service disruption.
- Poor testing: Inadequate testing of the update before deployment contributed to the bug not being detected in time. CrowdStrike has already promised to improve testing methods to prevent something similar from happening again.
Can the risk be mitigated?
Companies can reduce the impact of such events through various measures. For example, deploying better testing tools, backing up data, having a regular recovery plan in case of a failure or attack, diversifying IT infrastructure, or continuously monitoring systems to ensure the fastest possible response.
1. Improving testing procedures
Consider implementing automated testing tools that can quickly identify bugs in software updates before they are deployed.
2. Backup and recovery
Back up all important systems and data regularly. And develop and regularly update recovery plans that include detailed steps to quickly restore systems after an outage.
It can also help to monitor systems 24/7 to quickly detect and respond to suspicious activity and, for larger companies, set up dedicated incident response teams. These experts will be ready to respond immediately to any security incidents.
Tip: If you care about security but don’t have enough experts in your company, outsource this service to an expert partner.
It’s also a good idea to undergo regular external cybersecurity audits to identify and correct weaknesses.
3. Cyber resilience
In the event of a system failure, deploying redundant systems and diversifying the IT infrastructure to reduce reliance on one particular vendor or technology is a good prevention.
In general, meeting industry standards such as ISO 27001 and implementing compliance programmes increase resilience. The European NIS2 Directive and the new domestic law on cybersecurity, which is currently being drafted in accordance with it, seek to enshrine a minimum level of security in law.
Action by Microsoft
After the incident, Redmond promised to work on Windows stability and to work with antivirus manufacturers to help improve and modernise their functioning.
The goal is to prevent or at least restrict antiviruses from accessing the operating system kernel. The advantages of this current solution, where antiviruses run in kernel mode, include the ability to detect infections quickly and greater resistance to being shut down by malware. However, the disadvantage is the inability to exit and restart the driver. Which, as we have seen, then disrupts the entire system and can ultimately be a security risk.
Apple does not allow this method in its OS. However, Microsoft has previously agreed to a deal with the EU that does not allow such restrictions on antivirus manufacturers, so it has a much narrower scope. It can therefore only offer voluntary solutions to antivirus manufacturers, for example in the form of protection based on a separate virtualised environment (VBS). This is already part of Windows 11.
Do you have a problem with cyber security, or have you not paid attention to it? Contact us and get a reliable partner to guide you through the pitfalls of this vital area.