BlogBlog

  • Home
  • Blog
  • The CrowdStrike Outage: The importance of Quality Assurance

The CrowdStrike Outage: The importance of Quality Assurance

Jul 25, 2024 JIN

The CrowdStrike Outage: The importance of Quality Assurance

The biggest IT meltdown of this year took place last Friday, July 19th, 2024, and caused global businesses to experience chaos. It brought down thousands of Windows systems and millions of Windows PCs from what seemed to be a simple routine update. Many arguably believed that Crowdstrike had failed their software Quality Assurance, or was their Quality Assurance not comprehensive enough to up their testing game?

Crowdstrike and its Falcon sensor

Founded in 2011, Crowdstrike is a cybersecurity company determined to help its customers stop breaches. It offers better protection even when the endpoints are offline, real-time response, and claims to be the fastest threat detection in the industry. Before the Crowdstrike meltdown, the company was worth up to USD $83 million in revenue.

Falcon Sensor is a software platform designed to block and capture attacks from your system. Upon installation, Falcon can easily access your operating system’s kernel to detect and prevent threats on a much deeper level. That means one buggy defect can cause damage on a massive scale.

The Blue screens at LaGuardia Airport in New York due to the Crowdstrike outage, taken by Yuki Iwamura – Associated Press.

The global crash

Everyone might have once heard about the blue screen of death, aka BSOD.

The Falcon sensor update was released on July 19th, 2024, at 04:09 UTC, making Windows OS go on a reboot loop or recovery mode. As a result, worldwide platforms had difficulty. They remained “standing still,” including Amazon Web Services, eBay, Microsoft 365 – cloud-based productivity services and collaboration owned by Microsoft, Microsoft Azure – cloud computing platform developed by Microsoft, Visa – multinational payment card services, Delta – major airlines of United States, Instagram – photo & video sharing social platform, and so on.

Not only were e-commerce companies affected by this devastating faulty update, but government officials also had to share the same fate. The 911 emergency and non-emergency call center systems of the Alaska, New Hampshire Department of Safety weren’t functioning correctly. Hospitals and emergency healthcare centers could not operate as the system was down completely. Surgeries were postponed until further notice, and doctor visits and medical procedures were canceled. Airports were packed and chaotic due to the check-in system and flight communication controls being down, and flights were delayed and canceled. TV stations were unable to broadcast. The list might go on, but it has to stop.

At SHIFT ASIA, we suffered issues with VPN and networking during the incident. Though our IT department has quickly resolved this issue, the experience of the BSOD was not a fun ride.

The Crowdstrike Fix

The outage prevented people from accessing their files on the cloud, locked their bank accounts, grounded their flights, postponed their medical care, and left them unable to pay for their own groceries. It also took a hefty USD $300 million off Crowdstrike stock, an 11% drop in stock value.

As prompt as it might be, at 9:45 UTC, Crowdstrike’s CEO George Kurtz announced that the fix was deployed and stated that the incident was not a cyber attack and that Crowdstrike customers are fully protected. He later confirmed that the fix might take longer and cannot be done automatically. For that reason, without any physical access to the machines, any affected computers, virtual machines, or systems won’t be able to recover.

The root cause of the meltdown was in a configuration file, Channel File 291, which caused the buffer to overread while reading data from a buffer, overrun the buffers’ memory boundaries and violate the memory’s safety. To remedy this, the administrator can restore the system by booting in safe or recovery mode and deleting the .sys files starting with C-00000291. For this horrific failure, Crowdstrike CEO George Kurtz was summoned by Congress to testify on this particular global outage, according to the latest news from the Washington Post.

QA mishaps?

The Crowdstrike issue has been controversially discussed and investigated. Experts argued whether this issue was regarding QA failures from both ends: the vendor itself – Crowdstrike- and the customer’s end. From Crowdstrike’s POV, they had released the defective update without thoroughly identifying the defect. On the other end, the customers were seemingly granted automatic updates without testing the update on separate staging environments before installing it in your business public environment. Either the circumstances obviously could have been better in today’s tech world, and software assurance best practices were frankly not used and maintained responsively.

Many argue that a company handling cybersecurity like Crowdstrike should have decent and thorough testing before public release to safeguard its reputation. What happened was illogical, not typical, and hard to believe. It strongly indicates that they skipped the UAT, as the routine update was too frequent, and Crowdstrike was confident enough to bypass the regular strict QA testing process.

Others believe the QA testing process should be outsourced to a third-party vendor even if you own an in-house QA team, just like how businesses outsource auditing their accounting works even if they have an Accounting Department. By having third-party auditors, companies can have an unbiased, objective, and transparent business overview without conflict of interest. The same goes for third-party QA outsourcing vendors.

Key Takeaways: Balancing the SDLC cycle

The Crowdstrike event is a wake-up call for business owners out there. The incident highlights the importance of robust QA practices, particularly for security software that operates on critical infrastructure. Effective QA could have identified the bug before the update was released, preventing the widespread impact experienced by users.

For the product development community, time has always been favorably prioritized above the rest due to the highly competitive nature of the software industry. The developers are pushed against thin walls with collided deadlines and unrealistic feature updates due to the constant race businesses take against the latest tech interventions, which leads to the sacrifice of the most time-consuming process of all – QA testing.

Finding the breakeven point to make the best of both quality and time in an SDLC cycle is not impossible, but making it into an old-habit-dies-hard would be a challenge to commit to:

  • Shift-Left Testing: In a linear approach, shift-left testing means doing it at the early stages of development; don’t wait until the end of an SDLC for all the testing.
  • Continuous Testing & Delivery: Having a regular release cadence? Continuous testing and delivery is your best bet to eliminate any malfunctions and defects at the initial stage. This gives you the power to control frequent version updates and easily revert to the stable version when things go down south.
  • Operations and Maintenance: Documentations are often left untouched. Their existence is for the sake of being part of an SDLC, though realistically, it is the most critical part of the cycle. It helps the engineers quickly grasp the complex system infrastructure, design, and functionalities, making things easier when it comes to troubleshooting and maintenance.
  • Proactive monitoring and code reviews: Employ various tools, from automatic to manual, as long as they fit and align with your team’s working style and ethics to detect shortcomings proactively. Advocate code review practices to refine code quality for the team to improve the team’s coding skills and share knowledge constantly.

This IT meltdown might be a laughable joke for us a few years down the road, but as raw as it has been for millions of us around the world, the lesson is there to learn. The assumption is the lowest form of intelligence; we don’t assume a routine update would be as harmful, tragic, and expensive as Crowdstrike had.

It might not always be the case, but getting your business geared up and equipped with QA testing is not redundant. Seeking a Quality Assurance outsourcing services provider is a manageable step you might need to consider amid this outage exposure. Contact the Shift Asia support team for a private consultation if you need help figuring out where to look.

ContactContact

Stay in touch with Us

What our Clients are saying

  • We asked Shift Asia for a skillful Ruby resource to work with our team in a big and long-term project in Fintech. And we're happy with provided resource on technical skill, performance, communication, and attitude. Beside that, the customer service is also a good point that should be mentioned.

    FPT Software

  • Quick turnaround, SHIFT ASIA supplied us with the resources and solutions needed to develop a feature for a file management functionality. Also, great partnership as they accommodated our requirements on the testing as well to make sure we have zero defect before launching it.

    Jienie Lab ASIA

  • Their comprehensive test cases and efficient system updates impressed us the most. Security concerns were solved, system update and quality assurance service improved the platform and its performance.

    XENON HOLDINGS