This piece originally appeared in ProMarket.
The July 19 CrowdStrike outage from a botched software update crashed 8.5 million installations of Microsoft Office, forcing millions to encounter the Blue Screen of Death (BSOD). Planes were grounded and hospitals could not access electronic medical records. A leading cyber insurer estimates the outage caused $5.4 billion in financial loss to Fortune 500 firms, notably in airlines, healthcare, and banking.
Regulation is no substitute for continuous improvement in design and engineering
Some observe that the large scale of CrowdStrike itself was the problem. For example, in response to the outage, Federal Trade Commission Chair Lina Khan tweeted, “These incidents reveal how concentration can create fragile systems.” Fragility is one potent explanation for the incident. The widespread use of a handful of software systems means that more are impacted when things go wrong. But the dynamic cuts both ways: concentration protects against attacks every day.
To see why, remember that security products like those sold by CrowdStrike attempt to prevent, deter, and defend against state-sponsored cyber-criminals. These attacks grow more frequent, sophisticated, and severe. Among CrowdStrike’s recognized accomplishments are the exposure of North Korea’s hack of Sony Pictures, the Russian infiltration of the Democratic National Convention, and the contours of China’s espionage, information which U.S. authorities used to prosecute cyber criminals. The Federal Bureau of Investigation calculates an annual theft of intellectual property amounting to $225 billion-$600 billion, much of this enabled through cyber intrusion by Chinese state sponsored hackers. The government relies on firms like CrowdStrike for information they have difficulty getting themselves.
As Cornell University computer scientists Fred Schneider and Kenneth Birman observe, software “diversity, whether artificial or true, multiplies the number of distinct attacks that can compromise some platform someplace in the system.” Schneider and Birman were studying software monoculture and their findings show us that, contrary to conventional wisdom, deploying a software monoculture (or in our case a more concentrated IT market) in combination with automated diversity techniques can be an effective cybersecurity strategy, as it allows for focused defense against configuration attacks while mitigating risks associated with technology attacks.
Theories of market competition are based on firms competing for revenue; they do not presume the presence of malicious government actors attempting to steal, disrupt, and destroy assets. If the Chinese military stormed a United States airport to disrupt a flight, the U.S. Army would come to its defense. Yet when a Chinese soldier hacker attacks American cyber infrastructure, companies must fend for themselves, hence they buy the best-in-class IT security solution. The U.S. military has difficulty enough securing its own cyber defense, let alone the expertise to protectthe private sector online.
Best-in-class systems can ensure a broader plane of resiliency, for example by ensuring cybersecurity standards and addressing zero day attacks. More broadly, breaking up large companies does not necessarily fix underlying security problems. The same breaches and glitches can happen and could be more frequent and more problematic with more vendors. Whatever the policy, criminals will adapt to exploit vulnerabilities.
The market is already self-correcting without government intervention
The market has moved quickly to respond to the incident, faster than government actors. CrowdStrike lost a quarter of its stock price; Delta Airlines, which canceled some 5,000 flights because of the outage, hired superstar litigator David Boies to recover $500 million in damages from CrowdStrike and Microsoft. The firm called the incident a “wakeup call” and announced a review of its systems. Because firms suffer a loss of revenue, they have an incentive to avoid such mistakes in future. Indeed, the CrowdStrike CEO committed to “full transparency on how this occurred and steps we’re taking to prevent anything like this from happening again.” The incident will likely impact the cyberinsurance market, raising premiums for business interruption and liability. Buyers of security software may demand more quality assurance and stricter service level agreements.
CrowdStrike has made frequent updates on repair from the incident, noting that on July 31 that 99% of systems were back online. Microsoft noted its work with CrowdStrike, Amazon Web Services, and Google Cloud Platform to fix the update. “This incident demonstrates the interconnected nature of our broad ecosystem — global cloud providers, software platforms, security vendors and other software vendors, and customers. It’s also a reminder of how important it is for all of us across the tech ecosystem to prioritize operating with safe deployment and disaster recovery using the mechanisms that exist,“ they said.
Similarly, the incident signals opportunity for better, different solutions which include artificial intelligence (AI).
For example, Jason Arbon, CEO of Checkie.AI, a firm that tests AI programs, and leader of software testing at Google, emailed me his assessment of the CrowdStrike incident, noting that staged rollout, in addition to testing, could reduce the impact of an outage:
“If you look at the pictures of the airport kiosks, they’re all down with blue screen. In a well-managed rollout, you would test it on a couple machines on the side first. Then for an incremental rollout, you would only apply the patch and changes to a subset of those kiosks or machine machines at a time. At worst, with well-managed IT at the airports, you would have only seen a few machines down.“
He says the solution is not more testing but 1) incremental rollout of the feature (not relying on customer IT teams to do this) and 2) improved product design (updating the product code to minimize the amount which must be inside the Microsoft operating system).
“There will never be bug-free software,” Arbon explains. It is easy to diagnose the failure, that software checks and basic networking equipment could have detected that the machine was unresponsive—and stopped the rollout. “There isn’t monitoring of systems after they have been deployed. Software can go awry for many different reasons long after the change is applied.”
Most take it for granted when security works well, having little idea of the financial and human investment required to keep a global system running safely and economically. It’s only when a system goes down that we pay attention. The incident should make us appreciate the investment needed to make systems resilient and reliable.