Security Bits logo - a green padlock with the words Security Bits to the right and in tiny letters below ithat it says 10101010 indicating a digital lock

Security Bits – 21 July 2024

Feedback & Followups

Deep Dive β€” The Confused CloudStrike & Microsoft Kerfuffles

From an American view two significant outages happened overnight on one night, and in case it wasn’t confusing enough, they were not entirely disconnected.

The First Outage β€” Microsoft Azure US Central

Things started to go wrong when Microsoft pushed a configuration change to one of the regions in their global cloud network, it just happened to be the one serving much of the United States (Central US).

This change caused the VMs powering this region to respond to connectivity glitches to their backend storage by rebooting rather than pausing, so the available compute power in the region plummeted and it wasn’t enough to meet demand.

This had two effects:

  1. Some of Microsoft’s first-party services in the region became overloaded, Teams & Xbox Live in particular seem to have been causing people to complain.
  2. Some of the services Microsoft sell to corporations became overloaded, particularly the Power Platform which is used to power business logic in the cloud with server-less functions (really cool tech actually)

Microsoft seem to have been able to deal with the first problem pretty quickly by migrating the Teams service for US customers to different regions, but that added latency and probably stressed those regions so the service was likely sluggish for a while.

By the time I woke up in Ireland, the Teams issue looked to be under control, but the Power Platform was still orange on the service health dashboard.

For context, like with the leader in the field, Amazon Web Services (AWS), Microsoft offer Azure services in multiple geographic regions, and when you provision something you choose not only a primary region but the level of resilience you want to pay for. The scale starts at none and goes up to full geo-redundancy with resources mirrored in different parts of the world.

American corporations using the Power Platform Customers who chose to take the risk and save money with lesser resilience would have had problems running their business processes, causing outages.

The Second Problem β€” The Bad CloudStrike Update

As morning dawned on the other side of the world a new problem emerged. Some Australians and New Zealanders arriving into their offices found their Windows PCs & Servers stuck on Blue Screens of Death, and reboots didn’t help β€” MEEP!

It wasn’t all Windows computers, just some, and after some initial confusion, the pattern soon became clear β€” it was Windows devices protected by the Enterprise AV product Falcon Strike from the very well-regarded Cybersecurity experts CrowdStrike.

Falcon Strike is a cloud-first real-time AV product driven by AI that uses lightweight local agents which stream their telemetry up to the cloud and get high-frequency updates pushed down to keep the protection as current as possible. Because all the agents stream their data to the cloud in real-time, CloudStrike can use AI to learn about attacks as they happen, and quickly send rules to all their other clients, theoretically nipping even novel attacks in the bud very quickly.

You can see why this product is popular with large enterprise customers β€” unlike more traditional AV which is great at protecting against known threats and very poor at protecting against newly emerging threats, this is architected to give good protection against even the most novel attacks. Novel attacks first emerge against valuable targets, so the bigger a company is, the more appealing a product like Falcon Strike looks!

A subtle but important point to note is that there is a trade-off here. All updates need testing before they go out, but that adds lag to the process, and the whole point is that the system should be really reactive. The way you balance this is with a massive bank of virtual machines running a wide array of tests in an entirely automated way. In theory, your test suite should cover every possible configuration in use in the real world, but it simply can’t, so there will be gaps.

These kinds of systems tend to follow the power law statistical distribution, so small errors affecting a few customers are massively more probably than big errors affecting lots of customers, but sometimes you get unlucky!

At this stage, I don’t think we know enough to understand how something got through testing that affected so many customers, but one worrying piece of anecdata is that this is not the first time this year an entire OS family seems to have been affected β€” there were bugs crashing two different flavours of Linux earlier this year. They just didn’t get the same kind of press because they didn’t have the same scale of impact.

Why is Recovery so Slow?

CrowdStrike figured out the root cause pretty quickly, and they revoked the problem update, but that only stops more machines from being knocked out, it does nothing to bring the dead machines back!

To compound the problem, it can’t be fixed remotely or automatically because the fix is to boot the device into safe mode, delete a single file, and then re-boot. In a corporate environment, most users don’t have local administrator rights on their PCs, so they literally can’t fix the problem themselves, they have to wait for some from IT to physically restore their device.

Having said that, servers should be easier to restore because most are virtual these days, and any company being run in an even reasonably responsible manner will have daily if not hourly snapshot backups they can roll back to. But, and office with servers and no PCs is still not a very functional place!

One Final Connection Back to Microsoft

In case there wasn’t already enough confusion between Microsoft’s part in the day’s drama and CloudStrike’s part, one of the services Microsoft sell to enterprise customers is virtual desktop PCs. You run your actual work PC in the cloud, and use a thin client to access it from anywhere, even a web browser. Companies manage these virtual PCs like they were physical, so, they will push out AV tools to them like they would any other PC, including FalconStrike in some cases, so, Microsoft reported that many of their cloud desktops also got stuck into infinite reboot loops because of the CloudStrike bug.

A Sting in the Tail β€” Cybercriminals try to Cash In

As always happens when something nasty gets headlines, cybercriminals are targeting companies with fake ‘fixes’ from CloudStrike that are actually malware πŸ™

This is timely reminder that this same kind of dynamic is in play each time there is any kind of bad news, be it a natural disaster, an accident, or a war, baddies will try to exploit the situation for profit.

Can we Learn any Lessons from all this?

Let’s start easy, does the Azure region outage teach us anything? To be honest, nothing new, we’ve seen this before with Amazon, Google, and Microsoft cloud services. It doesn’t happen often, but entire regions do sometimes go down. This is why all these providers offer resiliency as a feature.

When companies choose to accept a higher risk of failure to save money, the risk is real.

Moving on to the CloudStrike event, I don’t see a clear-cut answer.

You might assume the lesson is not to rely on one vendor for all your AV, but that’s a terrible idea. To have any chance of running an effective cybersecurity operation you need a unified platform. Yes, having all your eggs in one basket is a risk, but having a total hodge-podge is actually worse. Instead of a low risk of a really spectacular outage, you’ll suffer lots of smaller incidents very frequently, and you’ll struggle to contain them. Your cybersecurity team will spend all their time firefighting and filing breach reports, and your reputation will suffer. Better to have a small chance of being one of many many companies affected at the same time when everyone knows it’s not your fault, but the vendors!

You might assume CloudStrike must be some kind of fly-by-night operation, but they are extremely well respected. The reason they are used by so many such big companies is that they are one of the best, and that’s a reputation they’ve earned over many years of hard work.

I’m a little concerned that it seems they had warnings their testing systems were leaky a few months back, so it’s possible they deserve some criticism for not reacting to those warnings better, but it’s equally possible they are very busy re-architecting things behind the scenes, and that there are changes in the pipeline already. We have much too little information today to draw any conclusions about whether or not CloudStrike were in some way negligent. Expect to learn much more in the future because it seems inevitable that CloudStrike will need to publish a detailed incident report on all this once they’ve had time to gather all the facts and do the needed analysis to engineer an appropriate response.

For now, my advice is to ignore anyone who tells you that the blame for this is in any way clear. That’s a sign of someone who just doesn’t get that this is a tradeoff all the way down:

  1. You need a rapid response, and you need testing, the more you test, the slower your response
  2. You need a single cybersecurity platform to be able to run an effective operation, but that makes you vulnerable to a catastrophic failure

Maybe this is a good argument for allowing your users to choose their end-user OS as long as it’s supported by your cybersecurity platform, and you allow your sysadmins to use multiple server solutions as long as they too are supported by your platform.

Links

❗ Action Alerts

Worthy Warnings

Notable News

  • πŸ‡ͺπŸ‡Ί X (formerly Twitter) joins the ranks of companies with preliminary findings against them for breaking the EU Digital Services Act (DSA) β€” ec.europa.eu/… (Digital Services, not Digital Markets!)
    • Complaints revolve around the Blue checkmark being misleading, the absence of required advertisement transparency reporting, and the lack of data access for researchers.
    • Remember Preliminary Findings are official accusations, not convictions, the company now gets to offer a defence
  • Google have been caught with their fingers in the proverbial cookie jar, though in a surprisingly open way: Google Chrome, Along With Other Popular Chromium Browsers, Grants System Monitoring Privileges to *.google.com Domains β€” daringfireball.net/…
  • Google have made their Advanced Protection program for at-risk people a little more accessible by allowing users to choose passkeys rather than requiring hardware FIDO tokens β€” www.bleepingcomputer.com/…
  • After downplaying the weakness for years, Signal have agreed to start encrypting local copies of chats in their desktop apps making use of OS-level key stores to securely store the keys (i.e. keychain on Macs) β€” www.bleepingcomputer.com/…
  • MacPaw have previewed technology they have developed for real-time on-device phishing detection that promises to be a lot more effective than our existing block-listing approach β€” appleinsider.com/…
    • Making use of the AI hardware on modern chips, they use on-device AI to pre-load link destinations in the background and check if they imitate known brands
    • This was presented at a research conference, it was not a product demo, so we don’t know how or when we’ll get to purchase this, but it looks very promising
  • Two nice cybersecurity-related announcements from Microsoft:
  • πŸ‡ΈπŸ‡¬ Singapore leads the way, and hopefully, many other countries will soon follow: Banks in Singapore to phase out one-time passwords in 3 months β€” www.bleepingcomputer.com/… (only phishing-resistant MFA is acceptable now, no more codes users have to type in, whether they be via SMS or an authenticator app)

Palate Cleansers

Legend

When the textual description of a link is part of the link it is the title of the page being linked to, when the text describing a link is not part of the link it is a description written by Bart.

Emoji Meaning
🎧 A link to audio content, probably a podcast.
❗ A call to action.
flag The story is particularly relevant to people living in a specific country, or, the organisation the story is about is affiliated with the government of a specific country.
πŸ“Š A link to graphical content, probably a chart, graph, or diagram.
🧯 A story that has been over-hyped in the media, or, “no need to light your hair on fire” πŸ™‚
πŸ’΅ A link to an article behind a paywall.
πŸ“Œ A pinned story, i.e. one to keep an eye on that’s likely to develop into something significant in the future.
🎩 A tip of the hat to thank a member of the community for bringing the story to our attention.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top