The Optus Outage: Business Continuity Lessons for Organizations

Thursday, November 9, 2023

Yesterday in Australia, huge portions of the country were left without phone or internet access when a massive telecommunications outage by Optus impacted over 10 million customers, including thousands of organizations of all sizes across the country.

What are some key lessons about Business Continuity that organizations might take away from this event?

What will fail?

First, we need to consider what might go wrong that could impact our business.

Okay, how could anyone plan to deal with a national telecomm provider breaking connections for 10,000,000 customers, and knocking their business offline? The good news is you don't have to. The bad news is the scale doesn't matter as much as you think.

So let's think about the biggest categories of hazards business face:

🤖 Technological

Losing your internet connection is a problem for a modern organization of any size, in any sector. Whether it was a whole country being knocked offline, or just your business, the impact is the same to your bottom-line.

Beyond that, technology risks extend through everything from other service providers going down, the iPad powering your POS suddenly not turning on, or a tradie accidentally knocking out all electrical power to your premises.

⛈️ Natural

Natural hazards have some obvious causes with extreme weather, like storms, heatwaves, blizzards, and many more. And of course, we've all just lived through a global pandemic. Your exposure ultimately depends on your work.

Retail and hospitality will see demand reduced by severe weather, and events may need to be cancelled entirely. Logistics businesses can be extremely exposed by transport challenges. Even geological risks like earthquakes became a huge concern for many organizations - especially governments and large building-owners.

🤷 Humans

Surprising nobody, humans are our own risk category. Consider the three categories already covered, even due to accidents: If extreme weather blocks the roads, employees can't get the work, illness requires them to take time off, or an accidental mistake causes a technological problem.

But the most terrifying human risks are the intentional ones. This could be internal, like insider attacks or criminal acts, or external risks like cybersecurity breaches or regular theft.

Image description

What happens when it fails?

Your organization has critical systems and business processes that it relies on. After thinking about what could happen, we need to think about how those events would impact our business processes.

Start by identifying your core activities, and what's necessary to make them work. Let's go with some common activities in most organizations:

Rostering staff for shifts
Delivering service to customers
Handling customer payments
Receiving goods from suppliers
Processing payroll to staff

Any of these processes failing would be a major problem for our organizations, and how they work will be different for every business.

Image description

What should we do when it fails?

No matter what, eventually something in our organization will break. It's just a matter of what will break, when will it break, and when will it happen.

"Everything fails, all the time" Werner Vogels (CTO, Amazon.com)

Fortunately, even if a single system or process fails, the actual activity that we need to deliver often isn't completely prevented - there are alternatives we can explore.

But we need to explore them before the incident occurs, so we have a plan for how to react. We also need to exercise these scenarios, to see where they may not work as intended, and so people are familiar with what to do.

Let's look through a few scenarios:

Image description

💳 EFTPOS Terminal Failure at Local Café

Most people don't carry cash in 2023, and EFTPOS is vital to retail and hospitality businesses. When these machines break, it's a big problem. In this case, we don't know what's gone wrong - just that it's stopped working.

There's a few steps we could try. With hindsight and context, we might consider them "obvious", but that's never something we should assume, especially in a crisis situation.

Attempt the transaction a second time
Reboot the EFTPOS terminal
Try another EFTPOS terminal (if available)
Check for an internet outage
Enable S&F offline mode (if supported)
Use manual EFTPOS vouchers (if available)
Contact the bank/provider

There's also things we might need to consider ahead of time in our planning, such as whether our business has additional EFTPOS machines, or whether we'd be comfortable with enabling Store & Forward on our terminals. If we use manual vouchers, our staff would also need to know how to use them.

This doesn't mean you must take time for every staff member needs to be deeply familiar with the least-likely circumstances of writing the vouchers. But would it be reasonable that the duty supervisors are at least familiar with them?

Image description

⚠️ Cloud Outage at Global SaaS Software Provider

Most of my regular audience are cloud computing experts, and we're all very acquainted with the fact that stuff breaks a lot.

Let's take a look this time at the mitigation aspect, and what we can do before an incident takes place:

Ensure the application has been architected for fault-tolerance
Consider cross-geographical global deployments
Develop rigorous regular testing conditions (chaos engineering)
Ensure inter-service dependencies are well-mapped and understood
Deploying warm or cold standby's that can be activated in case of failures

All of these can be useful, but all come with risks, costs, and trade-off's. Standby's, even cold, are going to cost money, depending on the RTO/RPO required. Deploying solutions globally may not be possible, depending on infrastructure availability and data sovereignty requirements.

We're not just talking about Disaster Recovery or Fault Tolerance either - that's just the technology. These need to be incorporated within the holistic continuity of the business itself too.

Image description

Final Thoughts

Business Continuity isn't a simple topic. In fact there's a whole field associated with it. Even if you're not looking at developing a full Business Continuity Plan, it's worth looking at the basics of the topic for your organization, no matter your industry or scale.

Mitigating risks has costs, in both time and money. The question is whether a risk is likely enough to occur that it's worth the costs to offset.

Stephen Sennett

Stephen is a cloud technology leader. He has worked in the industry for over a decade, holds high-level technical certifications, spoken at events around Australia and internationally, and recognized as an AWS Community Builder.