Leading Disaster Recovery on AWS Serverless

Did you just waste your company’s time and money with your serverless solution’s disaster recovery strategy on AWS? What if your disaster recovery plan takes longer to get your system back online than the outage lasts? It’s important to have a plan for when a disaster happens, and while serverless solutions tend to be highly available and tolerant to datacenter outages a regional outage can cause significant issues to your business and customers. Regional disaster recovery falls under Pillar 3: Reliability of the Well Architected Framework, and is also now a requirement for partnering with AWS and many businesses in the public and private sectors. This means you now need a solid disaster recovery plan.

Managing serverless solutions vs traditional is significantly different, although the goals remain the same. The goal is to maintain business continuity when a region has an outage of service due to either an issue within AWS or a regional catastrophe takes place (such as wide-ranging wildfires which somehow impact all the data centers in the region). In this type of event, it can be important to get your services back up in a timely manner so your customers can access your services. In some industries like medicine or emergency response, this means that your tolerance to these outages is zero, and you need your systems back up in seconds. In other industries such as photo storage, this could mean bringing your systems back up within a few days. All of this leads to what your options are when it comes to disaster recovery, and the reasons you do disaster recovery.

What is disaster recovery (DR)

Disaster recovery describes the processes and steps to fully restore your system to a different region. Typically, this includes backing up or replicating your data to that other region either continually or at a set time of day or a specific day of the week. In traditional architectures this process might be handled by your operations team, which would make sure that your virtual machines and databases were being backed up, then annually restore those backups to a separate datacenter. The problem with serverless technologies though is that this more traditional approach breaks down when you start inserting services which store data, event processing and resources which operate at a global level. These complexities drive the responsibility of disaster recovery back onto the serverless team which is responsible for the development of the system.

When it comes to disaster recovery there five types:

The recovery time significantly improves with each subsequent approach, with active/active being potentially seconds. Deciding on the best DR approach for your company really comes down to two measurements we use to determine your tolerance: a recovery point objective (RPO) and a recovery time objective (RTO).

Recovery Point Objective

RPO focuses on the amount of data you can lose. Think about a situation where you collect random votes for news articles based on sentiment in the article. The votes are averaged at the end of each month. Losing one day of votes for the month would not significantly impact your service. On the other hand, medical information being lost could significantly impact your service.

Recovery Time Objective

RTO focuses on the amount of time your service takes to become available again to your customers. If you have a blog, odds are that you want your blog back up, but if it's off for a day or two it’s not the end of the world. On the other hand, if you handle 911 calls, having your service down for a day or two could be extremely impactful.

Seven Other Considerations

While the RPO and RTO will dictate some options, there are also seven other points that you must consider when leading your organization's disaster recovery strategy.

Leadership Level

The first consideration is the level of your technical leaders in your organization. Solid team and technical leadership, providing a vision and inspiring others in a company can open up your options and elevate a company's ability to respond to disasters as well as reduce the costs of your solutions. On the other hand, if a company has a leader who lacks in either technical or team related aspects, driving towards more advanced disaster recovery paradigms will be out of reach for the organization.

Team Type

The make up of a team will also impact your organization's choices in disaster recovery. A legacy development team will struggle with more advanced disaster recovery. At the same time, if your team is built toward Pillar 1: Organizational Excellence of the Well Architected Framework on Organization Culture.

Team AWS Skill

Structuring your team is only half the battle. Nothing beats experience, and disaster recovery implementation is no difference. Teams with more experience individuals in AWS will more easily design and implement more advanced approaches, while teams less experienced will struggle to implement more novice approaches.

Risk Tolerance

Depending on your company's reliance on data being immediately available or potentially loosing some data, your options can change. The higher the level of risk your company can take on, your options to leverage lower paradigms of disaster recovery become more palatable.

Average AWS Outage

One of the questions I always ask is, what happens if your disaster recovery time takes longer than the typical outage AWS experiences. With the average AWS outage being 6 hours, and a large database restore potentially being twice that duration, will your disaster recovery approach be more theoretical or will it be effective.

Regional Recovery Time Objective (RRTO)

There is some argument that having multiple data centers in a region is a disaster recovery option. In that case, we need to assess what our Regional Recovery Time Objective is, as that better describes what is being targeted. RTO and RRTO can be synonymous in this regard, with the difference being the scope and location of recovery.

Recovery requiring design changed

In a perfect world, building infrastructure as code will automatically work in any AWS account. In the real world this often isn't the case. Having a disaster happen can be an extremely stressful event. This is made worse if the CloudFormation or resources you're trying to redeploy fail to deploy due to globally named resources, which cause conflicts.

Opportunity Level

Opportunity Level defines what other business capabilities or cost reductions open up as part of your disaster recovery decision making. Building a disaster recovery solution which enables business expansion can change Disaster Recovery from a cost center to a profit center, allowing expenses to become more palatable to the business.

Up Front Development Costs

Everything in IT is around investing in outcomes, and disaster recovery is no different. The up-front costs to build a disaster recovery solution can be a major driving force in your organization's decision making.

Conclusion

By understanding the driving forces behind planning disaster recovery can help you better understand which options will work for your business and which ones would not. This article is the first part in a series which will outline the costs of each type of disaster recovery approach, and how it will impact your organization and its use of AWS.