AWS CSA Professional Examination notes-High Availability and Business Continuity

Let's start with snippet from "AWS Certified Solutions Architect Professional Exam Blueprint" for knowledge to be evaluated for Domain 1.0: High Availability and Business Continuity
1.1 Demonstrate ability to architect the appropriate level of availability based on stakeholder requirements
1.2 Demonstrate ability to implement DR for systems based on RPO and RTO
1.3 Determine appropriate use of multi-Availability Zones vs. multi-Region architectures
1.4 Demonstrate ability to implement self-healing capabilities

This subject area is carried 15% coverage for your AWS Certified Solution Architect - Professional certification. This is mandated to have a better understanding of the subject and use cases around this. And best way to learn is to go through white papers, YouTube videos, AWS Masterclass and deep dive sessions. Although always prefer to go through AWS provided documentation, preparing a note to just summarize this vast subject area and have a handy last-minute reference before appearing to examination. In my view, understanding of the subject area is the key success factor. A qualified, efficient solution architect must have the strong knowledge of "what is High availability", How to design a very high availability architecture, a scalable architecture, fault tolerance system architecture etc.. He must understand Business Continuity in case of any disaster. This subject is universal, not specific to any technology or product or platform, it's just a numerous problems around any business systems, that must be resolved in a cost-effective way with a consideration of the business impacts and agreed on RTO and RPO with stakeholders. And the solution to a business problem must be addressed in accordance with available AWS services, products, and offerings. This means, how can business house solve their problem in a cost-effective way and achieve their targeted high availability, scalability, fault tolerance systems along with disaster recovery on agreed RTO and RPO by using of AWS Cloud platform.

What is High Availability, Business Continuity, and Disaster Recovery

These are three different planning but the interconnected aspect of any IT ecosystem.
A high Availability is always referred as service must be available for a long time without any failure. A system must be designed to ensure your business application must be up and running to serve your customer even you know, everything break or fail due to any other reason. This means need to design a system which does not have a single point of failure, each layer of system stack should be prepared for a  redundancy. Although redundancy does not enough make a system high availability that system will not be having any downtime; need to have failure detection mechanism/process to detect the failure early and take appropriate actions to prevent your stack to become down.

Let's see what are the elements that can cause your system down, and need redundancy in place to prevent. Please note, this is not limited to only software or hardware, it could be several things.
  • Environment Location- There could be possibility of outage if data center in single location due to Riots, strike, or natural calamities like earthquake, floods that take system down
  • Hardware Infrastructure - Hardware failure due to any reasons like Disk crashed, power failure, circuit burned out, overheating, it could be any known or unknown causes that brings system to down and lead system outage
  • Software- Again outage can be happened due to any software components, the whole Operating system may crash, or your application components crashed or any other software component crashed which is used by your application and due to failure independent software module/components, your service stop working. Root cause analysis and resolutions are separate tasks, first priority task is design systems that can tolerate any kind of fault, and available for business even if some of the components are down, Systems should support all critical functionalities almost 24X7 all time across the year with consideration of planned or unplanned outage
  • Data- This is one of potential cause to break down the business system due to inconsistent data caused by several factors, Data safety must be considered in the account of High availability to prevent the events of failure. This is not limited to data disk failure
  • Network- This is communication backbone of any business systems, if this fail, the system will be unavailable. There are several causes for network failure, planned or unplanned, network device burnout, Network cable cut down, cable unplugged, Power supply interruption to network devices. Many more
In short, whatever the reason behind system outages must be considered and evaluated to design highly available system architecture, that enables the business to deploy a business system which will never be down even if system components are down.
Only getting a premium high availability system is not a deal for business, getting a premium high availability system solution at low cost or very cost-effective is business houses goal. Therefore AWS Certified Solutions Architect Professional Exam will evaluate the candidate skill for most cost-effective high availability solution architecture design which meets the Stakeholder requirement and how AWS platform and product will help business houses to lower down the cost by using of AWS Cloud platform. Every business organization has their own strategy on system outage based on their nature of business, criticality, cost Vs. benefits of investing a significant amount on high availability, live-live, active- passive or backup restore. 

In AWS Solution architect profession certification exam expectation is to demonstrate the ability to architect the appropriate level of availability based on stakeholder requirements. This is the key success factors of any High availability solution. A solution must meet the stakeholder requirement in a cost-effective way.

When we say "Disaster Recovery", this is again the availability of system in case of any disaster, how quickly the system can be restored (RTO) and what are the data loss period (RPO). Typically, high availability system includes more demanding recovery time objective (RTO) and more demanding recovery point objective, almost zero user disruption than disaster recovery scenarios. RTO (Recovery Time Objective) and RPO (recovery point objective). High Availability solution provides automated failover to the backup system to avoid user disruption, a system is available to the user to continue without disruption. High Availability must have the ability to provide significantly better recovery point than Non-High Availability System. 

And Business Continuity: Planning to how entire business is keep functioning during emergency time. This is a major exercise, much bigger than IT system for any business house. This cover all the bits and pieces of keep functioning of business in the emergency (In aftermath of disaster). This is not limited to IT infrastructure, it is a whole business unit planning to keep functioning your full business. This begins with a very deep understanding of internal and external threats. This includes logistics for an emergency like a natural disaster, sabotage, power grid failure and Fire event. Based on the impact, commonly referred as "Level", Organization has advance planning about their IT system availability, people location to work and support business, equipment to be used, space to be utilized and communication to happen between teams.

As a solution architect, how can you design a system that meets an appropriate level of availability based on stakeholder requirements? How can you design an availability solution based on AWS Cloud platform that meets stakeholder's expectations? There are many many things that need to be addressed in availability design and very specific solution. Best ways to understand availability design is go through published AWS white papers and deep drive session, available on YouTube or google it. To score in AWS certify Solution architecture professional exam, you must have to demonstrate the ability to architect the appropriate level of availability based on stakeholder requirements. 

What is availability, what does it mean a number of '9' when we talk about availability 

Availability means service should be available throughout 24 X7 a year normally with an agreed outages hours for planned or unplanned events. And the Outage hours are normally measured by a number of '9' in SLA documents. Let's understand total outage duration per year

Given below is an estimated downtime, based on number of '9' how many total hours of outages can have in a year

Achieving the availability SLA, the system must be redundant. Design for failure and nothing to fail approach is key success factor to design Availability solution. Below given chart, just to demonstrate the relationship between  Availability, Continuity, and Redundancy.  The SLA increase, your availability design will be complex to have more redundancies. 

As Availability SLA is more demanding, you need to implement redundancies at each layer of your system. Need to follow Availability best practices and principles rigors.  These are
1. No single point of failure
2. Multi-locations
3. Scalable systems
4. Self-healing
5. Loose coupling 

Design a High Availability Systems

With a traditional infrastructure environment, High availability is a complex process, required a lot of analysis and accurate forecasting of demand, resources and upfront investment. Now with cloud computing, many things get eliminated and its simplify the High availability Solution design. There is no need to guess capacity, it can be scaled out or scale down in few minutes. Cloud platform provides a programmable infrastructure that brings any benefits and advantage to the customer who is willing to have cost-effective efficient High availability solution for the critical IT systems.  Using the AWS Cloud platform, it simplifies the process, cost and planning around all three High Availability, Disaster Recovery and Business Continuity. Let's discuss AWS products, feature and best practices in brief, how can this help customer to have robust, effective, low-cost high availability solutions

Programmable  IT Resources -  This is the key benefit of Cloud computing. All IT assets to be provisioned through just a script in few minutes. This feature has eliminated all the capacity planning, forecasting of peak or off-peak hours resources capacity needs. This also helps the customer to eliminate the cost of keeping expensive resources idle, you can spin up resources in a quick time when you need and Stop or terminate (Decommission) the resources when your job is done, or load is down. With Cloud computing, dynamic scaling is possible in few seconds only and can leverage the capacity according to resource demands based on loads, with lowering down your cost using "Pay as you Go" billing model.

AWS provides Servers, Database, storage and high-level application components, those are can be instantiated within few seconds. The resources are treated as temporary infrastructure resources and be disposed of after utilization. This enables to customer to have different approach for change management, reliability, testing  and capacity planning from the way customer approach for these planning with traditional data center based infrastructure

Highly Available, Global infrastructure with unlimited capacity -  AWS Cloud platform consists of regions, the physical location in the world. The region is the location where AWS has its data centers. A Region consists of Availability Zone, that is the composition of a Datacenter. An Availability Zone is isolated Datacenter separated from another Datacenter;  fully fault tolerance environment. There would not be any impact to another Datacenter if one Datacenter is down. All Datacenter is equipped with redundant power supply, networks, and security. Multiple Availability Zone in a region enables to operate production environment in AWS region. The database can be in multi-datacenter that enables to achieve high availability and fault tolerance. AWS Global infrastructure also meets business requirement like proximity to the customer, Compliance requirement, Data residency law and regulatory.  It also reduces latency to end users, serving through its edge locations by using the CloudFront content delivery framework. AWS provides virtually unlimited capacity, just need to think differently when you planning your IT infrastructure expansion.

Managed services: AWS provides Managed services which highly scalable and available with lower down the complexity and cost. It enables the customers to use high-end emerging technology without having in-house expertise.   This enables faster implementation and improve go to market time and reduce the implementation risk

Security built in: Instead of a periodic and manual security audit, AWS provide complete governance capabilities that enable continuous monitoring of AWS resources. Programmable nature of AWS Resources also allowed embedding the formalized security policies with the design of Infrastructure.  AWS Resources are temporary in nature also able to spin up your environment to test security policies as well. AWS provides a plethora of native security and encryption features that can be leveraged by solution architect to achieve high levels of data protection and compliance.

Design Principles 

Before deep drive into AWS cloud platform and product offerings that will simplify the high availability and disaster recovery solution, let's have a quick  look on design principles which play vital role during high availability solution design

Scalability - If there is an expectation from application to grow over the time, that application to be built on scalable Architecture.   Scalable architecture supports growth in user volume, traffic or data size without any drop in performance of the application. The Scalable architecture enables to add extra capacity in a very linear manner to support additional traffic, User volume or data size growth without compromising performance and user experience. There are two general ways to scale your application.
  • Vertical Scalability - you can add more CPUs core or Memory to existing boxes to increase the overall processing capacity of application server till a limit. This is a very simplified way to scale your system vertically. To scale your system vertically, the Instance must be stopped to add extra CPU cores or Memories so there would be outage required. Although we can use vertically scaling for both kind of applications which support stateless or Stateful transaction, most appropriate use case of vertical scaling is for the application, supports Stateful transaction, example database due to database application architecture. Vertical Scaling is also known as "Scale-up"
  • Horizontal Scalability - This is also called as Scale-out. You can add extra instances to your system and distribute the increased load to newly added instances. This horizontal scaling does not need nay outage, however, this would need to deal with application sessions. This is most appropriate for the application which supports stateless transactions. In case of stateful, horizontal scaling is implemented by taking care of user sessions or previous transaction information. There are many solutions with pros and cons, you can choose session affinity (Sticky Session) approach, redirect existing users transaction to old instances and new user session to the newly added instance. This approach will not distribute the existing load to newly added instances. You can also use alternatives to manage your state e.g. using of database or HTTP browser cookies but this approach also will have its pros and cons like user can temper the cookies or it can increase the latency of request. Just you need to have a solution to manage session not with any local file/caches, should be on shared stores

Disposable Resources instead of fixed resources (Self-Healing):  Second Design principle is to have disposable resources. This means use the temporary resources to process your task and once the task is completed, release the resource to pool. If instance health is not good, then unhealthy instance to be replaced by a new instance. With traditional architecture, we have dedicated resources even they are just sitting ideal, and other processes are waiting for resources, not optimized use in most of the cases. Disposable resource architecture enables to create a VM based on predefined images and use to process the request and once Request has been processed then terminate the VM. This architecture design pattern also addresses the configuration drift for long-running instances. Long-running instances can be taken out for patch upgrade or security configuration change, and replaced by new instances to ensure instance are having latest configurations and tested properly.  This architecture pattern is based on "Infrastructure as code"- you need to create the script which will create your instances according to your configurations, AMIs and bootstrapping.

Loose Coupling - Systems to be designed in the manner that there would not be any interdependencies between its components. There would not be cascading impact from a change or failure in the other component. A well defines interface can reduce the interdependencies among components as they can interact with each other by using of REST-API, technology agnostic, and encapsulate the internal implementation to each other. Also, you need to implement a Service discovery in such a way that all Services should interact with each other.  As these services will be hosted or running on different-different instances (new or replaced), tradition ways to use IP address, will not be appropriate, must have a discrete mechanism to discover the services to be called. 

--> To be continued in upcoming Part 2 (AWS Product offerings for High availability and business continuity - efficient  and cost-effective) 


Popular posts from this blog

AWS Identity and access management (IAM)