High availability, failure tolerance, and disaster recovery are crucial aspects to take into consideration when making a design for the system.
These terms are frequently used to refer to developers and architects. They are, however, the same thing, and understanding the difference will save you a lot of headaches, time, and money.
This article will discuss the distinctions between the three terms and also explain how to implement these in AWS.
Highly Available vs. Fault Tolerant vs. Disaster Recovery
A highly available system strives to be up and running as frequently as possible. While there is the possibility of downtime in a system that is highly available, the goal in achieving high availability is to reduce the amount of rest but not completely remove it.
A fault-tolerant system can run through a malfunction without interruption. The goal of fault tolerance is to prevent the disruption of service completely.
In the event of a total system failure, however, reliability and availability may not be enough. A disaster recovery plan outlines how a system will continue to function even if an area of reliability and availability is lost in the event of a system-wide failure.
What Does High Availability Mean?
Before we begin, let’s define what high availability does not mean. It does not suggest that the device is in constant operation or does not experience any downtime. A system that is highly accessible is merely one that aims to be up and running as frequently as it is.
Imagine that we have an establishment that serves pizza open all day of the year. If the restaurant is run by just one cook, their availability – or its capacity to handle orders – won’t be 100 100%. This is due to the fact that the chef will only be employed for around 8 hours per day with an hour break, that is, seven hours per day, seven days a week.
A higher availability restaurant
The 100% availability of this information is only a theoretical assumption since it assumes that the chef isn’t absent for all of the year. This isn’t the case since chefs may get sick, their vehicles can malfunction on the journey from work, or they could need to leave work early to collect their children.
Let’s assume that all the chef’s downtime is equivalent to five hours over a year. This is an amount that gives you an availability of 99.94 percent.
What can you do to make your restaurant more open? Find standby chefs willing to visit the restaurant at any moment. But this comes with an expensive cost since you will have to pay the chefs to hold them until they’re required.
What these standby chefs provide them is the capability to swiftly overcome the issue of not having enough chefs to satisfy customer requests. There is no way to guarantee 100% availability due to the limitations of the real world. You can only achieve absolute availability with a price that is increasing.
What is Availability in a System?
Availability refers to the likelihood that the system can respond to an inquiry.
It is important to note that high availability does not mean anything about the quality of pizzas or the speed at which they’re delivered. The increased availability of pizzas is merely focused on the capacity of the restaurant to meet pizza requests from customers.
Major cloud providers typically come with SLAs that define the availability of their systems.
Consider a blob storage system like. AWS S3 standard has an availability SLA of 99.99 percent. The same is true for Azure Blob storage and Google Cloud storage.
What does 99.99 percent availability mean? It is a fact that, in any year, there is a 99.99 percent chance for the server to remain up and running. An uptime of 99.99 percent equals a downtime of 0.01 percent. This amounts to a rest of around 53 minutes, which is just under an hour over the entire year.
Do you think that the system is available at 99.9 percent? A design like this will have a downtime of 0.1 percent, which equals 8.8 hours per year.
Although 99.9 percent availability may appear to be a lot, for a company processing payments, an air traffic management system, or other crucial system, that duration of downtime can not be acceptable.
What is the ideal quantity of availability you should aim for? It depends on the needs of the system that you are creating.
What Does Fault Tolerance Mean?
In the event that a malfunction within a system happens, is the system able to function without interruption? If so, the system is fault-tolerant.
What’s the difference between high availability and fault tolerance? When a system is highly available, downtime is inevitable. Failures can occur, but not often. The system also can recover from problems. However, if the system is not functioning, it is unable to respond to requests.
Let’s consider the pizza shop to illustrate. If the restaurant is affected by an outage in power, no matter how many chefs are at the counter or standing by, they will assist in cooking pizzas for customers as ovens require a power source.
An emergency generator that is activated immediately when power failure is detected will make the restaurant fault-tolerant.
Another excellent example is a commercial airplane driven by a jet engine. They are designed to be fault-tolerant. This means that when one machine fails, the aircraft will remain flying and landing with no interruption or the need to repair the engine that failed in flight.
Single-engine aircraft or helicopters, however, aren’t fault-tolerant. The failure of an engine causes the plane not to fly. These failures can be severe and are the main reason for the higher percentage of single-machine and helicopter aircraft crashes when compared to twin-engine aircraft.
What Does Disaster Recovery Mean?
If the magnitude of the system’s failure is such that the high reliability and fault tolerance of the system have been neutralized, can the system continue to function?
Let’s return to the restaurant instance. If a flood, fire, or another disaster occurs at your pizza shop, How can you keep making Pizzas for your patrons?
This is a joking scenario since, in the case of an incident, worrying over customers’ orders is not the top concern – however, the reasoning behind the example remains valid.
In this situation, the availability of high-quality services is not going to assistance. With the infinity of chefs in the kitchen or on standby at a restaurant engulfed in flames, there is no pizza for customers.
Failure tolerance is also ineffective. A backup generator is not useful to the appliances it’s designed to power if they’ve been destroyed.
The only way for the system (restaurant) to continue to function is by transferring orders to a nearby restaurant that is not affected by the fire. Recovery from disasters is a planned course of action that outlines the steps to get back on track following a catastrophe.