High availability vs fault tolerance – what are they? These are two IT concepts that are closely related, but have different requirements and outcomes. This article will explore the difference between high availability (HA) and fault tolerance (FT) in regard to IT infrastructure, highlighting the key differences, use cases, and benefits of each so that you can ensure that your infrastructure meets its unique reliability and availability requirements.
What is high availability (HA)?
“Availability” in this context refers to whether users or services can access a system. If a system is not available, it can’t perform its task. Availability is measured by the total amount of time a service is available per year: An application with 99% availability is expected to be available 361.35 days per year (0.99 multiplied by 365 days in a year). Many businesses will provide their online services with a contract stating the level of availability they provide as part of their service level agreement (SLA).
High availability is an architecture in which systems and software are deployed and configured to prioritize uptime and the ability to respond to requests. High availability is usually considered to be a 99% or greater uptime, and generally, any service interruption is expected to be momentary and infrequent if it does occur at all — an application that promises 99% availability should not be offline for contiguous days, or even minutes, even if it still comes in under the prescribed amount.
Even if you aren’t providing services to other companies, many organizations require high availability to ensure the uninterrupted operation of their own businesses. For example, if you have servers hosting your sales or e-commerce software, they should focus on high availability to ensure that customers are never prevented from spending money with you.
What is fault tolerance (FT), and how is it different from HA?
Fault tolerance takes things a step further than high availability, aiming to create infrastructure that doesn’t suffer any interruptions at all. The goal is the same as high availability — preventing service interruption — but the mechanisms used to add fault tolerance to systems are much more complex and costly.
Due to this, many tech teams will decide that a certain level of high availability is more practical and more cost-effective than a fully fault-tolerant system, or whether 100% uptime is worth the overheads. For example, while an online store designed for high availability going offline for a couple of seconds when there are no customers (to fix a bug or install an update) might be considered acceptable, an air traffic control system dropping out periodically for any amount of time is completely unacceptable, and all systems involved must have zero downtime.
How do they work?
High availability
High availability implements several key mechanisms, depending on the kind of services being provided:
- Replication and mirroring: Replication is used to ensure that there are multiple up-to-date copies of data stored on servers running on separate hardware (and even in separate locations). If one fails, users can be quickly redirected to another server. Data integrity should be monitored and checked across systems to ensure that all data is current, complete, and accurate.
- Clustering and load balancers: Web servers, databases, and other services can be hosted on multiple servers, and configured to spread traffic across them so that no individual server is overwhelmed. If an issue causes one server in a cluster to fail, service is uninterrupted as the others continue to operate.
- Redundancy and backup: Regular backups, in addition to replicated copies, ensure that if there is an issue with corrupted or damaged data, or a system has become unrecoverable, they can be restored to a known good state.
- Redundant hardware: RAID is a common technology deployed to protect against disk failure. At a network level, ensuring there are multiple failover devices available (for example, a secondary router and internet connection in case the primary connection drops), ensures that problems like power surges and cut cables do not prevent continued operation.
Fault tolerance
Fault tolerance adds in further redundancy and fault detection to ensure full resilience against interruption. In essence, a fault-tolerant system combines multiple high availability systems to ensure that as a whole, downtime can be completely avoided even if one of these HA systems completely fails:
- Redundant HA systems (including geographical redundancy): Multiple highly available systems, either entire duplicates of an entire application infrastructure, or its individual components separately, can be instantly switched to in the event of an outage. Ideally, one or more of these redundant systems will be located in a different geographical region to protect against natural disasters or other emergencies that may result in an entire data center being pulled offline.
- Automated fault detection and reversion: Fault tolerant systems must be able to detect issues as they occur and immediately mitigate them, for example by switching to a redundant system while the problem is investigated and repaired. When the issue is resolved, the system should automatically revert to its designed state.
- Removing single points of failure: Any FT system must have no single point of failure. The architecture of the system and of all of its components should be regularly inspected to ensure full redundancy (and functionality of the redundant components).
- Fault containment: Some errors can affect other infrastructure, cascading through systems. For example, unexpected data values in a data pipeline may cause applications to malfunction as it is passed along. FT systems should monitor for non-critical faults and isolate them (for example, by placing bad messages in a dead letter queue rather than letting them cause potential critical errors that may affect uptime).
One of the key technologies used for both HA and FT systems is automated monitoring and alerting: If an issue occurs, tech teams need to be notified so that they can respond immediately to ensure no service interruption.
High availability vs fault tolerance: Benefits and drawbacks
When deciding on the level of high availability you want to achieve for your system, or whether it is appropriate to implement full fault tolerance, you should consider the cost, performance, and technical implications.
Fault tolerant systems are much more costly and complex to implement and maintain than systems designed only for high availability. This is because every system must have multiple redundant duplicates, and for each duplicate, the implementation cost multiplies, as do the ongoing hosting and maintenance costs. Additionally, the larger your IT infrastructure, the more difficult it is to manage with a small team without the right tools.
Performance and scalability are additional considerations; redundant components in FT do not add to the performance (as they are only called into action in the event of a failure). For HA, when clustering is used to provide failover, other running instances can take some of the workload using a load balancer, improving performance. This can of course be addressed in FT systems by using multiple redundant high availability systems (at great cost).
In both high availability and fault tolerant systems, there are some additional resource overheads to maintaining redundant copies of data, as it must be transferred and verified — a potential concern in use cases where latency is a concern (for example, automated stock trading).
High availability and fully redundant infrastructure are designed with two main factors in mind: avoiding downtime, and fast recovery if something does go wrong. The recovery time objective (RTO) of fault tolerant applications is expected to be zero, so recovery has to be instant. For instant recovery, data integrity of replicas must be absolute, so FT systems tend to put a greater focus on the integrity of live data replicas, while HA systems can pause and spend a few minutes recovering from a known good backup.
High availability and fault tolerance: Best practices
When implementing your own high availability or fault tolerant systems, you should ensure that you have the following key technologies and processes covered:
- Full redundancy: Ensure that no single point of failure exists.
- Automated detection and response: Implement systems that identify and mitigate problems automatically, and alert engineering teams of any potential problems.
- Geographical resiliency: For FT systems, ensure redundant systems are available in a different region. For HA, ensure that at least a current data mirror is stored in a different region from which it can be quickly restored.
- Mind your costs: Do not underestimate the compounding costs of deployment and maintenance, especially in fault tolerant systems.
You may decide that high availability or fault tolerance are not worth the increased costs and resources for your organization’s use case, however, this does not mean that one of the tools they rely on isn’t absolutely critical: backups. Regardless of how complex (or simple) your IT infrastructure is, you must have a robust backup plan in place to make sure that accidents, theft, or disaster do not wipe out the valuable information you rely on.