Site icon Bitwarsoft

A Brief Introduction To Fault Tolerance

Summary: Fault tolerance means the ability of the system to continue to operate uninterruptedly, even if one or more of its components fails. In this article, we will give a more detailed introduction to fault tolerance.

Definition Of Fault Tolerance

Fault tolerance refers to the property that enables the system to continue to function correctly even when some of its components fail. In other words, fault tolerance means how an operating system (OS) responds and allows hardware or software malfunctions and fails.

The ability of OS to recover and tolerate faults can be handled through software, hardware, or a combination solution that leverages load balancers. Some computer systems use multiple duplicate fault tolerance systems to handle faults gracefully, which is called a fault-tolerant network.

Fault-tolerant computing includes several levels of tolerance:

Fault-tolerant systems use backup components that automatically replace failed components to ensure that no break occurs in service.

Fault Tolerance Techniques

  1. ReplicationIt provides multiple identical instances of the same system or subsystem, direct tasks or requests to all instances in parallel, and select the correct results based on arbitration.
  2. Failure-oblivious computingIt enables computer programs to continue to execute despite errors, which can be applied in different contexts.
  3. Recovery shepherdingIt is a lightweight technique that enables software programs to recover from otherwise fatal errors.
  4. Circuit breaker: This design pattern is a technique to prevent catastrophic failures in distributed systems.

Requirements Of Fault Tolerance

The following are the primary characteristic requirements for fault tolerance:

  1. No single point of failureIf the system fails, it must continue to operate during repair without interruption.
  2. Fault isolation to the failing components: In the event of a failure, the system must be able to isolate the fault to the component in question. This requires the addition of dedicated fault detection mechanisms that exist only for fault isolation. Recovery from a fault state requires classification of faults or faulty components
  3. Fault containment to prevent the spread of the failureSome failure mechanisms can cause system failures by the propagation of faults to the rest of the system. The “rogue transmitter” is an example of such a failure that leads to legitimate communication in the system and causes complete system failure. A malicious transmitter or failed component needs to be isolated to protect the system’s firewall or other mechanisms.
  4. Availability of reversion modes.

Disadvantages Of Fault Tolerance

Examples Of Fault Tolerance

Sometimes hardware fault tolerance requires that damaged parts be removed and replaced with new parts while the system is still running. Such systems implemented using a single backup are called single-point tolerance and represent the vast majority of fault-tolerant systems.

Fault tolerance succeeds in computer applications. Tandem Computers build their entire business on such computers, which use a single point tolerance to create their nonstop systems, which are picked up in years.

A fail-safe architecture may also include computer software, such as replication through processes.

The data formats can also be designed to degrade naturally. For example, HTML is designed to be forward-compatible, allowing web browsers that don’t understand them without rendering the document unusable to ignore new HTML entities.

Exit mobile version