Introduction
fault tolerance (English: faulttolerance) is also called fault tolerance, fault tolerance, is to cause the system in partial components (one or more) The ability to operate normally when the fault is fame.
If the operating quality of the system is reduced, the reduction amplitude is proportional to the fault, and the system is not considered when the fault capacity is not considered, and it is possible to completely fault in the event of a small fault. A high availability or life-critical system is particularly pursued.
The ability to maintain the maintenance function in the system is called an elegant downgrade, flexible degradation or calibration degradation.
Related Concepts
Fault allows for a very little faulty system is different. For example, the failure of the western electrical cross-switch system is two-hour two-hour, so it has a high degree of defense ability.
However, when the fault occurs, they will completely stop running, so there is no fault.
Measurement Indicators
fault capacity, refers to the capabilities that the software detecting the software or hardware that occurs in the hardware and recovered from the error.
can usually measure:
1. System reliability
2. System availability
3 The system's measurability, etc.
reliability is especially important for the critical application of rocket launch.
and an important indicator is the availability of the system for general-purpose computers.
Availability
Availability refers to ensuring that the system does not fail during a year.
Measuring
Measuring in the design process of fault tolerant systems is also a very important indicator, if we can't test a system, how can it guarantee it? What about problems? In addition, it is also MTBF (average time of the fault), that is, how long does it take to persist after the system is running normally.
Reliability
MTTR (average time to repair the repair), that is, the time required for the system to clear the fault. The size of the MTTR directly affects the availability of the system, while MTBF reflects the reliability of the system.
Examples
Fault Ruver is faulttolerance, which is definitely a fault, but is not an error (Error).
For example, in a two-machine fault-tolerant system, another machine can be replaced, thereby ensuring the normal operation of the system. This situation is relatively common in an early computer hardware.
The current hardware is much more reliable than the previous stability, but hardware faults are still very important for those systems that do not allow errors.