Fault-Tolerant Design of Computer Systems
An Introductory Course

A Five Day Short Course at City University, London

Companies place increasing reliance on computer systems for the very survival of their business; computer applications become ever more complex, yet they are often built from unreliable components, hardware or software.

Fault tolerance - design for surviving component failures- is becoming a necessity for a growing number of companies, far beyond its traditional application areas, like aerospace and telecommunications.

This course - organised as five one-day lectures that can be taken individually - addresses the needs of:

IT and engineering managers who have to address new needs for dependability of their (or their customers') computer applications
Software designers or system integrators who want an introduction to the problems found in designing for fault tolerance and to the range of design solutions.

This course has been developed by the Centre for Software Reliability with funding from the Engineering and Physical Sciences Research Council (Grant Number 00711ENG95) as part of their individual MSc Modules Programme.

The Centre for Software Reliability is a Registered Provider for the IEE Continuing Professional Development (CPD) Scheme.

Background, Course Objectives, Contents, Course Teaching, Fees, Booking information

Introduction and Background

Computer failures can have crippling effects on an organisation's ability to function. Any company, not just software-related businesses, can become bankrupt as a result of computer failure. In the next two or three years, the "Millennium bug" alone will generate many vital errors. And yet increasingly, business-critical computing systems are being assembled from off-the-shelf components never designed for high reliability, availability or safety.

This course offers a unique opportunity for engineering managers and software designers to learn about fault tolerance - about systems surviving failure. It is about maintaining systems despite the failure of some of their parts. In other words, without uncontrolled disruption of service. This is not rocket science; if you know the basic principles, you can apply them to everyday design and purchasing decisions.

The participants will learn the basic concepts necessary for decisions about the form and extent of redundancy to be employed during the design or procurement of computer systems. These concepts have been developed by researchers during the whole history of computing, but their application has been mostly limited to safety-critical and other high-risk, high-budget applications. By contrast, this course will consider the range of techniques available to organisations with different dependability requirements and budgets for fault tolerance. We will cover the integration of automatic and manual procedures, and will specifically address software-caused and operator-caused failures. The course will thus satisfy the needs of companies that have to decide between market offerings of fault-tolerant commercial products, and/or the need to integrate a fault-tolerant system out of non- fault-tolerant products.

Course Objectives

At the end of this course you should:

understand the risk of computer failures and their peculiarities compared with other equipment failures;
know the different advantages and limits of fault avoidance and fault tolerance techniques;
be aware of the threat from software defects and human operator error as well as from hardware failures;
understand the basics of redundant design;
know the different forms of redundancy and their applicability to different classes of dependability requirements;
be able to choose among commercial platforms (fault-tolerant or non fault-tolerant) on the basis of dependability requirements;
be able to specify the use of fault tolerance in the design of application software;
understand the relevant factors in evaluating alternative system designs for a specific set of requirements;
be aware of the subtle failure modes of "fault-tolerant" distributed systems, and the existing techniques for guarding against them
understand cost-dependability trade-offs and the limits of computer system dependability

Detailed contents

The standard timetable below includes ample time for class discussions and group problem sessions. The presentation of the material will emphasise examples in practical contexts.

Day 1 Fundamentals of design for dependability and fault tolerance.

Faults and failures of computer systems: what is special about computer failures
elements of systems theory
definitions: fault, error, failure; fault avoidance and fault tolerance
reliability, availability and other dependability measures
deriving dependability requirements
forms of redundancy
organisation of fault tolerance
dependability modelling: combinatorial and non-combinatorial languages.

Day 2 Methods for error detection, confinement and recovery

Phases of response to fault manifestation:
Error detection mechanisms in hardware and software
Damage containment: error propagation paths, protection mechanisms
Damage assessment and diagnosis, reconfiguration
Forward and backward recovery. Problems of backward recovery.
Atomic actions and transactions.

Day 3 Recovery, modular redundancy and fault tolerance in distributed systems

modular redundancy design problems: synchronisation, communication, adjudication, replica determinism
choice of redundant configuration: trade-offs among reliability, availability, safety
distributed systems: advantages and design problems
communication faults
distributed support for replication, consistency in distributed systems, multicast communication and "Byzantine" failures
structuring fault-tolerant, distributed applications
recovery and atomic transactions in concurrent and distributed systems
real-time issues

Day 4 Fault tolerance against software and design faults, and against operator error

importance of these failure sources and motivations for using fault tolerance
software faults and failures
exception handling and defensive programming
module-structured methods for software fault tolerance
effectiveness of software fault tolerance
how to achieve diversity
industrial experience with software fault tolerance
operator error: basics, classification: slips and mistakes, violations
user-centred design principles; fault tolerant solutions
design trade-offs

Day 5 Commercial fault tolerant systems; decisions in design, procurement and deployment of fault-tolerant systems

examples of commercial systems: general-purpose, transaction-processing systems, applications in process control, telecommunications and safety systems
systematic design approaches to fault tolerance
choice of the level (application, platform, hardware) and degree of fault tolerance
fault assumptions and design trade-offs; role of complexity and novelty
automatic fault tolerance vs manual procedures, disaster recovery
organisational and training aspects, maintenance
cost of increasing dependability and limits to achievable dependability

Each day starts at 9:30 and ends at 16:30

Course Teaching

The course is prepared and taught by the Centre for Software Reliability (CSR), at City University, which is recognised internationally as a centre for excellence in software reliability and measurement. The course leader is Prof Lorenzo Strigini, who has 18 years' experience in research in fault tolerance in hardware and software, including consulting and teaching industrial courses.

About CSR (Centre for Software Reliability at City University)