Fault-Tolerant Design of Computer Systems
An Introductory Course
A Five Day Short Course at City University, London
Companies place increasing reliance on computer systems for the very survival of their
business; computer applications become ever more complex, yet they are often built from
unreliable components, hardware or software.
Fault tolerance - design for surviving component failures- is becoming a necessity for
a growing number of companies, far beyond its traditional application areas, like
aerospace and telecommunications.
This course - organised as five one-day lectures that can be taken individually -
addresses the needs of:
- IT and engineering managers who have to address new needs for dependability of their (or
their customers') computer applications
- Software designers or system integrators who want an introduction to the problems found
in designing for fault tolerance and to the range of design solutions.
This course has been developed by the Centre
for Software Reliability with funding from the Engineering and Physical Sciences
Research Council (Grant Number 00711ENG95) as part of their individual MSc Modules
Programme.
The Centre for Software Reliability is a Registered Provider for the IEE Continuing
Professional Development (CPD) Scheme.
Background, Course Objectives, Contents, Course Teaching, Fees,
Booking information
Introduction and
Background
Computer failures can have crippling effects on an organisation's ability to function. Any
company, not just software-related businesses, can become bankrupt as a result of
computer failure. In the next two or three years, the "Millennium bug" alone
will generate many vital errors. And yet increasingly, business-critical computing systems
are being assembled from off-the-shelf components never designed for high reliability,
availability or safety.
This course offers a unique opportunity for engineering managers
and software designers to learn about fault tolerance - about systems surviving failure.
It is about maintaining systems despite the failure of some of their parts. In
other words, without uncontrolled disruption of service. This is not rocket
science; if you know the basic principles, you can apply them to everyday design and
purchasing decisions.
The participants will learn the basic concepts necessary for
decisions about the form and extent of redundancy to be employed during the design or
procurement of computer systems. These concepts have been developed by researchers during
the whole history of computing, but their application has been mostly limited to
safety-critical and other high-risk, high-budget applications. By contrast, this course
will consider the range of techniques available to organisations with different
dependability requirements and budgets for fault tolerance. We will cover the integration
of automatic and manual procedures, and will specifically address software-caused and
operator-caused failures. The course will thus satisfy the needs of companies that have to
decide between market offerings of fault-tolerant commercial products, and/or the need to
integrate a fault-tolerant system out of non- fault-tolerant products.
Course Objectives
At the end of this course you should:
- understand the risk of computer failures and their peculiarities compared with other
equipment failures;
- know the different advantages and limits of fault avoidance and fault tolerance
techniques;
- be aware of the threat from software defects and human operator error as well as from
hardware failures;
- understand the basics of redundant design;
- know the different forms of redundancy and their applicability to different classes of
dependability requirements;
- be able to choose among commercial platforms (fault-tolerant or non fault-tolerant) on
the basis of dependability requirements;
- be able to specify the use of fault tolerance in the design of application software;
- understand the relevant factors in evaluating alternative system designs for a specific
set of requirements;
- be aware of the subtle failure modes of "fault-tolerant" distributed systems,
and the existing techniques for guarding against them
- understand cost-dependability trade-offs and the limits of computer system dependability
Detailed contents
The standard timetable below includes ample time for class discussions and group problem sessions. The
presentation of the material will emphasise examples in practical contexts.
Day 1 Fundamentals of design for dependability and fault tolerance.
- Faults and failures of computer systems: what is special about
computer failures
- elements of systems theory
- definitions: fault, error, failure; fault avoidance and fault
tolerance
- reliability, availability and other dependability measures
- deriving dependability requirements
- forms of redundancy
- organisation of fault tolerance
- dependability modelling: combinatorial and non-combinatorial
languages.
Day 2
Methods for error
detection, confinement and recovery
- Phases of response to fault manifestation:
- Error detection mechanisms in hardware and software
- Damage containment: error propagation paths, protection mechanisms
- Damage assessment and diagnosis, reconfiguration
- Forward and backward recovery. Problems of backward recovery.
- Atomic actions and transactions.
Day 3
Recovery, modular
redundancy and fault tolerance in distributed systems
- modular redundancy design problems: synchronisation, communication,
adjudication, replica determinism
- choice of redundant configuration: trade-offs among reliability,
availability, safety
- distributed systems: advantages and design problems
- communication faults
- distributed support for replication, consistency in distributed
systems, multicast communication and "Byzantine" failures
- structuring fault-tolerant, distributed applications
- recovery and atomic transactions in concurrent and distributed
systems
- real-time issues
Day 4
Fault tolerance
against software and design faults, and against operator error
- importance of these failure sources and motivations for using fault
tolerance
- software faults and failures
- exception handling and defensive programming
- module-structured methods for software fault tolerance
- effectiveness of software fault tolerance
- how to achieve diversity
- industrial experience with software fault tolerance
- operator error: basics, classification: slips and mistakes,
violations
- user-centred design principles; fault tolerant solutions
- design trade-offs
Day 5
Commercial fault
tolerant systems; decisions in design, procurement and deployment of fault-tolerant
systems
- examples of commercial systems: general-purpose,
transaction-processing systems, applications in process control, telecommunications and
safety systems
- systematic design approaches to fault tolerance
- choice of the level (application, platform, hardware) and degree of
fault tolerance
- fault assumptions and design trade-offs; role of complexity and
novelty
- automatic fault tolerance vs manual procedures, disaster recovery
- organisational and training aspects, maintenance
- cost of increasing dependability and limits to achievable
dependability
Each day starts at 9:30 and ends at 16:30
Course Teaching
The course is prepared and taught by the Centre for Software Reliability (CSR), at City University, which is
recognised internationally as a centre for excellence in software reliability and
measurement. The course leader is Prof Lorenzo
Strigini, who has 18 years' experience in research in fault tolerance in hardware and
software, including consulting and teaching industrial courses.
About CSR (Centre for
Software Reliability at City University)
To arrange for a delivery of the course, to be informed of the next delivery, for additional information on course
contents, for discussing a tailored version of the course, or to be put on a mailing list
for future information, contact the course leader, Prof
Lorenzo
Strigini