Fault tolerance in distributed systems

Code 312AA
Credits 6

Learning outcomes

Objectives
The course introduces main issues and techniques in fault tolerance computing for distributed systems. Several techniques are discussed: software replication, atomic actions, checkpointing strategies and rollback recovery protocols. Each technique is introduced in its original context and in the context of parallel and distributed systems. Critical implementation details and optimizations will be also described, such as stable storage. Moreover the course introduces the models to evaluate the overheads provided by fault tolerance techniques over application performance. A part of the course is dedicated to the experimentation of existing techniques and their implementation by means of laboratory sessions.

Syllabus
1) Fault tolerance techniques for distributed systems
a. Software replication
b. Atomic actions
c. Checkpointing and Rollback Recovery
d. Cost models for fault tolerance supports
2) Supports
a. Group communication primitives
b. Stable storage
c. Message Logging
d. Garbage collection of checkpoints
3) Laboratory
a. Implementation of existing checkpointing and rollback recovery protocols:
b. Implementation of message logging for communication channels
c. Fault tolerance for parallel programming paradigms.

Course structure
6 credits (3 on techniques (1) and supports (2), 3 on lab activities). Exam consists in a study of a part of the course by means of a project, not necessarily including programming activities. The project is discussed in an final colloquia.