Fast, Localized Document Recovery for Distributed Apps with K-Optimistic Logging
Computing & Wireless : Computing Methods
Available for non-exclusive licensing
- Vijay Garg, Ph.D. , Electrical and Computer Engineering
- Om Damani, Ph.D. , IBM, Inc.
- Yi-Min Wang, Ph.D. , AT&T Corporation
Log-based rollback-recovery is an effective technique for providing low-cost fault tolerance to distributed applications, and is especially useful for distributed applications that frequently interact with the outside world. It can be used to reduce the amount of lost work due to failures and to enable fast and localized recovery.
Optimistic logging first saves messages in a volatile buffer and later writes several messages to stable storage in a single operation. This methodology incurs a lower failure-free overhead than pessimistic logging, where each message is individually and synchronously logged. However, messages saved in the volatile buffer may be lost upon a failure, and the corresponding lost states may revoke messages and force other non-failed processes to roll back as well.
It is desirable to have a flexible scheme with tunable parameters such that each application can balance this trade-off so that it can be responsive during normal operation, and also control the rollback scope so that it can recover reasonably quickly in the event of a failure.
This invention relates to fault-tolerant systems and methods. More particularly, the invention relates to fault-tolerant systems and methods using optimistic logging with a synchronous recovery in message passing systems.
A fault-tolerant message passing system includes a plurality of interconnected processors with storage and a watchdog process wherein the processors may undergo failure. A method restores a consistent system state using optimistic logging protocol with asynchronous recovery. If, upon receiving an incoming message, a process failure is detected, then the failed process is re-started.
Allows the user to specify a fine-grain tradeoff between failure-free overhead and recovery efficiency.
- The variable K, the maximum number of processes whose failures can revoke a specific message, is a tunable parameter.
- Each process comprises a sequence of state intervals and checkpoints for storing the state of the process, sufficient to re-start execution of the process by restoring the latest checkpoint upon failure detection.
- When a process fails, it broadcasts a rollback announcement. Upon receiving a rollback announcement, dependent processes roll back to undo the orphan states, and start a new incarnation as if it itself has failed.
Fault tolerance, distributed systems
Proof of concept