5/21/2009

Distributed Lease and Failure Detection

I. What Is Lease?
- A grant to use a resource for a certain period of time.
- Let nodes in networked system to gain exclusive/shared ownership over Named Entities.

II. Why Use Lease?
- Deal with resource management in distributed context. (for example, never freed ownership etc.)

III. The Rules
- Client Must Request a Lease from Server in order to Use it.
- Client Must Renew it before Lease Expires.
- Client Should Return Lease to Server when finish using it.
- If client doesn't renew lease for some period of time, server will expire it and may grant it to other clients later.
- If client doesn't receive renew request acknowledge from server for some period of time, it should commit suicide if the requested resource is exclusive(can't be shared).

IV. Case Studies:
- Failure Detection
- Distributed Garbage Collection Using Lease
- Weblogic Leasing Service
- Leasing in JINI

V. Failure Detection Using Lease

- The Idea

The basic idea is to use lease to manage life of processes. If a process's life lease is not renewed before some period of time, that process is considered to be failed.

To maintain a simple and easy to use semantic, the following statements should be satisfied:
- 1. If the server think a client is failed, the client should did failed already
- 2. If a client fail, the server should eventually find this fact
- 3. If the failed process recovered, the whole system should work correctly as well

- The Algorithm

1. For Server Side

[Message Thread]
while ( TRUE ) {
  Receive a msg from Network;
  Send ack back to Client;
  If msg.type = RENEW {
    lease_last_renew_time = GetCurrentTime();
  } Else If msg.type = ACQUIRE {
    grant lease & lease_last_renew_time = GetCurrentTime();
    or deny request
  } Else If msg.type = RELEASE {
    revoke lease;
}

[Check Thread]
While ( TRUE ) {
  Sleep(INTERVAL_TIME)
  If (lease_last_renew_time + SERVER_LEASE_LENGTHGetCurrentTime()) {
    revoke lease;
  }
}

2. For Client Side

[Message Thread]
last_acked_send_time = GetCurrentTime();
While (TRUE) {
  last_send_time = GetCurrentTime();
  Send RENEW to Server;
  Wait for ack;
  If get ack from server {
    last_acked_send_time = last_send_time;  
  }
  Sleep(TIME_INTERVAL);
}

[Check Thread]
While (TRUE) {
  Sleep(TIME_INTERVAL);
  If ((last_acked_send_tme + CLIENT_LEASE_LENGTHGetCurrentTime()) {
    commit_suicide
  }
}

- Configuration
As you can see, there are many parameters need to be determined:
Ts - server side lease time
Tc - client side lease time
Ti - check action interval

Since message passing is independent of status checking:
- client will notice failure after (Tc, Tc + Ti) time since last acknowledged lease message was issued
- server will notice failure after (Ts, Ts + Ti) time since last renew message was received.

[NOTE: the client event - "last acknowledged lease was issued" will definitely occurs before the server event - "last lease was received", because they are just causal sequence]

To maintain the ivariant - "Server must notice Client failure AFTER that Client notice its failure itself" ("Client Notice its failure" means: - 1. the client crashed and can't work any more, or, - 2. It notices that it SHOULD be looked as failed), the following expression should be true: (Tc, Tc + Ti) 〈 (Ts, Ts + Ti) => Tc 〈 Ts - Ti -- [1]

Further More: to ensure the whole system work in normal cases and some small packet loss situations, at least one renew message should be send before server think client failed and before client commit suicide : Ti min(Ts, Tc) / 2 -- [2]

So when you configure the upper algorithm with parameter, condition [1] & [2] must be satisfied.

You can do some failure scenario analysis to ensure the correctness of the upper algorithm:
- what if server failed?
- what if client failed?
- what if network is partitioned?
- will it work in normal situation?

Failure Scenario of Leasing

NOTE:
1. the time intervals of server and client can be different
2. and at client, the lease message sending and liveness checking time interval can be different

update@11/10/09: add reference links about failure detection

[Reference]
1. Lease Service in Distributed System
2. Failure Detector Work @ Cornell
3. Patent - Leasing for Failure Detector, US Patent by SUN
4. Paper - Perfect Failure Detection In Timed Asynchronous Systems
5. Paper - Unreliable Failure Detectors for Reliable Distributed Systems
7. Paper - Quorum-Based Perfect Failure Detection Service
6. PPT - Failure Detectors: A Perspective
8. PPT - Presentation on Failure Detector
9. Tutorial - Fault-Tolerant Distributed Consensus
10. Enforcing Perfect Failure Detection
11. Book - Reliable distributed systems: technologies, Web services, and applications

No comments: