5/29/2009

Implementing Consistency - Protocols for Data Replication and Cache Coherence

Various concrete consistency models have been described in [1], now it's time to discuss how to implement these models.

Consistency semantic is divided into two categories in [1] - "Coherence and Replication are very similar concepts and deal with the same problem, but the former is often used in hardware system (for example, in SMP Cache system) and the later is often used in software system (for example, in distributed files system and database system)".

Some other differences:
1. Coherence deals with data item at client side, while Replication deals with data item at server side
2. Coherence deals with relicated data item that has a corresponding one in a reliable storage backend, but Replication doesn't have that.

We will discuss protocols for Data Replication and Cache Coherence in different sections.

Part 1 - Data Replication

To keep replicated data item consistent in distributed storage system, there are some core problems to think about: where and how to store various replicas, what, when and how to propagate update to replicas? how to keep consistency?

I - Replica Server & Replica Content Placement
- As close to client as possible to improve performance
- Adjust repica placement dynamicly according to access history

II - General Update
Propagation Problems

1. What to Propagate?
- Notification (Invalid Message)
- Command (operation & parameters)
- Data (updated region)
- Change Log (to accumulate several updates into one)

2. When to Propagate?
- Pull (each replica asks for updates)
- Push (primary replica send updates to all secondaries)

3. How to Propagate?
- Unicasting (send updates to each replica one by one)
- Multicasting (send update message to all replicas at one time)

III - Replication Protocols

A replication protocol aims to implement some specific consistency model, the detail of the protocol is highly dependent on the target model. Here we only describe protocols which implement models that are popular in practical systems.

1. Sequential Consistency

1.1 Primary Based Protocol
- all write operations to a data item x is served by one special replica called primary replica, this replica is responsible for updating other replicas, client only interact with this special replica.

1. 2. Replicated Write Protocol - write operations are sent to each replica to execute:
- Active Replication, a total order of all write operations is required in order to make each replica execute the same order of write commands.
- Quorum Based, write operations only need to be executed on part of all replicas before return. It use votes to prevent write-write confilict and write-read conflict:
  • Suppose each data item has N replicas
  • To read a value, client must contact with at least Nr replicas
  • To write a data item, client must contact with at least Nw replicas
  • To ensure no WW and WR conflicts, Nr + Nw > N and Nw + Nw > N should be statisfied
2. Eventual Consistency Protocol

Two requirements should be meet for this kind of protocol:
- All update operations to a data item should reach and be executed all replicas at some time
- These operatioins should be executed in the same order

Some popular methods to ensure these requirements are:
- Use write set and read set (use Nw, Nr to control how many porcesses should be involved in write or read operation), thus update operations are ordered.
- Expose data item version number to client, so when client accesses data, it can pass known latest version number to server to implement some client centric consistency model
- Limit update operation execution process, so write-write conflicts can be solved easily

Note:
- All upper protocols ignore process/node & communication failure, which would occur often in practical distributed system.
- Replication protocols than deal with failure, such as Paxos, Two-Phase Commit, are much more complicated.

Part 2 - Cache Coherence

Cache Coherence Protocol is used to keep client side replicas consistent in the context that a reliable data item exists in storage backend (it's true for smp cache like hardware system and for distributed file system cache like software system).

I. Core problems of Cache Coherence Protocol:

1. Coherence Detection Strategy
That is to say, when inconsistencies are actually detected. For example, in distributed database system, this detection can be performed:
- At the beginning of a transaction
- Parallel with the on going transaction but before it commits
- When transaction commits

2. Coherence Enforcement Strategy
That is to say, how all caches are kept consistent with each other. Generally, there are two methods
- Invalidating: if a data item is modified in one cache, other caches that hold the same data item will receive a invalidation notification.
- Propagating: if a data item is modified in one cache, the update is propagated to other caches that hold the same data item. So all replicas in cache are update to the same version.

3. Cache-Server Consistency
That is to say, how to keep data items in cache and in storage server consistent with each other. There are serveral policies:
- Use Read Only Cache, modification can only be made on items in storage server, cache pull these updates
- Write Back, modifications to data items are accumulated at cache side, and are written to storage server at some other time.
- Write Through, modifications are made at cache, and are also propagated to storage server. In SMP system, bus or directory may be used to serialize all operations. While in distributed file systems, an exclusive lock may be needed for a data cache that can perform modification operations to avoid write-write confliction.

II Implementing Cache Coherence

1. Implementation Mechanism

1.1 Directory Based - In a directory-based system, some information about the data being shared is placed in a common directory. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.[2] It is not as fast as snooping but more scalable.

1.2 Snooping - In such system, the individual caches monitor address lines for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates or update its own copy of the snooped memory location.[2] Since it needs broadcasting, not very scalable.

2. Protocols

The various cache coherence protocols use Directory or Snooping mechanism to solve the three core problems described above:

2.1 Invalidation Protocols
- Write-Once
- Berkeley
- Illinois/MESI

2.2 Propagating Protocols
- Firefly
- Dragon

Most of the cache protocols in multiprocessors are supporting sequential consistency model, while in software distributed shared memory more popular are models supporting release consistency or weak consistency.

[Reference]
[1] http://xcybercloud.blogspot.com/2009/05/data-consistency-model-survey.html
[2] http://en.wikipedia.org/wiki/Cache_coherence
[3] http://www.nedprod.com/NedHAL/Cache%20Coherency%20solutions.html
[4] http://en.wikipedia.org/wiki/Memory_coherence
[5] Shared Memory Architecture, 2001, Chinese Higher Education Press, Weiwu Hu
([5] 共享存储系统体系结构, 2001, 中国高等教育出版社, 胡伟武)
[6] Distributed Systems - Principles and Paradigms (2e), 2006, by Tanenbaum & Steen
[7] Distributed Systems - Concepts and Design (4e), 2005, by Coulouris & Dollimore & Kindberg

5/26/2009

Consistency Model - A Survey

Part I - What's Data Consistency Model and Why Should We Care?

Data Consistency Model
- it is a Semantic Contract between a data storage system and its user. Here, data storage system may refer to hardware system ( for example : memory sub-system inDSM, SMP, CMP computers) , or software systems (for example: distributed file system, web caching or distributed database).

Essentially, Consistency Model defines what value to return in a read operation.

The most natural semantic for storage system is - "read should return the last written value". It is intuitive in memory system with uniprocessor associated, where no concurrent access and no data replication exist.

But when concurrent client access and multiple replica exist, it's not easy to identify what "last write" means, mainly due to the lack of global time clock in parallel/distributed system.

So various data models are proposed to define various data semantic when Contention and Replication exists:
- Data Contention - In memory sub system of parallel computers, there may be multiple processors that access the same memory location at the same period of time, what's the semantic each processor should expect? (different process access the same data item)
- Data Replication - In distributed file/database system, each data block is replicated at multiple places, even access(read/write/modify) requests from the same client may involve more than one replicas, what's the semantic in client's perspective? (data item is replicated)
- Both Contention & Replication : What if multiple clients access replicated data blocks? What if there is a local cache for each processor in SMP system? (many processes access replicated data items)

Part II - The Various Models

There are many consistency models exist in computer architectures and distributed storage community. Why people invent so many semantics between storage and its clients? The reason is the trade off between Strict Semantic VS Available Optimization Space - strict (also simple and easy to understand, use and implement) semantic will limit the space that we can use to improve availability and do performance optimization. So each model has its advantages and drawbacks.

Before describe various models, we should clarify some concepts first:

Program Order
- the order of operations (on data items) that is implied by the order of program statements(or assembly instructions). It's the issuing order of operations that are from the same thread/process.

Write Atomicity - write operation performs as no concurrent/overlapped operations can occur. So the meaning of "Atomicity" here differs from that in DBMS field, where "Atomicity" means that all ops in a transaction will be performed or none of them will be performed.

Consistency VS Coherence/Replication, Consistency focus on the semantic of data store (a set of data item) for multiple client accessing, while Coherence/Replication only deal with the semantic (and how to implement it) of a replicated data item (multiple replica exist) when it is accessed by multiple clients. Coherence/Replication is just part of the Consistency Model. Coherence / Replication Protocol only describes how to implement part of the semantic of some specific data consistency model.

[Update@07/23: Coherence and Replication are very similar and deal with the same problem, but the former is used in hardware system (for example, cache coherence) and the later is used in software system (for example, in distributed files system and database system]

Following are various Consistency Model descriptions:

1. Atomic Consistency/Strict Consistency

It means that for a particular data item (byte, cache block and file region etc.), each read will return the latest update. I.E. events are visible to all clients as they occurs/issues. As we already said, due to the lack of global time clock in distributed/parallel system, it is very hard to implement if not impossible.

2. Sequential Consistency

This model is defined as - "The result of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appears in this sequence in the order specified by its program." - Leslie Lamport

This definition implies:
1. Events/Operations from the same process are executed in program order
2. Every client sees the same order of all events/updates, which is a sequential one

Goods:
- Ensures every body sees the same order, very good for data replication
- Preserves causality

Bads:
- Many interleavings are valid (two independent program with M mem ops and N mem ops will have (M + N)!/(M!N!) valid interleavings)
- Different runs of a program might act differently
- Execution order may be very different from issuing order

3. Causal Consistency

Events that are causally related must be seen by everybody in the same order. Unrelated ("concurrent") events can be seen in different orders.

The challenge in implementing this model lies in how to identify "causally related" events.

4. Eventual Consistency

If no updates take place for a long time (inconsistency window), all replicas will gradually become the latest version of the value.

In this model, if client reads data from different replica, it's hard to define a clear semantic about the return value. To resolve this problem, client-centric consistency is introduced. These models focus on the semantic in a single client's perspective.

Monotonic Reads - If a process reads the value of an object, any successive read operations on that object by that process will always return the same or more recent value.

Monotonic Writes - A write operation by a process on a data item x is completed before any successive write operation on x by the same process.

Read Your Writes - The effect of a write operation by a process on data item x will always be seen by a successive read operation on x by the same process. A closely related model is "Session Consistency". In this model, Read Your Write semantic only works within a session context and not cross session boundary.

Write Follow Reads - Any successive write operation by a process on a data item x will be performed on a replica of x that is up-to-date with the value most recently read by that process.

5. Weak Consistency

A storage system is said to support weak consistency if:

  1. All synchronization operations are seen by all processors in the same sequentially order.
  2. All data operations may be seen in different order on different processors. (But order of ops on the same data item is preserved)
  3. The set of both read and write operations in between different synchronization operations is the same in each processor.
An alternative definition is as:
  1. Accesses to synchronization variables are sequentially consistent.
  2. No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere.
  3. No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed.
This model is weaker than sequential consistency model, since no order assumption can be made on data operations between consecutive synchronization operations.

The weak consistency leverage the fact that - between consecutive sync operations, no other process can use the data being written. So it's safe to reorder write operations on different mem locations. But the order of operations on the same location must be preserved to ensure a reasonable memory semantic.

6. Release Consistency

Release consistency is an extension of weak consistency that exploits the information about acquire, release, and nonsynchronization accesses.
  1. Before an ordinary LOAD or STORE access is allowed to perform with respect to any other processor, all previous acquire accesses must be performed, and
  2. Before a release access is allowed to perform with respect to any other processor. all previous ordinary LOAD and STORE accesses must be performed, and
  3. Special accesses are processor consistent with respect to one another.
In Release Consistency Model, all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it, but a node must call acquire to get up-to-date values.

There are two kinds of coherence protocols that implement release consistency:
- eager, where all coherence actions are performed on release operations, and
- lazy, where all coherence actions are delayed until after a subsequent acquire

7. Entry Consistency

Like all variants of Release Consistency, it requires the programmer to use acquire and release at the start and end of each critical section, respectively. However, entry consistency requires each ordinary shared variable to be associated with some synchronization variable such as a lock or barrier.

- An acquire access of a synchronization variable S is not allowed to perform with respect to process Pi until all updates to the guarded shared data Ds have been performed with respect to process Pi.

- Before an exclusive mode access to a synchronization variable S by processor Pi is allowed to perform with respect to Pi, no other processor may hold S in non-exclusive mode.

- After an exclusive mode access to S has been performed, any processor's next non-exclusive mode access to S may not be performed until it is performed with respect to the owner of S.

8. Processor Consistency

- Writes done by a single processor are received by all other processors in the order in which they were issued, but
- Writes from different processors may be seen in a different order by different processors


The motivation behind this model is to better reflect the reality of networks in which the latency between different nodes can be different.

This model is also known as FIFO Consistency and Pipeline Random Access Memory Consistency.

Notes on Nex Steps:
1. Why People Invent Each Model?
2. What's Good and What's Bad for Each Model?

[Reference]
[1] Wiki on Consistency Model
[2] Summary on Consistency Model by CTO@Amazon
[3] Shared Memory Consistency Models: A Tutorial
[4] Memory Consistency Models
[5] Notes On Consistency
[6] Entry Consistency, Midway : Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors
[7] Release Consistency, Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors
[8] Weak Consistency, Memory Access Buffering in Multiprocessors
[9] Memory Ordering in Modern Microprocessors (Part I, Part II)
[10] Wiki on Cache Coherence
[11] Distributed Systems - Principles and Paradigms (2e), 2006, by Tanenbaum & Steen
[12] Distributed Systems - Concepts and Design (4e), 2005, by Coulouris & Dollimore & Kindberg

5/24/2009

Popular Concepts in Error/Result Analysis

The following are some popular and also confusing terms used in error/result analysis:

I - False Positive/False Negative

- False Positive
- A result that is erroneously positive when a situation is normal.
[误判 是某种情况被判断为成立,但实际上并不成立]
- False Negative - A result that appears negative but fails to reveal a situation.
[漏判 是某种情况实际上成立,但被判断为不成立]

The following table may better describe these concepts:

Actual condition
Present Absent
Test
result
Positive Condition Present + Positive result = True Positive Condition absent + Positive result = False Positive
Type I error
Negative Condition present + Negative result = False (invalid) Negative
Type II error
Condition absent + Negative result = True (accurate) Negative
(from wikipedia)

What confues me often is what's positive, what's negative? It may depend on your definition on test result and actual condition.

II - Precision/Recall

- Precision can be seen as a measure of exactness or fidelity
[查准率 是用来衡量准确、逼真程度的度量]

- Recall
is a measure of completeness.
[查全率 是用来衡量完备程度的度量]

In Information Retrieve Context, these two concepts can be defined as:

\mbox{Precision}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{documents retrieved}\}|}{|\{\mbox{documents retrieved}\}|}
\mbox{Recall}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{documents retrieved}\}|}{|\{\mbox{relevant documents}\}|}

While in classification context, these two concepts can be defined as:



correct result / classification


E1 E2
obtained
result / classification
E1 tp
(true positive)
fp
(false positive)
E2 fn
(false negative)
tn
(true negative)

\mbox{Precision}=\frac{tp}{tp+fp} \,
\mbox{Recall}=\frac{tp}{tp+fn} \,
(from wikipedia)

[Reference]

http://en.wikipedia.org/wiki/Precision_and_recall
http://en.wikipedia.org/wiki/Type_I_and_type_II_errors

5/23/2009

Time and Order of Events in Distributed System

1. The Need for Logical Clock

One of the challenges in distributed system is the lack of global time clocks, it's very hard to timestamp events is different processes and order them globally.

To solve the "Time & Order of Events in Distributed System" problem, let's rethink what "Time" and "Order" means - essentially, Time is a property of events that is used to Order or Sequence them.

In distributed system, many events occur in different process independently, we can't and no need to find a deterministic order between two independent events. But are there dependent/related events? Yes, events are related when one causally affects another, for example through communication between processes.

Causal action determines order among events, and we must use some form of information to represent this kind of orders. This kind of information catches the causal relationship of events but don't need a real time/clock, it is called Logical Time or Logical Clock.

2. How to implement Logical Clock?

What we have in hand?
- Ordered Evends within a Process
- Causual Relationship between some Crosss-Process Events

With these information, we can define a relation ship called "Happend Before"
- 1. if Event E1 and E2 are from the same process, and E1 comes before E2, then E1 Happened Before E2
- 2. if E1 is sending of a message and E2 is receiving of the same message, then E1 Happened Before E2
- 3. if E1 Happened Before E2, and E2 Happened Before E3, then E1 Happened Before E3.

In essence, we have a partial order among the events in distributed system.

2.1 Lamport Clock

Lamport Clock is a mechanism that can extend a partial order into a total order that is consistent with the original partial order.

The algorithm is:
  1. Each process set a initial clock value(arbitary) on start up;
  2. A process increments its counter between two consecutive events in that process; (Send message and Receive message are all events)
  3. When a process sends a message(after doing step 2), it includes its counter value with the message;
  4. On receiving a message(after doing step 2), the receiver process sets its counter to be greater than the clock value in received message AND greater than or equal to its present clock value;
  5. If two events have the same clock value, use its associated process ID to determine the order between them.
After applying this algorithm, we can get a total order (each event pair can be compared in terms of order) of all the events in a distributed system.

The invariant of Lamport clock algorithm is that: if event E1 is Happened Before event E2, then Clock(E1) > Clock(E2).

But the drawback of Lamport Clock is that: if Clock(E1) > Clock(E2), we don't know whether E1 is happened before E2 or E2 is happened before E1. The root cause of this problem is that, Lamport Clock lost some information during the extending algorithm. In another word, the resulting total order is just one of those valid total orders.

Sometimes, we need to keep the whole partial order information. To this end, people invented so called Vector Clock.

2.2 Vector Clock

The main purpose of vector clock mechanism is to retain the complete partial order information(I.E. all possible total orders, not just one of them, as in Lamport Clock) in a logical clock system.

The basic observation is that, in Lamport Clock algorithm, we only retain the information that a receiving message event is causally afftected by the sending message event, but some information about the fact that the receiving message event is NOT causally impacted by some events in other processes is lost.

To fix this problem, we should not only use the clock value of the sending process, but also the clock value from other processes to identify all causal events from all processes in the whole distributed system.

So the basic idea is:
- Use a vector to represent event time/clock
- Each element stores the clock value for one process

Combining the upper idea and Lamport Clock algorithm, we can design a new algorithm to produce vector clock value for each events in the whole distributed system:
  1. Initially all clock values are set to zero;
  2. Before a process experiences an internal event, it increments its own clock value in the vector by at least one.
  3. Each time a process prepares to send a message, it conducts Step 2 and then sends its entire vector along with the message being sent.
  4. Each time a process receives a message, it conducts Step 2 and updates each element in its vector by (suppose the sending process is Ps):
   If local_vector[Ps] <= other_vector[Ps] then
    local_vector[Ps] = other_vector[Ps] + 1;
   for each element i other than Ps do
    local_vector[i] = MAX(local_vector[i], other_vector[i]);

From the algorithm we can see that, an element of the vector clock of an event means that this event only causally dependents on the events happened before that time in the corresponding process.

How to use Vector Clock obtained by the upper algorithm to infer event order?
- Let vClock(x) denote the vector clock of event x
- VC(x) < VC(y) \iff \forall z [VC(x)_z \le VC(y)_z] \and \exists z' [ VC(x)_{z'} < VC(y)_{z'} ]
[ In English: vClock(x) is less than vClock(y) if and only if at least one element in vClock(x) is strictly less that that of vClock(y), and other elements are less than or equal to those in vClock(y). ]
- X Happened Before Y <=> vClock(X) < vClock(Y)

In a word, X Happened Before Y if and only if at least one element in vClock(x) is strictly less that that of vClock(y), and other elements are less than or equal to those in vClock(y).

[Reference]


1. Order
 - Partial Order
 - Total Order

2. Logical Clock
 - Lamport Clock
 - Vector Clock

3. Time, Clock and Ordering of Events in a Distributed System by Leslie Lamport (1978)
4. Timestamps in message-passing systems that preserve the partial ordering by Colin J. Fidge (1988)

5/21/2009

Distributed Lease and Failure Detection

I. What Is Lease?
- A grant to use a resource for a certain period of time.
- Let nodes in networked system to gain exclusive/shared ownership over Named Entities.

II. Why Use Lease?
- Deal with resource management in distributed context. (for example, never freed ownership etc.)

III. The Rules
- Client Must Request a Lease from Server in order to Use it.
- Client Must Renew it before Lease Expires.
- Client Should Return Lease to Server when finish using it.
- If client doesn't renew lease for some period of time, server will expire it and may grant it to other clients later.
- If client doesn't receive renew request acknowledge from server for some period of time, it should commit suicide if the requested resource is exclusive(can't be shared).

IV. Case Studies:
- Failure Detection
- Distributed Garbage Collection Using Lease
- Weblogic Leasing Service
- Leasing in JINI

V. Failure Detection Using Lease

- The Idea

The basic idea is to use lease to manage life of processes. If a process's life lease is not renewed before some period of time, that process is considered to be failed.

To maintain a simple and easy to use semantic, the following statements should be satisfied:
- 1. If the server think a client is failed, the client should did failed already
- 2. If a client fail, the server should eventually find this fact
- 3. If the failed process recovered, the whole system should work correctly as well

- The Algorithm

1. For Server Side

[Message Thread]
while ( TRUE ) {
  Receive a msg from Network;
  Send ack back to Client;
  If msg.type = RENEW {
    lease_last_renew_time = GetCurrentTime();
  } Else If msg.type = ACQUIRE {
    grant lease & lease_last_renew_time = GetCurrentTime();
    or deny request
  } Else If msg.type = RELEASE {
    revoke lease;
}

[Check Thread]
While ( TRUE ) {
  Sleep(INTERVAL_TIME)
  If (lease_last_renew_time + SERVER_LEASE_LENGTHGetCurrentTime()) {
    revoke lease;
  }
}

2. For Client Side

[Message Thread]
last_acked_send_time = GetCurrentTime();
While (TRUE) {
  last_send_time = GetCurrentTime();
  Send RENEW to Server;
  Wait for ack;
  If get ack from server {
    last_acked_send_time = last_send_time;  
  }
  Sleep(TIME_INTERVAL);
}

[Check Thread]
While (TRUE) {
  Sleep(TIME_INTERVAL);
  If ((last_acked_send_tme + CLIENT_LEASE_LENGTHGetCurrentTime()) {
    commit_suicide
  }
}

- Configuration
As you can see, there are many parameters need to be determined:
Ts - server side lease time
Tc - client side lease time
Ti - check action interval

Since message passing is independent of status checking:
- client will notice failure after (Tc, Tc + Ti) time since last acknowledged lease message was issued
- server will notice failure after (Ts, Ts + Ti) time since last renew message was received.

[NOTE: the client event - "last acknowledged lease was issued" will definitely occurs before the server event - "last lease was received", because they are just causal sequence]

To maintain the ivariant - "Server must notice Client failure AFTER that Client notice its failure itself" ("Client Notice its failure" means: - 1. the client crashed and can't work any more, or, - 2. It notices that it SHOULD be looked as failed), the following expression should be true: (Tc, Tc + Ti) 〈 (Ts, Ts + Ti) => Tc 〈 Ts - Ti -- [1]

Further More: to ensure the whole system work in normal cases and some small packet loss situations, at least one renew message should be send before server think client failed and before client commit suicide : Ti min(Ts, Tc) / 2 -- [2]

So when you configure the upper algorithm with parameter, condition [1] & [2] must be satisfied.

You can do some failure scenario analysis to ensure the correctness of the upper algorithm:
- what if server failed?
- what if client failed?
- what if network is partitioned?
- will it work in normal situation?

Failure Scenario of Leasing

NOTE:
1. the time intervals of server and client can be different
2. and at client, the lease message sending and liveness checking time interval can be different

update@11/10/09: add reference links about failure detection

[Reference]
1. Lease Service in Distributed System
2. Failure Detector Work @ Cornell
3. Patent - Leasing for Failure Detector, US Patent by SUN
4. Paper - Perfect Failure Detection In Timed Asynchronous Systems
5. Paper - Unreliable Failure Detectors for Reliable Distributed Systems
7. Paper - Quorum-Based Perfect Failure Detection Service
6. PPT - Failure Detectors: A Perspective
8. PPT - Presentation on Failure Detector
9. Tutorial - Fault-Tolerant Distributed Consensus
10. Enforcing Perfect Failure Detection
11. Book - Reliable distributed systems: technologies, Web services, and applications

5/18/2009

Book Notes - Microtrends

东旦同学推荐,最近读了此书,以下是笔记。

整本书可以分成两部分来读,一部分是序言和结论(估计读的人不多,其实信息含量很大),另一部分是构成主体的75个案例分析。


Part I - 序言和结论部分的内容

1. 全书主要观点

- 现代社会不再是几个阶层或者集团的简单构成,已经被分隔成了一个个有着不同特征的小群体,这些规模不大、特征迥异但却为数众多的小群体正对社会发展起着重大影响,对未来政治、商业趋势的预测极具指导作用。

2. 什么是小群体、小趋势?

- 因共同的需求、习惯和偏好而聚在一起的小组织是所谓的小群体,这些团体成员共同的诉求和目标,就是所谓的小趋势。为什么被称作“小”群体、“小”趋势? 因为相对于全体人口总数,这样的群体成员数非常微小,但1%的比例是作者认为这些群体、趋势能够对未来产生影响的底线。

- 现在的社会正向着几百(甚至更多)个“小趋势”方向发展,而不是少数几个“大趋势”方向发展。

3. 为什么会产生小群体、小趋势?

- 经济的全球化、互联网技术的发展使得世界变得越来越平,个人的力量和影响变大,团队的结成和沟通变得容易,追求个性变得更加现实。
- 星巴克经济(Startbucks Economy)取代福特经济(Ford Economy),个人选择的空间空前广阔。在福特经济中,很多人生产一种千篇一律的批量产品;而在星巴克经济中,几个人生产几千种量身定做的个性化产品。个性化和自由选择的胜利,让社会趋势变得百花齐放。
- 人们的选择越多,就越会把自己融入到社会中愈来愈小的一部分中去。

4. 怎么发现这些小趋势?

- 人们通常依赖:新闻、媒体、杂志以及个人亲身经历这类不可靠、片面的信息来源获得对世界的认识和看法,结论难免有失偏颇
- 我们应该用数字说话而不是依靠主观臆测
- 可信赖的数字来源于民意调查宏观统计

5. "小趋势"成为潮流的影响?

- “主流”观念减少,个人选择的成本增加,选择需要更多的思考和智慧
- 价值观变得更加多样化,人们变得更加宽容:什么事情都不会只有一种方法
- 各种利益团体的广泛存在,民主社会的管理将变得更加艰难
- 大众传媒影响力降低,个性化营销成为主导方式,个人信息的收集分析变得愈发重要

Part II - 案例分析

总共分为15部分,总共75个案例,涉及感情、工作、宗教、健康、家庭、政治、青少年、食品、技术、娱乐、教育和时尚等领域。

其中比较有趣的一些观察和观点:
1. 高智商或是说是精英们更关注于个性,而大众才是更关注于理性
2. 预测未来的最好方法,就是了解过去(Cache的LRU算法?)
3. 科技正不断发展,但宗教变也得更加温和实用,其影响力不断扩大
4. Live Apart Together的夫妻越来越多
5. 依靠互联网带来的便捷,有病自医的人越来越多
6. 在家而不是学校接受教育的孩子逐渐增多
7. 非盈利组织在社会活动中的影响力不断扩大
8. 女性的审美习惯对数字产品的设计变得举足轻重
9. 技术Geek其实比普通人更加喜欢社交活动
10. 不喜欢清理、生活比较邋遢的人,往往都是文化程度高、比较富有的“忙碌人”
11. 不少恐怖分子其实都是家境殷实、受过良好教育并且才能卓著的理想主义者

Part III - 小结

优点:
1. 材料收集广泛,内容(案例)丰富,信息量大而且所选取的角度也确实比较新颖;
2. 对每个“小趋势”,除了描述现象本身,也分析了其背后产生的原因,以及对未来的政治、商业的影响;
3. 此书最后给出了前面每个Case中提到的数字、图表的来源,其严谨、认真的态度值得肯定和学习。

不足:
1. 作者一再强调1%:如果某个小团体成员占到在全社会人口比例的1%以上,就成为了不可忽视的“小趋势”。为什么是1%,不是10%也不是 1‰? 这个观点是其大部分案例分析中的基石,但作者并未对这个观点本身给出有力的论据,也没有严密的论证。
2. 书中描述了诸多小趋势,也给出了来源、做出了其影响的分析,但是,这些小趋势具体是怎么被发现的?用什么方法可以发现新的小趋势?我们能保证找出所有正在发生的新趋势么? 作者对这些问题都没有很好的回答。

Tips:

- 此书最值得一读的,是原书开篇的序言和最后的结论两章,中间总共75个Cases,挑出部分感兴趣的话题仔细读读,其它的浏览浏览也就差不多了。
- http://www.microtrending.com/index.php 有关于此书的最新信息

5/11/2009

convert Arbitrary Bool Expression into Conjunctive Normal Form

Bool Expression is widely used in search engine, automatic thoery proving and database query evaluation etc. In these situations, you need to convert the bool expression into some normal form so that two expression can be compared to test equality, can be implemented using existing operators or can be used to optimize the query evaluation. CNF is such kind of normal form.

In Conjunctive Normal Form(CNF), every sentence is expressed as a conjunction of clauses (where a clause is a disjunction of literals)

The conversion algorithm is simple and straight forward[1]:
1. Drive In Negation (Move Negation Inwards)
- Double-Negation Elimination
- De Morgan
2. Distributed ORs over ANDs
3. Clause Collection
- Remove always true clauses

The implementation:
1. Define a BoolExpr class to represent bool expression. Composite pattern is used since bool expression is a tree like data structure.
2. Use recursive method: DriveInNegation() to implement step 1 in conversion algorithm.
3. Recursive method ToCNF()is used to implement step 2 in conversion algorithm. It first convert each children into CNF, then recognize several cases and deal each case individually:
- if root op is AND, it's done.
- if root op is OR, both left and right op are OR, it's done.
- if root op is OR, left is AND but right is OR, it is converted using the rule: (A ∧ B) ∨ C ∨ D <=> (A ∨ C ∨ D) ∧ (B ∨ C ∨ D)
- if root op is OR, left is OR but right is AND, it's similar to upper case
- if root op is OR, both left and righ op are AND, it's converted using the rule: (A ∧ B) ∨ (C ∧ D) <=> (A ∨ C) ∧ (B ∨ C) ∧ (A ∨ D) ∧ (B ∨ D)
4. RemoveDeadClause() is used to implement step 3 in the algorithm. It uses a static map to detect paried literals(for example: Q/Q)

conversion logic code
 1 public BoolExpr ToCNF()
 2 {
 3     if (IsLeaf())
 4     // A
 5     {
 6         return new BoolExpr(this);
 7     }
 8
 9     if (_op == BOP.NOT)
10     {
11         if (_left.IsLeaf())
12         // ┐A
13         {
14             return new BoolExpr(this);
15         }
16         else
17         // ┐(...)
18         {
19             BoolExpr expr = _left.DriveInNegation();
20             return expr.ToCNF();
21         }
22     }
23
24     // convert children first
25     BoolExpr cnfLeft = null, cnfRight = null;
26     if (_left  != null) { cnfLeft  = _left.ToCNF(); }
27     if (_right != null) { cnfRight = _right.ToCNF(); }
28
29     if (_op == BOP.AND)
30     //   *
31     // ?   ?
32     { return new BoolExpr(_op, cnfLeft, cnfRight); }
33
34     if (_op == BOP.OR)
35     {
36         if (( cnfLeft == null || cnfLeft.IsAtomic() || cnfLeft.Op == BOP.OR)
37             && (cnfRight == null || cnfRight.IsAtomic() || cnfRight.Op == BOP.OR))
38         //   +
39         // +   +
40         {
41             return new BoolExpr(BOP.OR, cnfLeft, cnfRight);
42         }
43         else if ((cnfLeft != null && cnfLeft.Op == BOP.AND)
44                 && (cnfRight == null || cnfRight.IsAtomic() || cnfRight.Op == BOP.OR))
45         //   +
46         // *   +
47         {
48             BoolExpr newLeft = new BoolExpr(BOP.OR, cnfLeft.Left, cnfRight);
49             BoolExpr newRight = new BoolExpr(BOP.OR, cnfLeft.Right, cnfRight);
50
51             return new BoolExpr(BOP.AND, newLeft.ToCNF(), newRight.ToCNF());
52         }
53         else if ((cnfRight != null && cnfRight.Op == BOP.AND)
54                 && (cnfLeft == null || cnfLeft.IsAtomic() || cnfLeft.Op == BOP.OR))
55         //   +
56         // +   *
57         {
58             BoolExpr newLeft = new BoolExpr(BOP.OR, cnfLeft, cnfRight.Right);
59             BoolExpr newRight = new BoolExpr(BOP.OR, cnfLeft, cnfRight.Left);
60
61             return new BoolExpr(BOP.AND, newLeft.ToCNF(), newRight.ToCNF());
62         }
63         else if ((cnfLeft != null && cnfLeft.Op == BOP.AND)
64                 && (cnfRight != null && cnfRight.Op == BOP.AND))
65         //   +
66         // *   *
67         {
68             BoolExpr newLeft = new BoolExpr(BOP.AND,
69                 new BoolExpr(BOP.OR, cnfLeft.Left, cnfRight.Left),
70                 new BoolExpr(BOP.OR, cnfLeft.Right, cnfRight.Left));
71
72             BoolExpr newRight = new BoolExpr(BOP.AND,
73                 new BoolExpr(BOP.OR, cnfLeft.Left, cnfRight.Right),
74                 new BoolExpr(BOP.OR, cnfLeft.Right, cnfRight.Right));
75
76             return new BoolExpr(BOP.AND, newLeft.ToCNF(), newRight.ToCNF());
77         }
78     }
79
80     // error status, should NOT reach here
81     System.Console.Out.WriteLine("Error Status");
82     return null;
83 }

console test output
(A ∧ B) ∨ (C ∧ D) => (A ∨ C) ∧ (B ∨ C) ∧ (A ∨ D) ∧ (B ∨ D)
(A ∧ B) ∨ C ∨ D => (A ∨ C ∨ D) ∧ (B ∨ C ∨ D)
¬((A ∧ B) ∨ C ∨ D) => (¬A ∨ ¬B) ∧ ¬C ∧ ¬D
¬¬¬¬¬¬A => A
¬¬¬¬¬A => ¬A
(P ∧ Q) ∨ (¬P ∧ R) ∨ (¬Q ∧ ¬R) => (P ∨ R ∨ ¬Q) ∧ (Q ∨ ¬P ∨ ¬R)


TODO:
1. Eliminate true clauses
2. Reorder literals

source code

Update@5/14/2009
- Always True Clauses are removed in the final result

[Reference]
[1] Artificial Intelligence, a modern approach [2E]
[2] code for algorithms in [1]
[3] course lecture on cnf conversion
[4] cnf conversion in scheme

5/05/2009

QConBeijing 2009

一个月前召开的会议,该公开的资料基本都出来了。

QconBeijing Home Media Report

Day 1 - 04/07/2009

Qcon Opening

KeyNote 1:

C4Media总裁:让中国得到全世界专家的信
ThoughtWorks首席科学家Martin Fowler演讲
Dylan_Schiemann:面对开放性的网络
Jeff Barr:亚马逊利用云计算的实战经验

Track 1 - Agile
1.1 Martin Fowler:使用Ruby将更有效率
1.2 Henrik Kniberg:多团队的Sprint计划
1.3 吕建伟:不要把自己的公司定位为软件公司
1.4 验收测试驱动开发实战(麦天志)

Track 2 (Intro)
2.1 面对Open Web,我们准备好了吗? - 多功能厅Dylan Schiemann)
2.2 由CCTV网络电视奥运台谈起——RIA的技术趋势和应用趋势(邵荣)
2.3 RIA领域的设计开发流程(吕维德)
2.4 Flex体系架构深度剖析马鉴
2.5 RIA技术在GeoWeb项目中的实际应用(张剑宇)

Day 2 - 04/08/2009

KeyNote 2:
Rod Johnson:简约Java时代已经开始
Randy Shoup:eBay在因特网上所面临的挑战
Rod Johnson:简约Java时代已经开始

Track 3 - Case Study: web architecture
3.1 来自eBay的教训——可扩展站点的最佳实践(Randy Shoup)
3.2 大规模SOA系统治理中的架构支持(程立)
3.3 豆瓣网技术架构的发展历程(洪强宁)
3.4 对话 Randy Shoup(程立 & Randy Shoup)
3.5 从优酷网谈大型网站架构(邱丹)

Track 4 - Java in Enterprise Application
4.1 Spring的现在和未来及企业Java的挑战 - 多功能厅(Rod Johnson)
4.2 对话Rod Johnson(Rod Johnson & 毛新生)
4.3 基于Java构建的淘宝网(林昊)
4.4 JRuby和Rails让Ruby语言融入于Java项目骆古道
4.5 Java在企业级开发中的应用(毛新生)

Day 3 - 04/09/2009

Track 5 - Cloud Computing
5.1 亚马逊Web Services实战 Jeff Barr)
5.2 Sun的云计算平台和技术实现(曹喜平)
5.3 深入云计算开发和谷歌技术平台(栾跃)
5.4 基于云计算的企业协同商务智能设计(王翔)

Track 6 - Architect
6.1 提高架构质量的10个观点another version 高焕堂)
6.2 Hadoop取舍之间──高性能、高流量和多数据中心互联网应用架构设计(于晶纯 & 王迪)
6.3 我之于架构的主要观点(周爱民)
6.4 大型复杂系统的架构与设计(李伟)
6.5 山寨软件复用和架构策略(潘加宇)

Blog Review on QConBeijing

from许超前 D1, D2, D3.
from InfoQ Cn
from BlueDavy
言简意赅的总结
大会感想 by Ronghao
from a ThoughtWorker
from a Baiduer

Mp3 for QConBeijing 2009 All Sessions

Update@05/07/2009
Slides for QConBeijing2009 All Sessions

5/03/2009

ganji.com Architecture

程序员 Vol 4, 2009 有一篇对赶集网CEO和技术副总的采访,里面简要谈到ganji.com的架构。从登载的访谈内容来看,其系统架构如下:

1. Software Stack


Web Server - Nginx
App Lang - php
Database - mysql
OS - Linux

2. Global Optimization

App Cache
- e-accelerator
- memcache
- tt server

Web Cache
- Squid Cluster

CDN
- Detail Unknown

Load Balance
- Citrix Netscaler

3. Database Optimization

Performance - Read/Write Separation(Master/Slave Cluster)
Availability - Active/Passive Failover
Load Balance - Hardware rather Mysql Proxy

4. Intra-Site Search

- Sphinx

总的说来,整个架构中规中矩,每个子系统采用的都是最广为接受的方案。比较好奇的几点:

1. 分类广告网站这样天生按照地域作为(Content/Requests)分割依据的应用,非常适合用Dynamic Load-Balance DNS来做负载平衡, 而且用软件实现Load Balance的LVS也被很多系统采用,只有极少数有钱又有强烈需求的主才会采用F5, NetScaler这样的hardware solution。赶集网的规模真到了从web到db都必须要用硬件来解决负载分配问题了?

2. 即使加上用户管理、广告发布,整个业务逻辑还是比较简单清晰,但为什么会采用这么多不同的Cache方案?