8/27/2009

Parallel DBMS V.S. Distributed DBMS

 Large Scale Data Intensive Computing is a hot topic today, many people starts to talk so called Parallel Database System and Distributed Database System technologies. But these two concepts seem very confusing, so I devoted sometime to try to make it clear.

Parallel Database System seeks to improve performance through parallelization of various operations, such as data loading, index building and query evaluating. Although data may be stored in a distributed fashion in such a system, the distribution is governed solely by performance considerations.

In Distributed Database System, data is physically stored across several sites, and each site is typically managed by a DBMS capable of running independent of the other sites. In contrast to parallel databases, the distribution of data is governed by factors such as local ownership and
increased availability.

PDB & DDB Comparison:

1. System Components
- Distributed DBMS consists of many Geo-distributed, low-bandwidth link connected, autonomic sites.
- Parallel DBMS consists of tightly coupled, high-bandwidth link connected, non-autonomic nodes.

2. Component Role
- Sites in Distributed DBMS can work independently to handle local transactions or work together to handle global transactions.
- Nodes in Parallel DBMS can only work together to handle global transactions.

3. Design Purposes
= Distributed DBMS is for:
 - Sharing Data - Local Autonomy - High Availability= Parallel DBMS is for: - High Performance - High Availability
 But both PDB&DDB need to consider the following problems: 1. Data Distribution (Placement & Replicatioin); 2. Query Parallelization(Distributed Evaluation). And also, many parallel system consists of network of workstation, the difference between Parallel DB & Distributed DB is becoming smaller.

[Reference]
1. Great Paper on PDB&DDB Explanation Distributed and Parallel Database Systems
2. Great Paper by Jim Gray Parallel Database Systems3. Textbook, Database Management System (3rd edition)
4. Textbook, Database System Concepts (5th edition)
5. Textbook, Principle of Distributed Database Systems (2nd edition)
6. DB Textbook List @ Amazon

8/19/2009

Baidu World 2009

  2009年08/18,在国贸一层的中国大饭店,参加了百度公司主办的2009百度公司技术创新大会。这次大会分主论坛和诸多分论坛(详情参见这里),去参加了主论坛和搜索技术分论坛,这里说下在技术和非技术方面上的一些看法。

  非技术层面
  1. 大会参会人员众多,足见业内业外人士对互联网、对搜索、对B公司的热情及该公司的影响力
  2. 大会由原来作为营销大会的百度世界演变而来,虽然出现了诸多技术keynote,但技术含量不足;会场热闹非凡,但定位稍显不清;
  3. 组织工作乏善可陈:缺乏引导工作,人工输入密码完全不必要(Barcode Scan可以节约很多时间),提示信息带有误导性,规章流程形同虚设,参会人数失控等等;
  4. 总体气场、底气和一流国际公司还有很大差距

from SINA.COM

  技术层面主要想谈一下Box Computing,Aladdin Platform 和 从产品设计的角度分析搜索引擎面临的问题这三个话题。

  1. Box Computing 这是Robin Li同学的开场主题演讲,此次大会的重头戏。

  所谓框计算的主要想法其实很简单:用户通过传统输入框表达查询需求->分类系统对查询请求进行分析后作出类别判断->根据不同的分类讲请求发送给不同的信息服务源。
  这样的思路在结构化或者说垂直搜索中是早已采用,因为这样的系统中往往有多种信息源,必须对用户的查询串进行分析再有针对性地向特定的信息源发出请求。

  但既然Baidu大张旗鼓来做这件事,这个问题却也不想想象中那么简单平凡:
  1). 如何理解用户需求? Box Computing的一个核心是要对用户的输入进行分类、理解,为了尽可能准确理解,需要结合用户的搜索历史、发出请求所处的地理时间等上下文信息,还有语义理解、人工智能等等一系列工业界长久以来想做而没做成的事情。
  2). 如何构建开放的信息生态系统? 百度会向第三方开放接口,引入外部信息提供商。从技术上来讲,这将极大提高通过搜索可获取的信息量,消除了一部分传统爬虫无法处理的暗网(deep web);从商业上来讲,这将建立一个新的“信息”生态系统,信息拥有者将会有很好的机会得到用户访问和商业收益。对Baidu还是对高价值信息的拥有者,这是一个双赢的局面。
  
  其中语义相关的问题:用户请求分成哪些类别?每个类别分析结果应该是什么格式?信息提供商和用户怎么交互? 这些问题其实和早年用SOA搞企业信息集成的人面临的问题没什么区别。怎样达到一个合理而且大家都能遵守的标准、规范,既是个技术问题,又是个非技术问题。

  2. Aladdin Platform 这是在下午的搜索技术分论坛上百度主任架构师廖若雪的主题演讲。内容比较High Level, 演讲者的PPT以及现场演讲效果都不错,但其实和Box Computing并没有太实质性的区别。

  主要观点:
  1) 发掘暗网,收集结构化的而不仅仅是网页文本数据;
  2) 分析用户需求,展示丰富而个性化的结果;
  3) 与Box Computing区别(个人理解),Box Computing是一个开放的生态系统,而Aladdin则自己完全掌控。

  几点观察:
  1). 有25%的用户查询需求未能得到满足,现有搜索引擎大致覆盖了所有数据的37%
  2). 互联网上越来越多、越重要的数据将是结构化、非文本的数据
  3). 数据的多样化造成结果展示的多样化(表格、视频、图片等),这对基础架构有新的需求(Server/Network的重新布局优化)
  4). 传统基于page间link的ranking system不再适用,结果排序问题是新的挑战
  5). 需要自动及时地更新信息,注重时效性和自动化程度
  6). Aladdin Platform的目标和贵公司的Bing的定位有诸多相似之处
  7). 搜索引擎正从单纯的网页检索工具演化成无所不能的巨大人工智能系统

  3. 用户眼中的搜索引擎问题 这个演讲其实是得到信息最多的,因其中有很多Facts & Observations是在其它地方很难知道的。

  问题分析:
  一 搜索需求的数量激素膨胀、形式越发多样;
    - 用户期待搜索引擎不仅仅是搜索,而是一个智能问答系统
    - 移动和手持设备的普及增加了新的需求
  二 用户搜索需求满足方式的复杂化;
    - 搜索只是用户行为的其中一个环节
    - 更高的时效性需求
    - 答案是有上下文环境的
  三 用户搜索行为越来越趋于“傻瓜化”;
    - 用户查询请求越来越接近自然对话
    - 长搜索词数量占到50%,查询数量占到30%
  四 互联网上的有价值资源获取难度越来越高。
    -大量不依靠Link的数据出现:Blog, SNS, 网上图书馆
    -大量现实数据未数字化网络化
    -人脑中的知识信息如何数字化可搜索

  解决途径:
  一 自然语言处理,准确理解需求
  二 丰富的用户界面,满足多种需求
  三 搜索社区,挖掘人脑中的知识
  四 第三方结构化信息,挖掘暗网

[Reference]
http://tech.sina.com.cn/focus/baiduworld/index.shtml

8/03/2009

Consistent Hashing - Theory & Implementation

What's it?

The consistent hashing comes from the solving of hot spot problem in Internet system, I.E., it comes from the distributed cache system. [1][2]

The idea is simple and straight forward (more detail in paper[1][2]):
Hot Spot -> Centric Cache -> Distributed Cache (Communicate using IP Multicast) -> Harvest Cache(Tree Structure Cache, structured communication)[9] -> Random Tree Cache(different Tree for different cached Objects, hash mapping from tree node to machine node) -> Random Tree + Consistent Hashing Cache(deal with machine node dynamics: machine may leave/down and join/recover)

Essentially, a Consistent Hash Function is one that changes minimally as the range of the function changes. A hash function can be looked as a mapping from items to buckets, suppose we have such a mapping m1. If some buckets are added/removed, we have another mapping m2. In normal hash functions, m1 -> m2 will introduce many item movements (from one bucket to another). A consistent hash function has the characteristic that item movements are minimal.

How does it accomplish that?

In consistent hash function:
1. Items and buckets are all hashed(normal hash) into some integer interval (for example: [0, 1024]).
2. Item is mapped to a bucket that is closet to it.
3. Bucket may be replicated to improve even distribution balance.

NOTE: here "closet" means the first bucket it meets when traverse clock wise along the integer circle (see diagram below)

Suppose the normal hash output range is [0, 1023], you have 4 buckets and 8 items, each bucket is replicated twice. One possible consistent hashing may be illustrated as below:


Current Mapping/Hashing:
Bucket1 - Item1, Item3, Item8
Bucket2 - Item2, Item7
Bucket3 - Item4, Item6
Bucket4 - Item5

If Bucket3 is down/broken, the new Mapping/Hashing will be:


Bucket1 - Item1, Item3, Item8
Bucket2 - Item2, Item6, Item7
Bucket4 - Item4, Item5


You can see that only Items on Bucket3 are changed, and they are distributed among the remaining buckets.

If a new Bucket5 is added, you can see that only small number of items are changed, and the new bucket gets load from the original 4 Buckets.

How to Implement it?

Normal hash function is stateless, the return value only determines by the input parameter, but the consistent hash function is a stateful function. The state is how the buckets are arranged on the integer circle.

It's natural to store how the buckets are arranged on the integer circle as a search tree, since it's in fact a typical search algorithm - you need to know which segment an item (its hash value) belongs to.

In practical system, this stateful data structure will be stored on each client that uses this consistent hash function. As buckets(machine nodes) join and leave, the state will change. But different client may see the join/leave at different time, or even in different order, thus will produce different hash value using the same consistent hash function(It's said to have different view in paper[1]).

But it is proven in [1], that the number of buckets that one item may belong and the number of items that on one particular bucket won't be very large (the so called Spread/Load property).

A C++ version of consistent hashing function can be found here, it uses STL map as the binary search tree.

The impact of the bucket replica number can be visualized as below (code can be found in testmain.cxx):

You can see that as the replica count increases, the item distribution over buckets will become more and more even.

[Reference]
1. Theory Paper Consistent hashing and random trees
: distributed caching protocols for relieving hot spots on the World Wide Web
2. Practical Paper Web Caching with Consistent Hashing
3. Blog About Consistent Hashing with Java Code
4. Blog Understanding Consistent Hash
5. http://en.wikipedia.org/wiki/Consistent_hashing
6. http://en.wikipedia.org/wiki/Distributed_hash_table
7. The Chord P2P system
8. A Hierarchical Internet Object Cache