12/20/2010

Parallel Database for OLTP and OLAP

Just a survey article on materials on parallel database products and technologies for OLTP/OLAP applications. It mainly covers major commercial/academic efforts on developing parallel dbms to solve the ever growing large amount of relational data processing problem.
 
Part I - Parallel DBMSs

1.1 Parallel Database for OLAP (Shared-Nothing/MPP)

TeraData
- TeraData Home
- Teradata DBC/1012 Paper
- NCR Teradata VS Oracle Exadata

Vertica
- Vertica Home
- The original research project: C-Strore

Paraccel
- Paraccel Home
- MPP Based Architecture
- Columnar Based Storage
- Flash Based Storage

DataLlegro(now MS Madison)
- Design Choices in MPP Data Warehousing Lessons from DATAllegro V3
- Microsoft SQL Server Parallel Data Warehousing

Netezza
- Netezza Home
- Acquired by IBM
- Hadoop & Netezza: Synergy in Data Analytics (Part 1, Part 2
- Netezza Twinfin VS Oracle Exadata (eBook, Blog)

GreenPlum:
- GreenPlum Home
- Combined: PostGreSQL/ZFS/MapReduce
- Acquired by EMC

Oracle ExaData:
- ExaData Home
- OLTP & OLAP Hybrid Orientation
- 1 * RAC + N * Exadata Cells (Storage Node) + Infiniband Network
- Exadata Cell: Flash Cache + Disk Array + Data Filtering Logic (partial SQL execution)
- Exadata – the Sequel is a great Exadata study article

IBM DB2 Data Partitioning Feature (can work with both OLAP/OLTP)
- formerly known as DB2 Parallel Edition (An Shorter Overview)
- DB2 At a Glance - Data Partitioning Feature
- Simulating Massively Parallel Database Processing on Linux

AsterData:
- Supercharging Analytics with SQL-MapReduce
- Aster Data brings Applications inside an MPP Database 

Misc Articles:
- What's MPP?
- Comparison of Oracle to IBM DB2 UDB and NCR Teradata for Data Warehousing
- SMP or MPP for Data Warehouse
- Dividing the data Warehousing work among MPP Nodes
- SANs vs. DAS in MPP data Warehousing
- Three ways Oracle or Microsoft could go MPP

1.2 Parallel Database for OLTP (Shared-Disk/SMP)

Oracle Real Application Cluster
- Oracle RAC Concepts
- Oracle Parallel Database Server Concepts
- Oracle RAC Case Study on 16-Node Linux Cluster

IBM DB2 for z/OS (with Sysplex Technology)
- Share Disk and Share Nothing for IBM DB2
- What's DB2 Data Sharing?

IBM DB2 for LUW (with pureScale Technology)
- IBM DB2 pureScale: The Next Big Thing or a Solution Looking for a Problem?
- What is DB2 pureScale?
- DB2 pureScale Scalability (section 1, section 2)

Part II - Academic Readings

2.1 Overview
1). Parallel Database System: The Future of High Performance Database Processing
2). Survey of Architecture of Parallel Database System
3). The Case for Shared Nothing
4). Much Ado About Shared-Nothing 

2.2 Research System
1). XPS: A High Performance Parallel Database Server
2). The Design of XPRS
3). Prototyping Buuba, H High Parallel Database System
4). The Gamma Database Machine Project
5). NonStop SQL, A Distributed, High-Performance, High-Availability Implementation of SQL
6). Parallel Query Processing in Shared Disk Database System
7). Architecture of SDC, the Super Database Computer

2.3 Commercial System
1). A Study of A Parallel Database Machine and Its Performance - The NCR/TERADATA DBC/1012
2). A Practical Implementation of the Database Machine - Teradata DBC/1012
3). DB2 Parallel Edition
4). Parallel SQL Execution in Oracle 10g
6). Shared Cache - The Future of Parallel Database
7). Cache Fusion: Extending Shared-Disk Clusters with Shared Caches

12/15/2010

Lecture Notes - AltaVista Indexing and Search Engine

01/18/2000, Michael Burrows gave a technical presentation  at UW. In this video, he talked about the design of the AltaVista indexing system and the search engine site. The presentation is short and brief, but covers many core design and concepts which are used in today's commercial search engine systems.

The presentation video can be found at uwtv: http://uwtv.org/programs/displayevent.aspx?rid=2123

And I had recreated the PPT used in his video for further use. I tried my best to record the text and redraw the diagrams, but there may be many errors during this process. The copyright is of Mike.
I think the most interesting design is the Location Space and ISR abstraction. The first one enables store any information using inverted index mechanism and the second one solve the problem of interpreting complicated search query semantic.

But it's not easy to fully understand how the whole ISR system works to serve various query semantic.

And in the second part of his presentation, Mike mentioned many aspects of AltaVista search engine web site. Many of the experiences and designs are still good reference for today's Internet web application.


[Reference]
1. http://www.searchenginehistory.com/
2. http://en.wikipedia.org/wiki/Search_engine
3. http://en.wikipedia.org/wiki/AltaVista

11/23/2010

Source Insight Tips

1. Specify file types to add in a project
Option -> Document Options -> Document Type -> Include when adding to projects

2. Add new language support
Option -> Preferences -> Languages -> Import/Add

3. Associate new file type to some language
Options -> Document Options -> Document Type -> Add Type

4. Add new files to project automatically
Project -> Synchronize Files -> Add new files automatically

5. Show full path of source code file
Preference -> Display -> Trim Long Path Names With ellipses

6. Project dependency
Option -> Preference -> Symbol Lookups -> Add Project to Path

7. Create common projects
Option -> Preference -> Symbol Lookups -> Create Common Projects

8. Colors
Background Color: Option -> Preference -> Windows background -> Color
Foreground Color: Option -> Preference -> Default Text -> Color

9. Fonts
Options -> Document Options -> Document Type -> Screen/Print Font

NOTE: Options -> Style Properties has more control on each element's font and color. You can save all your settings as disk file and share it with others in this dialog box.

10. Fixed width view
View -> Draft View, actually, ignore all style settings

11. Shortcut Keys
Use can set using: Options -> Key Assignment

The common default settings are:
Ctr l+ = : Jump to definition
Alt + / : Look up reference
F3 : search backward
F4 : search forward
F5: go to Line
F7 : Look up symbols
F8 : Look up local symbols
F9 : Ident left
F10 : Ident right
F12 : incremental search
Alt+, : Jump backword
Alt+. : Jump forward
Shift+F3 : search the word under cusor backward
Shift+F4 : search the word under cusor forward
Shift+F8 : hilight word
Shift+Ctrl+F: search in project

12. Custom Command
Options -> Custom command

There are many substitution chars you can use when invoking the command, for example:
%f - full path of current file
%l - line number of current file
%d - full dir path of current file

Full list can be found in SI's help doc: Command Reference -> Custom Commands -> Command Line Substitutions

13. Macros

Source Insight provides a C-like macro language, which is useful for scripting commands, inserting specially formatted text, and automating editing operations. Macros are saved in a text file with a .EM extension. Once a macro file is part of the project, the macro functions in the file become available as user-level commands in the Key Assignments or Menu Assignments dialog boxes.

For language reference, see "Macro Language Guide" section in SI help doc.

SI's web site also contains some sample macro files: http://www.sourceinsight.com/public/macros

14. Special Features

Conditional Parsing:
- This is similar to conditional compiling for C/C++, chose what statements to parse
- You can change the settings using: Project -> Project Settings -> Conditions

Token Macro:
- Similar to C/C++ macro feature, but can be used in other languages
- Defined in *.tom file
- Put it in our project data directory

[Reference]

1. Reading WRK code using Source Insight

2. How to use SI to read linux kernel code

3. A good macro file for source insight

10/25/2010

Integer Variable Length in C/C++ on Different Platforms

While writing codes for multiple platform (in terms of both OS and CPU Arch), making the code independent of the exact byte size of each integer type in c/c++ on each specific platform becomes a typical challenging problem.

What about the ANSI C standard regarding this problem?

The standard defines 5 standard integer types:
- unsigned char - short int - int - long int - long long int

It also defines some limitations on these types in limits.h

But it didn't say explicitly on the exact byte size of each type.

Common understanding on the standard is that it requires:
sizeof(short int) <= sizeof(int) <= sizeof(long int)  sizeof(long long int)
So how about popular compiler's documentation on this?

Visual Studio 10 has an article on MSDN describes the exact size of each integer type.

From that article we can see:
sizeof(short int) = 2
sizeof(int) = 4
sizeof(long int) = 4
sizeof(long long int) = 8

And these constrains are true on both 32/64 bit platforms.

To help programmer aware of the exact size of integer types they are using, vs 10 introduces some other integer types:
__int8, __int16, __int32, __int64 and their unsigned counter parts.

In fact, ANSI c99 also defined those fixed width integer types in stdint.h
uint8_t/int8_t
uint16_t/int16_t
uint32_t/int32_t
uint64_t/int64_t

To scanf() and printf()? The format string for these types are defined in the standard header - inttypes.h. For example, this is inttypes.h for visual studio.

And here is a good summary on how to use format strings to deal with integer types in c/c++


[Reference]
1. stdint.h in C99
2. Integer Types in VS10
3. ANSI C99 Spec
4. Integers in C99

10/14/2010

Alexa and Its Ranking List

I recently read some material talking about the page view ranking of some web sites. It said that the source of the ranking data is alexa.com.

It's good to know source of referenced data but what's the confidence of these data source?

I had a look that website to learn how the ranking list is generated. Here is my understanding:

1. Alexa’s traffic rankings are based on data collected from Alexa Toolbar and other, diverse sources over a rolling 3 month period.

2. A site’s ranking is based on a combined measure of Reach and PageViews.
- Reach is determined by the number of unique Alexa users who visit a site on a given day.
- PageViews are the total number of Alexa user URL requests for a site. However, multiple requests for the same URL on the same day by the same user are counted as a single pageview.

3. Sites with relatively low traffic will not be accurately ranked by Alexa. Traffic rankings of 100,000+ should be regarded as not reliable. Conversely, the closer a site gets to #1, the more reliable its rank. Since Alexa only uses sampled data from all Alexa Toolbar and Alexa Toolbar in fact is just a small portion of the whole Internet user.

So it seems that the ranking list should not be so authoritative as very few people uses its toolbar. But why it gets so popular and important for many VCs? I guess it's mainly due to the lack of other better solutions.

The better data provider should be web browser vendors like Microsoft, Mozilla and Google. But obviously, they are not willing to share with community the data they collected for privacy concerns and potential legal issues.

[Reference]
1. How Reliable Are Your Traffic Ranking?
http://www.alexa.com/faqs/?p=139

2. How are Alexa’s traffic rankings determined?
http://www.alexa.com/faqs/?p=134

3. About the Alexa Traffic Rankings
http://www.alexa.com/help/traffic_learn_more

9/23/2010

Disciplines in Microsoft Engineering Team

I really want this blog to be a place to express my own ideas and thoughts, but I don't refuse reference other people's great ideas, especially when they are really helpful for me or potential readers.

The following content is copied from a MSDN blog post named- Product Development Disciplines at Microsoft, I just highlighted some lines.

"Over the last several months in my role here in China, I have given talks at several leading universities and met with many of the leading faculty and students working on technologies related to the Data Platform. I’ve also spoken at several industry conferences, meeting with customers, partners, analysts and other industry folks. There are many topics that come up at these meetings – changing technology trends, distributed development, the tremendous growth of Asia etc. But one topic that seems to come up more than almost any other is the question of how we organize and conduct our product development in Microsoft. I suppose this is only natural – Microsoft is one of the most successful software companies in the world, and the software industry here in this region is poised for tremendous growth, so it makes sense that people in the industry are eager to learn from the our experience over the last quarter century.
This is actually a very big topic and within Microsoft we have an Engineering Excellence group that actually runs courses that can span several days and provide an overview of Microsoft’s software development methodology, our engineering system, organizational structures, best practices, tools and technologies we use internally ensure quality, reliability, security etc and a variety of related topic. By no means would we claim that we have all this figured out perfectly and have a perfect system, but there is indeed a lot of accumulated knowledge and experience that we can share. And we do actually share this information, in appropriate form, with others in our industry, worldwide and also in this region.
As this is indeed a large topic, I don’t want to get too deep into this here, but I do want to address one aspect of our engineering system – the core disciplines that we organize our R&D teams around and the particular roles that each of these disciplines plays. I want to discuss this because I believe Microsoft does this a little bit differently from the rest of the industry even in the US, and especially here in China there is not a good understanding of these core disciplines and what role each of them plays.
Traditionally, the Microsoft engineering system has consisted of 3 “core” disciplines: “Development”, “Test”, and “Program Management”, also known as Dev/Test/PM for short. I’m going to touch on each of these briefly here, but I like to introduce them in a different order:
PM: When we think of engineering disciplines, most people start with “Dev”. For me however, things really start with the Program Management discipline. At Microsoft, “PM” means many different things, but for me the core essence of the PM role is two things:
1. The first part of the PM’s job is to understand the customer’s requirements and translate that into a functional specification of what we should build. This is where it all begins. If we don’t understand the customer, it is not very likely that we’ll end up building the right thing.
2. The second part of the PM’s job is to work with Dev and Test to translate the initial specification into a living, breathing product.
I find that many people, especially here in China, think “Project Management” when they hear PM. Indeed, Project Management is part of a PM’s job (under #2 above), but it is only a part of the PM’s job. The real skill that a PM brings is the expertise to listen to customers, understand the world from their point of view, and then to design a solution for their problem. This does not just mean giving customers what they ask for literally, but to truly understand them and design a solution that solves their problems even if the customers could never imagine the solution – as the famous saying goes, if we had only listened to customers, we would have looked for a faster horse, not come up with the automobile.
Dev: Of all the engineering disciplines, this one is probably the one people think about the most commonly. Dev is short-hand for “Development”, the folks who responsibility it is to actually design and build the software that we ship. The essential job of Dev is to take the functional specification produced by PM and translate that into an actual implementation. In the world of mission-critical system-level software, this implementation better be extremely reliable, secure, manageable, scalable and high-performance. And the designs and implementations Dev produces better stand the test of time and last for several versions and years to come.
Test: The test discipline in Microsoft is much misunderstood, certainly externally, but sometimes internally as well. When I first came to Microsoft many years ago, I was (pleasantly) surprised to find that Microsoft had almost as many, if not more, testers as developers. Coming from a company that had a much less developed testing discipline (and where as a result, quality assurance was considerably weak), it took a little while to get used to what the essence of the Test discipline really is. The reality is that, in Microsoft, how fast we can ship software depends on not how quickly we can design and implement it but rather on how quickly we can test it. This is because every piece of software we ship, especially on the systems-software side, has to pass an extremely high quality bar. The Test discipline is really an complex area, and one where have learned a lot over the years in terms of different types of testing that we employ – unit tests, functional test, integration tests, stress and long-haul tests, performance tests, security tests, localization tests, etc. The set of tools and techniques we employ in test is truly some of the most impressive and complex – automated test harnesses, automated test generators, automated test failure analyzers, automated security “fuzzers”, fail-point and state-machine based testing.
The three “core” engineering disciplines described above are like the 3 legs of a chair – you need all three of them, and in a balance, to have a proper engineering organization. No one leg can dominate the other – otherwise, you get an organization that may not be in touch with customers needs or one that does not pay enough attention to quality. Indeed, the three disciplines are a little bit like the branches of government – they form a system of checks and balances that ensures we understand what customers want, we design and build that with high quality, and we ensure that we deliver a product that meets customer expectations in every regard.
It is also important to emphasize that we aim to attract the best talent to all three core disciplines – the bar is equally high for all the disciplines, it just happens to be that the passion and skill-set for each is a little different:
- PMs usually have a passion for working with customers, conceptualizing what the product should do, and then working with their Dev and Test peers to coordinate all the work to make sure we deliver exactly that.
- Developers have a passion for building top-quality software – software that is innovative, simple, reliable, secure, scalable, high-performance and stands the test of time.
- Testers are passionate about finding all kinds of ways to break software and making sure making sure we find all the issues and bugs before we ship it to customers.
When we interview candidates, a very important part of what we do is find out which discipline the person’s talent and passion really lie in and directs them accordingly. Of course, over the course of one’s career, one’s passion and talent may change, and the person may change disciplines as a result – I myself started in the Dev discipline before switching to PM. This is only natural and we actually encourage that as a way to build better teams.
Other disciplines
It is also important to point out that although the three disciplines mentioned above are what have traditionally been considered the “core” disciplines at Microsoft, there are several other disciplines that are also becoming increasingly important. For example, User Experience (UX) professionals are essential to ensuring that products are intuitive and natural for users to use. A great user experience can make the difference a product that customers love versus one they merely tolerate. UX is certainly very important for products aimed at end consumers, but it is also important for all our audiences – Developers, IT Professionals, Information Workers.
As we move into the Software+Services era, a variety of disciplines related to architecting, building and running extremely large-scale infrastructure becomes increasingly important. Again, while this has been true for some time for our consumer facing web properties such as MSN and Live, it is now becoming increasingly important for all our product groups as more and more of them take steps to evolve their products along the Software+Services model.
Many candidates I talk to often want to discuss what role at Microsoft would be the best fit for them and how they can grow their careers. The best advice I can think of is to work on a technology and a role that they are really passionate about.
As I mentioned above, we value all the disciplines equally and a well-balanced organization needs great people in all the different roles. While different disciplines appeal to people with different passions and skill-sets, all the disciplines offer opportunities for innovation and great work. And all of them offer opportunities for advancement and leadership. Indeed if you look across the senior levels of Microsoft, there are leaders who emerged from various disciplines – what they shared was a passion for what the work they were doing.
I hope this discussion of the different engineering disciplines at Microsoft and the approach we take to them shall be useful for the many people who seem to be interested in this topic. If you have any questions or comments, feel free to post a reply to his entry."

9/05/2010

Relevance Measuring in Information Retrieval System

One of the challenges Information Retrieval system faces is Relevance Quality. It's the main factor that determines end end user's happiness. (The other two are latency and corpus size)

To design and implement a IR system that has high relevance quality, we must have some methods to measure the quality of relevance.

Generally speaking, a measuring system consists of three components:
- Test Corpus (Document Collection for Test purpose)
- Test Query Set (Set of Queries for Test)
- Measuring Parameter (usually a function, used to measure the retrieval result of an IR system for some query in the query set, using the test corpus)

Test Corpus/Query is another story and we only focus on measuring parameter/function here.

1. Precision/Recall for un-ranked retrieval result

Precision = #Relevant Documents Retrieved / #Retrieved Documents, it's the percentage of the returned documents that are really relevant to the user query. (查准率)

Recall = #Relevant Documents Retrieved / #Total Relevant Document, it's the percentage of the relevant document in the corpus that is retrieved in the query result. (查全率)

2. NDCG for ranked retrieval result

NDCG stands for Normalized Discounted Cumulative Gain, which is a human rating based measuring system.

Gain - user will assign a numeric value (which is a score gained) to represent the goodness of a returned document for some specific query request.

Cumulative Gain - user will assign gain value for each document in the top K returned results, the values is assigned individually and independently.
 \mathrm{CG_{p}} = \sum_{i=1}^{p} rel_{i}
Discounted Cumulative Gain - when assigning relevance score to the returned document, there is a weight related to the order of the document in the retrieval result.
 \mathrm{DCG_{p}} = rel_{1} + \sum_{i=2}^{p} \frac{rel_{i}}{\log_{2}i}
Normalized Discounted Cumulative Gain - it's easy to understand: make the final value to be [0, 1]. Usually, the DCG score of the ideally ordered (ordered using Gain score) document list is used as the normalizing factor. So
 \mathrm{nDCG_{p}} = \frac{DCG_{p}}{IDCG{p}}
For concrete example of how to compute the NDCG value of a query result, please see wiki on NDCG

NDCG is widely used in today's commercial search engine evaluation, but the problem is that, if the returned document is ordered in the same way as the decreasing order of gain score, the NDCG value will be the max:1.

This means that, NDCG is only used for the measuring the ranking algorithm of a search engine and can't tell whether the returned document is highly related to the user intention or not. But in end user's perspective, the perfect return result should be highly related document ordered properly.

More technically, a typical query serving sub-system of an IR system has two phases, one is matching (find highly related document), and the other is ranking (order the matched documents). NDCG may be a proper tool to measure the ranking phase, but definitely not the matching phase. So I think is not an ideal measuring mechanism for IR system.

So, tuning the whole system against NDCG score only may not be a correct direction for search engine improving.

Update@07/09
- The ideal set, which is used to calculate the normalization factor, is the highly scored documents list ordered properly, not the proper order of the returned documents. So the problem I mentioned above doesn't exist.
- But the final effect of this measuring method depends on what test corpus, what test query, what the predefined gain score for each query.

8/05/2010

Lecture Notes - Evolution of Google Search Engine

Jeff Dean gave a keynote Building Large Scale Information Retrieval Systems at WSDM 2009. It's actually a presentation on how Google search engine evolves during the past 10 years. Here are my notes for this lecture.

Part I - Overview of Search Engine Evolution: 1999 VS 2009

Factors to Consider when Designing a Information Retrieval System:
1. Corpus Size(# docs to be indexed)
2. QPS(Query Per Second)
3. Freshness/Update Rate
4. Query Latency
5. Complexity/Cost of Scoring/Retrieval Algorithm

Parameter Change 1999 -> 2009:
1. Corpus Size: 70M -> *B ~100X
2. QPS: ~1000X
3. Refresh: Months -> Minutes ~10000X
4. Latency: <1s> <02 .s="" style="font-style: italic; font-weight: bold;">~5X
5. Machine Scale: ~1000X

Consider 10x Growth when designing, Rewrite for 100x Growth!

Part II - Evolution of Google Search Engine

~1997 - Circa, Research Prototype

- Simple Architecture and Focus on System Distributing/Partitioning
- Term vs Doc based Partition: Doc based Win
- Disk Based Index, DocID+Posting List with Position Attributes, Byte Aligned Encoding


~1999 - Circa, Production

- Introduced Cache
-- hit rate is low 30~60% due to index refresh and long tail query
-- very beneficial, reduce large disk i/o
-- hot term first priority to cache, hot and costy request

- Replica Index Data
-- better performance
-- better availability

Some Summary in late 1990's:

Crawler is simple batch system
- start with very few urls
- queue it when found new urls
- stop when have enough docs

Index Serving using cheap machine
- no failure handling
- added record/chunk checksum

Index Update
- once/month
- wait traffic to low -> take replica offline -> do update -> start serving


~2000 - Dealing with Growth

Situation:

- doc size:50m -> 1000m
- ~20% query traffic increase/month
- Yahoo! deal

Solution:

add machines constantly
- more index shards for larger index size
- more index replica for bigger query capacity

And improve software constantly
- better disk scheduling
- better index encoding


~2001 - Adding In-Memory Index

- enough machine memory: holding all index in mem
- machine function: replica -> micro shard holding
- balancer: cordinator
- availability: replicate important docs



~2004 - Adding Infrastructure

- Generalize tree structured query flow
- Generalize balancer concept
- New index encoding: group varint encoding
- GFS appear in production


~2007 - Universal Search
- Universal search: combine results from multiple vertical corpus
- Realtime search: fast url finding -> crawling -> indexing -> serving cycle
- Experiment supporting: have idea -> try it on real data offline and tune -> live experiment on small piece -> roll out and launch

Part III - Future Trends

- Cross Language Information Retrieval
- ACL in large IR system with huge amount of user and dynamic requirement
- Automatic Construction of Efficient IR system (one bin for realtime and regular web index with different parameter configuration)
- Info extraction from semi-structured data

[Reference]
1. Challenges in Building Large Scale Information Retrieval Systems[PDF, Video]
2. Notes by Another Blogger - http://www.searchenginecaffe.com/2009/02/jeffrey-dean-wsdm-keynote-building.html

7/28/2010

Book Notes - The Long Tail

这是一本为互联网经济摇旗呐喊的著作,早几年互联网从业者几乎言必称“Long Tail”,否则都不好意思在道上混。但短短几年间,原本的流行畅销书几乎快变成一本古典经济学著作了。作为此书的伪读者,最近总算真正抽了点时间来好好 通读一遍,就花的时间精力而言,估计阅读真正的经济学专著也就这样了。

下面是一点 读书笔记:

Part I - 什么是长尾现象(What's Long Tail?)


经济学和市场营销中的长尾现象是指:市场销售额不再仅 由少数几个热门商品(Short Head)霸占,越来越多的大量非热门(Long Head)商品按照幂次律开始分享市场份额,并且非热门商品的市场份额总有超越热门商品市场份额总和的趋势。

长尾现象的另一个特点:分形 特性(fractal):长尾曲线的任一连续部分依然是个长尾现象(self-similarity at multiple scales.)。

Part II - 为什么是长尾经济(Why Long Tail?)


1. 前长尾时代:大热门经济

大热门 经济(Hit Driven Economics)是指市场主要由品种不多,但是销量巨大的热门商品构成。背后的实质是工业革命带来的规模化生产理念。在工业化时代,商品制造能力空前提高,但边际生产成本只有在生产 达到一定规模的时候才会降到普通消费者能够接受的范围。除了生产制造, 在物理世界中运输营销单件商品的成本受其销售额的影响也很大。

这个时代经济活动的 关键词是:规模。任何一件商品,如果市场需求的预期规模达不到一个相 当的额度,制造和流通部门都无法从中获得利润,市场上自然也就只有大热门商品才能形成良性循环。

在这样的经济中,生产者自然将精力放在了 大众主流市场,消费者得到了价格实惠,但是选择范围很窄。而极少数的利基市场(Niche Market)往往都是由精英上流社会人士构成的奢侈品市场。

在这样的时代里,人们进行交流的渠道主要是以广播模式进行信息传播的大众媒 介,比如报纸、杂志、电台和电视台。在这样的社会中的芸芸众生,虽然有着各种不同特点,但其得到的产品和服务,获取到的社会信息,都不会差别太多(指针对 社会普通大众而言)。个人追求个性的代价非常高昂。

2. 长尾经济

1)对个性化商品的巨大需求

作为人类社会的一员,我们虽然有众多的共同点,但也有诸多的 个性特点和需求。毕竟不论从生物属性和社会属性来看,我们生来就是不同的。物质生产的极大丰富使得人类基本需求得到满足,而教育事业的发展、社会文明的进 步和交流手段的改进都加强化了我们对自身个性的追求。所以,对小规模、个性化商品的需求是早就隐藏在我们的内心深处的。

2)个性化商品生产成本的降低

科技的不断发展,特别是在信息科技的推动下,制 造业的生产成本不断降低,小规模个性化定制产品成为可能。

电脑、网络以及各类数码硬件设备和软件产品的出现,极大提高了社会大众的制造能 力,也降低了创意产业产品的制造成本。

商品总类和数目极大丰富,可以按照个性定制非热门商品,价格可以承受

3)商品营销成本显著降低

网络技术的出现的普及,使得商品的零售成本急 剧降低。商家不再需要在繁华地段租赁昂贵的写字楼,铺设光鲜华丽的货柜,雇佣众多销售管理人员,产品实物也不用运送到城市每个角落的实体商店。

交 通工具和道路的不断完善,使得运输物流成本不断降低,使得单件产品递送的费用下降到可以接受的范围。

在这两个前提下,商家只要建立货品目 录网站和低成本的集中货物仓储中心,加上相对较少的发货人员就能开始商品零售业务。

和传统实物卖场相比,单件商品的营销成本急剧下降,盈 利对商品销售额的要求也自然下降。对于某些特殊数字产品,比如mp3音乐和电子书,由于其单品营销成本微乎其微,哪怕只有一条出售记录,商家也能从中获 利。

4)消费者的购买成本降低

有了 Amazon这样的网上商城(Online Aggregator),消费者不用浪费时间去实体店浏览挑选。

各种过滤器(Filter) 降低了用户的选择成本:
- 搜索技术的出现,使得人们从众多 商品中进行个性化挑选变得方便
- 用户对商品的打分评论的方便分享,为用户购买活动的选择提供了权威参考
- 利用信息技术跟踪用户的商业行为,加上智能的算法,能为用户自动做出合乎情理的商品推 荐
- 个人出版(Blog, Twitter)业和网络社区(Facebook)的繁荣,使得网络营销大行其道,也成为帮助消费者过滤长尾中杂音的有力工具

在 这样的市场环境中,关键词从规模变成了个性。只要有独特的地方,满足 哪怕只是极少数人的需求,都能从中获利。所以从生产商到营销渠道开始关注规模不大但是数目众多的小众市场(Niche Market)。这些Niche Market虽然每个带来的利润较少,但是由于数目众多,其累积总数已经可以同Short Head中大热门商品的利润相提并论。
- 这就是所谓的Long Tail: Selling Less of More

Part III - 深入浅出长尾(More About Long Tail)

1. Long Tail VS Short Head

虽然长尾的出现,让我们把注意力也放到了市场上长长的尾部,但我们同时也不能忽视了热门的头部。

因为作为人 类,“不管我们之间有多少不同之处,我们之间的相似之处总是更多”("For each way that we differ from one another, there are more ways that we’re alike")。所以我们大部分基本需求还是相似的,这些相似性在长尾时代依旧会形成Short but Big Head,它们是市场上的Low Hanging Fruit,要首先把握住。

从另外一方面说,缺少了Short Head仅剩Long Tail的Aggregator是很难吸引并留住客户的,我们都只是基本共性和少数个性的混合体。

不过既然Long Tail市场上依旧有Short Head,那本书的副标题似乎从"Why the Future of Business IS Selling Less of More"改成""Why the Future of Business INCLUDES Selling Less of More "更为恰当。

2. Long Tail VS Power Law VS Zipf's Law VS Pareto Distribution

帕累托分布(Pareto Distribution): 一个国家内个人收入X不小于某个特定值x的概率与x的常数次幂存在简单的反比关 系:P[X≥k]~x^(-k)。

齐 普夫法则(Zipf's Law): 对于某语言,如果把单词出现的频率按由大到小的顺序排列,则每个单词出现的频率与它的名次(r)的常数次幂存在简单的反比关系:P(r)~r^(-α)。

其 实不论Pareto Distribution还是Zipf's Law都是理论上的Power Law Distribution分别在经济和语言领域的应用,另外Pareto Distribution是从累积分布函数(Cumulative Distribution Function)的角度进行描述的,而Zipf's Law是从概率密度函数(Probability Density Function)的角度进行描述的。

幂次分布(Power Law Distribution)说的是随机变量X的概率密度函数与另一个变量X数值的幂次成比例(通常是反比), i.e., f(x) = http://upload.wikimedia.org/math/2/b/b/2bbff8a3c0be71b037045b90981d611b.png。Long Tail只是直观描述了Power Law Distribution变量的分布特点。

而所谓80/20 Principle,也就是所谓的重要的少数法则(The Vital Few),只不过是Power Law Distribution的一个自然的逻辑推论,这里的具体数字并不是那么重要,这是个定性而不是定量的法则。另外,一个常见的误解认为:如果这里具体的 数字变了,两者的和必须要达到100。其实这里的80和20是描述的不同对象,并非一定要满足求和为100。

既然前面已经说明 了,Long Tail依然是符合Power Law Distribution,也许具体的数字从80/20变成了50/10,但具体的数字并不重要,性质、趋势更加重要,其中The Vital Few的特性并未有本质的改变,依旧存在Short but Big Head。所以本质上Long Tail的发现和帕累托The Vita Few的发现并不冲突。

3. Long Tail VS Choosing Cost

长尾市场给消费者带 来的主要问题是选择的代价。对于选择过多的问题,有两个有趣的结论:
1. 更多的选择是会促进总体市场规模的扩大(蛋糕做大)
2. 人们需要选择,但是排斥复杂的选择过程

对于在长尾市场的经营者,如何帮助用户快速方便做出选择是个重要的话题。

4. Long Tail: Economic VS Culture

作者也谈到了Niche Economics和Niche Culture的相互关系,这些观察其实并不新鲜,与另一本畅销书Micro Trends(小趋势)的观察和结论基本一致:现代社会不再是几个阶层或者集团的简单构成,已经被分隔成了一个个有着不同特征的小群体。

这 个结论是从消费者需求的角度理解长尾经济出现的基础。

5. 一些争议

Anderson的理论也遭受了一些质疑和讨论。

1) Anderson 与哈佛商学院教授的激辩:

2) Long Tail Challenge in Search Query
- http://www.seomoz.org/blog/the-long-tail-theory-gets-challenged-just-not-in-search-query-demand

3) Rethinking the Long Tail Theory
http://knowledge.wharton.upenn.edu/article.cfm?articleid=2338

4) Why Chris Anderson's theory of the digital world might be all wrong

http://www.slate.com/id/2195151/

Part IV - 长尾的商业启示(Guidelines for Long Tail Business)

1. 降低成本,尽可能提供品种丰富的商品
- 鼓励和利用顾客参与生产
- 利用网络社区的群体智慧
- 集中仓储或者利用合作伙伴的分散仓储
- 尽可能将商品数字化,降低传输、存储成本

2. 处处为顾客着想
- 吸引用户主动对商品进行打分和评论
- 易用并且客观的搜索、推荐界面
- 全方位详细的货品展示方式
- 认识到并尊重你的顾客是多种多样的

Part V - Notes on Book Notes

书 中的观点在其刚面世的时候还是很有启发意义的。但其写作论证还是显得不很严密,给的例子都很精彩,叙述得娓娓道来,但往往和其相应章节的观点逻辑上有点脱 节。比如第三章长尾简史,里面讲的是零售商业的发展历史,而不是长尾经济现象的历史,再比如第五章新生产者,讲的那些所谓新生产者创造的东西和书中长尾市 场的案例中涉及的商品没几毛钱的关系。

不过这好像是美国畅销书的特点:提出一个吸引眼球的观点,然后在这个大旗下夹带私货,把逻辑关联不 是那么强,但比较有趣的案例靠其出色的文笔加入进去。不细心的读者一看,哇,观点新颖有启发意义,案例精彩有趣,读得相当过瘾。

另外,现 在的互联网经济和营销其实值得深入探讨学习的话题很多,有很多具体的技术活,但Anderson基本还是站在社会学的角度远远的观察,太过理论、太过社会 化,不够经济化、不够可操作化。在这点上,日本人管谷义博的长尾经济学更胜一 筹。

[Reference]

01. http://en.wikipedia.org/wiki/Long_Tail
02. http://en.wikipedia.org/wiki/Power_law
03. http://en.wikipedia.org/wiki/Zipf%27s_Law
04. http://en.wikipedia.org/wiki/Pareto_distribution

11. http://www.longtail.com/
12. http://longtail.typepad.com/
13. http://www.thelongtail.com/
14. The Long Tail @ Wired

21. Power Laws, Pareto Distributions and Zipf's Law
22. Zipf, Power-laws, and Pareto - a Ranking Tutorial
23. 幂律分布研究简史

7/17/2010

Load Test: An Overview

Part I - VSTT Load Test


= Concepts =

Terms Compared with xUnit (xUnit/VSTT):
- Test Suite/Test Class
- Test Case/Test Method
- Test Runner/Test Runner
- Test Fixture Setup/Test Init
- Test Fixture Teardown/Test Cleanup
- Suite Fixture Setup/Class Init
- Suite Fixture Teardown/Class Cleanup

The load simulation architecture consists of a Client, Controller, and Agents:
  • The Client is used to develop tests, select tests to run, and view test results.
  • The Controller is used to administer the agents and collect test results.
  • The Agents are used to run the tests.
  • The controller and agents are collectively called a Rig.
When conducting load test, each Unit Test/Web Test is independent of other tests and of the computer on which it is run.

When you create a load test, Visual Studio Team System Test Edition lets you specify a Counter Set. A counter set is a set of performance counters that are useful to monitor during a load test run.

During a load test run, the performance counter data is collected and stored in the Load Test Results Repository, , which is a SQL database. The Load Test Results Repository contains performance counter data and any information about recorded errors.

= Practices =

1. Compose Test Cases

- 1.1 Install VSTT + Team Explorer(the Later one is required for Web/Load Test)
- 1.2 Write Web/Unit Test
- 1.3 Compose Load Test
The most complicated thing may be specifying load pattern.

2. Setting Up Test RIG

- 2.1 Install Load Test Controller [LoadAgent]
- 2.2 Install Load Test Agent
- 2.3 Admin Test RIG
== Click Visual Studio Menu: Test -> Administer Test Controller, you can now Add / Remove / Config test agents in the dialog box.

3. Run Test

- 3.1 Config Test [Specify Test Run Configuration, Setup Load Test on Rig]
- 3.2 Run Load Test [Conduct Test Using Mstest.exe]

4. Analysis Result

- 4.1 View a Test Run on a Rig: http://msdn.microsoft.com/en-us/library/ms243178.aspx
- 4.2 Monitoring and Analyzing a Load Test Result: http://msdn.microsoft.com/en-us/library/aa730850(VS.80).aspx

[VSTT Load Agent - http://msdn.microsoft.com/en-us/teamsystem/aa718815.aspx]
[Controller/Agent/Rig - http://msdn.microsoft.com/en-us/library/ms182634.aspx]

[Reference]
1. Report Visual Studio Team System Load Test Results Via A Configurable Web Site - http://msdn.microsoft.com/en-us/magazine/cc163592.aspx
2. Considerations for Large Load Tests - http://msdn.microsoft.com/en-us/library/ms404664.aspx
3. Real-World Load Testing Tips to Avoid Bottlenecks When Your Web App Goes Live - http://msdn.microsoft.com/en-us/magazine/cc188783.aspx


Part II - LoadRunner


LoadRunner Architecture:
http://test-soft.blogspot.com/2007/06/loadrunner-architecture.html

LoadRunner contains the following components:
- The Virtual User Generator captures end-user business processes and creates
an automated performance testing script, also known as a virtual user script.
- The Controller organizes, drives, manages, and monitors the load test.
- The Load Generators create the load by running virtual users.
- The Analysis helps you view, dissect, and compare the performance results.
- The Launcher provides a single point of access for all of the LoadRunner
components.

The Controller controls load test runs based on "Scenarios" invoking compiled "Scripts" and associated "Run-time Settings". Scripts are crafted using Mercury's "Virtual User script Generator" (named "V U Gen"), It generates C-language script code to be executed by virtual users by capturing network traffic between Internet application clients and servers.

At the end of each run, the Controller combines its monitoring logs with logs obtained from load generators, and makes them available to the "Analysis" program, which can then create run result reports and graphs for Microsoft Word, Crystal Reports, or an HTML webpage browser.


III Other Tools

1. Web Application Stress Tool
2. Apache Benchmark

[Reference]

General:
1. http://en.wikipedia.org/wiki/Stress_testing_(software)
2. http://en.wikipedia.org/wiki/Portal:Software_Testing
3. http://en.wikipedia.org/wiki/Category:Load_testing_tools
4. http://en.wikipedia.org/wiki/Load_testing

7/07/2010

Mechanical Turk, Crowdsourcing and Amazon MTurk Service

I. Mechanical Turk

Mechanical Turk(土耳其下棋傀儡) 是十八世纪晚期一位奥地利人制作的自动下棋装置。这套装置看似一套全机械操作的自动化系统,实际上却是由躲在这些复杂的表面系统后面的真人操作。因为装置精巧,识破其骗局的人不多,加之操作的人棋术不错,在当时的欧洲颇风靡了一阵。这套系统在一个机械的、自动化的接口下,使用了人力操作的实现方式;用今天的流行话来说,这是一台“人肉下棋机器人”。

II. Crowd Sourcing

Crowd Sourcing(众包)是连线杂志的编辑Jeff Howe在2006年发表的文章"The Rise of Crowdsourcing" [3]中创造的一个词汇。按照他的定义: Crowdsourcing is the act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call.

和一般意义上的Outsourcing(外包)相比:
1. 问题的解决方案提供者是一大群未知的公众群体,而不是预先已知的小团队或个人
2. 在多个方案存在时,最终方案的选择也由群体来做出
可以说最后的解决方案完全依赖于群体的智慧和力量。

Jeff Howe认为形成Crowdsourcing现象主要得益于:
1. 现代科技的发展使得专业事务的门槛降低,初学者和专业人士界限模糊。人们能够完成的工作能力变得前所未有的强大
2. 互联网技术的发展使得人们交流手段变得快速而强大

而Crowdsourcing的优势在于:
1. 由于面向的人群不限于专业人士,解决方案的最终成本更低廉
2. 提交者能接触到更多的问题解决者,选择面更宽,能利用到的资源更丰富,得到的结果更好

但也有人指出了这种模式的诸多不足:
1. 对方案提供者了解和控制太少,不确定性太大
2. 缺乏法律规范和保障,一旦出现纠纷,解决起来比较麻烦
3. 质量和持续性难以保证

III. Amazon Mechanical Turk

受这两个想法的启发,Amazon在推出了一项叫做Amazon Mechanical Turk的互联网服务。

Amazon 的认为:虽然科技的发展让计算机系统变得越来越智能,但是在可以预见的将来,依然有很多数据处理任务,人的处理速度和质量都是计算机系统不能比拟的 - 这就是所谓的HIT(Human Intelligence Task)。Amazon MTurk就是要为这类问题提供一种快速、简单、廉价同时又可扩展的基于的互联网的解决方案(make accessing human intelligence simple, scalable, and cost-effective)。

常见的Human Intelligence Task:
- Identify Object in Photo/Video
- Audio/Video Transcription

这类任务还有一个特点:往往可以分成数量巨大的小任务;用专业点的术语表达:都是些data parallel的任务。

从 技术实现上Amazon Mturk只不过是一个典型的web based distributed job system:Requester通过Amazon提供的接口发送任务描述,Worker从任务队列挑选自己感兴趣的条目,处理完后返回给系 统,Requester再从系统获取任务结果。

这和前几年的OGF搞的Basic Execution Service, JSDL那套没什么本质区别,只不过里面的Worker不再是被动接受调度的机器,而是拥有完全自主权的人类。值得一提的是,为了提高结果的质 量,Amazon MTurk允许HIT Requester指定一段测试任务,只有通过这个测试的Worker才有资格处理相应的任务。

从技术上看:
1. Amazon MTurk将人类智能包装成了易于与更大规模的信息系统集成的Web Service接口
2. 促进了机器和人类智能的融合,向着Licklider所描绘的Man-Computer Symbiosis又前进了一步。
3. 非常适和解决Data Parallel/Task Parallel的大规模智能问题

从商业、社会意义上看:
1. Amazon MTurk实质上是通过规避法律、人权、流程上的风险来降低完成一项大规模智能问题的成本。因为你将不再需要为这些大量临时需求的worker签订合同文书、关注其劳动环境和社会福利。
2. Requester完全掌握着评判结果和实施支付的主动权,使得Requester/Worker处于完全不平等的地位

和前面的Crowdsourcing相比:
1. Amazon Mturk的Requester通常需要更多的参与:负责任务的划分和结果的最后组装呈现
2. Worker要做的事情比较机械单一,通常不需要和其他Worker之间相互交流沟通
3. Crowdsourcing看中的是群体智慧,是广泛自主和1+1>2的效应,对参与者要求较高;而Mturk常常只是利用人类大脑的基本功能,对参与者要求往往并不高
4. Crowdsourcing适合处理一些比较复杂的任务;而Mturk则适合处理比较简单但是规模比较大的任务
5. Crowdsourcing中,经济利益往往并不是参与者主要的推动力,成就感、社区认可往往也能激励人们参与其中;而Mturk中,经济利益往往是唯一的诱惑因素

至于为什么Amazon Mturk推出这么多年却未能见到持续的发展,在我看来主要有几点原因:
1. 适合的Worker未能广泛存在
- Mturk中的Worker说穿了就是互联网上自由的廉价劳动力,但互联网技术在世界范围内还没普及到最廉价的劳动力能够自由方便访问的程度
- 互联网上最廉价的劳动力往往并不在英语区,但目前Requester提交的HIT,全是英语描述,内容往往也和英语能力相关
2. 结果验证的自动化依旧是个问题
- Requester需要检查结果是否符合期待,以付出报酬。但这个结果是人类智能处理的结果,计算机很难自动化地作出判断。如何大规模自动化地作出公正客观的判断,视乎又是另外一个HIT
3. Requester和Worker间缺乏良好的互动模式和监管措施
- Requester在HIT交易过程中掌握了太多的主动权,Worker处于不公平的位置,正当的权益得不到适当的保障,使得Worker的积极性受到影响
- Amozon完全可以将e-commerce中的互动监管模式引入到Mturk服务中,确保交易的公平公正

[Reference]
1. http://en.wikipedia.org/wiki/The_Turk
2. http://en.wikipedia.org/wiki/Crowdsourcing
3. Wired: The Rise of Crowdsourcing (中文版: 1, 2, 3, 4)

4. http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
5. Amazon MTurk Home
6. Amazon MTurk Web Service

6/07/2010

Hashing Without Collision

There are many data look-up operations in today's Internet scale web data processing, for example, from URL to a unique integer index value. It's essentially a basic hashing problem except that its data and operation scale is very large. Usually, it's required that no collision should exist in these hash function, which is called "Perfect Hashing"

A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal (perfect hash function).

Constructing such a perfect hash function from a large set of unique keys is not a simple task. But there are still many algorithms and implementations about it.

Algorithms:

1. Perfect Hashing for Data Management Applications
2. External Perfect Hashing for Very Large Key Sets
3. Gperf: A Perfect Hash Function Generator
4. Monotone Minimal Perfect Hashing

Implementations:

C Minimal Perfect Hashing Library:
http://cmph.sourceforge.net/

Gperf:
http://www.gnu.org/software/gperf/

Minimal Perfect Hashing by Bob Jenkins:
http://burtleburtle.net/bob/hash/perfect.html

General Purpose Hash Function Algorithms:
http://www.partow.net/programming/hashfunctions

These solutions require a static key set known in advance. Sometime, this is not the case, so you need to construct it dynamically. Dynamic perfect hashing is somewhat complicated, an alternative is to use Cuckoo Hashing

The basic idea is to use two hash functions(h1 and h2) to determine the hash value of a key. It's not a perfect hash function, because there is a chance that two distinct keys(k1 and k2) have the same hash value with h1 and h2. (i.e. h1(k1) == h1(k2) and h2(k1) == h2(k2)) But the probability is quite low and the worst case needs only two loop up operations.

Here are some tutorials about this algorithm:

Cuckoo Hashing for Undergraduate
Cuckoo Hashing - Theory and Practice (Part 1, Part 2 and Part 3)

Paper:

1. Initial paper about Cuckoo Hashing
2. Another version of the paper

4/21/2010

Mainstream Network Library Overview

Part I - .Net Networking Briefing

All .Net networking facilities are build on top of Winsock2 subsystem, so concepts such as non-blocking I/O, async I/O and I/O multiplexing can all be applied to the managed world. .Net network framework just wrapped Winsock2 and added some extra abstraction.

Most of the wrapping stuff is done through the System.Net.Sockets.Socket class. If you are familiar with Winsock, you will feel comfortable with this class. Here I will focus on the extra (OO style)abstraction that .Net library team has added on top of this class.

There are 3 abstraction layers for .Net network programming:

-Raw Socket Layer
In this model, you work directly with System.Net.Socket.Sockets class, which is just a lightweight OO and PInvoke wrapper around native Winsock2 functionalities. You can do anything in Winsock2 using methods of this class. It's almost just method to method mapping. More info see .Net Network Programming Using Raw Socket.

- Tcp/Udp Layer
.Net network framework provides three extra classes: TcpClient, TcpListener, and UdpClient to ease the task of Tcp/Udp application developing. They are built on top of raw socket facilities and hide detail works such as connection management, data buffering and parsing with the help of NetworkStream class. More info about Tcp/Udp model and more specific on .Net Tcp Programming, .Net Udp Programming
- Request/Response Layer
It's also known as pluggable protocol layer, in which all communication are simple request/response pattern using standard URI to represent internet resources. This abstraction is build on top of Raw Socket and Tcp/Udp layer and it's also extensible so you can customize it. If you just need to work in web world, this is a great facility. More info about programming pluggable protocols, FTP programming and HTTP programming

Source code for example client/server application using .Net network framework.

BTW, blocking/non-blocking/async model and the upper 3 layer concepts are orthogonal and can be combined arbitrarily. For example, there are both blocking TcpClient/TcpListener and async TcpClient/TcpListener.

[Reference]

1. Network Programming Tutorial
2. Network Programming Sample
3. Core NameSpaces:
- System.Net
- System.Net.Socket
4. Good .Net Networking Books
- Tcp/Ip Sockets in C#
- C# Network Programming

Part II - Java
Networking Briefing

Java has a totally different network library taxonomy.

1. Abstraction Layers

Java doesn't have real raw socket facility, which is similar to BSD socket interface and Winsock2 interfaces. But it does contain class named as Socket that is used do network communication. It can be roughly divided into Tcp/Udp layer and URL layer.

- Tcp/Udp Layer
You deal with Tcp/Udp concepts directly when coding at this layer. Java provides 3 core classes:Socket/ServerSocket (Connection Oriented network facility, similar to .Net's TcpClient/TcpListener), DatagramSocket (Connectionless, Message Oriented network facility, similar to .Net's UdpClient). These classes also use I/O Stream to do reading/writing operations.

- URL Layer
Similar to .Net's pluggable protocol layer, it uses URL to identify network resources, provides convenient methods to talk to them. URL and URLConnection are those core classes.

2. Network I/O Model

For network I/O model, JDK provides blocking I/O in Java.Net name space since Java 1.0, multiplexing(a.k.a. non-blocking) I/O is available since Java 1.4 and Async I/O is added in Java 7, both in Java.Nio name space.

Multiplexing I/O consists of three core concepts: SelectableChannel, Selector and SelectionKey.(illustrated below). Common practices using java nio suggests that a network server is composed of acceptor to accept client connections, dispatcher to monitor network ready events and dispatch them to application handler, and application handler with worker thread pool, which handles application logic and generate response data to client.
I had write an time echo server using Java Nio and the upper architecture.

Java Asychronous IO(part of NIO.2) is relatively new and not officially released yet. But its basic idea is very simple: you issue the i/o request without blocking. The system will notify you the completion of the request by meas of:
- calling callback function
- waiting handle to wait
- polling flag to poll

[Reference]

1. Official Network Tutorial
2. Java Scalable I/O
3. NIO based Server Architecture

Packages:
- Java.Net
- Java.NIO

Good Java Networking Books:
- Tcp/Ip Sockets in Java 2E
- Java Network Programming 3E

Other Java Network Frameworks:
- Mina from Apache
- Netty from JBoss
- Grizzly from SUN
- xSocket @ SF
[comparison report]


Part III - C/C++ Network Frameworks

1. LibEvent - http://monkey.org/~provos/libevent

- Libevent is an event notification library not only for network operations but for almost all types of i/o operations. The libevent API provides a mechanism to execute a callback function when a specific event occurs on a file descriptor or after a timeout has been reached.
- It's essentially a Reactor design pattern, in which readiness event is triggered and notified.

- The document file for LibEvent 2.0
- The Sample Code Using LibEvent 2.0

2. APR - http://apr.apache.org/

- APR provides platform independent interface for OS functionalites, which includes some basic network i/o functions. But the interface is rather basic, only blocking i/o is supported. You can see the list of functions from APR 1.4 Network I/O Doc.

3. BOOST.ASIO - http://think-async.com/
Boost ASIO is a scalable network framework using Proactor design pattern (which essentially uses completion notification mechanism, a.k.a Async Network I/O).

Here is in-depth document about Boost.Asio.

4. ACE - http://www.cs.wustl.edu/~schmidt/ACE.html

- It's a heavyweight OO communication framework but it goes beyond just computer proress communication. In a high level point view, it contains three lays:

- OS Adapter Layer: it abstracts the underlying native OS's services using raw C wrappers and provide a unified interface to upper layer components and application developers. The OS services include:

  .Concurrency and Synchronization
  .Inter-Process Communication
  .Event Demultiplexing  .File System I/O
  
- OO Facade Layer: it wraps the OS adapter layer using OO, more specifically, the C++ language. Its functionalities are all the same as the adapter layer.

- Communication Framework Layer: it integrates and enhances the lower-level C++ wrapper facades and supports the dynamic configuration of concurrent distributed services into applications. The framework the following components:
  .Event Demultiplexing - The Reactor and Proactor are extensible, object-oriented demultiplexers .
  .Service Initialization - The Acceptor and Connector components represents the active and passive initialization roles, respectively.
  .Service Configuration - The Service Configurator supports the configuration of applications dynamically.

ACE
is very heavy-weighted and contains many powerful components. It is also the pioneer of concurrent and communication software architect. It invents many famous design patterns in this area, for example, Reactor and Proactor.

5. ICE - http://www.zeroc.com/ice.html

ICE is a language independent RPC solution and supports Java/C#/C++/Python. It uses an IDL language called Slice(Specification Language for Ice) to define client/server interface.

The overall high level design of ICE can be found at: A New Approach to Object-Oriented Middleware

ICE website also contains other two great articles:
- The Rise and Fall of CORBA
- API Design Matters

2/20/2010

Essential English Words for Technology Evangelist

Here I summarized some most used(also most likely misused) technical English word in my daily activities. Also Punctuation and Number terms are listed for further reference.

Part I - Paired Words

Homogeneous/Heterogeneous
- Homogeneous[ˌho-mo-ˈjē-nē-əs]: of the same or a similar kind or nature
- Heterogeneous[ˌhe-tə-rə-ˈjē-nē-əs]: consisting of dissimilar elements or part

Synchronous/Asynchronous
- Synchronous[ˈsiŋ-krə-nəs]:of, used in, or being digital communication (as between computers) in which a common timing signal is established that dictates when individual bits can be transmitted and which allows for very high rates of data transfer
- Asynchronous[ā-ˈsiŋ-krə-nəs]:of, used in, or being digital communication (as between computers) in which there is no timing requirement for transmission and in which the start of each character is individually signaled by the transmitting device

Deterministic/Non-deterministic
- Deterministic[-ˌtər-mə-ˈnis-tik]:a : a theory or doctrine that acts of the will, occurrences in nature, or social or psychological phenomena are causally determined by preceding events or natural laws b : a belief in predestination
- Non-deterministic: the opposite of deterministic

Intuitive/Heuristic
- Intuitive[in-ˈtü-ə-tiv]: known or perceived without evident rational thought and inference
- Heuristic[hy-ˈris-tik]: involving or serving as an aid to learning, discovery, or problem-solving by experimental and especially trial-and-error methods

Ambiguous/Ambitious
-Ambiguous[am-ˈbi-gyə-wəs]:capable of being understood in two or more possible senses or ways
-Ambitious[am-ˈbi-shəs]:having a desire to achieve a particular goal

Linear/Exponential/Polynomial/Order of Magnitude
- Linear[ˈli-nē-ər]:of the first degree with respect to one or more variables(线性的)
- Polynomial[ˌpä-lə-ˈnō-mē-əl]:a mathematical expression of one or more algebraic terms, each of which consists of a constant multiplied by one or more variables raised to a non-negative integral power (as a + bx + cx2)(多项式的)
- Exponential[ˌek-spə-ˈnen-chəl]:expressible or approximately expressible by an exponential function; especially : characterized by or being an extremely rapid increase(指数级的)
- Order of Magnitude:a range of magnitude extending from some value to ten times that value(数量级)

Supplementary/Complementary
- Supplementary[sə-plə-ˈmen-tə-rē]: added or serving as something that completes or makes an addition(增加的,额外的)
- Complementary[ˌkäm-plə-ˈmen-t(ə-)rē]:mutually supplying each other's lack(互补的,补缺的)

Mandatory/Canonical
- Mandatory[ˈman-də-ˌtr-ē]:containing or constituting a command(强制的)
- Canonical[-ni-kəl]:conforming to a general rule or acceptable procedure(规范的)

Preliminary/Elementary
- Preliminary[pri'liminəri]: something that serves as a preceding event or introduces what follows(初步的)
- Elementary[.elə'mentəri]: easy and not involved or complicated(基本的)

Involve/Evolve/Evaluate/Revolution
- Involve: to engage as a participant
- Evolve: undergo development or evolution
- Evaluate: place a value on; judge the worth of something
- Revolution: a drastic and far-reaching change in ways of thinking and behaving

Theorem/Law/Axiom/Lemma
- Theorem['θiərəm]: an idea accepted as a demonstrable truth(定理)
- Law: the collection of rules imposed by authority(法则)/a generalization that describes recurring facts or events in nature(定律)
- Axiom['æksiəm]: a proposition that is not susceptible of proof or disproof; its truth is assumed to be self-evident(公理)
- Lemma['lemə]: a subsidiary proposition that is assumed to be true in order to prove another proposition(引理)

Principle/Principal
- Principle['prɪnsəp(ə)l]: a rule or standard especially of good behavior
- Principal['prɪnsəp(ə)l]: main, or most important/any one of the most significant participants in an event or a situation

Personal/Personnel
- Personal['pɜrsən(ə)l]: relating to a specific person rather than anyone else/the personal column in a newspaper or magazine
- Personnel[.pɜrsə'nel]: the people employed in an organization, business, or armed force

Illegible/Eligible/Tangible
- Illegible[ɪ'ledʒəb(ə)l]: impossible or very difficult to read
- Eligible['elɪdʒəb(ə)l]: entitled or qualified to do, be, or get something
- Tangible['tændʒəb(ə)l]: capable of being given a physical existence

Paradigm/Diagram/Chart
- Paradigm: an example that serves as a pattern or model for something, especially one that forms the basis of
- Diagram: a simple drawing showing the basic shape, layout, or workings of something
- Chart: a diagram or table displaying detailed information

Compliant/Complement/Compliment
- Compliant: designed to follow a particular law, system, or set of instructions
- Complement: to complete, perfect, or go well with something else
- Compliment: to say something nice to or about someone

Consecutive/Continuous
- Consecutive[kən'sekjətɪv]: following one after another, without interruption, successive
- Continuous[kən'tɪnjuəs]: uninterrupted in time, sequence, substance or extend

In a Nutshell/Cookbook
- In a Nutshell: in very few words, getting right to the main point
- Cookbook: a book containing detailed directions for a process of any kind

Nuts and Bolts/Pros and Cons
- Nuts and Bolts: the most basic components, elements, or constituents of something
- Pros and Cons: the advantages and disadvantages of something

Consensus/Quorum
- Consensus[kən'sensəs]: agreement among all the people involved(一致意见)
- Quorum['kwɔrəm]: the smallest number of people who must be present at a meeting to allow official decisions to be mad(法定人数)

Empirical/Theoretical
- Empirical: based on real experience or scientific experiments rather than on theory
- Theoretical: based on theories or ideas instead of on practical experience

Systematic/Ad hoc
- Systematic: carried out in a methodical and organized manner
- Ad hoc: done only when needed for a specific purpose, without planning or preparation

Authentication/Authorization
- Authentication: the act of proving or showing that something is real and not false or copied
- Authorization: the process which aims to confirm that somebody has permission to do something or be somewhere

Orthogonal/Causal
- Orthogonal[ɔ:'θɔgənl]: statistically unrelated (正交的,无关的)
- Causal['kɔz(ə)l]:involving or constituting a cause; causing(因果相关的)

Part II Single Words

Prune
- to remove something considered unnecessary or unwanted

Dilemma
- a situation in which you have to make a difficult decision

Spin Off
- a product made during the manufacture of something else
- a company "splits off" sections of itself as a separate business

De Facto
- acting or existing in fact but without legal sanction

Part III - Abbreviation/省略词组

- VS Versus
- I.E. Id Est, that is to say
- E.G. exempli gratia, for example
- A.K.A. as known as

Part IV - Punctuation/标点符号

, Comma/逗号
; Semicolon/分号
: Colon/冒号
. Dot/Period/句号
" Quote/引号
- Dash/Hyphen/连字符号
_ Underscore/下划线
... ellipsis/省略号
/ Slash/斜线
() Parentheses/圆括号
[] Bracket/方括号
{} Brace/花括号
<> Angle Bracket/尖括号
| Vertical Bar/单竖线
‖ Parallel 双线号
* Asterisk/Star/星号
& Ampersand/and符号
^ Caret/脱字符
% Per Cent/百分号
‰ Per Mill/千分号
$ Dollar/美元符号
# Pound/井字号
@ At/地址符号
~ Tilde/Swung Dash/波浪号/代字符号
` Grave accent/Back Quote/重音符号/反引号
+ Plus/正
- Minus/负
x Multiply/乘
÷ Divide/除
∞ Infinity/无限
∵ Since/Because/因为
∴ Hence/所以
∷ Equals/等于
∪ Union/并
∩ Intersection/ 交
∫ Integral/积分
∑ Sigma/Summation/总和
℃ Celsius System/摄氏度
§ Section/Division/分节号

Part V - Number

10^1 - ten
10^2 - hundred
10^3 - thousand/kilo-
10^6 - million/mega-
10^9 - billion/giga-
10^12 - trillion/tera-
10^15 - quadrillion/peta-
10^18 - quintillion/exa-

10^-01 - tenth
10^-02 - hundredth
10^-03 - thousandth/milli-
10^-06 - millionth/micro-
10^-09 - billionth/nano-
10^-12 - trillionth/pico-
10^-15 - quadrillionth/femto-
10^-18 - quintillionth/atto-

semi-/半
uni-/一
bi-/二
tri-/三
quad-/四
penta-/五
hex-/六
hept-/七
octa-/八
nona-/九
dec-/十