Parallel Programming - Using OpenMP

OpenMP is a parallel programming model for shared memory parallel computers.

It's based on Fork-Join parallel execution pattern and is suitable for Data Parallel and Task Parallel applications.

Fork-Join Pattern
- OpenMP programs begin as a single thread - the Master thread, which executes sequentially until the first parallel region construct is encountered.
- Fork: the master thread then creates a team of concurrent threads, which will execute some user provided codes.
- Join: when the team threads complete, they are synchronized to wait each other(barrier) and then terminate, leaving only the master thread ahead

Fork/Join Pattern in OpenMP (from[1])

Work Sharing Constructs
The core functionality of OpenMP is to parallelly process data or execution tasks, I.E, sharing work load. It provides several constructs to support it:
- For, OpenMP will automatically divide these (independent) loop iterations and assign them to one of the team thread to execute.
- Section, programmer can define static code sections, each one will be (parallelly) assigned to one of the team thread to execute.
- Task, data and code can be (dynamically) packed as a task and the delivered to team thread to execute them.

OpenMP is designed for Fortran and C/C++. Its functionalities often exist in the following form:
- New Construct as Language Directive
- APIs as runtime library
- Environment Variables

Currently, visual studio 2008 supports OpenMP 2.5, OpenMP@MSDN

To use OpenMP in VS2008 c++ developing, you only need to include omp.h header and enable compiler flag /openmp (project property -> c/c++ -> language -> OpenMP Support)

More detailed tutorial can be found at [4][5].

I had written some OpenMP sample applications, it's compiled with vs2008 (except the task example).

[1] http://en.wikipedia.org/wiki/OpenMP
[2] http://openmp.org/wp/

[3] Introduction to Parallel Programming
[4] OpenMP tutorial at LLNL
[5] OpenMP hands-on Tutorial at SC08

[6] Parallel Programming with OpenMP and MPI
[7] Blog on OpenMP programming
[8] Intel on OpenMP traps
[9] Parallel Programming Model Comparison
[10] Microsoft on OpenMP version in Visual Studio
[11] More OpenMP sample applications


Parallel Programming - Using POSIX Threads

Pthreads (a.k.a POSIX Threads), is another parallel programming model over Shared Memory Computers, which is categorized to Threads Based Model (the other is message passing based model).

Pthreads provides Threads by means of pure C style APIs, while OpenMP does it through language compiler directives.

As the Process/Thread concepts are very popular and well understood in today's developing community, I will ignore the basic explanation.

Pthreads APIs can be divided into the following categories:
- Thread Management (Create, Destroy, Join, Cancellation, Scheduling, Thread Specific Data and all related attributes)
- Thread Synchronization (Mutex, Conditional Variable, Barrier, Reader Writer Lock, Spin Lock and all related attributes)

Pthreads is an international standard and is well supported in *nix world. Microsoft Windows has its own threading interface. But there is a famous open source project Pthreads-Win32, which targets a Pthreads implementation over Windows Platform.

Multithreading Programming is a very broad topic, this post just aims to give very simple introduction. [3] and [5] are very good hands on tutorials about Pthreads programming.

[1] http://en.wikipedia.org/wiki/POSIX_Threads
[2] http://sourceware.org/pthreads-win32/

[3] Pthreads Tutorial by LLNL
[4] Pthreads Tutorial by IBM
[5] Pthreads Hands-On Tutorial
[6] Pthreads Tutorial for Linux Platform
[7] Pthreads Info Center


Parallel Computing - An Introduction

Parallel Computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel"). The core motivation for parallelizing is speeding up computing tasks.

1. Types of Parallelism

There are various forms of parallel processing:
- Bit Level Parallelism, each bit of a word is processed at the same time
- Instruction Level Parallelism, execute multiple instructions simultaneously
- Data Parallelism, (a.k.a. loop level parallelism) focuses on distributing the data across different parallel processing units
- Task Parallelism, (a.k.a. functional parallelism) focuses on distributing execution task(code + data) across different parallel processing units

2. Types of Parallel Computer

Using Flynn's Taxonomy, computer architecture can be divided into:
- Single Instruction, Single Data stream (SISD)
- Single Instruction, Multiple Data streams (SIMD)
- Multiple Instruction, Single Data stream (MISD)
- Multiple Instruction, Multiple Data streams (MIMD)

Today's parallel computer are all MIMD type, in more coarse-grained style, parallel computer can be further divided into:
- SPMD Single Program, Multiple Data
- MPMD Multiple Program, Multiple Data

According to the memory architecture, parallel computer can be divided as:

- Shared Memory Computer

In this kind of computer, each processing node share the same global memory address space. Programming on these computer can be as easy as on multicore workstation.

Shared memory computer is easy to programming, but since bus is used among all processing node, the scale is limited(usually several tens), as bus contention will become the bottle neck when scale arises.

Shared memory computer can be further divided into two kinds:
- UMA/cc-UMA, all processing node share the same physical memory device through a bus
- NUMA/cc-NUMA, each process node has local physical memory but accessible by other nodes, but the access time depends on the memory location relative to a node

Uniformed Memory Access(from [1])

Non-Uniformed Memory Access(from [1])

- Distributed Memory Computer

In this kind of computer, each node has its own private memory address space and can't access other node's memory directly. Usually, processing nodes are connected using some kind of interconnection network.

Distributed memory computer can scale to very large since no bus contention occurs. But it's more complicated to write program on this kind of computers.

Distributed Memory (from[1])

- Distributed Shared Memory Computer

In this kind of computer, their hardware architecture is usually the same as Distributed Memory system, but its interface for application developers is the same as Shared Memory system.

DSM is usually implemented using software extension to OS, which some performance penalty.

3. Parallel Programming Model

With these powerful computer in hand, how to programming on them?

3.1 Conceptually, there are tow models to write parallel programming.

Threads Model

There are two well known API interface about multi-threading:
- OpenMP (Open Multi-Processing), using compiler directives, environment variable and library to provide threading supports
- PThreads (Posix Threads), supports thread by means of library only.

a Process with two Threads(from wiki)

Message Passing Model

In this model, a set of tasks use their local memory for computation, communication among these tasks is conducted by means of sending/receiving network messages.

There are two standards:
- MPI (Message Passing Interface)
- PVM (Parallel Virtual Machine)

Typical Message Passing Patterns are listed below:

Collective communications examples

Message Passing Pattern (from[1])

3.2 Other factors to Consider when Designing Parallel Programs

a. Problem Decomposition/Partitioning
- Data Partitioning
- Functional Partitioning

b. Communication Considering [1]
- Latency, is the time it takes to send a minimal (0 byte) message from point A to point B.
- Bandwidth, is the amount of data that can be communicated per unit of time.
- Async vs Sync.
- Point-to-Point, involves two tasks with one task acting as the sender/producer of data, and the other acting as the receiver/consumer.
- Collective, involves data sharing between more than two tasks, which are often specified as being members in a common group, or collective.

c. Load Balancing

Load balancing refers to the practice of distributing work among tasks so that all tasks are kept busy all of the time

How? [1]
- Equally partition workload
- Use dynamic work assignment

d. Granularity [1]

Measured by Computation/Communication Ratio, because periods of computation are typically separated from periods of communication by synchronization events.

- Coarse-grain Parallelism, relatively large amounts of computational work are done between communication/synchronization events
- Fine-grain Parallelism, relatively small amounts of computational work are done between communication events


[1] Parallel Programming Tutorial by LLNL
[2] Parallel Programming Models and Paradigms

[3] Book - Designing and Building Parallel Programs
[4] Book - Introduction to Parallel Computing 2E
[5] Book - Parallel and Distributed Programming Using C++
[6] Book - Patterns for Parallel Programming


Debugging Facilities on Windows Platform

Part I - System/Application Error Collection Tools

These tools are used to collection software data (especially when error occurs) that can be used to identify & fix software defects.

1. Dr. Watson

"Dr. Watson for Windows is a program error debugger that gathers information about your computer when an error (or user-mode fault) occurs with a program. The information obtained and logged by Dr. Watson is the information needed by technical support personnel to diagnose a program error for a computer running Windows."
A text file (Drwtsn32.log) is created whenever an error is detected, and can be delivered to support personnel by the method they prefer. A crash dump file can also be created, which is a binary file that a programmer can load into a debugger.

Starting with Windows Vista, Dr. Watson has been replaced with "Problem Reports and Solutions"

2. Windows Error Report

While Dr.Watson left the memory dump on the user's local machine for debugging, Windows Error Reporting offers to send the memory dump to Microsoft using the internet, more info can be found at - http://en.wikipedia.org/wiki/Windows_Error_Reporting

3. Adplus
ADPlus is console-based Microsoft Visual Basic script. It automates the Microsoft CDB debugger to produce memory dumps and log files that contain debug output from one or more processes. It has many switches to control what data to collect, more info can be found at Microsoft KB on Adplus tool

[Reference]- Wiki on Dr. Watson Debugger
- Description of Dr. Watson Tool
Part II - Structured Exception Handling

SEH is usually known as a convenient error handling mechanism for windows native code programmers provided by windows operating system itself (with compiler support). But it is also a great system to enable applications talking to software debuggers.

1. Various Concepts

Structured exception handling
is a mechanism for handling both hardware and software exceptions. To full understand SEH mechanism, you should get familiar with the following concepts:
- guarded body of code
- exception handler
- termination handler
- filter expression, follows __except keyword,
evaluated when system conducting exception processing
- filter function
, can only be called in filter expression

filter expression & function can only return the following three values:
EXCEPTION_CONTINUE_SEARCH, The system continues its search for an exception handler.
EXCEPTION_CONTINUE_EXECUTION, The system stops its search for an exception handler, restores the machine state, and continues thread execution at the point at which the exception occurred.
EXCEPTION_EXECUTE_HANDLER, The system transfers control to the exception handler, and thread execution continues sequentially in the stack frame in which the exception handler is found.
2. Stack Unwinding
If the located exception handler is not in the stack frame in which the exception occurred, the system unwinds the stack, leaving the current stack frame and any other stack frames until it is back to the exception handler's stack frame.
3. Vectored Exception Handling
Vectored handlers are called in the order that they were added, after the debugger gets a first chance notification, but before the system begins unwinding the stack. Since they are not framed based, they will be called each time when an exception is raised.
4. Exception & Debugger

SEH is also a communication mechanism between windows application and debugger. The detailed description on the whole exception dispatching process can be found here and the debugger exception handling process can be found here.
The core concepts here are first-chance notification and second-chance(last-chance) notification.
- 1st chance notification is a mechanism to notify debugger the exception information before application get chance to process the exception.
- 2nd chance notification happens after the windows system finds that no application defined exception handler exists.

5. Functions and Keywords
- GetExceptionCode and GetExceptionInformation can be used to get detail information about current exception.
- The SEH compatible compiler recognize __try, __except, __finally, __leave as keywords.
- It will also interpret the GetExceptionCode, GetExceptionInformation, and AbnormalTermination functions as keywords, and their use outside the appropriate exception-handling syntax generates a compiler error.
Part III - Dump File
In Part I, we introduced several tools to get diagnose data to enable offline debugging and analyzing. The most important diagnose data is dump file. There are two types of dump files:
1. Kernel-Mode Dump (system/core/memory dump)
This kind of dump happens when an stop error occurs in the windows system. The common phenomenon is that the blue screen shows up and at the same time, an core dump file is generated.
There are three kinds of core dump files:
- Complete Memory Dump
A Complete Memory Dump is the largest kernel-mode dump file. This file contains all the physical memory for the machine at the time of the fault.
The Complete Memory Dump file is written to %SystemRoot%\Memory.dmp by default.
- Small Memory Dump
A Small Memory Dump is much smaller than the other two kinds of kernel-mode crash dump files. It is exactly 64 KB in size, and requires only 64 KB of pagefile space on the boot drive.
- Kernel Memory Dump
A Kernel Memory Dump contains all the memory in use by the kernel at the time of the crash. This kind of dump file is significantly smaller than the Complete Memory Dump. Typically, the dump file will be around one-third the size of the physical memory on the system.
This dump file will not include unallocated memory, or any memory allocated to user-mode applications. It only includes memory allocated to the Windows kernel and hardware abstraction level (HAL), as well as memory allocated to kernel-mode drivers and other kernel-mode programs.
An great Microsoft KB on system dump(core dump, blue screen) configuration.

2. User-Mode Dump (application/process dump)
This kind of dump file is from specific process, not from the windows system itself.
- Full Dump
A full user-mode dump is the basic user-mode dump file. This dump file includes the entire memory space of a process, the program's executable image itself, the handle table, and other information that will be useful to the debugger.
- Mini Dump
A mini user-mode dump includes only selected parts of the memory associated with a process. The size and contents of a minidump file varies depending on the program being dumped and the application doing the dumping.
The name "minidump" is misleading, because the largest minidump files actually contain more information than the "full" user-mode dump.
User mode dump files can be created using the following methods:
- using DebugDiag (discuss blog )
- using adplus

- using userdump
- using Process Explorer
- using ProcDump
- using task manager
- visual Studio's "Save Dump As ..."

You can also using some debugging tools such as Visual Studio and Windbg to create dump files.

Manipulate Mini Dump Programmatically:
- MiniDumpReadDumpStream()
- MiniDumpWriteDump()

For how to use dump file to analys software defect:
- Effective MiniDump
- Post-Motem Debugging using MiniDump
- Reading Minidump
- Dump in Visual Studio
- Crash Dump Doc @ MSDN

Reader/Writer Locking and Beyond

The Reader/Writer problem[2.0] is a synchronization problem that deals with how to improve multi-threading program's overall performance. The core idea is improving concurrency - make as many as possible threads to run.

Reader/Writer lock is a mechanism to resolve this problem - shared access for readers, exclusive access for writers.

Part I - Considering Factors

The basic idea behind Reader/Writer lock is simple and implementation algorithms are published here and there[2.1][2.2][2.3], but there are various aspects to consider:

1. Spin Lock or Blocking Lock

Reader/Writer lock uses some basic synchronization primitive - lock/semaphore, there are two types of locks:[2.2]
- Busy-Wait lock is a proactive locking mechanism, the thread is waiting while holding the CPU (spin lock & barrier are the most popular constructs of this type)
- Blocking lock is yield waiting and scheduler-based blocking mechanism, the thread yield the CPU when waiting

2. Re-Entrance/Recursion

Whether a thread that already holds some lock can acquire other kind of lock? If so, the implementation is said to support recursion.

Roughly speaking, a typical reader/writer lock that supports recursion behaves like:
- thread holds read lock can't request write lock,
- thread holds read lock can be granted read lock without blocking
- thread holds write lock can be granted read lock without blocking
- thread holds write lock can be granted write lock without blocking

3. Time-Out

A lock requesting thread may want to set time limitation on how long to wait. Usually, time-outed APIs are named as TryXXX ...

4. Fairness/Preference

When multiple threads are waiting for some locks, which thread to wake up and grant corresponding access is very critical for fairness problem.

A reader-preferred policy will grant a reader to access if the resource is accessed by a reader

A writer-preferred policy will grant the longest waiting writer to access if there is any, requesting reader will be blocked if there is any writer waiting in the queue.

A fair policy usually ensure that:
1. reader arrives when some readers are being granted reader lock should be blocked if some writer is already waiting, i.e, it avoids writer starvation
2. if reader arrives before a writer, it should be granted access before the writer is granted to access, i.e, it avoids reader starvation

But what threads to grant when reader locks are allowed is another consideration factor:
- Consecutive, if there is a group of reader threads waiting longer than all waiting writer threads, that group will be assigned the read lock.
- All in One, when readers are allowed to access, all waiting reader will be enabled.

5. Upgradable Read(Update) Lock

As we had said, recursive r/w lock doesn't allow a reader(shared) lock holder to request writer(exclusive) lock (a.k.a. lock upgrade), but sometimes, this is a strong desired feature.

For instance, in database implementation, a UPDATE statement may first acquire reader lock on all table pages, when it finds some row need to be modified, a writer locks is requested on related page.

Simple implementation on lock upgrade may causes deadlocks (P556, Chapter 17, Database Management System 3E), so the idea of lock downgrade comes out - acquire writer(exclusive) locks at first, downgrade to reader(shared) lock when it is clear. Also this idea avoids many deadlocks, it limits the concurrency - it suggests acquire exclusive lock at first.

So people invented upgradable lock(a.k.a. update lock), it's compatible with reader(shared) lock, but not with update(upgradable) lock, nor writer(exclusive) lock. If one thread is granted update lock, it can read the resource and also has the right to upgrade its lock to write lock(may cause blocking). An update lock holder can also downgrade its lock to read lock. More explanation on upgradable lock can be found at [3.1.1]

Essentially, upgradeable lock is intended for cases where a thread usually reads from the protected resource, but might need to write to it if some condition is met. Exactly the same semantic of UPDATE statement is DBMS(that's why it is also named as Update Lock). Sql Server and DB2 supports Update Lock, while Oracle doesn't.[5.1][5.2][5.3][5.4]

Part II - A C++ Implementation

In implementing Reader/Writer Lock, we made the following decisions:
1. Spin Lock is useful when lock holding time is relatively short, we adopt block-waiting lock since how these locks are used is not cleared
2. It supports recursion
3. It supports time-out feature
4. For better flexibility, we implemented three RWLocks: reader preferred ReaderWriterLock, writer preferred WriterReaderLock, the fair FairSharedLock.
5. Currently, we don't support Update lock

Basic Algorithms for ReaderWriterLock and WriterReaderLock are that given in paper[2.1], recursion is supported by introducing current writer field and time-out is supported by using win32 wait functions.

Reader Preferred ReaderWriterLock Algorithm, from[2.1]

Writer Preferred WriterReaderLock Algorithm, from[2.1]

The fair shared lock is implemented using a lockword, which stores lock state and manipulated by Interlocked primitives, and a waiting thread queue, which queues waiting reader/writer threads to keep information for fairness judging. Some ideas are learned from Solaris[3.8.1] and Jeffery's OneManyResourceLock[3.1.2].

source code can be found at (Locking.h & Locking.cxx)

Part III - Some Multithreading Better Practices

1. Don't allow thread to transit from user mode to kernel mode
2. Use as few threads as possible (Ideally, equals to cpu/core count)
3. Use multiple threads for tasks that require different resources
4. Don't assign multiple threads to a single resource

Synchronization General Concepts

1.1 Synchronization
1.2 Lock
1.3 Mutex(Mutual Exclusion)
1.4 Semaphore
1.5 Monitor
1.6 Event
1.7 Barrier
1.8 Lock Convoy

General Reader/Writer Lock
2.0 The First, Second and Third Reader/Writer Problems

2.1 Paper: Concurrent Control with "Readers" and "Writers"
- It firstly introduced the reader/writer problem and gave 2 algorithms (R preferred/W preferred)
2.2 Paper: Algorithms for scalable synchronization on shared-memory multiprocessors
- It summarized busy-wait sync mechanism and introduced algorithm for Spin Lock based only on local memory spinning
2.3 Paper: Scalable RW Synchronization for SMP (Its Pseudocode & PPT)
- It introduced:
-- a. R-Preferred/W-Preferred/Fair RWLock using general Spin Lock on SMP machine
-- b. Improve a. by using algorithm that needs local spinning only Spin Lock
2.4 Paper: A Fair Fast Scalable Reader-Writer Lock
- Some Improvement on 2.2
2.5 Paper: Scalable Reader-Writer Locks
- Latest and some Summary on previous Works

2.6 Notes on Implementing a Reader/Writer Mutex
2.7. Scalable Read/Write Lock on SMP
2.8 Test Reader/Writer Lock Performance

Reader/Writer Lock Implementations on Various Platforms

- 1. .Net
3.1.1 Reader/Writer Lock in .Net (the Slim Version)
3.1.2 Jeffrey Richter on Reader/Writer Lock implementation
3.1.3 Joe Duffy on Reader/Writer Lock
3.1.4 Implementing Spin-Lock based Reader/Writer Lock and Its Analyzing

- 2. Java
3.2.1 Various Aspects on Java Reader/Writer Lock
3.2.2 Implementing Java Reader/Writer Lock
3.2.3 Java Doc on Java SE Reader/Writer Lock
3.2.4 Java Threading Book on Synchronization

- 3. Boost
3.3.1 DDJ on Boost Thread Library
3.3.2 Boost Thread Library Official Doc
3.3.3 Multithreading for C++0x

- 4. Apache Portable Runtime
3.4.1 APR RWLock Doc

- 5. PThread
3.5.1 PThread RWLock by IBM
3.5.2 PThread RWLock by Sco
3.5.3 PThread Doc

- 6. Intel TBB
3.6.1 Intel TBB Reader/Writer Lock

- 7. Win32
3.7.1 Synchronization Primitives New To Windows Vista
3.7.2 Slim Reader/Writer Lock
3.7.3 Win32 Native Concurrency (Part 1, Part 2, Part 3)

-8 Solaris
3.8.1 Reader/Writer Lock source code in Open Solaris

-9 Linux
3.9.1 simple doc on Linux kernel locking
3.9.2 Linux reader/writer lock implementation: header & source

- 10. Misc
3.10.1 Reader/Writer Lock Cxx Source Code (Header, Source)
3.10.2 An FIFO RWLock Source Code

Spin Lock

4.1 http://en.wikipedia.org/wiki/Spinlock
4.2 Spin Lock Implementation and Performance Comparison
4.3 Spin Wait Lock implementation by Jeffry Richter
4.4 User Level Spin Lock
4.5 Introduction on Spin Lock
4.6 Paper on FIFO Queued Spin Lock
4.7 InterlockedExchange atomic fetch_and_store on Windows
4.8 InterlockedCompareExchange atomic compare_and_swap on Windows
4.9 Ticket (Spin) Lock (wikipeida, in Linux Kernel, in Paper)
4.10 FIFO Ticket Spinlock in Linux Kernel
4.11 MCS Spin Lock Design & Implementation

Locking in Database

5.1 Understanding Locking in Sql Server
5.2 Oracle Locking Survival Guide5.3 DB2 Locking Mechanism
5.4 Sql Server Locking Types


6.1 MSDN Magazine Concurrent Affairs Column
6.2 Many Insights on Thread Synchronization and Scalable Application
6.3 ReaderWriterGate Lock by Jeffrey (R/W lock + Thread Pooling)
6.4 A Richer Mutex & Lock Convoy by Jeffery


BLAS & LAPACK - Math Kernel for Scientists

1. The Standard Interface

BLAS (basic linear algebra software) and LAPACK (linear algebra package) are standards for linear algebra routines.

2. The Various Implementations

The reference BLAS [2] is the reference implementation of the BLAS standard. It is usually slower than machine-optimised versions, but can be used if no optimised libraries are accessible.

It is available from http://www.netlib.org/blas/.

The reference LAPACK [1] is the reference implementation of the LAPACK standard. Its performance is heavily dependent on the underlying BLAS implementation.

It is available from http://www.netlib.org/lapack/.

The Intel MKL (Math Kernel Library) implements (among others functionality, such as FFT) the BLAS and LAPACK functionality. It is optimised for Intel CPUs.

It's available from http://software.intel.com/en-us/intel-mkl/

The AMD ACML (AMD Core Math Library) is AMD’s optimised version of BLAS and LAPACK,
and also offers some other functionality (e.g. FFT).

It's available from http://developer.amd.com/acml.jsp

The Goto BLAS [3][4] is a very fast BLAS library, probably the fastest on the
x86 architecture.

It is available from http://www.tacc.utexas.edu/software_modules.php.

Its main contributor is Kazushige Gotō, who is famous for creating hand-optimized assembly routines for supercomputing and PC platforms that outperform best compiler generated codes. Some news report about him: "Writing the Fastest Code, by Hand, for Fun: A Human Computer Keeps Speedup Chips", "The Human Code".

The ATLAS (automatically tuned linear algebra software, [5]) contains the
BLAS and a subset of LAPACK. It automatically optimises the code for the machine on which it
is compiled.

It is available from http://math-atlas.sourceforge.net/.

3. Extensions to Cluster System (Distributed Memory Parallel Computer)

BLACS [6] are the basic linear algebra communication subprograms. They
are used as communication layer by ScaLAPACK. BLACS itself makes use of PVM (parallel
virtual machine) or MPI.

BLACS is available from http://www.netlib.org/blacs/

ScaLAPACK [7] is a library for linear algebra on distributed memory architectures. It implements routines from the BLAS and LAPACK standards. ScaLAPACK makes it possible to distribute matrices over the whole memory of a distributed memory machine, and use routines similar to the standard BLAS and LAPACK routines on them.

ScaLAPACK is available from http://www.netlib.org/scalapack/

PLAPACK [8] is also a library for linear algebra on distributed memory architectures. Unlike ScaLAPACK, it attempts to show that by adopting an object based coding style, already popularized by the Message-Passing Infrastructure (MPI), the coding of parallel linear algebra algorithms is simplified compared to the more traditional sequential coding approaches.

PLAPACK is available from http://www.cs.utexas.edu/~plapack/


[1] LAPACK User's Guide
[2] Basic Linear Algebra Subprograms for FORTRAN usage

[3] Anatomy of high-performance matrix multiplication
[4] High-performance implementation of the level-3 BLAS

[5] Automated Empirical Optimization of Software and the ATLAS project

[6] A user’s guide to the BLACS
[7] ScaLAPACK: A scalable Linear Algebra Library for Distributed Memory Concurrent Computers
[8] PLAPACK: Parallel Linear Algebra Libraries Design Overview


Extend C++ Standard IOStream Library

Why extend C++ standard IoStream library?
1. To support new data types
2. To support new devices.

The first extend method is easy - just overload the following two global operators:
     istream& operator>>(istream&, Complex&);
    ostream& operator<<(ostream&, Complex);
But the second is not so easy.

The stream library actually performs two unrelated tasks: formatting and buffering.

Formatting is the act of translating between binary data and their character representations. It is done by the class ios, the base class for both istream and ostream. The ios class keeps a format state that governs formatting. The format state specifies
  • the field width
  • the fill character
  • field alignment
  • integer base (decimal, hex or octal)
  • floating point format (fixed, scientific or general)
  • whether to show a + sign, trailing decimal point and zeros thereafter, or the integer base
  • whether to use upper- or lowercase letters for E, 0X, and the hex digits A ... F
The ios class concerns itself with formatting, the conversion between binary data and their ASCII characters representations. Transporting these characters from and to devices is the responsibility of the streambuf class and its descendants.

How to extend C++ standard stream library?

Here is a very good example: http://www.horstmann.com/cpp/iostreams.html

1. The Architecture of Iostream
2. Extending the iostream library
3. IOStream Online Documentation
4. Standard C++ IOStreams and Locales (Google Book Version)
5. Using Iostreams and Library Headers
6. The C++ IOStreams Library


Map/Reduce - in Functional Programming & Parallel Processing Perspectives

Map/Reduce is a very popular term pair in today's technical community, mainly due to the popularity of its "inventor" - Google.

But in fact, the terms and concepts of map & reduce exist in programming language community long before G company's successful paper "MapReduce: Simplified Data Processing on Large Clusters", which appeared in OSDI04.

In this article, I want to summarize what this term pair means in functional programming literature and parallel processing literature respectively.

I - Map/Reduce in Functional Programming Perspective

  Functional Programming has long history in academia, but not been massively accepted in developer communities yet. It has some beautiful features, compared with our daily use imperative language. Higher Order Function is one of them. Basically, it means that function can be used as input parameters or return value for a function definition.

  Among various higher order functions, map, fold and filter are the most popular ones:

- Map is a higher order function that applies a given function(a.k.a transformer) element-wise to a list of elements and returns a list of results. Transformer is a function applies to each element and will produce one or more new elements.

for example: map (toLower) "abcDEFG12!@#" will produces output:"abcdefg12!@#"

- Fold (a.k.a. Reduce, Accumulate) is a higher order function that processes (using a combiner function) a list of elements in some order and build up a return value. Combiner is a function that is applied to two elements and produces a result that can be combined using combiner with the remaining elements in the list.

for example: fold (+) 0 [1..5] will produces output: 15, which is the sum of all the elements.

- Filter is a higher-order function that processes a list of elements in some order to produce a result containing exactly those original elements for which a given predicate returns the Boolean value true. Predicate is a function that takes one element as input parameter and return either true or false.

for example: filter (isAlpha) "$#!+abcDEF657" will produces output: "abcDEF"

  Essentially, these three higher order functions apply an operation on some list/array and produce some results: map transform each element, filter filtering some elements and reduce combine all the elements.

  Pure functional language, such as haskell/lisp, and some mixed language, such as python, have build-in functions named exactly as Map/Reduce. C# 3.0 introduces some functional features in LINQ subsystem, where Map is called Select and Reduce is called Aggregate.

  More concrete examples can be found in [2].

II - Map/Reduce in Parallel Processing Perspective

  Map/Reduce is a Programming Model & also an Implementation Runtime. The programming model is what you can use to express your computation tasks while implementation runtime is those software components that realize what the model claims.

  This model is called map/reduce, but their meanings are somewhat different:
  - the map semantic is the same as in functional programming language: the transformer (the mapper in Google's paper) is applied to each element of the list
  - the reduce semantic differs. Here, the combiner(the reducer in Google's paper) is applied to multiple sub sections of the elements in the list and thus produces multiple reduce results, but in functional programming language it is applied to all the elements and only produces one result.

  Conceptually, how the elements are divided into multiple sub sections?
  To resolve this problem, this model introduces some structure on the elements that are produced by mapper and consumed by reducer - each element/record has two parts: key and value. Then all the elements are divided according to the key. The records with the same key form a sub section and are passed to a reducer function as a whole.

  From implementation's perspective, the most important advantage of this Programming Model is that - it enables automatic parallelization and distribution of large scale data processing:
  - mapper is applied to each record, it's a data parallel problem by itself, we just need to distribute input data in record boundary among processing nodes.
  - reducer is applied to some sub section, we just need to distribute those sub sections among process nodes.

  Another implementation problem is fail over - what to do when failure happened?
  Simple! It just re-execute the failed specific mapper/reducer, other mappers/reducers won't be bothered at all. Because there is no communication among mappers and reducers respectively, this solution is semantically correct for mapper/reducer.
  Since the input of mapper is persisted in reliable storage system, failed mapper only need to re-execute that mapper. But the input of reducer (also the output of mapper) is persisted in worker's local storage system, re-executed reducer may found some input unavailable (for example intermediate data node crashed). In this situation, failed reducer need to re-execute both some mappers and that reducer.

1. Functional Programming
2. higher order functions - map, fold and filter
3. Map/Reduce/Filter in Python
4. Map/Reduce in PHP
5. Google's MapReduce Programming Model — Revisited
6. MapReduce: Simplified Data Processing on Large Clusters
7. Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters


some info about compiler front end development




flex/lex的文档,看"Lex - A Lexical Analyzer Generator "应该够了.

bison manual确实非常详细,不过还是建议你看看OReilly那本lex & yacc 2e:
二来,上面有个简单的sql parser实例,可以照葫芦画瓢当作练习

网上最好的资料莫过于: http://dinosaur.compilertools.net/ 关于flex/bison, lex/yacc的方方面面都有.

lex & yacc那本书国内似乎有人翻译过,不过绝版了,我当时是在taobao上买的翻印版.


你如果从头开始做的话,可以考虑试试antlr(http://www.antlr.org),很多domain specific language compiler都用的这个工具, 比如hibernate的HQL, Yahoo!的YQL都是基于antlr的.


Edit Distance for String, Tree and Graph

  Edit Distance refers to the min steps to convert one object into another. It's a useful concept in searching, data mining and pattern recognition. It's not only limits to text String, but also applied to structured data such as Tree and Graph.
  1. String Edit Distance
  The basic idea is dynamic planning: we can solve the problem if a smaller scale problem can be solved, by choosing the least cost option. The algorithm and the proof can be found below:
[1] http://en.wikipedia.org/wiki/Levenshtein_distance
[2] http://nlp.stanford.edu/IR-book/html/htmledition/edit-distance-1.html
  I wrote an implementation that can show the whole edit process, it can be found here.

  2. Tree Edit Distance
    This algorithm is also based on dynamic planning, but it's not that easy for understanding anymore. The detailed tutorial on this algorithm can be found at [1], from the presentation you can find that the recursive formula is rather simple. The challenging part is to understand how to compute the elements of the matrix in dynamic algorithm and why all the steps work. It may takes several days to fully understand the algorithm.
  Code for the algorithm implementation and test can be found under this folder. It is based on the algorithm propose in [2]. [1] is a very clear tutorial on this algorithm.
References can be found below:
[1] http://www.inf.unibz.it/dis/teaching/ATA  
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989
[3] A Survey on Tree Edit Distance and Related Problems
  3. Graph Edit Distance
    This problem is a even harder problem, related paper can be found at below:[1] Bridging the Gap Between Graph Edit Distance and Kernel Machines
[2] http://www.springerlink.com/content/k21v4h4ur2r068w2/
  4. Visualization

hen you dealing with Tree/Graph algorithm, visualization is very important for debug/diagnose purpose. You can use graphwiz tool to do this.
  Here you can find a lot of papers on graph visualization algorithms.

  More references listed below:
[1] http://en.wikipedia.org/wiki/Graph_drawing[2] http://graphdrawing.org/index.html[3] Graphviz and Dynagraph – Static and Dynamic Graph Drawing Tools
[4] Graph Visualization and Navigation in Information Visualization: a Survey
[5] Interactive Visualization of Large Graphs and Networks (PH.D Thesis)
  Update@09/27/2009 Microsoft has a great graph layout library called MSAGL(formerly known as: GLEE) that is available publicly. It beats Graphviz on many aspects, especially on the scalability perspective.


Parallel DBMS V.S. Distributed DBMS

 Large Scale Data Intensive Computing is a hot topic today, many people starts to talk so called Parallel Database System and Distributed Database System technologies. But these two concepts seem very confusing, so I devoted sometime to try to make it clear.

Parallel Database System seeks to improve performance through parallelization of various operations, such as data loading, index building and query evaluating. Although data may be stored in a distributed fashion in such a system, the distribution is governed solely by performance considerations.

In Distributed Database System, data is physically stored across several sites, and each site is typically managed by a DBMS capable of running independent of the other sites. In contrast to parallel databases, the distribution of data is governed by factors such as local ownership and
increased availability.

PDB & DDB Comparison:

1. System Components
- Distributed DBMS consists of many Geo-distributed, low-bandwidth link connected, autonomic sites.
- Parallel DBMS consists of tightly coupled, high-bandwidth link connected, non-autonomic nodes.

2. Component Role
- Sites in Distributed DBMS can work independently to handle local transactions or work together to handle global transactions.
- Nodes in Parallel DBMS can only work together to handle global transactions.

3. Design Purposes
= Distributed DBMS is for:
 - Sharing Data - Local Autonomy - High Availability= Parallel DBMS is for: - High Performance - High Availability
 But both PDB&DDB need to consider the following problems: 1. Data Distribution (Placement & Replicatioin); 2. Query Parallelization(Distributed Evaluation). And also, many parallel system consists of network of workstation, the difference between Parallel DB & Distributed DB is becoming smaller.

1. Great Paper on PDB&DDB Explanation Distributed and Parallel Database Systems
2. Great Paper by Jim Gray Parallel Database Systems3. Textbook, Database Management System (3rd edition)
4. Textbook, Database System Concepts (5th edition)
5. Textbook, Principle of Distributed Database Systems (2nd edition)
6. DB Textbook List @ Amazon


Baidu World 2009


  1. 大会参会人员众多,足见业内业外人士对互联网、对搜索、对B公司的热情及该公司的影响力
  2. 大会由原来作为营销大会的百度世界演变而来,虽然出现了诸多技术keynote,但技术含量不足;会场热闹非凡,但定位稍显不清;
  3. 组织工作乏善可陈:缺乏引导工作,人工输入密码完全不必要(Barcode Scan可以节约很多时间),提示信息带有误导性,规章流程形同虚设,参会人数失控等等;
  4. 总体气场、底气和一流国际公司还有很大差距


  技术层面主要想谈一下Box Computing,Aladdin Platform 和 从产品设计的角度分析搜索引擎面临的问题这三个话题。

  1. Box Computing 这是Robin Li同学的开场主题演讲,此次大会的重头戏。


  1). 如何理解用户需求? Box Computing的一个核心是要对用户的输入进行分类、理解,为了尽可能准确理解,需要结合用户的搜索历史、发出请求所处的地理时间等上下文信息,还有语义理解、人工智能等等一系列工业界长久以来想做而没做成的事情。
  2). 如何构建开放的信息生态系统? 百度会向第三方开放接口,引入外部信息提供商。从技术上来讲,这将极大提高通过搜索可获取的信息量,消除了一部分传统爬虫无法处理的暗网(deep web);从商业上来讲,这将建立一个新的“信息”生态系统,信息拥有者将会有很好的机会得到用户访问和商业收益。对Baidu还是对高价值信息的拥有者,这是一个双赢的局面。
  其中语义相关的问题:用户请求分成哪些类别?每个类别分析结果应该是什么格式?信息提供商和用户怎么交互? 这些问题其实和早年用SOA搞企业信息集成的人面临的问题没什么区别。怎样达到一个合理而且大家都能遵守的标准、规范,既是个技术问题,又是个非技术问题。

  2. Aladdin Platform 这是在下午的搜索技术分论坛上百度主任架构师廖若雪的主题演讲。内容比较High Level, 演讲者的PPT以及现场演讲效果都不错,但其实和Box Computing并没有太实质性的区别。

  1) 发掘暗网,收集结构化的而不仅仅是网页文本数据;
  2) 分析用户需求,展示丰富而个性化的结果;
  3) 与Box Computing区别(个人理解),Box Computing是一个开放的生态系统,而Aladdin则自己完全掌控。

  1). 有25%的用户查询需求未能得到满足,现有搜索引擎大致覆盖了所有数据的37%
  2). 互联网上越来越多、越重要的数据将是结构化、非文本的数据
  3). 数据的多样化造成结果展示的多样化(表格、视频、图片等),这对基础架构有新的需求(Server/Network的重新布局优化)
  4). 传统基于page间link的ranking system不再适用,结果排序问题是新的挑战
  5). 需要自动及时地更新信息,注重时效性和自动化程度
  6). Aladdin Platform的目标和贵公司的Bing的定位有诸多相似之处
  7). 搜索引擎正从单纯的网页检索工具演化成无所不能的巨大人工智能系统

  3. 用户眼中的搜索引擎问题 这个演讲其实是得到信息最多的,因其中有很多Facts & Observations是在其它地方很难知道的。

  一 搜索需求的数量激素膨胀、形式越发多样;
    - 用户期待搜索引擎不仅仅是搜索,而是一个智能问答系统
    - 移动和手持设备的普及增加了新的需求
  二 用户搜索需求满足方式的复杂化;
    - 搜索只是用户行为的其中一个环节
    - 更高的时效性需求
    - 答案是有上下文环境的
  三 用户搜索行为越来越趋于“傻瓜化”;
    - 用户查询请求越来越接近自然对话
    - 长搜索词数量占到50%,查询数量占到30%
  四 互联网上的有价值资源获取难度越来越高。
    -大量不依靠Link的数据出现:Blog, SNS, 网上图书馆

  一 自然语言处理,准确理解需求
  二 丰富的用户界面,满足多种需求
  三 搜索社区,挖掘人脑中的知识
  四 第三方结构化信息,挖掘暗网



Consistent Hashing - Theory & Implementation

What's it?

The consistent hashing comes from the solving of hot spot problem in Internet system, I.E., it comes from the distributed cache system. [1][2]

The idea is simple and straight forward (more detail in paper[1][2]):
Hot Spot -> Centric Cache -> Distributed Cache (Communicate using IP Multicast) -> Harvest Cache(Tree Structure Cache, structured communication)[9] -> Random Tree Cache(different Tree for different cached Objects, hash mapping from tree node to machine node) -> Random Tree + Consistent Hashing Cache(deal with machine node dynamics: machine may leave/down and join/recover)

Essentially, a Consistent Hash Function is one that changes minimally as the range of the function changes. A hash function can be looked as a mapping from items to buckets, suppose we have such a mapping m1. If some buckets are added/removed, we have another mapping m2. In normal hash functions, m1 -> m2 will introduce many item movements (from one bucket to another). A consistent hash function has the characteristic that item movements are minimal.

How does it accomplish that?

In consistent hash function:
1. Items and buckets are all hashed(normal hash) into some integer interval (for example: [0, 1024]).
2. Item is mapped to a bucket that is closet to it.
3. Bucket may be replicated to improve even distribution balance.

NOTE: here "closet" means the first bucket it meets when traverse clock wise along the integer circle (see diagram below)

Suppose the normal hash output range is [0, 1023], you have 4 buckets and 8 items, each bucket is replicated twice. One possible consistent hashing may be illustrated as below:

Current Mapping/Hashing:
Bucket1 - Item1, Item3, Item8
Bucket2 - Item2, Item7
Bucket3 - Item4, Item6
Bucket4 - Item5

If Bucket3 is down/broken, the new Mapping/Hashing will be:

Bucket1 - Item1, Item3, Item8
Bucket2 - Item2, Item6, Item7
Bucket4 - Item4, Item5

You can see that only Items on Bucket3 are changed, and they are distributed among the remaining buckets.

If a new Bucket5 is added, you can see that only small number of items are changed, and the new bucket gets load from the original 4 Buckets.

How to Implement it?

Normal hash function is stateless, the return value only determines by the input parameter, but the consistent hash function is a stateful function. The state is how the buckets are arranged on the integer circle.

It's natural to store how the buckets are arranged on the integer circle as a search tree, since it's in fact a typical search algorithm - you need to know which segment an item (its hash value) belongs to.

In practical system, this stateful data structure will be stored on each client that uses this consistent hash function. As buckets(machine nodes) join and leave, the state will change. But different client may see the join/leave at different time, or even in different order, thus will produce different hash value using the same consistent hash function(It's said to have different view in paper[1]).

But it is proven in [1], that the number of buckets that one item may belong and the number of items that on one particular bucket won't be very large (the so called Spread/Load property).

A C++ version of consistent hashing function can be found here, it uses STL map as the binary search tree.

The impact of the bucket replica number can be visualized as below (code can be found in testmain.cxx):

You can see that as the replica count increases, the item distribution over buckets will become more and more even.

1. Theory Paper Consistent hashing and random trees
: distributed caching protocols for relieving hot spots on the World Wide Web
2. Practical Paper Web Caching with Consistent Hashing
3. Blog About Consistent Hashing with Java Code
4. Blog Understanding Consistent Hash
5. http://en.wikipedia.org/wiki/Consistent_hashing
6. http://en.wikipedia.org/wiki/Distributed_hash_table
7. The Chord P2P system
8. A Hierarchical Internet Object Cache


How DNS Works

I - What's DNS & Why DNS?

In the beginning, people use numerical identifiers(IP) to represent network devices. But human is good at remembering meaningful names, not numbers, so here comes the host name. In early days, there is a global file that stores the name/ip mapping, which is known as hosts file.

As there are more and more devices in Internet, a single hosts file can't solve the mapping problems. So people invented DNS.

The Domain Name System(DNS) is a hierarchical naming system for computers, services, or any resource participating in the Internet. It associates various information with domain names assigned to each of the participants. Most importantly, it translates domain names meaningful to humans into the numerical (binary) identifiers associated with networking equipment for the purpose of locating and addressing.

DNS essentially functions as a distributed database using a client/server relationship between clients that need name resolution (mapping host names to IP addresses) and the servers that maintain the DNS data.

II - Related Concepts

1. Host & Host Name

Each device on the Internet is called a Host. Whether the host is a computer, printer, router, and so forth, as long as it has a unique IP address, it’s a host. Just as the IP address identifies the host uniquely, so does the Host Name.

2. Zone, Domain & Delegation

A Zone is a portion of the DNS database that contains the resource records with the owner names belonging to a contiguous portion of the DNS namespace.

A Zone starts as a storage database for a single DNS domain name. If other domains are added below the domain used to create the zone, these domains can either be part of the same zone or belong to another zone. Once a subdomain is added, it can then either be:
  • Managed and included as part of the original zone records, or
  • Delegated away to another zone created to support the subdomain
A DNS database can be partitioned into multiple Zones. A DNS server is considered authoritative for a domain name if it loads the Zone file containing that name.

Delegation is a process of assigning responsibility for a portion of a DNS namespace to a DNS server owned by a separate entity.

3. DNS Database Replication

There could be multiple zones representing the same portion of the namespace. Among these zones there are three types:

  • Primary
  • Secondary
  • Stub

Primary is a zone to which all updates for the records that belong to that zone are made. A secondary zone is a read-only copy of the primary zone. A stub zone is a read-only copy of the primary zone that contains only the resource records that identify the DNS servers that are authoritative for a DNS domain name.

Any changes made to the primary zone file are replicated to the secondary zone file. DNS servers hosting a primary, secondary or stub zone are said to be authoritative for the DNS names in the zone. A DNS server hosting a primary zone is said to be the primary DNS server for that zone.

4. Resource Record

A DNS database consists of resource records (RRs). Each RR identifies a particular resource within the database. There are various types of RRs in DNS. The common RR types are: Start of Authority(SOA), Name Server(NS), Mail Exhanger(MX), Host(A), Alias(CNAME). Please read[3] for detailed description on each RR type.

III - How DNS Works

DNS is essentially a distributed client/server system, where communication is mainly done by send/receive DNS query.

DNS queries can be sent from a DNS client (resolver) to a DNS server, or between two DNS servers. A DNS query is merely a request for DNS resource records of a specified type with a specified DNS name. For example, a DNS query can request all resource records of type A (host) with DNS name "abc.com".

There are two types of DNS queries that may be sent to a DNS server:

  • Recursive
  • Iterative

A recursive query forces a DNS server to respond to a request with either a failure or a successful response. With a recursive query, the DNS server must contact any other DNS servers it needs to resolve the request. When it receives a successful response from the other DNS server(s), it then sends a response to the DNS client.

An iterative query is one in which the DNS server is expected to respond with the best local information it has, based on what the DNS server knows from local zone files or from caching, without contacting other DNS servers. If a DNS server does not have any local information that can answer the query, it simply sends a negative response.

When iteration is used, a DNS server answers a client based on its own specific knowledge about the namespace with regard to the names data being queried. For example, if a DNS server on your intranet receives a query from a local client for “www.microsoft.com”, it might return an answer from its names cache. If the queried name is not currently stored in the names cache of the server, the server might respond by providing a referral - that is, a list of NS and A resource records for other DNS servers that are closer to the name queried by the client.

As shown in the graphic above, a number of queries were used to determine the IP address for www.whitehouse.gov. The query sequence is described below:

  1. Recursive query for www.whitehouse.gov (A resource record)
  2. Iterative query for www.whitehouse.gov (A resource record)
  3. Referral to the .gov name server (NS resource records, for .gov); for simplicity, iterative A queries by the DNS server (on the left) to resolve the IP addresses of the Host names of the name server’s returned by other DNS servers have been omitted.
  4. Iterative query for www.whitehouse.gov (A resource record)
  5. Referral to the whitehouse.gov name server (NS resource record, for whitehouse.gov)
  6. Iterative query for www.whitehouse.gov (A resource record)
  7. Answer to the interative query from whitehouse.gov server (www.whitehouse.gov’s IP address)
  8. Answer to the original recursive query from local DNS server to Resolver (www.whitehouse.gov’s IP address)

IV - RFCs about DNS
  • RFC 1034 -- Domain Names — Concepts and Facilities
  • RFC 1035 -- Domain Names — Implementation and Specification
  • RFC 1123 -- Requirements for Internet Hosts — Application and Support
  • RFC 1886 -- DNS Extensions to Support IP Version 6
  • RFC 1995 -- Incremental Zone Transfer in DNS
  • RFC 1996 -- A Mechanism for Prompt Notification of Zone Changes (DNS NOTIFY)
  • RFC 2136 -- Dynamic Updates in the Domain Name System (DNS UPDATE)
  • RFC 2181 -- Clarifications to the DNS Specification
  • RFC 2308 -- Negative Caching of DNS Queries (DNS NCACHE)
  • RFC 2535 -- Domain Name System Security Extensions (DNSSEC)
  • RFC 2671 -- Extension Mechanisms for DNS (EDNS0)
  • RFC 2782 -- A DNS RR for specifying the location of services (DNS SRV)
1. Wiki On Domain Name System
2. HowStuffWorks on Domain Name System
3. MS TechNet on How DNS Works
4. Understanding Domain Name System (Part I, Part II)

Papers on Designing/Implementing Internet
1. Rethinking the Design of the Internet
2. End to End Argument In System Design
3. End to End Principle