12/27/2009

Parallel Programming - Using OpenMP

OpenMP is a parallel programming model for shared memory parallel computers.

It's based on Fork-Join parallel execution pattern and is suitable for Data Parallel and Task Parallel applications.

Fork-Join Pattern
- OpenMP programs begin as a single thread - the Master thread, which executes sequentially until the first parallel region construct is encountered.
- Fork: the master thread then creates a team of concurrent threads, which will execute some user provided codes.
- Join: when the team threads complete, they are synchronized to wait each other(barrier) and then terminate, leaving only the master thread ahead


Fork/Join Pattern in OpenMP (from[1])

Work Sharing Constructs
The core functionality of OpenMP is to parallelly process data or execution tasks, I.E, sharing work load. It provides several constructs to support it:
- For, OpenMP will automatically divide these (independent) loop iterations and assign them to one of the team thread to execute.
- Section, programmer can define static code sections, each one will be (parallelly) assigned to one of the team thread to execute.
- Task, data and code can be (dynamically) packed as a task and the delivered to team thread to execute them.

Implementation
OpenMP is designed for Fortran and C/C++. Its functionalities often exist in the following form:
- New Construct as Language Directive
- APIs as runtime library
- Environment Variables

Currently, visual studio 2008 supports OpenMP 2.5, OpenMP@MSDN

To use OpenMP in VS2008 c++ developing, you only need to include omp.h header and enable compiler flag /openmp (project property -> c/c++ -> language -> OpenMP Support)

More detailed tutorial can be found at [4][5].

I had written some OpenMP sample applications, it's compiled with vs2008 (except the task example).

[Reference]
[1] http://en.wikipedia.org/wiki/OpenMP
[2] http://openmp.org/wp/

[3] Introduction to Parallel Programming
[4] OpenMP tutorial at LLNL
[5] OpenMP hands-on Tutorial at SC08

[6] Parallel Programming with OpenMP and MPI
[7] Blog on OpenMP programming
[8] Intel on OpenMP traps
[9] Parallel Programming Model Comparison
[10] Microsoft on OpenMP version in Visual Studio
[11] More OpenMP sample applications

12/19/2009

Parallel Programming - Using POSIX Threads

Pthreads (a.k.a POSIX Threads), is another parallel programming model over Shared Memory Computers, which is categorized to Threads Based Model (the other is message passing based model).

Pthreads provides Threads by means of pure C style APIs, while OpenMP does it through language compiler directives.

As the Process/Thread concepts are very popular and well understood in today's developing community, I will ignore the basic explanation.

Pthreads APIs can be divided into the following categories:
- Thread Management (Create, Destroy, Join, Cancellation, Scheduling, Thread Specific Data and all related attributes)
- Thread Synchronization (Mutex, Conditional Variable, Barrier, Reader Writer Lock, Spin Lock and all related attributes)

Pthreads is an international standard and is well supported in *nix world. Microsoft Windows has its own threading interface. But there is a famous open source project Pthreads-Win32, which targets a Pthreads implementation over Windows Platform.

Multithreading Programming is a very broad topic, this post just aims to give very simple introduction. [3] and [5] are very good hands on tutorials about Pthreads programming.

[Reference]
[1] http://en.wikipedia.org/wiki/POSIX_Threads
[2] http://sourceware.org/pthreads-win32/

[3] Pthreads Tutorial by LLNL
[4] Pthreads Tutorial by IBM
[5] Pthreads Hands-On Tutorial
[6] Pthreads Tutorial for Linux Platform
[7] Pthreads Info Center

12/15/2009

Parallel Computing - An Introduction

Parallel Computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel"). The core motivation for parallelizing is speeding up computing tasks.

1. Types of Parallelism

There are various forms of parallel processing:
- Bit Level Parallelism, each bit of a word is processed at the same time
- Instruction Level Parallelism, execute multiple instructions simultaneously
- Data Parallelism, (a.k.a. loop level parallelism) focuses on distributing the data across different parallel processing units
- Task Parallelism, (a.k.a. functional parallelism) focuses on distributing execution task(code + data) across different parallel processing units

2. Types of Parallel Computer

Using Flynn's Taxonomy, computer architecture can be divided into:
- Single Instruction, Single Data stream (SISD)
- Single Instruction, Multiple Data streams (SIMD)
- Multiple Instruction, Single Data stream (MISD)
- Multiple Instruction, Multiple Data streams (MIMD)

Today's parallel computer are all MIMD type, in more coarse-grained style, parallel computer can be further divided into:
- SPMD Single Program, Multiple Data
- MPMD Multiple Program, Multiple Data

According to the memory architecture, parallel computer can be divided as:

- Shared Memory Computer

In this kind of computer, each processing node share the same global memory address space. Programming on these computer can be as easy as on multicore workstation.

Shared memory computer is easy to programming, but since bus is used among all processing node, the scale is limited(usually several tens), as bus contention will become the bottle neck when scale arises.

Shared memory computer can be further divided into two kinds:
- UMA/cc-UMA, all processing node share the same physical memory device through a bus
- NUMA/cc-NUMA, each process node has local physical memory but accessible by other nodes, but the access time depends on the memory location relative to a node

Uniformed Memory Access(from [1])

Non-Uniformed Memory Access(from [1])

- Distributed Memory Computer

In this kind of computer, each node has its own private memory address space and can't access other node's memory directly. Usually, processing nodes are connected using some kind of interconnection network.

Distributed memory computer can scale to very large since no bus contention occurs. But it's more complicated to write program on this kind of computers.

Distributed Memory (from[1])

- Distributed Shared Memory Computer

In this kind of computer, their hardware architecture is usually the same as Distributed Memory system, but its interface for application developers is the same as Shared Memory system.

DSM is usually implemented using software extension to OS, which some performance penalty.

3. Parallel Programming Model

With these powerful computer in hand, how to programming on them?

3.1 Conceptually, there are tow models to write parallel programming.

Threads Model

There are two well known API interface about multi-threading:
- OpenMP (Open Multi-Processing), using compiler directives, environment variable and library to provide threading supports
- PThreads (Posix Threads), supports thread by means of library only.

a Process with two Threads(from wiki)

Message Passing Model

In this model, a set of tasks use their local memory for computation, communication among these tasks is conducted by means of sending/receiving network messages.

There are two standards:
- MPI (Message Passing Interface)
- PVM (Parallel Virtual Machine)

Typical Message Passing Patterns are listed below:

Collective communications examples

Message Passing Pattern (from[1])

3.2 Other factors to Consider when Designing Parallel Programs

a. Problem Decomposition/Partitioning
- Data Partitioning
- Functional Partitioning

b. Communication Considering [1]
- Latency, is the time it takes to send a minimal (0 byte) message from point A to point B.
- Bandwidth, is the amount of data that can be communicated per unit of time.
- Async vs Sync.
- Point-to-Point, involves two tasks with one task acting as the sender/producer of data, and the other acting as the receiver/consumer.
- Collective, involves data sharing between more than two tasks, which are often specified as being members in a common group, or collective.

c. Load Balancing

Load balancing refers to the practice of distributing work among tasks so that all tasks are kept busy all of the time

How? [1]
- Equally partition workload
- Use dynamic work assignment

d. Granularity [1]

Measured by Computation/Communication Ratio, because periods of computation are typically separated from periods of communication by synchronization events.

- Coarse-grain Parallelism, relatively large amounts of computational work are done between communication/synchronization events
- Fine-grain Parallelism, relatively small amounts of computational work are done between communication events

[Reference]

[1] Parallel Programming Tutorial by LLNL
[2] Parallel Programming Models and Paradigms

[3] Book - Designing and Building Parallel Programs
[4] Book - Introduction to Parallel Computing 2E
[5] Book - Parallel and Distributed Programming Using C++
[6] Book - Patterns for Parallel Programming

12/09/2009

Debugging Facilities on Windows Platform

Part I - System/Application Error Collection Tools

These tools are used to collection software data (especially when error occurs) that can be used to identify & fix software defects.

1. Dr. Watson

"Dr. Watson for Windows is a program error debugger that gathers information about your computer when an error (or user-mode fault) occurs with a program. The information obtained and logged by Dr. Watson is the information needed by technical support personnel to diagnose a program error for a computer running Windows."
A text file (Drwtsn32.log) is created whenever an error is detected, and can be delivered to support personnel by the method they prefer. A crash dump file can also be created, which is a binary file that a programmer can load into a debugger.

Starting with Windows Vista, Dr. Watson has been replaced with "Problem Reports and Solutions"

2. Windows Error Report

While Dr.Watson left the memory dump on the user's local machine for debugging, Windows Error Reporting offers to send the memory dump to Microsoft using the internet, more info can be found at - http://en.wikipedia.org/wiki/Windows_Error_Reporting

3. Adplus
ADPlus is console-based Microsoft Visual Basic script. It automates the Microsoft CDB debugger to produce memory dumps and log files that contain debug output from one or more processes. It has many switches to control what data to collect, more info can be found at Microsoft KB on Adplus tool

[Reference]- Wiki on Dr. Watson Debugger
- Description of Dr. Watson Tool
Part II - Structured Exception Handling

SEH is usually known as a convenient error handling mechanism for windows native code programmers provided by windows operating system itself (with compiler support). But it is also a great system to enable applications talking to software debuggers.

1. Various Concepts

Structured exception handling
is a mechanism for handling both hardware and software exceptions. To full understand SEH mechanism, you should get familiar with the following concepts:
- guarded body of code
- exception handler
- termination handler
- filter expression, follows __except keyword,
evaluated when system conducting exception processing
- filter function
, can only be called in filter expression

filter expression & function can only return the following three values:
-
EXCEPTION_CONTINUE_SEARCH, The system continues its search for an exception handler.
-
EXCEPTION_CONTINUE_EXECUTION, The system stops its search for an exception handler, restores the machine state, and continues thread execution at the point at which the exception occurred.
-
EXCEPTION_EXECUTE_HANDLER, The system transfers control to the exception handler, and thread execution continues sequentially in the stack frame in which the exception handler is found.
2. Stack Unwinding
If the located exception handler is not in the stack frame in which the exception occurred, the system unwinds the stack, leaving the current stack frame and any other stack frames until it is back to the exception handler's stack frame.
3. Vectored Exception Handling
Vectored handlers are called in the order that they were added, after the debugger gets a first chance notification, but before the system begins unwinding the stack. Since they are not framed based, they will be called each time when an exception is raised.
4. Exception & Debugger

SEH is also a communication mechanism between windows application and debugger. The detailed description on the whole exception dispatching process can be found here and the debugger exception handling process can be found here.
The core concepts here are first-chance notification and second-chance(last-chance) notification.
- 1st chance notification is a mechanism to notify debugger the exception information before application get chance to process the exception.
- 2nd chance notification happens after the windows system finds that no application defined exception handler exists.

5. Functions and Keywords
- GetExceptionCode and GetExceptionInformation can be used to get detail information about current exception.
- The SEH compatible compiler recognize __try, __except, __finally, __leave as keywords.
- It will also interpret the GetExceptionCode, GetExceptionInformation, and AbnormalTermination functions as keywords, and their use outside the appropriate exception-handling syntax generates a compiler error.
Part III - Dump File
In Part I, we introduced several tools to get diagnose data to enable offline debugging and analyzing. The most important diagnose data is dump file. There are two types of dump files:
1. Kernel-Mode Dump (system/core/memory dump)
This kind of dump happens when an stop error occurs in the windows system. The common phenomenon is that the blue screen shows up and at the same time, an core dump file is generated.
There are three kinds of core dump files:
- Complete Memory Dump
A Complete Memory Dump is the largest kernel-mode dump file. This file contains all the physical memory for the machine at the time of the fault.
The Complete Memory Dump file is written to %SystemRoot%\Memory.dmp by default.
- Small Memory Dump
A Small Memory Dump is much smaller than the other two kinds of kernel-mode crash dump files. It is exactly 64 KB in size, and requires only 64 KB of pagefile space on the boot drive.
- Kernel Memory Dump
A Kernel Memory Dump contains all the memory in use by the kernel at the time of the crash. This kind of dump file is significantly smaller than the Complete Memory Dump. Typically, the dump file will be around one-third the size of the physical memory on the system.
This dump file will not include unallocated memory, or any memory allocated to user-mode applications. It only includes memory allocated to the Windows kernel and hardware abstraction level (HAL), as well as memory allocated to kernel-mode drivers and other kernel-mode programs.
An great Microsoft KB on system dump(core dump, blue screen) configuration.

2. User-Mode Dump (application/process dump)
This kind of dump file is from specific process, not from the windows system itself.
- Full Dump
A full user-mode dump is the basic user-mode dump file. This dump file includes the entire memory space of a process, the program's executable image itself, the handle table, and other information that will be useful to the debugger.
- Mini Dump
A mini user-mode dump includes only selected parts of the memory associated with a process. The size and contents of a minidump file varies depending on the program being dumped and the application doing the dumping.
The name "minidump" is misleading, because the largest minidump files actually contain more information than the "full" user-mode dump.
User mode dump files can be created using the following methods:
- using DebugDiag (discuss blog )
- using adplus

- using userdump
- using Process Explorer
- using ProcDump
- using task manager
- visual Studio's "Save Dump As ..."

You can also using some debugging tools such as Visual Studio and Windbg to create dump files.

Manipulate Mini Dump Programmatically:
- MiniDumpReadDumpStream()
- MiniDumpWriteDump()
- MINIDUMP_TYPE ENUM

For how to use dump file to analys software defect:
- Effective MiniDump
- Post-Motem Debugging using MiniDump
- Reading Minidump
[Reference]
- Dump in Visual Studio
- Crash Dump Doc @ MSDN

Reader/Writer Locking and Beyond

The Reader/Writer problem[2.0] is a synchronization problem that deals with how to improve multi-threading program's overall performance. The core idea is improving concurrency - make as many as possible threads to run.

Reader/Writer lock is a mechanism to resolve this problem - shared access for readers, exclusive access for writers.

Part I - Considering Factors

The basic idea behind Reader/Writer lock is simple and implementation algorithms are published here and there[2.1][2.2][2.3], but there are various aspects to consider:

1. Spin Lock or Blocking Lock

Reader/Writer lock uses some basic synchronization primitive - lock/semaphore, there are two types of locks:[2.2]
- Busy-Wait lock is a proactive locking mechanism, the thread is waiting while holding the CPU (spin lock & barrier are the most popular constructs of this type)
- Blocking lock is yield waiting and scheduler-based blocking mechanism, the thread yield the CPU when waiting

2. Re-Entrance/Recursion

Whether a thread that already holds some lock can acquire other kind of lock? If so, the implementation is said to support recursion.

Roughly speaking, a typical reader/writer lock that supports recursion behaves like:
- thread holds read lock can't request write lock,
- thread holds read lock can be granted read lock without blocking
- thread holds write lock can be granted read lock without blocking
- thread holds write lock can be granted write lock without blocking

3. Time-Out

A lock requesting thread may want to set time limitation on how long to wait. Usually, time-outed APIs are named as TryXXX ...

4. Fairness/Preference

When multiple threads are waiting for some locks, which thread to wake up and grant corresponding access is very critical for fairness problem.

A reader-preferred policy will grant a reader to access if the resource is accessed by a reader

A writer-preferred policy will grant the longest waiting writer to access if there is any, requesting reader will be blocked if there is any writer waiting in the queue.

A fair policy usually ensure that:
1. reader arrives when some readers are being granted reader lock should be blocked if some writer is already waiting, i.e, it avoids writer starvation
2. if reader arrives before a writer, it should be granted access before the writer is granted to access, i.e, it avoids reader starvation

But what threads to grant when reader locks are allowed is another consideration factor:
- Consecutive, if there is a group of reader threads waiting longer than all waiting writer threads, that group will be assigned the read lock.
- All in One, when readers are allowed to access, all waiting reader will be enabled.

5. Upgradable Read(Update) Lock

As we had said, recursive r/w lock doesn't allow a reader(shared) lock holder to request writer(exclusive) lock (a.k.a. lock upgrade), but sometimes, this is a strong desired feature.

For instance, in database implementation, a UPDATE statement may first acquire reader lock on all table pages, when it finds some row need to be modified, a writer locks is requested on related page.

Simple implementation on lock upgrade may causes deadlocks (P556, Chapter 17, Database Management System 3E), so the idea of lock downgrade comes out - acquire writer(exclusive) locks at first, downgrade to reader(shared) lock when it is clear. Also this idea avoids many deadlocks, it limits the concurrency - it suggests acquire exclusive lock at first.

So people invented upgradable lock(a.k.a. update lock), it's compatible with reader(shared) lock, but not with update(upgradable) lock, nor writer(exclusive) lock. If one thread is granted update lock, it can read the resource and also has the right to upgrade its lock to write lock(may cause blocking). An update lock holder can also downgrade its lock to read lock. More explanation on upgradable lock can be found at [3.1.1]

Essentially, upgradeable lock is intended for cases where a thread usually reads from the protected resource, but might need to write to it if some condition is met. Exactly the same semantic of UPDATE statement is DBMS(that's why it is also named as Update Lock). Sql Server and DB2 supports Update Lock, while Oracle doesn't.[5.1][5.2][5.3][5.4]

Part II - A C++ Implementation

In implementing Reader/Writer Lock, we made the following decisions:
1. Spin Lock is useful when lock holding time is relatively short, we adopt block-waiting lock since how these locks are used is not cleared
2. It supports recursion
3. It supports time-out feature
4. For better flexibility, we implemented three RWLocks: reader preferred ReaderWriterLock, writer preferred WriterReaderLock, the fair FairSharedLock.
5. Currently, we don't support Update lock

Basic Algorithms for ReaderWriterLock and WriterReaderLock are that given in paper[2.1], recursion is supported by introducing current writer field and time-out is supported by using win32 wait functions.

Reader Preferred ReaderWriterLock Algorithm, from[2.1]

Writer Preferred WriterReaderLock Algorithm, from[2.1]

The fair shared lock is implemented using a lockword, which stores lock state and manipulated by Interlocked primitives, and a waiting thread queue, which queues waiting reader/writer threads to keep information for fairness judging. Some ideas are learned from Solaris[3.8.1] and Jeffery's OneManyResourceLock[3.1.2].

source code can be found at (Locking.h & Locking.cxx)

Part III - Some Multithreading Better Practices

1. Don't allow thread to transit from user mode to kernel mode
2. Use as few threads as possible (Ideally, equals to cpu/core count)
3. Use multiple threads for tasks that require different resources
4. Don't assign multiple threads to a single resource
[Reference]

Synchronization General Concepts

1.1 Synchronization
1.2 Lock
1.3 Mutex(Mutual Exclusion)
1.4 Semaphore
1.5 Monitor
1.6 Event
1.7 Barrier
1.8 Lock Convoy

General Reader/Writer Lock
2.0 The First, Second and Third Reader/Writer Problems

2.1 Paper: Concurrent Control with "Readers" and "Writers"
- It firstly introduced the reader/writer problem and gave 2 algorithms (R preferred/W preferred)
2.2 Paper: Algorithms for scalable synchronization on shared-memory multiprocessors
- It summarized busy-wait sync mechanism and introduced algorithm for Spin Lock based only on local memory spinning
2.3 Paper: Scalable RW Synchronization for SMP (Its Pseudocode & PPT)
- It introduced:
-- a. R-Preferred/W-Preferred/Fair RWLock using general Spin Lock on SMP machine
-- b. Improve a. by using algorithm that needs local spinning only Spin Lock
2.4 Paper: A Fair Fast Scalable Reader-Writer Lock
- Some Improvement on 2.2
2.5 Paper: Scalable Reader-Writer Locks
- Latest and some Summary on previous Works

2.6 Notes on Implementing a Reader/Writer Mutex
2.7. Scalable Read/Write Lock on SMP
2.8 Test Reader/Writer Lock Performance

Reader/Writer Lock Implementations on Various Platforms

- 1. .Net
3.1.1 Reader/Writer Lock in .Net (the Slim Version)
3.1.2 Jeffrey Richter on Reader/Writer Lock implementation
3.1.3 Joe Duffy on Reader/Writer Lock
3.1.4 Implementing Spin-Lock based Reader/Writer Lock and Its Analyzing

- 2. Java
3.2.1 Various Aspects on Java Reader/Writer Lock
3.2.2 Implementing Java Reader/Writer Lock
3.2.3 Java Doc on Java SE Reader/Writer Lock
3.2.4 Java Threading Book on Synchronization

- 3. Boost
3.3.1 DDJ on Boost Thread Library
3.3.2 Boost Thread Library Official Doc
3.3.3 Multithreading for C++0x

- 4. Apache Portable Runtime
3.4.1 APR RWLock Doc

- 5. PThread
3.5.1 PThread RWLock by IBM
3.5.2 PThread RWLock by Sco
3.5.3 PThread Doc

- 6. Intel TBB
3.6.1 Intel TBB Reader/Writer Lock

- 7. Win32
3.7.1 Synchronization Primitives New To Windows Vista
3.7.2 Slim Reader/Writer Lock
3.7.3 Win32 Native Concurrency (Part 1, Part 2, Part 3)

-8 Solaris
3.8.1 Reader/Writer Lock source code in Open Solaris

-9 Linux
3.9.1 simple doc on Linux kernel locking
3.9.2 Linux reader/writer lock implementation: header & source

- 10. Misc
3.10.1 Reader/Writer Lock Cxx Source Code (Header, Source)
3.10.2 An FIFO RWLock Source Code

Spin Lock

4.1 http://en.wikipedia.org/wiki/Spinlock
4.2 Spin Lock Implementation and Performance Comparison
4.3 Spin Wait Lock implementation by Jeffry Richter
4.4 User Level Spin Lock
4.5 Introduction on Spin Lock
4.6 Paper on FIFO Queued Spin Lock
4.7 InterlockedExchange atomic fetch_and_store on Windows
4.8 InterlockedCompareExchange atomic compare_and_swap on Windows
4.9 Ticket (Spin) Lock (wikipeida, in Linux Kernel, in Paper)
4.10 FIFO Ticket Spinlock in Linux Kernel
4.11 MCS Spin Lock Design & Implementation

Locking in Database

5.1 Understanding Locking in Sql Server
5.2 Oracle Locking Survival Guide5.3 DB2 Locking Mechanism
5.4 Sql Server Locking Types

Misc

6.1 MSDN Magazine Concurrent Affairs Column
6.2 Many Insights on Thread Synchronization and Scalable Application
6.3 ReaderWriterGate Lock by Jeffrey (R/W lock + Thread Pooling)
6.4 A Richer Mutex & Lock Convoy by Jeffery

12/02/2009

BLAS & LAPACK - Math Kernel for Scientists

1. The Standard Interface

BLAS (basic linear algebra software) and LAPACK (linear algebra package) are standards for linear algebra routines.

2. The Various Implementations

The reference BLAS [2] is the reference implementation of the BLAS standard. It is usually slower than machine-optimised versions, but can be used if no optimised libraries are accessible.

It is available from http://www.netlib.org/blas/.

The reference LAPACK [1] is the reference implementation of the LAPACK standard. Its performance is heavily dependent on the underlying BLAS implementation.

It is available from http://www.netlib.org/lapack/.

The Intel MKL (Math Kernel Library) implements (among others functionality, such as FFT) the BLAS and LAPACK functionality. It is optimised for Intel CPUs.

It's available from http://software.intel.com/en-us/intel-mkl/

The AMD ACML (AMD Core Math Library) is AMD’s optimised version of BLAS and LAPACK,
and also offers some other functionality (e.g. FFT).

It's available from http://developer.amd.com/acml.jsp

The Goto BLAS [3][4] is a very fast BLAS library, probably the fastest on the
x86 architecture.

It is available from http://www.tacc.utexas.edu/software_modules.php.

Its main contributor is Kazushige Gotō, who is famous for creating hand-optimized assembly routines for supercomputing and PC platforms that outperform best compiler generated codes. Some news report about him: "Writing the Fastest Code, by Hand, for Fun: A Human Computer Keeps Speedup Chips", "The Human Code".

The ATLAS (automatically tuned linear algebra software, [5]) contains the
BLAS and a subset of LAPACK. It automatically optimises the code for the machine on which it
is compiled.

It is available from http://math-atlas.sourceforge.net/.

3. Extensions to Cluster System (Distributed Memory Parallel Computer)

BLACS [6] are the basic linear algebra communication subprograms. They
are used as communication layer by ScaLAPACK. BLACS itself makes use of PVM (parallel
virtual machine) or MPI.

BLACS is available from http://www.netlib.org/blacs/

ScaLAPACK [7] is a library for linear algebra on distributed memory architectures. It implements routines from the BLAS and LAPACK standards. ScaLAPACK makes it possible to distribute matrices over the whole memory of a distributed memory machine, and use routines similar to the standard BLAS and LAPACK routines on them.

ScaLAPACK is available from http://www.netlib.org/scalapack/

PLAPACK [8] is also a library for linear algebra on distributed memory architectures. Unlike ScaLAPACK, it attempts to show that by adopting an object based coding style, already popularized by the Message-Passing Infrastructure (MPI), the coding of parallel linear algebra algorithms is simplified compared to the more traditional sequential coding approaches.

PLAPACK is available from http://www.cs.utexas.edu/~plapack/

[Reference]

[1] LAPACK User's Guide
[2] Basic Linear Algebra Subprograms for FORTRAN usage

[3] Anatomy of high-performance matrix multiplication
[4] High-performance implementation of the level-3 BLAS

[5] Automated Empirical Optimization of Software and the ATLAS project

[6] A user’s guide to the BLACS
[7] ScaLAPACK: A scalable Linear Algebra Library for Distributed Memory Concurrent Computers
[8] PLAPACK: Parallel Linear Algebra Libraries Design Overview
[9] PLAPACK vs ScaLAPACK