11/25/2009

Extend C++ Standard IOStream Library

Why extend C++ standard IoStream library?
1. To support new data types
2. To support new devices.

The first extend method is easy - just overload the following two global operators:
     istream& operator>>(istream&, Complex&);
    ostream& operator<<(ostream&, Complex);
But the second is not so easy.

The stream library actually performs two unrelated tasks: formatting and buffering.

Formatting is the act of translating between binary data and their character representations. It is done by the class ios, the base class for both istream and ostream. The ios class keeps a format state that governs formatting. The format state specifies
  • the field width
  • the fill character
  • field alignment
  • integer base (decimal, hex or octal)
  • floating point format (fixed, scientific or general)
  • whether to show a + sign, trailing decimal point and zeros thereafter, or the integer base
  • whether to use upper- or lowercase letters for E, 0X, and the hex digits A ... F
The ios class concerns itself with formatting, the conversion between binary data and their ASCII characters representations. Transporting these characters from and to devices is the responsibility of the streambuf class and its descendants.

How to extend C++ standard stream library?

Here is a very good example: http://www.horstmann.com/cpp/iostreams.html

[Reference]
1. The Architecture of Iostream
2. Extending the iostream library
3. IOStream Online Documentation
4. Standard C++ IOStreams and Locales (Google Book Version)
5. Using Iostreams and Library Headers
6. The C++ IOStreams Library

11/10/2009

Map/Reduce - in Functional Programming & Parallel Processing Perspectives

Map/Reduce is a very popular term pair in today's technical community, mainly due to the popularity of its "inventor" - Google.

But in fact, the terms and concepts of map & reduce exist in programming language community long before G company's successful paper "MapReduce: Simplified Data Processing on Large Clusters", which appeared in OSDI04.

In this article, I want to summarize what this term pair means in functional programming literature and parallel processing literature respectively.

I - Map/Reduce in Functional Programming Perspective

  Functional Programming has long history in academia, but not been massively accepted in developer communities yet. It has some beautiful features, compared with our daily use imperative language. Higher Order Function is one of them. Basically, it means that function can be used as input parameters or return value for a function definition.

  Among various higher order functions, map, fold and filter are the most popular ones:

- Map is a higher order function that applies a given function(a.k.a transformer) element-wise to a list of elements and returns a list of results. Transformer is a function applies to each element and will produce one or more new elements.

for example: map (toLower) "abcDEFG12!@#" will produces output:"abcdefg12!@#"

- Fold (a.k.a. Reduce, Accumulate) is a higher order function that processes (using a combiner function) a list of elements in some order and build up a return value. Combiner is a function that is applied to two elements and produces a result that can be combined using combiner with the remaining elements in the list.

for example: fold (+) 0 [1..5] will produces output: 15, which is the sum of all the elements.

- Filter is a higher-order function that processes a list of elements in some order to produce a result containing exactly those original elements for which a given predicate returns the Boolean value true. Predicate is a function that takes one element as input parameter and return either true or false.

for example: filter (isAlpha) "$#!+abcDEF657" will produces output: "abcDEF"

  Essentially, these three higher order functions apply an operation on some list/array and produce some results: map transform each element, filter filtering some elements and reduce combine all the elements.

  Pure functional language, such as haskell/lisp, and some mixed language, such as python, have build-in functions named exactly as Map/Reduce. C# 3.0 introduces some functional features in LINQ subsystem, where Map is called Select and Reduce is called Aggregate.

  More concrete examples can be found in [2].

II - Map/Reduce in Parallel Processing Perspective

  Map/Reduce is a Programming Model & also an Implementation Runtime. The programming model is what you can use to express your computation tasks while implementation runtime is those software components that realize what the model claims.

  This model is called map/reduce, but their meanings are somewhat different:
  - the map semantic is the same as in functional programming language: the transformer (the mapper in Google's paper) is applied to each element of the list
  - the reduce semantic differs. Here, the combiner(the reducer in Google's paper) is applied to multiple sub sections of the elements in the list and thus produces multiple reduce results, but in functional programming language it is applied to all the elements and only produces one result.

  Conceptually, how the elements are divided into multiple sub sections?
  To resolve this problem, this model introduces some structure on the elements that are produced by mapper and consumed by reducer - each element/record has two parts: key and value. Then all the elements are divided according to the key. The records with the same key form a sub section and are passed to a reducer function as a whole.

  From implementation's perspective, the most important advantage of this Programming Model is that - it enables automatic parallelization and distribution of large scale data processing:
  - mapper is applied to each record, it's a data parallel problem by itself, we just need to distribute input data in record boundary among processing nodes.
  - reducer is applied to some sub section, we just need to distribute those sub sections among process nodes.

  Another implementation problem is fail over - what to do when failure happened?
  Simple! It just re-execute the failed specific mapper/reducer, other mappers/reducers won't be bothered at all. Because there is no communication among mappers and reducers respectively, this solution is semantically correct for mapper/reducer.
  Since the input of mapper is persisted in reliable storage system, failed mapper only need to re-execute that mapper. But the input of reducer (also the output of mapper) is persisted in worker's local storage system, re-executed reducer may found some input unavailable (for example intermediate data node crashed). In this situation, failed reducer need to re-execute both some mappers and that reducer.

[Reference]
1. Functional Programming
2. higher order functions - map, fold and filter
3. Map/Reduce/Filter in Python
4. Map/Reduce in PHP
5. Google's MapReduce Programming Model — Revisited
6. MapReduce: Simplified Data Processing on Large Clusters
7. Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters