Programming Windows Hpc Server - Using MPI Model

Conventionally, HPC/Parallel problems can be roughly divided into the following two categories[ref]:

- Data Parallel, these applications divides the input data into a number of completely independent parts. The same computation is undertaken on each part. And some kind of post processing after the computations is needed.

- Task Parallel, these are those jobs that its functionality can be divided into many small tasks, each of which can be executed on one CPU core. These tasks may need to communication or not all at.

Using another taxonomy(orthogonal to data/task parallel), parallel problems can be divided into:

- Embarrassingly Parallel, for these applications, little or no effort is required to separate the problem into a number of small tasks that runs on one CPU core. No or very little lightweight post processing is needed. (no/little cooperation among task and post processing)

- Dependent Parallel, these are those problems in which there are dependencies among various tasks and communication among these tasks is required. Communication can be accomplished by sharing variables(on shared memory architecture) or passing(send/receive) messages(typically on distributed memory architecture).

The de facto interface of message passing model is MPI, which is the focus of this article.

Windows Hpc Server Network Topology (from Microsoft)

MPI is just an API standard, there are various implementations(see reference section). In this article, programing example is using MS-MPI on windows hpc server.

Part I - Environment for MPI Programming on Windows Hpc Server

To begin with, you should have the following environment:
- The Windows Hpc Cluster (1 Head Node, N Compute Node)
- The Hpc App Dev Machine

The Hpc App Dev Machine should have the following software installed:
- Visual Studio 2005/2008
- Hpc Pack 2008 Client Utilities
- Hpc Pack 2008 SDK

Then you should configure you VS environment:
1. Set MPI include dir: VS->Tools->Options->Projects and Solutions->VC++ Directories, for each platform(Win32/X64), choose "Include Files" in "show directories for" dropdown list, add "$(hpc pack 2008 sdk)\include" ($(hpc pack 2008 sdk) is where your hpc pack 2008 sdk is installed)
2. Set MPI library dir: VS->Tools->Options->Projects and Solutions->VC++ Directories, for each platform(Win32/X64), choose "Library Files" in "show directories for" dropdown list, add "$(hpc pack 2008 sdk)\Lib\i386"/"$(hpc pack 2008 sdk)\Lib\amd64" respectively. ($(hpc pack 2008 sdk) is where your hpc pack 2008 sdk is installed)

Part II - Programming Using MPI APIs in Visual Studio

For those who is new to MPI, here are some basic intro, more info please see[4][5][6]:
1. A MPI application consists of many processes, which is called Task in MPI. All tasks are associated with a unique identifier starts from 0 ... N - 1, which is called Rank in MPI vocabulary. Rank is used to identify the source and target of message passing.
2. MPI is Message-Oriented, not connection/stream oriented.
3. MPI uses Tag to identify message type.

Now let's write a MPI application that tells where itself is running:
1. Create a new empty win32 console application in your VS.
2. Add a new c++ source file named MpiHello.cxx and the content is
MPI Hello Source Code
 1 #include <mpi.h>
 2 #include <windows.h>
 3 #include <stdio.h>
 4 #include <stdlib.h>
 6 int main(int argc, char** argv)
 7 {
 8         int nProc;
 9         int nThisRank;
10         char host[MAX_PATH];
11         char msg[1024];
13         MPI_Init(&argc, &argv);
14         MPI_Comm_size(MPI_COMM_WORLD, &nProc);
15         MPI_Comm_rank(MPI_COMM_WORLD, &nThisRank);
17         gethostname(host, sizeof(host) / sizeof(host[0]));
19         if (nThisRank == 0)
20         {
21                 printf("Master Process is running on host[%s].\n", host);
23                 char rcvMsg[1024];
24                 MPI_Status status;
25                 for (int i = 1; i < nProc; ++i)
26                 {
27                         MPI_Recv(rcvMsg,
28                                 sizeof(rcvMsg),
29                                 MPI_CHAR,
30                                 MPI_ANY_SOURCE,
31                                 MPI_ANY_TAG,
32                                 MPI_COMM_WORLD,
33                                 &status);
34                         printf("%s\n", rcvMsg);
35                 }
36         }
37         else
38         {
39                 sprintf_s(msg,
40                         sizeof(msg),
41                         "Worker Process [%d] of [%d] is running on host [%s].",
42                         nThisRank,
43                         nProc,
44                         host);
46                 MPI_Send(msg,
47                         (int)strlen(msg) + 1,
48                         MPI_CHAR,
49                         0,
50                         0,
51                         MPI_COMM_WORLD);
52         }
54         MPI_Finalize();
56         return 0;
57 }

3. Some Explanation:
- To use MPI APIs, you should include mpi.h header file
- Each MPI task should starts with MPI_Init() and ends with MPI_Finalize()
- MPI_Comm_Size() is used to get task count in this application
- MPI_Comm_Rank() is used to get this task's rank value

Now build your application and use MPI tool mpiexec to run it:
mpiexec -n 4 MpiHello.exe

The console output will be:
Master Process is running on host[hpc-01].
Worker Process [2] of [4] is running on host [hpc-01].
Worker Process [3] of [4] is running on host [hpc-01].
Worker Process [1] of [4] is running on host [hpc-01].

Part III - Deploy and Run MPI Application on Windows Hpc Cluster

1. Deploy
- Copy MpiHello.exe to \\your_head_node\App\
(You may copy app bin to each compute node's local disk and maybe some extra input data files. But deploying is nearly all about code/data file copying)

2. Submit Windows Hpc Job
- In Cmd.exe shell, change your dir to $(Hpc Pack 2008)\Bin.
- Run the following command:
Job.exe submit /scheduler:your_head_node /jobname:MpiHello /numprocessors:6-6 /workdir:\\your_head_node\users\your_name /stdout:_OUT.txt /user:your_domain\your_name mpiexec.exe \\your_head_node\app\MpiHello.exe

On return, a job ID will display in console window.

You can now use the Job Management component in Hpc Cluster Manager to monitor the progress of your Hpc application.

When the job is finished successfully, goto \\your_head_node\users\your_name and check the _OUT.txt file, it will contain contents very similar as:

Master Process is running on host[hpc-01].
Worker Process [1] of [6] is running on host [hpc-01].
Worker Process [3] of [6] is running on host [hpc-02].
Worker Process [4] of [6] is running on host [hpc-03].
Worker Process [2] of [6] is running on host [hpc-02].
Worker Process [5] of [6] is running on host [hpc-03].


About MPI

1. Message Passing Interface on wikipedia.
2. The MPI standard
4. Tutorial on MPI by William Gropp.
5. MPI Tutorial at ANL
6. Great MPI tutorial by LLNL
7. C++ MPI Exercises by John Burkardt.
8. Book online: MPI The Complete Reference.

About MS-MPI

10. Windows HPC Server 2008 - Using MS-MPI whitepaper .
11. Using Microsoft MPI (@TechNet).
12. MPI.NET Home Page

About Win Hpc Programming

20. MS Hpc Dev Center
21. Classic Hpc Programming Using Visual C++(Doc, Code)
22. Hpc Developing Using .Net: MPI.NET


Programming Windows Hpc Server - Using SOA Model

Conventionally, HPC/Parallel problems can be roughly divided into the following two categories[1][2]:

- Data Parallel, these applications divides the input data into a number of completely independent parts. The same computation is undertaken on each part. And some kind of post processing after the computations is needed.

- Task Parallel, these are those jobs that its functionality can be divided into many small tasks, each of which can be executed on one CPU core. These tasks may need to communication or not all at.

There is another special kind of parallel problems(orthogonal to data/task parallel):
- Embarrassingly Parallel, for these applications, little or no effort is required to separate the problem into a number of small tasks that runs on one CPU core. No or very lightweight post processing is needed. (no/little cooperation among task and post processing)

Windows Hpc SOA Programming Overview (from Microsoft)

On Windows Hpc Server 2008, programming model for Embarrassingly Parallel(especially web based) applications is referenced as SOA Model, in which, client send requests to service broker, and service broker forward these requests to service instances. Service instance never talks to each other, and communication among these components are all service oriented(more specifically, WCF based).

[SOA Programming Model Workflow, from Microsoft]

[3] is a detailed documentation on this topic, but in this article, I will demo a SOA based PI(3.1415926535...) value calculation application using Monte Carlo method[6] in a real Windows Hpc Cluster.

When I say "real windows hpc cluster", I mean:
1. The cluster has multiple nodes(6 nodes: 1 head, 1 broker, 4 compute).
2. This cluster has dedicated AD/Network, which is totally different from client/dev machines and network env.
3. You use some server called "Boundary Server" to access the Hpc Cluster. The boundary server has NICs to connect with both cluster private network and corp/enterprise network.
4. You dev/debug your application on boundary server, not on cluster head node.(let head node focus on job requests serving)
5. Your corp network domain account is different from Hpc Cluster private network domain account.
6. In the following sections, I assume the cluster environment is already correctly set up.
[These environment assumptions are more complex than those in [3], but they are more similar to real production env.]

The Monte Carlo PI value calculation contains two parts - the server part and client part.

The server part is a pure WCF service
- you define interface/contract and implement the interface. The core logic is listed below:
1. it's a .net/c# class library project, and it is a typical WCF service application.
2. use local machine IP and current date time to hash out a random seed number. This will make the whole process more random.
3. use .NET build-in random generator to generate a serial random number to do many independent Monte Carlo experiments. The idea is generating a ranged random point and see whether it is located within a circular area with some fixed diameter.
PiCalcServer Core Logic
1 public PiCalcResult Calc(UInt64 scale)
2 {
3 // use system time and machine ip to hash out the seed for random number generation
4 long ticks = DateTime.Now.Ticks;
5 IPHostEntry host = Dns.GetHostEntry(Dns.GetHostName());
6 ticks += host.AddressList[0].GetAddressBytes()[0];
7 ticks += host.AddressList[0].GetAddressBytes()[1] * 256;
8 ticks += host.AddressList[0].GetAddressBytes()[2] * 256 * 256;
9 ticks += host.AddressList[0].GetAddressBytes()[3] * 256 * 256 * 256;
10 Random rand = new Random((int)ticks);
12 // result init
13 PiCalcResult calcResult = new PiCalcResult();
14 calcResult.InCount = 0;
15 calcResult.OutCount = 0;
17 // do Monte Carlo exercise
18 Int32 x = 0, y = 0;
19 for (UInt64 i = 0; i < scale; ++i)
20 {
24 UInt64 d = (UInt64)Math.Round(Math.Sqrt((double)x * (double)x + (double)y * (double)y));
25 if (d <= RAND_RANGE_MAX)
26 {
27 calcResult.InCount++;
28 }
29 else
30 {
31 calcResult.OutCount++;
32 }
33 }
35 return calcResult;
36 }

After the WCF service implementation, you should deploy it to the Windows Hpc Cluster. This includes two steps:
1. Compose a service configuration file.(The PiCalcService.config file is contained in the source code package, see [3] for detailed fields explanation)
<service assembly="%CCP_HOME%App\PiCalcServer.dll"
<add name="PATH" value="%MY_SERVICES_HOME%Bin"/>

2. Copy Bin/Conf files to each compute node
clusrun xcopy /y \\FileServer\PrjDir\PiCalcServer.dll "c:\Program Files\Microsoft HPC Pack\App"
clusrun xcopy /y \\FileServer\PrjDir\PiCalcServer.Config "c:\Program Files\Microsoft HPC Pack\ServiceRegistration"

To see whether the deployment is successful:
1. go to StartMenu -> Hpc Pack -> Hpc Cluster Manager -> Diagnostics -> Tests -> SOA -> SOA Service Configuration Report and run this test.
2. Diagnostics -> Test Results. It will list the detailed results of test in step 1. If your deployment is successful, the report will tell you the service name/bin/interface/implementation and target arch.

The other part of the solution is the client application.
- It's both WCF client application and Hpc cluster application:
1. It's a normal .Net/C# console/winform application, which will call remote WCF service and Hpc scheduler service.
2. As normal WCF client application, you should use svcutil tool to generate the wcf client proxy class(async style is used here) and add it to your client application project.
svcutil PiCalcServer.dll
svcutil *.wsdl *.xsd /async /language:C# /out:PiCalcServerProxy.cs

3. When developing SOA application, you should create session with Hpc cluster, get the SOA broker service endpoint from the session and call WCF service from this endpoint.
4. Your whole client logic looks like: create session, divide computation task, send requests, collect the partial results from various sub-tasks and compute the final result.
PiCalcClient Core Logic
1 //
2 // Create a session object that specifies the head node to which to connect and the name of
3 // the WCF service to use.
4 //
5 SessionStartInfo ssInfo = new SessionStartInfo(schedulerHost, serviceName);
6 ssInfo.Username = clusterHeadUser;
7 ssInfo.Password = clusterHeadPassword;
8 ssInfo.ResourceUnitType = Microsoft.Hpc.Scheduler.Properties.JobUnitType.Core;
9 ssInfo.MinimumUnits = 2;
10 ssInfo.MaximumUnits = 1000;
12 Console.WriteLine("Creating a session ...");
13 using (Session session = Session.CreateSession(ssInfo))
14 {
15 Console.WriteLine("Session creation done!");
16 Console.WriteLine("Session's Endpoint Reference:{0}", session.EndpointReference.ToString());
17 int nodes = session.ServiceJob.AllocatedNodes.Count;
19 //
20 // Binds session to the client proxy using NetTcp binding (specify only NetTcp binding). The
21 // security mode must be Transport and you cannot enable reliable sessions.
22 //
23 System.ServiceModel.Channels.Binding myTcpBinding = new NetTcpBinding(SecurityMode.Transport, false);
24 myTcpBinding.ReceiveTimeout = maxTimeOut;
25 myTcpBinding.SendTimeout = maxTimeOut;
26 PiCalcServerClient calcServerClient = new PiCalcServerClient(myTcpBinding, session.EndpointReference);
27 calcServerClient.ClientCredentials.Windows.ClientCredential.UserName = wcfClientUser;
28 calcServerClient.ClientCredentials.Windows.ClientCredential.Password = wcfClientPassword;
30 //
31 // There is no way to get the accurate allocated core count, just assume each node has avgCoresPerNode cores.
32 //
33 timeBegin = DateTime.Now;
34 int taskCount = session.ServiceJob.AllocatedNodes.Count * avgCoresPerNode;
35 asyncCalcCount = taskCount;
36 for (int i = 0; i < taskCount; i++)
37 {
38 UInt64 scale = totalScale / (UInt64)taskCount;
39 calcServerClient.BeginCalc(scale, AsyncCalcCallback, new CalcReqContext(calcServerClient, i));
40 }
41 asyncCalcDone.WaitOne();
42 Console.WriteLine("All sub tasks done!");
43 timeEnd = DateTime.Now;
45 calcServerClient.Close();
46 Console.WriteLine("========================================");
47 Console.WriteLine("totalIn:{0}, totalOut:{1}", totalIn, totalOut);
48 Console.WriteLine("the mc pi value:{0}", (totalIn + 0.0) / (totalScale + 0.0) * 4);
49 Console.WriteLine("the total time used:{0}", (timeEnd.Value.ToFileTime() - timeBegin.Value.ToFileTime()) / (10 * 1000 * 1000));
50 Console.WriteLine("Please enter any key to continue...");
51 Console.ReadLine();
52 }

1. this is just the core logic, for full code, see the source code package.
2. the serviceName is defined as the service configuration file name without the .config extension. schedulerHost is defined as the machine name of the head node of the Hpc cluster.
3. clusterHeadUser/clusterHeadPassword is used to login to head node to submit jobs, while wcfClientUser/wcfClientPassword is used to login to compute node to access WCF services, both of them should be explicitly set in a real cluster environment. The two account are usually the same in most cluster environments, but not the same as your domain account that is used to login corp network.
4. if "Can't find file - Microsoft.Hpc.Sheduler.Store.dll" exception raised when running the client, install Windows HPC Pack Client Utilities. Only Windows HPC Pack SDK is not enough for developing/running Hpc applications.
5. it takes some time(about 1 minute in my env) to establish session with Hpc cluster.
6. you can see the job status, node head map etc in Hpc Cluster Manager while the application is running. The cluster manager is also helpful for investigation when error encountered.

Typical Client Application Console Output
Creating a session ...
Session creation done!
Session's Endpoint Reference:net.tcp://dit840-013:9087/broker/206
Sub Task[3] Done = In:263541326, Out:72002994
Sub Task[2] Done = In:263543455, Out:72000865
Sub Task[10] Done = In:263544094, Out:72000226
Sub Task[15] Done = In:263534741, Out:72009579
All sub tasks done!
totalIn:4216696722, totalOut:1152012398
the mc pi value:3.14168387800455
the total time used:1244
Please enter any key to continue...

You can increase the exercise count to get more precise PI value, but it will consume more time.

full source code download

Some Personal Observations:
1. Windows Hpc Cluster provides convenient management tools and utilities, which makes deploying/managing middle-level(several hundreds of nodes) of computing cluster very easy.
2. Windows SOA programming model greatly simplified the development process of some specific kind of Hpc applications.
3. Windows Hpc build-in security feature add some complexity of the develop/deploy process and potential performance downgrade occurs if large amount of data movement happens among nodes. But these overhead results very little gains - dose security problem really matter in a private computing cluster?
4. Hpc SOA programming model is very similar to so called "web server farm" architecture. But as a general programming platform, the head/broker node fail-over problems are not solved in a very elegant and scalable way.
5. Windows Hpc scheduler is too general purpose, too centralized, which making the session creation very very time-consuming. This means that it takes SOA application much time to do init work.
6. Although it is called "SOA" and it uses popular "WCF" technology, the Hpc SOA architecture is completely not suitable for web applications(especially for scaling purpose). Microsoft describes the target scenario as "interactive application", which mainly includes Monte Carlo Problems, Ray Tracing, Excel Calculation Add-in and BLAST Searches.

[3]Microsoft Official SOA doc
[4]submit jobs to head node in another AD
[5]Call WCF services hosted on other nodes with specific client credentials(domain username/password)
[6]Monte Carlo Method
[8] From Sequential to Parallel Code Using Windows Hpc (Doc, Code)


Profiling, Instrumentation and Sampling

For Profiling, wikipedia[1] says:
In software engineering, performance profiling(a.k.a performance analysis), is the investigation of a program's behavior using information gathered as the program executes (i.e. it is a form of dynamic program analysis, as opposed to static code analysis).
There are two ways to obtain Profiling information: either statistical sampling or code instrumentation[1][2]:
- Statistical Sampling probes the target program's program counter at regular intervals using operating system interrupts. Sampling profiles are typically less accurate and specific, but allow the target program to run at near full speed.
- Code Instrumentation, on the other hand, modify the target program with additional instructions to collect the required information. It's more disruptive, but allows the profiler to record all the interested events. Also, Instrumenting the program can cause changes in the performance of the program.

Popular Sampling Profilers area:
Intel VTune
Apple Shark
Amd CodeAnalyst

Popular Instrumenting Profiles are:

Visual Studio Profiler supports profiling using both sampling and instrumenting methods.



Software Tracing VS. Event Logging

In software engineering, Tracing is a specialized use of Logging to record information about a program's execution. This information is typically used by programmers for debugging purpose.
- from Wikipedia

In other words, Logging is the generic term for recording information, while Tracing is the specific form of logging used to debug. Some people differentiate between "logging" that's useful for post-release diagnostics and "tracing" for development purposes.

- Event Logging provides system administrators with information useful for diagnostics, operation and auditing. The different classes of events that will be logged, as well as what details will appear in the event messages, are often considered early in the development cycle.

- Software Tracing provides developers with information useful for debugging. This information is used both during the development cycle and after the software is released.

But in practical, the distinction between these two terms is not so obvious and sometimes, they are used alternatively.

For example, .Net provides logging mechanism by means of Debug/Trace object in System.Diagnostics namespace. Here Trace is used for operational/diagnose purpose in release/production code, while Debug output information only in debug/development code.
MSDN has a detailed description on this topic.

Popular systems:
1. Apache Logging can be used for both purposes
2. SysLog can be used for event logging purpose
3. Win32 Event Logging is for operation purpose and Win32 Event Tracing is for developer


Programming Windows Hpc Server - An Overview

Here is a very brief introduction on how to build a Windows Hpc Cluster and various ways to developing applications on this platform.

Part I - Building A Windows Hpc Cluster

1. Build Hpc Cluster Head Node
- OS: Windows Server 2008 x64, Standard/Enterprise/Hpc Edtion
- Hpc Pack 2008 (head mode installation)
- Configure Network etc.
- Add Cluster Admin Group/User

2. Build Hpc Cluster Compute Node
- OS: Windows Server 2008 x64, Standard/Enterprise/Hpc Edtion
- Hpc Pack 2008 (compute mode installation)
- Add Cluster Admin Group/User

3. Build Hpc App Dev Node
- OS: Windows XP, Windows 2003, Windows Vista, Windows 7
- Hpc Pack 2008 (Client Utilities)
- Visual Studio 2005/2008
- Hpc Pack 2008 SDK

more info see Degisn & Deploy Guide for Windows Hpc Server 2008.

Part II - Writing Hpc Programs


- Programming Model for typical Cluster system(distributed memory system/shared nothing(disk) system)
Classic Hpc Programming Using Visual C++(Doc, Code)
MPI Programming on HPCS


- Programming Model using WCF/HPC, suitable for embarrassingly parallel problems.
From Sequential to Parallel Code Using Windows Hpc (Doc, Code)
SOA Programming on HPCS

3 Raw Job Scheduler

Job Types:

Parallel Task Job (from MSDN)
- Parallel Task Job: same bin, communicating among tasks

Parametric Sweep Job(from MSDN)
- Parametric Sweep Job: same bin, no communication among tasks
Task Flow Job (from MSDN)
- Task Flow Job: different bin, dependencies(DAG) among tasks

Job Scheduler Programming
Job Management on HPCS
Job Scheduling on WCCS2003
Job Manager Doc


Hpc Pack @ MSDN
Hpc Pack SDK @ MSDN
Hpc Dev Resources @ MsHpcHome


An Introduction to Windows HPC Server 2008

1. Components of Windows HPC Server

1.1 X64 Windows OS (Windows Server 2008 Standard, Enterprise or Hpc Edition)

1.2 Hpc Subsystem (Hpc Pack 2008)
- scalable Job Scheduler for creation, execution and monitoring Hpc jobs
- MS-MPI programming infrastructure, which is based on MPICH2
- HPC-WCF(HPC-SOA) programming infrastructure
- utilities for Cluster Managing (Configuring, Provisioning, Monitoring, Reporting and Diagnosing)

Note: Hpc Pack 2008 has three major installation types
- Head Node (Hpc Head Core + Management/Job Consoles)
- Compute Node (Hpc Compute Core)
- Client Node (optional, Management/Job Consoles)

2. Architecture of Windows HPC Server

Windows Hpc Server Typical Topology (from Microsoft)
2.1 Nodes
- Head Node (needs Hpc Pack)
- Compute Node (needs Hpc Pack)
- WCF Broker Node (optional, needs Hpc Pack)
- Client Node (optional, needs Hpc Pack Client Utilities)
- Dev Node (optional, needs Hpc Pack SDK + Hpc Pack Client Utilities)

2.2 Networks
- Interface
a. Winsock Direct
b. NetworkDirect
c. TCP/IP interface
- Physical
a. Ethernet
b. InfiniBand
c. Myrinet

Windows Hpc Server Network Stack (form Microsoft)

3. What can we do with it?

3.1 Application Developing

- Batch/Parallel: MPI
- Interactive/SOA: HPC-WCF

3.2 Job Management

A. Job Types

- Parallel Job: same bin, communicating among tasks
- Parametric Sweep Job: same bin, no communication among tasks
- Task Flow Job: different bin, dependencies(DAG) among tasks

B. Job Operations
- Job Scheduler Configuration
- Job Creation and Definition
- Job Execution and Monitoring

C. Interfaces
- Command line (CmdShell & PowerShell)
- COM and .NET Interface
- HPC Basic Profile Interface by Open Grid Forum

3.3 System Management

- Configuration
- Provisioning
- Monitoring, Reporting & Auditing
- Diagnostics

Windows Hpc Server Ecosystem (form Microsoft)


Introduction on Windows Hpc Server 2008

Windows Hpc Server Product Home
Windows Hpc Server Technet Home
Windows Hpc Server 2008 Docs
Windows Hpc Server Team Blog

- Management
HPCS Command Tools
HPCS PowerShell Cmdlets

- App Developing
Job Scheduling on WCCS2003
Job Management on HPCS
SOA Programming on HPCS
MPI Programming on HPCS
Hpc Pack @ MSDN
Hpc Dev Resources @ MsHpcHome

- Community
Hpc Show @ Channel9
WindowsHpc Site


OS provisioning in large scale data centers

In large scale data centers, manually installing OS is not practical due to its huge time/labor cost. There must be a way to enable auto OS installation, so here comes the - PXE(Preboot eXecution Environment) technology.

PXE is maintained by the Intel Corporation, it is an industry standard client/server interface that allows networked computers that are not yet loaded with an operating system to be configured and booted remotely by an administrator. The PXE code is typically delivered with a new computer on a read-only memory chip or boot disk that allows the computer (a client) to communicate with the network server so that the client machine can be remotely configured and its operating system can be remotely booted.

PXE provides three things:
1) The Dynamic Host Configuration Protocol (DHCP), which allows the client to receive an IP address to gain access to the network servers.
2) A set of application program interfaces (API) that are used by the client's Basic Input/Output Operating System (BIOS) or a downloaded Network Bootstrap Program (NBP) that automates the booting of the operating system and other configuration steps.
3) A standard method of initializing the PXE code in the PXE ROM chip or boot disk.

The PXE process consists of:
1. The client notifying the server that it uses PXE. If the server uses PXE, it sends the client a list of boot servers that contain the operating systems available.
2. The client finds the boot server it needs and receives the name of the file to download. The client then downloads the file using Trivial File Transfer Protocol (Trivia File Transfer Protocol) and executes it, which loads the operating system.
3. If a client is equipped with PXE and the server is not, the server ignores the PXE code preventing disruption in the DHCP and Bootstrap Protocol (BP) operations.

1. MSDN on PXE
2. Wikipedia on PXE
3. PXE wiki
4. PXE Spec


*cast: the any-, broad-, uni- and multi- ones

*cast means network addressing and routing schemes.

Anycast donates that data is routed to the "nearest" or "best" destination as viewed by the routing topology. In extrem scalable online web services, anycast may be used as the first level loadbalance solution. (DNS round robin)

Unicast transmission is the sending of information packets to a single destination. Its name is a little bit strange and the intention is to echo *cast style.

Multicast is a network addressing method for the delivery of information to a group of destinations simultaneously. It is useful for intranet applications such as video conference over IP and VOD services etc.

Broadcast refers to transmitting a packet that will be received (conceptually) by every device on the network. It is used widely in flooding messages when the destination address is unkown.

The following pictures illustrate the main concepts of each method:


Picture from Wikipedia

Scalable System Design Philosophies

Keep it simple
Complicated systems fail in complicated ways (making them difficult to test and debug)
Start with simplest design (evolve based on experience)

Assume everything can fail
Machines and hardware can fail at any time
Do not assume key points have to be reliable
Build redundancy at all levels

Make failure cases mainline
If an error path only executed on failure, then there is a higher chance of it not working correctly
You don’t want to discover that error handling has problems during a failure
Eliminating different paths of execution keep things simple and reduces coding and testing

Minimize work done by people
Processes with people involved are slow
They are difficult to test
Error prone and hard to scale
Build logic into the system

Scale out, not up
Don’t depend on bigger computers to scale up a problem

Reduce costs
Commodity hardware provides more processing power per dollar
Highly reliable hardware is much more expensive
It still does not eliminate all failures


Hub, Bridge, Router and Switch

Since I studied network technologies in college, I had been confused by many terms in network world. Recently, I faced these network terms again due to project requirement. So I did some studying again and summarized the most confusing terms here.

Network is a group of computers connected together in a way that information exchange among them is enabled.

Network Node is anything that is connected to the network as information sender/receiver. It is typically a computer, but may also be printer or CD-ROM tower.

Network Segment is a portion of a computer network wherein every node communicates using the same physical layer.

Hub works at the Layer 1 in the OSI network model. It broadcast packet from one port to all the other ports, thus makes devices connected by all ports to form one broadcast domain. In one broadcast domain, each network node receives all packets sent by other nodes. Hubs just extend the network segment to a larger scale, since it just does physical layer broadcasting.

Bridge connects multiple network segments at the data link layer (layer 2) of the OSI model. Rather than broadcasting packets to all ports, Bridges are capable of analyzing incoming data packets to determine which port should be used to send the packet to destination node, which is usually located in other network segment. In ethernet world, MAC address is used to do the destination port selecting. Bridges create new network segment, since they do switching at data link layer and isolate each broadcast domains.

Router works at layer 3(IP layer) to connect different subnets. It transmits logically addressed packets from their source toward their ultimate destination through intermediate nodes.

At conceptual level, Router Bridge do the same work - packet forwarding. The forwarding decision is based on two policies: routing(by routers) and briding(by bridges). Routing uses information encoded in device's address to infer its locatio on the network, while Bridging makes no assumptions about where the addresses are located and depends heavily on broadcasting to locate unknown addresses. Routing do the forwarding according to the destination node's network location, while Bridging do the forwarding according to the destination node's address.
Routing assumes that network addresses are structured and that similar addresses imply proximity within the network. Because structured addresses allow a single routing table entry to represent the route to a group of devices, structured addressing (routing, in the narrow sense) outperforms unstructured addressing (bridging) in large networks.

Switch is a broad and imprecise marketing term. This term commonly refers to a Network bridge that processes packet forwarding at the Data link layer (layer 2) of the OSI model. Switches that additionally process packet at the Network layer (layer 3 and above) are often referred to as Layer 3 switches, which is also used interchangeably with Router.