Memory Issues on Multicore Platform

On multi-core platform, pure computing is cheap since there are many processing unit and memory capacity may also not be problem since it's becoming larger and larger. But memory bandwidth remains the bottleneck all the time because it's a bus that is shared by all CPU cores. So efficient memory management is very critical for a scalable application on multicore CPU.

In this article I will point out some memory related problems regarding multicore architecture and also some solutions.

Part I - Memory Contention

Memory Contention means that different cores share a common data region(in main memory and cache) that needs to be synchronized among them. Synchronizing data among different cores has big performance penalty because bus traffic contention, locking cost and cache miss. To deal with such problem, there are two strategies:

1. Don't Share Writable State Among Cores

To minimize memory bus traffic, you should minimize core interactions by minimizing shared locations/data, even if the shared data is not protected by lock but some hardware level atomic instructions such as InterlockedExchangeAdd64 on win32 platform.

The patterns that tend to reduce lock contention also tend to reduce memory traffic, because it is the shared writable state that requires locks and generates contention. In practice, letting each thread work on its own local copy of the data and merging the data after all threads are done can be a very effective strategy.

Let's see two parallel versions of sum calculation program on an eight-core computer. One version uses a shared global variable protected by InterlockedExchangeAdd64() to track all intermediate results among all threads. The other version gives each thread a private partial sum variable that's not shared at all and the final sum is computed as the sum of all these partial sums.

From the console output we can see clearly that, the private partial sum solution is 20x faster than the other one.

Use Global - Total Sum is:49999995000000, used ticket:904.
Use Local - Total Sum is:49999995000000, used ticket:47.

So, even if we just share one variable protected by hardware atomic instructions, the performance penalty could be very significant.

The general rule for efficient execution on a single core is to pack data tightly, so that it has as small a footprint as possible. But on a multi-core processor, packing shared data can lead to a severe penalty from false sharing. Generally, the solution is to pack data tightly, give each thread its own private copy to work on, and merge results afterwards.

2. Avoid False Sharing introduced by Core Cache

Good performance depends on processors fetching most of their data from cache instead of main memory. For sequential programs, modern caches generally work well without too much thought, though a little tuning helps. The smallest unit of memory that two processors interchange is a cache line or cache sector.

Even if we follows the strategy 1 and let each thread access its private data/state, different thread on different cores may also share the same cache line. This is called "false sharing". Avoiding false sharing may require aligning variables or objects in memory on cache line boundaries. 

Let's use a parallel number increaser to see what's the performance penalty of false sharing. In the first version, each thread will modify some thread specific number variables, which are aligned together (so will be packed in the same cache line). In the second version, those variables are located on non-continuous places.

The performance related number would be:
Total Time:2012 (for first version)
Total Time:468 (for second version)

We can see that, false sharing introduced about 5x performance penalty. Avoiding false sharing may require aligning variables or objects in memory on cache line boundaries, so that each core accesses a private cache line that is not shared with others.

Part II - Heap Contention

Most developers manage memories using standard C library malloc/free or standard C++ library new/delete, some of them using OS APIs, for example, HeapAlloc()/HeapFree() on windows platform.

C/C++ standard memory management routines are implemented using platform specific memory management APIs, usually based on the concept of Heap. These library routines (whether is single thread version or multi-thread version) allocate/free memory resource on a single heap, which is usually called CRT heap. It's a global resource that is shared and contended among threads within a process.

This heap contention is one of the bottle neck of multi-threading applications that are memory intensive. The solution is to use thread local/private heap to do memory management, thus the resource contention is eliminated. On windows platform, this means that you need to create a dedicated heap using HeapCreate() for each thread and pass the returned heap handle to HeapAlloc()/HeapFree() functions.

Let's see this Global Heap Vs Local Heap example on Windows platform

On an 8-core system, perf test result using 8 threads are:
8 core time:59282, use global heap? true.
8 core time:20112, use global heap? false.

Using private heap will get around 3x perf gain.

- On windows platform, heap_no_serialization flag can be set when creating a heap, this means that there will be no synchronization cost when accessing it from multiple threads. But it turns out that setting this flag to thread private heap will be very slow on vista and later operating system.
- The reason is that in vista, Microsoft refactored the heap manager code, where some extra data structure and code are removed who is no longer part of the common case for handling heap API calls.
- Heap_no_serialization and some debug scenarios will disable Low Fragment Heap feature, who is now the de facto default policy for heaps and thus highly optimized.

Part III - Dynamic Creation/Free of C++ Object

Operator New/Delete are functions, which are the C++ version of malloc/free and responsible for create/release memory only. It has global version ::operator new and class level version (static member) class-name::operator new.

But New/Delete Operator will handle object construction and deconstruction besides memory management. It's a language operator just like +, - * / and others. New/Delete operator will call global operator new/delete or class specific operator new/delete if requested class has such operator functions.

In order to fully parallelize your application that may use some STL containers, you might need to write your own allocator to leverage thread private heap or some memory pools. Thus, your business logic is the same as single core version and contention bottle neck is eliminated at the same time.

Here is the example on writing your own operator new/delete and allocator.


Hehalem Architecture

Cache Organization and Memory Management of the Intel Nehalem Computer Architecture

Cross-Platform Get Cache Line Size

Understanding and Avoiding Memory Issues with Multi-core Processors

Thread/Data placement for better/consistent performance on Multi-Core/NUMA Achitecture

Parallel Memory Management(Allocate/Free) Intensive Applications on Multi-core system
English Version - http://www.codeproject.com/KB/cpp/rtl_scaling.aspx
Chinese Version - http://blog.csdn.net/arau_sh/archive/2010/02/22/5317919.aspx

Intel Guide for Developing Multithreaded Applications

Windows Heap Management/Performance

Memory Optimization for the entire C++ program

C++ Dynamic Memory Management Techniques

Understanding Operator New and Operator Delete

C++ Standard Allocator - Introduction and Implementation

Improve Performance by Allocator using Pooled Memory

Improving STL Allocators

Anatomy of the Linux slab allocator