Parallelization: Harder than it looks

Published on 2010-04-12
Tagged: c++ optimization parallelization

I recently finished a class at UCSD titled parallel microarchitecture. Each assignment in this class involved taking a serial program and parallelizing it. The goal was to reduce the execution time for each program by as much as possible.

The tool we used to parallelize these programs is called Cilk++. Cilk allows you to annotate portions of a normal C/C++ program which can be run in parallel, and the compiler generates parallel code. This allows you to optimize a serial program with minimal effort. There are three main constructs in Cilk:

cilk_spawn - call a function asynchronously
cilk_sync - wait for all asynchronous function calls to complete
cilk_for - execute iterations of a loop in parallel

Under the hood, Cilk runs a number of worker threads, each of which executes tasks from its ready deque (double ended queue). The number of worker threads depends on the number of processors. When a worker encounters a task that can be run in parallel (such as a cilk_spawn call), it pushes the task onto the head of its ready deque. When a worker completes a task, it pops the next task off the head of its deque. If a worker runs out of tasks, it steals tasks from the tail of another worker's ready deque. Since "thieves" and "victims" operate on different ends of ready deques, very little synchronization is required.

Despite the simplicity of the Cilk constructs and the low overhead of its implementation, parallelizing even simple programs can be difficult. I found the presentation Issues in Parallelization particularly useful. In these slides, Matteo Frigo lists four common performance problems for parallelized programs, which I will recapitulate below. NB Intel seems to have removed everything from the Cilk website including the presentation. The link goes to a cached Google page.

Problem 1: Wrong grain size

The easiest way to improve performance with Cilk is to take advantage of Data Level Parallelism. If you are processing a large data set, you should split it up into smaller chunks, if possible, and have each core process a portion of the chunks. This is usually done with cilk_for.

A key factor affecting performance is the size of each chunk. Cilk refers to this as grain size. Larger grain sizes result in less parallelism, especially if there are fewer grains than cores or if grains take different amounts of time to process. Smaller grains are better, but since there are more of them, there is a greater parallelization overhead. Smaller grains may also cause each core to have an irregular memory access pattern, so there will be more cache misses.

If you get lower than expected speedup when adding more cores, your grain size may be too small. If performance does not increase at all when adding more cores, your grain size may be too large.

Consider the following loop:

cilk_for (int i = 0; i < N; ++i)
    a[i] = b[i] + c[i];

Very little work is performed in each iteration of the loop. If the grain size were 1, then each grain would comprise a single iteration, and the overhead would dominate the time spent doing useful work. On the other hand, consider the following loop:

cilk_for (int i = 0; i < M; ++i) {
    int sum = 0;
    for (int j = 0; j < N; ++j)
        sum += a[i][j];
    b[i] = sum;
}

In this case, if N is large, then a fairly large amount of work is performed for each iteration of the outer loop. A small grain size may make sense here since the overhead will be small compared to the amount of work.

Cilk allows you to set the grain size using the cilk_grainsize pragma. For example, the following snippet executes each iteration of the loop separately:

#pragma cilk_grainsize = 1
cilk_for (int i = 0; i < N; ++i)
    ...

You can use any integer-valued expression for the grain size. The expression is evaluated at runtime. You can even use the number of worker threads:

#pragma cilk_grainsize = N / (4 * cilk::current_worker_count)
cilk_for (int i = 0; i < N; ++i)
    ....

The default grain size is min(512, N / (8*p)) where p is the number of workers.

Problem 2: Too little parallelism

Let's imagine a simple alternative to cilk_for:

void loop_body(int i) { ... }

...

for (int i = 0; i < N; ++i)
    cilk_spawn loop_body(i);

This wraps the body of the loop in a separate function, and makes each call to the function asynchronous. This is semantically similar to what cilk_for does, but it would be very inefficient for a large number of iterations. The problem is that the function calls in each iteration are being spawned in serial rather than in parallel. Since less parallelism is available, the program will execute more slowly than if the spawns were made in parallel.

cilk_for solves this problem using a divide and conquer approach. Here is some pseudocode from the Cilk++ Programmer's Guide to give you an idea of how this works.

void run_loop(first, last) {
    if (last - first < grainsize) {
        for (int i = first; i < last; ++i) LOOP_BODY;
    } else {
        int mid = (last-first)/2; 
        cilk_spawn run_loop(first, mid); 
        run_loop(mid, last);
    }
}

Each time run_loop is called with a range of iterations larger than the grain size, the range is divided by 2, and run_loop is called recursively on each half in parallel. If the range is smaller than the grain size, the iterations are executed serially.

Problem 3: Memory bandwidth limitations

In a multicore processor, each core has its own execution resources, so most instructions can be executed in parallel. However, most memory resources are still shared. Caches and RAM have limited bandwidth, and a single core doing nothing else can generate enough traffic to saturate them.

As a result, parallelizing memory bound programs will not produce any speedup. Try parallelizing a loop like this:

cilk_for (int i = 0; i < N; ++i)
    a[i] = b[i] + c[i];

On most systems, you will not get any speedup. In fact, you may see decreasing performance as more cores are added. If this loop is executed in serial, only compulsory cache misses will occur. The processor may be able the prefetch data into the cache as well, since there is a sequential access pattern. However, if multiple threads execute this loop, they will each use a portion of the shared cache, generating a lot of conflict misses.

It is difficult to tell whether bandwidth limitations are the bottleneck in a parallel program. Here's something to try though. If you suspect a loop of being memory bound, try adding some junk computation to it. You will see a much greater speedup if the loop was memory bound, since adding more work will cause it to be compute bound:

void grind(void) {
    int sum = 0;
    for (int i = 0; i < 100; ++i)
        sum += i;
}

cilk_for (int i = 0; i < N; ++i) {
    a[i] = b[i] + c[i];
    grind();
}

Be aware that the compiler may remove the junk code if optimization is enabled.

In general, there's not much you can do to eliminate a memory bottleneck. You can try to reduce memory bandwidth by changing memory access patterns to be more cache-friendly. If this is not possible, you can always buy a new machine with a larger cache and faster memory.

Problem 4: Too little burdened parallelism

Cilk defines parallelism in an idealistic way. Ignoring any overheads, parallelism is defined as T(1) / T(âˆž). T(1) denotes the time for one processor to execute the program. T(âˆž) denotes the time for an infinite number of processors to execute the program. Burdened parallelism is a similar concept which takes parallelization overhead into account. Burdened parallelism is intended to be an upper bound on speedup.

Cilkview is a tool which gives a number of useful statistics about your program, including burdened parallelism. If you find your burdened parallelism is much lower than your parallelism, there are a number of things you can do to reduce overhead.

Use cilk_for on outer loops and not on inner loops
Increase the grain size for cilk_for loops
Combine small parallel tasks into larger serial tasks
Set a threshold in recursive functions for asynchronous calls. For instance, if you were parallelizing a merge sort, you could only make a recursive call for lists above a certain length.