We implement fast barriers for x86-compatible multicore processors. We in particular target Intel's Core2 Duo and Core2 Extreme, their latest dualcore and 2x dualcore processors.
A few comments:
To fully exploit current multicore processors and parallelize even small, tightly coupled problems, extremely light-weight synchronization is a must. It turns out that using synchronization primitives provided by threading libraries, the operating system, and even x86 lock instructions are much too expensive to take real advantage of the new multicore processors. Using our own synchronization implementation, we succeeded speeding up very small workloads: For instance, on a Core2 Duo we see first parallel speed-up for an FFT of size 1024, which runs for approximately 10,000 cycles and the working set fits fully into the L1 data cache of one core.
We compared multiple fast barrier variants. The base line is the (fast) OpenMP barrier of the Intel C++ compiler. A OS-based or pthreads barrier takes on the order of 10,000s of cycles.
Barrier. zip (12 KB). The zip file contains a VisualStudio .NET 2003 + Intel C++ Compiler 9.1 project that implements a small threading interface to pthreads/winthreads, the barrier, affinity, timing, and a small correctness/timing program.
rdtsc.h | timing interface, using the RDTSC instruction |
smp.h, smp.c | the actual barrier implementation |
threads.h | threading interface to pthreads and winthreads |
main.c | example and test program |
barrier.sln | VisualStudio .NET 2003 solution file |
spmd.vcproj | VisualStudio .NET 2003 project file |
spmd.icproj | Intel C++ compiler 9.1 project file |
The following parameters control the behavior of the barrier implementation and the test program.
CHECK_RUNS | number of runs in the correctness check |
MEASURE_RUNS | number of runs in the timing |
FASTBARRIER | use the fast, specialized implementation, instead of the lock xadd variant |
NUMABARRIER | use the tree-based barrier instead of the flat barrier for 4 processors (targets Core2 Extreme) |
SAFETY | turns on the pause and mfence instruction for memory consistency paranoia |
CACHELINE | length of the cache line |
D21, D22, D41, D42 | magic numbers to put data in good cache line locations |
For more information please email Franz Franchetti, franzf (at) ece.cmu.edu.