C++ OpenMP


  • Description: OpenMP for C++ — fork/join, parallel loops and schedules, data sharing, reductions and atomic, sections and tasks, SIMD, synchronization
  • My Notion Note ID: K2A-B2-2
  • Created: 2020-01-13
  • Updated: 2026-04-30
  • License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

Table of Contents


1. Model

  • #pragma omp directives in C/C++/Fortran → compiler emits threading code for shared-memory parallelism.
  • Small runtime lib (libgomp for GCC, libomp for Clang/LLVM) provides thread management + timing.

1.1 Fork/Join Model

  • Program starts with one thread (master / initial).
  • #pragma omp parallel region → master forks a team of threads.
  • Team executes region concurrently.
  • End of region → implicit barrier; all threads wait, team joins back to master.
  • Execution between parallel regions = serial.

1.2 Compiling and Checking Version

# GCC and Clang
g++   -fopenmp -O2 main.cpp -o main
clang++ -fopenmp -O2 main.cpp -o main

# MSVC
cl /openmp main.cpp
  • Check supported OpenMP version — _OPENMP is a date for spec version (e.g. 201511 ≈ 4.5, 201811 ≈ 5.0):
echo | cpp -fopenmp -dM | grep -i openmp
# #define _OPENMP 201511

2. Controlling the Number of Threads

  • Priority order, lowest-precedence first:
  1. Compile-time default (typically # online cores).
  2. Env var: OMP_NUM_THREADS=4 ./app.
  3. Runtime call: omp_set_num_threads(4);.
  4. Per-region clause: #pragma omp parallel num_threads(4).
  • Useful runtime fns (<omp.h>):
Function Returns
omp_get_thread_num() Index of the calling thread within the team (0..N-1)
omp_get_num_threads() Size of the active team
omp_get_max_threads() Upper bound the next parallel region can use
omp_get_num_procs() Number of processors visible to the runtime
omp_in_parallel() true if inside an active parallel region
omp_get_wtime() Wall-clock time in seconds (for timing)
omp_get_wtick() Timer resolution
  • Gotcha: omp_get_num_threads() outside parallel region returns 1, not the team size. For next-region upper bound → omp_get_max_threads().
  • Nested parallelism — disabled by default. Enable via omp_set_max_active_levels(N) (or OMP_MAX_ACTIVE_LEVELS=N). Older omp_set_nested / OMP_NESTED deprecated since OpenMP 5.0. Often oversubscribes — prefer tasks or larger outer parallelism.

3. Parallel Loops

3.1 Syntax

Full form (parallel region + worksharing for):

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < n; ++i) {
        a[i] = b[i] + c[i];
    }
}

Combined form (common):

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    a[i] = b[i] + c[i];
}
  • Equivalent when parallel region contains exactly one loop.
  • Use separated form when multiple worksharing constructs should share the same team.

3.2 Loop-Form Restrictions

  • To worksharing-parallelize a loop → must be canonical form:

  • Loop variable is integer (or random-access iterator since OpenMP 3.0).

  • Trip count computable before loop starts.

  • Comparison: <, <=, >, >=.

  • Increment: ++i, --i, i += k, i -= k, k invariant in loop.

  • No break, goto, return, exception escaping loop body. (continue is fine.)

  • C++ range-based fornot parallelizable directly. Convert to index loop or use OpenMP 5.0 taskloop.

3.3 Schedules

  • schedule clause picks iteration distribution.
Schedule What it does When to use
static Equal contiguous chunks, assigned at compile/loop entry Iterations of roughly equal cost
static, n Cyclic chunks of size n Cache-friendly cyclic distribution
dynamic, n Threads grab chunks of n from a queue as they finish Iterations of variable cost
guided, n Like dynamic, but chunk size shrinks over time Variable-cost work, tail effects
auto Runtime/compiler picks Trust the implementation
runtime Picked from OMP_SCHEDULE env var Tune from outside the binary
#pragma omp parallel for schedule(dynamic, 64)
for (int i = 0; i < n; ++i) {
    process(i);   // each iteration may take very different time
}
  • Don't parallelize tiny loops blindly. Team-creation overhead dominates for small n. Gate with if(n > threshold):
#pragma omp parallel for if(n > 1024)
for (int i = 0; i < n; ++i) { ... }
  • Exceptions + OpenMP don't mix. Exception inside parallel region must not propagate out. Catch inside; communicate failure via shared atomic flag; or wrap entire region body in try/catch.

4. Data Sharing

4.1 Default Rules

Inside parallel region:

  • Vars declared outside region — shared by default.

  • Vars declared inside region — private.

  • Loop iteration vars on worksharing for (also parallel for, taskloop, distribute) — predetermined-private, regardless of where declared.

  • Static + global vars — always shared.

  • Habit: write default(none) + list every variable explicitly. Forces thinking about each; catches accidental sharing.

int sum = 0;
#pragma omp parallel for default(none) shared(a, n) reduction(+:sum)
for (int i = 0; i < n; ++i) {
    sum += a[i];
}

4.2 private, firstprivate, lastprivate, shared

Clause Meaning
shared(x) All threads see and modify the same x. Programmer is responsible for race-free access.
private(x) Each thread gets its own x. Uninitialized at entry; the original value is invisible inside, and the original is unchanged on exit.
firstprivate(x) Per-thread copy, initialized from the value before the region.
lastprivate(x) Per-thread copy. After the region, the original x receives the value from the thread that ran the last iteration of the loop (or the last section).
`default(shared none)`
  • private(x) does NOT initialize. Each thread's copy starts uninitialized; original invisible inside; original unchanged on exit. Use firstprivate if you need previous value.

  • Watch for false sharing. Multiple threads writing different bytes of same cache line (common with per-thread accumulators) → line ping-pongs between cores' caches; speedup collapses.

  • Pad per-thread data to cache line (typically 64 bytes), or restructure so each thread's working set is isolated.

4.3 Reductions

  • For accumulating across iterations → use reduction, not critical section.
  • Compiler gives each thread a private accumulator; combines at end with operator you specify. Far cheaper.
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; ++i) {
    sum += a[i] * b[i];
}
  • Built-in reduction operators: +, *, &, |, ^, &&, ||, min, max.
  • - deprecated since OpenMP 5.2 — use + with negated values.
  • OpenMP 4.0 added user-defined reductions via #pragma omp declare reduction.

4.4 Critical Sections and atomic

  • Real critical section unavoidable:
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    int v = compute(i);
    #pragma omp critical
    {
        global_log.push_back(v);
    }
}
  • For single scalar update — atomic much cheaper than critical:
#pragma omp atomic
counter += 1;

#pragma omp atomic update
total += a[i];

#pragma omp atomic capture
{ old = counter; counter += 1; }
  • atomic → hardware atomic instruction. critical → mutex. Use atomic whenever the op fits its restricted forms.

  • Loop writing to shared variable without reduction/atomic/critical = data race. Result undefined. May pass tests on some hardware, fail on others. No compile-time check. Always pick one of the three.

5. Sections and Tasks

Sections — parallelize fixed set of unrelated work blocks:

#pragma omp parallel sections
{
    #pragma omp section
    do_a();
    #pragma omp section
    do_b();
    #pragma omp section
    do_c();
}

Tasks (OpenMP 3.0+) — parallelize irregular work (recursion, dynamic graphs, anywhere iteration count unknown upfront):

int fib(int n) {
    if (n < 2) return n;
    int x, y;
    #pragma omp task shared(x)
    x = fib(n - 1);
    #pragma omp task shared(y)
    y = fib(n - 2);
    #pragma omp taskwait
    return x + y;
}

int main() {
    int r;
    #pragma omp parallel
    #pragma omp single
    r = fib(20);
}
  • taskloop (OpenMP 4.5) — task-based alternative to parallel for. Use for very uneven iterations or composability with other task work.

6. SIMD

  • #pragma omp simd — asks compiler to vectorize loop using SIMD instructions (no threads). Combine with parallel for for both.
#pragma omp simd
for (int i = 0; i < n; ++i) {
    a[i] = b[i] * c[i];
}

#pragma omp parallel for simd
for (int i = 0; i < n; ++i) {
    a[i] = std::sqrt(b[i] * b[i] + c[i] * c[i]);
}
  • #pragma omp declare simd on function → compiler generates vector version callable from inside simd loop.

7. Synchronization

Directive What it does
#pragma omp barrier All threads wait until every thread reaches the barrier
#pragma omp single Only one (unspecified) thread executes the block; others wait at end
#pragma omp master Only the master thread executes; no implicit barrier at end
#pragma omp ordered Inside a for ordered, forces this block to run in iteration order
nowait clause On for/single/sections, skips the implicit end-barrier

8. References