Description: OpenMP for C++ — fork/join, parallel loops and schedules, data sharing, reductions and atomic, sections and tasks, SIMD, synchronization
My Notion Note ID: K2A-B2-2
Created: 2020-01-13
Updated: 2026-04-30
License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

1. Model
- 1.1 Fork/Join Model
- 1.2 Compiling and Checking Version
2. Controlling the Number of Threads
3. Parallel Loops
4. Data Sharing
5. Sections and Tasks
6. SIMD
7. Synchronization
8. References

1. Model

#pragma omp directives in C/C++/Fortran → compiler emits threading code for shared-memory parallelism.
Small runtime lib (libgomp for GCC, libomp for Clang/LLVM) provides thread management + timing.

1.1 Fork/Join Model

Program starts with one thread (master / initial).
#pragma omp parallel region → master forks a team of threads.
Team executes region concurrently.
End of region → implicit barrier; all threads wait, team joins back to master.
Execution between parallel regions = serial.

1.2 Compiling and Checking Version

# GCC and Clang
g++   -fopenmp -O2 main.cpp -o main
clang++ -fopenmp -O2 main.cpp -o main

# MSVC
cl /openmp main.cpp

Check supported OpenMP version — _OPENMP is a date for spec version (e.g. 201511 ≈ 4.5, 201811 ≈ 5.0):

echo | cpp -fopenmp -dM | grep -i openmp
# #define _OPENMP 201511

2. Controlling the Number of Threads

Priority order, lowest-precedence first:

Compile-time default (typically # online cores).
Env var: OMP_NUM_THREADS=4 ./app.
Runtime call: omp_set_num_threads(4);.
Per-region clause: #pragma omp parallel num_threads(4).

Useful runtime fns (<omp.h>):

Function	Returns
`omp_get_thread_num()`	Index of the calling thread within the team (0..N-1)
`omp_get_num_threads()`	Size of the active team
`omp_get_max_threads()`	Upper bound the next parallel region can use
`omp_get_num_procs()`	Number of processors visible to the runtime
`omp_in_parallel()`	`true` if inside an active parallel region
`omp_get_wtime()`	Wall-clock time in seconds (for timing)
`omp_get_wtick()`	Timer resolution

Gotcha: omp_get_num_threads() outside parallel region returns 1, not the team size. For next-region upper bound → omp_get_max_threads().
Nested parallelism — disabled by default. Enable via omp_set_max_active_levels(N) (or OMP_MAX_ACTIVE_LEVELS=N). Older omp_set_nested / OMP_NESTED deprecated since OpenMP 5.0. Often oversubscribes — prefer tasks or larger outer parallelism.

3. Parallel Loops

3.1 Syntax

Full form (parallel region + worksharing for):

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < n; ++i) {
        a[i] = b[i] + c[i];
    }
}

Combined form (common):

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    a[i] = b[i] + c[i];
}

Equivalent when parallel region contains exactly one loop.
Use separated form when multiple worksharing constructs should share the same team.

3.2 Loop-Form Restrictions

To worksharing-parallelize a loop → must be canonical form:
Loop variable is integer (or random-access iterator since OpenMP 3.0).
Trip count computable before loop starts.
Comparison: <, <=, >, >=.
Increment: ++i, --i, i += k, i -= k, k invariant in loop.
No break, goto, return, exception escaping loop body. (continue is fine.)
C++ range-based for — not parallelizable directly. Convert to index loop or use OpenMP 5.0 taskloop.

3.3 Schedules

schedule clause picks iteration distribution.

Schedule	What it does	When to use
`static`	Equal contiguous chunks, assigned at compile/loop entry	Iterations of roughly equal cost
`static, n`	Cyclic chunks of size `n`	Cache-friendly cyclic distribution
`dynamic, n`	Threads grab chunks of `n` from a queue as they finish	Iterations of variable cost
`guided, n`	Like dynamic, but chunk size shrinks over time	Variable-cost work, tail effects
`auto`	Runtime/compiler picks	Trust the implementation
`runtime`	Picked from `OMP_SCHEDULE` env var	Tune from outside the binary

#pragma omp parallel for schedule(dynamic, 64)
for (int i = 0; i < n; ++i) {
    process(i);   // each iteration may take very different time
}

Don't parallelize tiny loops blindly. Team-creation overhead dominates for small n. Gate with if(n > threshold):

#pragma omp parallel for if(n > 1024)
for (int i = 0; i < n; ++i) { ... }

Exceptions + OpenMP don't mix. Exception inside parallel region must not propagate out. Catch inside; communicate failure via shared atomic flag; or wrap entire region body in try/catch.

4.1 Default Rules

Inside parallel region:

Vars declared outside region — shared by default.
Vars declared inside region — private.
Loop iteration vars on worksharing for (also parallel for, taskloop, distribute) — predetermined-private, regardless of where declared.
Static + global vars — always shared.
Habit: write default(none) + list every variable explicitly. Forces thinking about each; catches accidental sharing.

int sum = 0;
#pragma omp parallel for default(none) shared(a, n) reduction(+:sum)
for (int i = 0; i < n; ++i) {
    sum += a[i];
}

4.2 `private`, `firstprivate`, `lastprivate`, `shared`

Clause	Meaning
`shared(x)`	All threads see and modify the same `x`. Programmer is responsible for race-free access.
`private(x)`	Each thread gets its own `x`. Uninitialized at entry; the original value is invisible inside, and the original is unchanged on exit.
`firstprivate(x)`	Per-thread copy, initialized from the value before the region.
`lastprivate(x)`	Per-thread copy. After the region, the original `x` receives the value from the thread that ran the last iteration of the loop (or the last `section`).
`default(shared	none)`

private(x) does NOT initialize. Each thread's copy starts uninitialized; original invisible inside; original unchanged on exit. Use firstprivate if you need previous value.
Watch for false sharing. Multiple threads writing different bytes of same cache line (common with per-thread accumulators) → line ping-pongs between cores' caches; speedup collapses.
Pad per-thread data to cache line (typically 64 bytes), or restructure so each thread's working set is isolated.

4.3 Reductions

For accumulating across iterations → use reduction, not critical section.
Compiler gives each thread a private accumulator; combines at end with operator you specify. Far cheaper.

double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; ++i) {
    sum += a[i] * b[i];
}

Built-in reduction operators: +, *, &, |, ^, &&, ||, min, max.
- deprecated since OpenMP 5.2 — use + with negated values.
OpenMP 4.0 added user-defined reductions via #pragma omp declare reduction.

4.4 Critical Sections and `atomic`

Real critical section unavoidable:

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    int v = compute(i);
    #pragma omp critical
    {
        global_log.push_back(v);
    }
}

For single scalar update — atomic much cheaper than critical:

#pragma omp atomic
counter += 1;

#pragma omp atomic update
total += a[i];

#pragma omp atomic capture
{ old = counter; counter += 1; }

atomic → hardware atomic instruction. critical → mutex. Use atomic whenever the op fits its restricted forms.
Loop writing to shared variable without reduction/atomic/critical = data race. Result undefined. May pass tests on some hardware, fail on others. No compile-time check. Always pick one of the three.

5. Sections and Tasks

Sections — parallelize fixed set of unrelated work blocks:

#pragma omp parallel sections
{
    #pragma omp section
    do_a();
    #pragma omp section
    do_b();
    #pragma omp section
    do_c();
}

Tasks (OpenMP 3.0+) — parallelize irregular work (recursion, dynamic graphs, anywhere iteration count unknown upfront):

int fib(int n) {
    if (n < 2) return n;
    int x, y;
    #pragma omp task shared(x)
    x = fib(n - 1);
    #pragma omp task shared(y)
    y = fib(n - 2);
    #pragma omp taskwait
    return x + y;
}

int main() {
    int r;
    #pragma omp parallel
    #pragma omp single
    r = fib(20);
}

taskloop (OpenMP 4.5) — task-based alternative to parallel for. Use for very uneven iterations or composability with other task work.

6. SIMD

#pragma omp simd — asks compiler to vectorize loop using SIMD instructions (no threads). Combine with parallel for for both.

#pragma omp simd
for (int i = 0; i < n; ++i) {
    a[i] = b[i] * c[i];
}

#pragma omp parallel for simd
for (int i = 0; i < n; ++i) {
    a[i] = std::sqrt(b[i] * b[i] + c[i] * c[i]);
}

#pragma omp declare simd on function → compiler generates vector version callable from inside simd loop.

7. Synchronization

Directive	What it does
`#pragma omp barrier`	All threads wait until every thread reaches the barrier
`#pragma omp single`	Only one (unspecified) thread executes the block; others wait at end
`#pragma omp master`	Only the master thread executes; no implicit barrier at end
`#pragma omp ordered`	Inside a `for ordered`, forces this block to run in iteration order
`nowait` clause	On `for`/`single`/`sections`, skips the implicit end-barrier

8. References

OpenMP API Specification — actual standard, organized by version.
OpenMP 5.2 Reference Guide (PDF) — concise card with directives, clauses, runtime calls.
LLVM/Clang OpenMP runtime docs — implementation details for libomp.
GCC libgomp manual — implementation details for libgomp.
LLNL OpenMP Tutorial — practical introduction with examples.

C++ OpenMP

Table of Contents

1. Model

1.1 Fork/Join Model

1.2 Compiling and Checking Version

2. Controlling the Number of Threads

3. Parallel Loops

3.1 Syntax

3.2 Loop-Form Restrictions

3.3 Schedules

4.1 Default Rules

4.2 `private`, `firstprivate`, `lastprivate`, `shared`

4.3 Reductions

4.4 Critical Sections and `atomic`

5. Sections and Tasks

6. SIMD

7. Synchronization

8. References

Table of Contents

1. Model

1.1 Fork/Join Model

1.2 Compiling and Checking Version

2. Controlling the Number of Threads

3. Parallel Loops

3.1 Syntax

3.2 Loop-Form Restrictions

3.3 Schedules

4. Data Sharing

4.1 Default Rules

4.2 private, firstprivate, lastprivate, shared

4.3 Reductions

4.4 Critical Sections and atomic

5. Sections and Tasks

6. SIMD

7. Synchronization

8. References

4.2 `private`, `firstprivate`, `lastprivate`, `shared`

4.4 Critical Sections and `atomic`