C++ OpenMP
- Description: OpenMP for C++ — fork/join, parallel loops and schedules, data sharing, reductions and
atomic, sections and tasks, SIMD, synchronization - My Notion Note ID: K2A-B2-2
- Created: 2020-01-13
- Updated: 2026-04-30
- License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io
Table of Contents
- 1. Model
- 2. Controlling the Number of Threads
- 3. Parallel Loops
- 4. Data Sharing
- 5. Sections and Tasks
- 6. SIMD
- 7. Synchronization
- 8. References
1. Model
#pragma ompdirectives in C/C++/Fortran → compiler emits threading code for shared-memory parallelism.- Small runtime lib (
libgompfor GCC,libompfor Clang/LLVM) provides thread management + timing.
1.1 Fork/Join Model
- Program starts with one thread (master / initial).
#pragma omp parallelregion → master forks a team of threads.- Team executes region concurrently.
- End of region → implicit barrier; all threads wait, team joins back to master.
- Execution between parallel regions = serial.
1.2 Compiling and Checking Version
# GCC and Clang
g++ -fopenmp -O2 main.cpp -o main
clang++ -fopenmp -O2 main.cpp -o main
# MSVC
cl /openmp main.cpp
- Check supported OpenMP version —
_OPENMPis a date for spec version (e.g.201511≈ 4.5,201811≈ 5.0):
echo | cpp -fopenmp -dM | grep -i openmp
# #define _OPENMP 201511
2. Controlling the Number of Threads
- Priority order, lowest-precedence first:
- Compile-time default (typically # online cores).
- Env var:
OMP_NUM_THREADS=4 ./app. - Runtime call:
omp_set_num_threads(4);. - Per-region clause:
#pragma omp parallel num_threads(4).
- Useful runtime fns (
<omp.h>):
| Function | Returns |
|---|---|
omp_get_thread_num() |
Index of the calling thread within the team (0..N-1) |
omp_get_num_threads() |
Size of the active team |
omp_get_max_threads() |
Upper bound the next parallel region can use |
omp_get_num_procs() |
Number of processors visible to the runtime |
omp_in_parallel() |
true if inside an active parallel region |
omp_get_wtime() |
Wall-clock time in seconds (for timing) |
omp_get_wtick() |
Timer resolution |
- Gotcha:
omp_get_num_threads()outside parallel region returns 1, not the team size. For next-region upper bound →omp_get_max_threads(). - Nested parallelism — disabled by default. Enable via
omp_set_max_active_levels(N)(orOMP_MAX_ACTIVE_LEVELS=N). Olderomp_set_nested/OMP_NESTEDdeprecated since OpenMP 5.0. Often oversubscribes — prefer tasks or larger outer parallelism.
3. Parallel Loops
3.1 Syntax
Full form (parallel region + worksharing for):
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < n; ++i) {
a[i] = b[i] + c[i];
}
}
Combined form (common):
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
a[i] = b[i] + c[i];
}
- Equivalent when parallel region contains exactly one loop.
- Use separated form when multiple worksharing constructs should share the same team.
3.2 Loop-Form Restrictions
-
To worksharing-parallelize a loop → must be canonical form:
-
Loop variable is integer (or random-access iterator since OpenMP 3.0).
-
Trip count computable before loop starts.
-
Comparison:
<,<=,>,>=. -
Increment:
++i,--i,i += k,i -= k,kinvariant in loop. -
No
break,goto,return, exception escaping loop body. (continueis fine.) -
C++ range-based
for— not parallelizable directly. Convert to index loop or use OpenMP 5.0taskloop.
3.3 Schedules
scheduleclause picks iteration distribution.
| Schedule | What it does | When to use |
|---|---|---|
static |
Equal contiguous chunks, assigned at compile/loop entry | Iterations of roughly equal cost |
static, n |
Cyclic chunks of size n |
Cache-friendly cyclic distribution |
dynamic, n |
Threads grab chunks of n from a queue as they finish |
Iterations of variable cost |
guided, n |
Like dynamic, but chunk size shrinks over time | Variable-cost work, tail effects |
auto |
Runtime/compiler picks | Trust the implementation |
runtime |
Picked from OMP_SCHEDULE env var |
Tune from outside the binary |
#pragma omp parallel for schedule(dynamic, 64)
for (int i = 0; i < n; ++i) {
process(i); // each iteration may take very different time
}
- Don't parallelize tiny loops blindly. Team-creation overhead dominates for small
n. Gate withif(n > threshold):
#pragma omp parallel for if(n > 1024)
for (int i = 0; i < n; ++i) { ... }
- Exceptions + OpenMP don't mix. Exception inside parallel region must not propagate out. Catch inside; communicate failure via shared atomic flag; or wrap entire region body in try/catch.
4. Data Sharing
4.1 Default Rules
Inside parallel region:
-
Vars declared outside region —
sharedby default. -
Vars declared inside region —
private. -
Loop iteration vars on worksharing
for(alsoparallel for,taskloop,distribute) — predetermined-private, regardless of where declared. -
Static + global vars — always shared.
-
Habit: write
default(none)+ list every variable explicitly. Forces thinking about each; catches accidental sharing.
int sum = 0;
#pragma omp parallel for default(none) shared(a, n) reduction(+:sum)
for (int i = 0; i < n; ++i) {
sum += a[i];
}
4.2 private, firstprivate, lastprivate, shared
| Clause | Meaning |
|---|---|
shared(x) |
All threads see and modify the same x. Programmer is responsible for race-free access. |
private(x) |
Each thread gets its own x. Uninitialized at entry; the original value is invisible inside, and the original is unchanged on exit. |
firstprivate(x) |
Per-thread copy, initialized from the value before the region. |
lastprivate(x) |
Per-thread copy. After the region, the original x receives the value from the thread that ran the last iteration of the loop (or the last section). |
| `default(shared | none)` |
-
private(x)does NOT initialize. Each thread's copy starts uninitialized; original invisible inside; original unchanged on exit. Usefirstprivateif you need previous value. -
Watch for false sharing. Multiple threads writing different bytes of same cache line (common with per-thread accumulators) → line ping-pongs between cores' caches; speedup collapses.
-
Pad per-thread data to cache line (typically 64 bytes), or restructure so each thread's working set is isolated.
4.3 Reductions
- For accumulating across iterations → use
reduction, not critical section. - Compiler gives each thread a private accumulator; combines at end with operator you specify. Far cheaper.
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; ++i) {
sum += a[i] * b[i];
}
- Built-in reduction operators:
+,*,&,|,^,&&,||,min,max. -deprecated since OpenMP 5.2 — use+with negated values.- OpenMP 4.0 added user-defined reductions via
#pragma omp declare reduction.
4.4 Critical Sections and atomic
- Real critical section unavoidable:
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
int v = compute(i);
#pragma omp critical
{
global_log.push_back(v);
}
}
- For single scalar update —
atomicmuch cheaper thancritical:
#pragma omp atomic
counter += 1;
#pragma omp atomic update
total += a[i];
#pragma omp atomic capture
{ old = counter; counter += 1; }
-
atomic→ hardware atomic instruction.critical→ mutex. Useatomicwhenever the op fits its restricted forms. -
Loop writing to shared variable without
reduction/atomic/critical= data race. Result undefined. May pass tests on some hardware, fail on others. No compile-time check. Always pick one of the three.
5. Sections and Tasks
Sections — parallelize fixed set of unrelated work blocks:
#pragma omp parallel sections
{
#pragma omp section
do_a();
#pragma omp section
do_b();
#pragma omp section
do_c();
}
Tasks (OpenMP 3.0+) — parallelize irregular work (recursion, dynamic graphs, anywhere iteration count unknown upfront):
int fib(int n) {
if (n < 2) return n;
int x, y;
#pragma omp task shared(x)
x = fib(n - 1);
#pragma omp task shared(y)
y = fib(n - 2);
#pragma omp taskwait
return x + y;
}
int main() {
int r;
#pragma omp parallel
#pragma omp single
r = fib(20);
}
taskloop(OpenMP 4.5) — task-based alternative toparallel for. Use for very uneven iterations or composability with other task work.
6. SIMD
#pragma omp simd— asks compiler to vectorize loop using SIMD instructions (no threads). Combine withparallel forfor both.
#pragma omp simd
for (int i = 0; i < n; ++i) {
a[i] = b[i] * c[i];
}
#pragma omp parallel for simd
for (int i = 0; i < n; ++i) {
a[i] = std::sqrt(b[i] * b[i] + c[i] * c[i]);
}
#pragma omp declare simdon function → compiler generates vector version callable from insidesimdloop.
7. Synchronization
| Directive | What it does |
|---|---|
#pragma omp barrier |
All threads wait until every thread reaches the barrier |
#pragma omp single |
Only one (unspecified) thread executes the block; others wait at end |
#pragma omp master |
Only the master thread executes; no implicit barrier at end |
#pragma omp ordered |
Inside a for ordered, forces this block to run in iteration order |
nowait clause |
On for/single/sections, skips the implicit end-barrier |
8. References
- OpenMP API Specification — actual standard, organized by version.
- OpenMP 5.2 Reference Guide (PDF) — concise card with directives, clauses, runtime calls.
- LLVM/Clang OpenMP runtime docs — implementation details for
libomp. - GCC libgomp manual — implementation details for
libgomp. - LLNL OpenMP Tutorial — practical introduction with examples.