SIMD and SSE
- Description: Single-Instruction Multiple-Data parallelism on x86 — the SSE/AVX/AVX-512 register sets and instruction families, when SIMD pays off, autovectorization vs intrinsics vs portable wrappers (
std::experimental::simd, Highway, xsimd), and the gotchas (alignment, tail handling, downclocking on AVX-512). - My Notion Note ID: K2C-1-1
- Created: 2020-06-01
- Updated: 2026-05-22
- License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io
Table of Contents
- 1. What SIMD Is
- 2. SSE Family History
- 3. Register and Instruction Layout
- 4. Ways to Get SIMD Code
- 5. When SIMD Helps and When It Doesn't
- 6. Pitfalls
- 7. References
1. What SIMD Is
- Single Instruction, Multiple Data — one CPU instruction operates on a vector of values in parallel.
- One of Flynn's four taxonomy categories (SISD, SIMD, MISD, MIMD).
- On modern x86, the vector width has grown over time:
- MMX (1996) — 64-bit registers (
mm0-mm7), shared with x87 FPU. Integer-only. Obsolete. - SSE family (1999+) — 128-bit
xmmregisters. - AVX (2011) — 256-bit
ymmregisters. - AVX-512 (2016) — 512-bit
zmmregisters, mask registersk0-k7.
- MMX (1996) — 64-bit registers (
- ARM has its own SIMD: NEON (128-bit, ubiquitous on AArch64) and SVE / SVE2 (scalable, length-agnostic).
- On a CPU running at 3 GHz with 256-bit AVX float ops: 8 floats per cycle per FMA unit × 2 FMA units = 16 floats/cycle ≈ 48 GFLOPS per core. The scalar baseline is ~6 GFLOPS. The gap is what makes SIMD worth the trouble.
2. SSE Family History
| Extension | Year | Vector width | Notable additions |
|---|---|---|---|
| SSE | 1999 (Pentium III) | 128-bit xmm |
Single-precision float SIMD, separate register file (no x87/MMX sharing). |
| SSE2 | 2000 (Pentium 4) | 128-bit | Double-precision float, 128-bit integer ops. Baseline for x86-64 — every x86-64 CPU has it. |
| SSE3 | 2004 | 128-bit | Horizontal adds, complex-number ops. |
| SSSE3 | 2006 | 128-bit | PSHUFB byte shuffle, PMADDUBSW. |
| SSE4.1 | 2007 | 128-bit | PMULLD, PMOVZX, BLENDV, dot-product. |
| SSE4.2 | 2008 | 128-bit | String/text processing, CRC32, PCMPESTRI. |
| AVX | 2011 (Sandy Bridge) | 256-bit ymm |
3-operand VEX encoding, double-width floats. |
| AVX2 | 2013 (Haswell) | 256-bit | 256-bit integer ops, gather, FMA. |
| AVX-512 | 2016 (Knights Landing → Skylake-X) | 512-bit zmm |
Mask registers, conflict detection, many sub-extensions (F, CD, BW, DQ, VL, ...). |
| AVX-512 on consumer | mixed — present on Ice Lake, removed on Alder Lake/Raptor Lake (E-cores can't do it), partially back on Zen 4/Zen 5. |
- "SSE" today casually refers to the whole family. The original 1999 SSE is rarely the boundary you care about; SSE2 is the practical baseline for any 64-bit x86 code.
- Detect support at runtime via
cpuid(or__builtin_cpu_supports("avx2")in GCC/Clang).
3. Register and Instruction Layout
3.1 Registers
- MMX: 8× 64-bit
mm0-mm7(aliased with x87 — obsolete). - SSE / SSE2-SSE4.2: 16× 128-bit
xmm0-xmm15(8 in 32-bit mode). - AVX / AVX2: 16× 256-bit
ymm0-ymm15. Lower 128 bits =xmm. - AVX-512: 32× 512-bit
zmm0-zmm31. Lower 256 =ymm, lower 128 =xmm. Plus 8 mask registersk0-k7for predicated ops.
A 128-bit xmm register can hold 4 floats, 2 doubles, 16 bytes, 8 shorts, 4 ints, or 2 int64s — interpretation is per-instruction.
3.2 Operations
- Arithmetic —
ADDPS(add packed singles),MULPD(multiply packed doubles),VFMADD231PS(fused multiply-add). - Comparison —
CMPPS,PCMPEQBproduce a bit-mask result. - Permutation/shuffle —
PSHUFD,PSHUFB,VPERMPSrearrange lanes. Often the bottleneck. - Load/store —
MOVAPS(aligned),MOVUPS(unaligned),VMOVAPS(AVX). Pre-Nehalem (2008), unaligned loads were significantly slower; modern cores have near-identical aligned vs unaligned cost — except across cache-line boundaries. - Mask / predication (AVX-512) —
k-register tells which lanes execute; eliminates a lot of branch-heavy code.
3.3 Calling-convention impact
- AVX/AVX-512 write to wider registers; legacy code only saves the
xmmportion → upper bits get zeroed (VZEROUPPERinstruction). Mixing SSE and AVX without VZEROUPPER causes huge per-call stalls. Compilers handle this; hand-written assembly must not skip it.
4. Ways to Get SIMD Code
Ordered from most-portable / least-effort to most-control:
4.1 Autovectorization
- Compiler turns scalar loops into SIMD automatically.
- Conditions: countable loop, no aliasing, no data dependencies across iterations, simple body, predictable trip count, aligned access (or compiler handles peeling).
- Inspect with
-fopt-info-vec(GCC),-Rpass=loop-vectorize(Clang), or/Qvec-report(MSVC). - Help the compiler:
__restrict__to assert no aliasing, contiguous storage, simple control flow, mark loop with#pragma omp simd.
4.2 OpenMP / #pragma omp simd
- Portable hint to vectorize a loop, even when autovec is too timid.
#pragma omp simd reduction(+:sum)is the typical incantation.
4.3 Portable SIMD wrappers
std::experimental::simd(Parallelism TS v2 — ISO/IEC TS 19570:2018; formalized asstd::simdin C++26).- Google Highway — header-only, supports x86, ARM, RISC-V, WASM. Used by JPEG XL, Chromium.
- xsimd — used by xtensor / mlpack.
- Eve — modern C++20-style, expression templates.
- Write once, the wrapper picks the widest vector type the target supports.
4.4 Intrinsics
- C function names that map 1:1 to SIMD instructions. Define type prefix per width (
__m128,__m256,__m512).
#include <immintrin.h>
// dot product of two arrays of 8 floats with AVX
float dot8(const float* a, const float* b) {
__m256 va = _mm256_loadu_ps(a);
__m256 vb = _mm256_loadu_ps(b);
__m256 vmul = _mm256_mul_ps(va, vb);
// horizontal sum: a bit of shuffling
__m128 lo = _mm256_castps256_ps128(vmul);
__m128 hi = _mm256_extractf128_ps(vmul, 1);
__m128 sum128 = _mm_add_ps(lo, hi);
sum128 = _mm_hadd_ps(sum128, sum128);
sum128 = _mm_hadd_ps(sum128, sum128);
return _mm_cvtss_f32(sum128);
}
- Compile with
-mavx(or-mavx2,-mavx512f). Without the flag, the intrinsic is undefined. - Per-architecture; no portability without
#ifdefwalls.
4.5 Inline / standalone assembly
- Maximum control, minimum portability and readability. Avoid except for the hottest microkernels.
5. When SIMD Helps and When It Doesn't
Helps:
- Tight loops over contiguous arrays of small numeric types.
- Image processing (per-pixel ops), audio (per-sample), physics (per-particle).
- Dense linear algebra — BLAS/LAPACK libs (OpenBLAS, MKL, BLIS, Eigen) are extensively hand-SIMD'd.
- Compression/decompression (zstd, lz4 use SIMD heavily).
- Cryptography hash functions (BLAKE3, SHA-256 with SHA extensions).
- Parsing — JSON (simdjson is the reference example), CSV.
Doesn't help much:
- Branchy code, indirect lookups, pointer chasing — these defeat vectorization.
- Loops dominated by memory bandwidth — SIMD doesn't make DRAM faster.
- Small N — fixed setup cost / function-call overhead eats the gain.
- Data already on GPU — orders of magnitude more parallel there.
6. Pitfalls
- Alignment.
MOVAPSrequires 16-byte alignment; misaligned →#GPfault.MOVUPSworks on any address. Usealignas(32)oraligned_allocwhen in doubt. On modern cores the unaligned penalty is tiny except across cache lines. - Tail handling. Vectorized loop processes
floor(N/W) * Welements; the remainingN mod Welements need scalar cleanup or a mask. Forgetting this = silent UB or skipped data. - AVX-512 downclocking. On Skylake-SP and Cascade Lake, executing 512-bit AVX-512 instructions throttled the core (and sometimes neighbors) by ~10-25%. Workload had to overcome the frequency loss before showing a speedup. Improved on Ice Lake; gone on Sapphire Rapids. Still real on older Xeon-SP fleets.
- VZEROUPPER stall. Calling an SSE routine from AVX code without
VZEROUPPERcauses a ~70-cycle stall on each transition. Compilers insert it automatically; hand-assembly must not omit. - Denormals. SIMD denormal handling is slow; enable FTZ (flush-to-zero) and DAZ (denormals-are-zero) via
_MM_SET_FLUSH_ZERO_MODE/_MM_SET_DENORMALS_ZERO_MODEin numerical hot paths. - CPU dispatch. Binaries built with
-mavx2only run on CPUs with AVX2. Either ship multiple variants + a CPU-dispatch table (__attribute__((target("avx2")))function multi-versioning), or set a baseline (e.g., x86-64-v3 ABI → AVX2+BMI2). - AVX-512 absent on consumer Intel chips (Alder Lake onward) — fragmented. Don't assume it's there; check at runtime.
- Autovec silently giving up. A subtle dependency, aliasing, or function call inside the loop drops vectorization with no error. Check the report flags every time.
7. References
- Intel® Intrinsics Guide — https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
- Agner Fog's optimization manuals — https://www.agner.org/optimize/
- Google Highway — https://github.com/google/highway
- Daniel Lemire, "Parsing gigabytes of JSON per second" (simdjson) — https://arxiv.org/abs/1902.08318
- Wikipedia: SSE family — https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
- Wikipedia: AVX — https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
- Wikipedia: AVX-512 — https://en.wikipedia.org/wiki/AVX-512
- Computer Architecture: A Quantitative Approach (Hennessy & Patterson) — SIMD / vector chapter.