Description: Paper note on Calib-Anything — zero-training LiDAR-camera extrinsic calibration via SAM mask-wise attribute consistency; includes paper summary and detailed code walkthrough (arXiv:2306.02656)
My Notion Note ID: K2E-B-C4-1
Created: 2026-03-02
Updated: 2026-06-06
License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io

1. Summary

Title: Calib-Anything: Zero-training LiDAR-Camera Extrinsic Calibration Method Using Segment Anything
Authors: Zhaotong Luo, Guohang Yan, Yikang Li
Paper: arXiv:2306.02656
Code: OpenCalib/CalibAnything

Calib-Anything proposes a zero-training method for LiDAR-camera extrinsic calibration that uses Segment Anything (SAM) to generate dense 2D masks on RGB images and then searches for the extrinsic transform that maximizes the within-mask consistency of projected LiDAR point attributes. The method avoids explicit 2D-3D correspondences by treating each SAM mask as a region in which LiDAR attributes should be "internally consistent" if the extrinsic is correct.

The paper defines a mask-wise consistency score combining (i) reflectivity/intensity dispersion, (ii) normal-vector agreement, and (iii) a coarse segmentation class obtained from RANSAC plane fitting and Euclidean clustering on the point cloud, and aggregates these scores over all masks to evaluate candidate extrinsics.

Quantitatively, the paper reports improved translation and rotation errors on a KITTI-odometry-based setup compared with baseline [19] and also reports results on an in-house dataset with a HESAI Pandar64 LiDAR and a color camera (FOV = 60 deg).

The official implementation is at OpenCalib/CalibAnything. Several engineering choices visible in the code (mask packing limits, optional mask "edge-only" processing, multi-round shrinking random search, and apparent deviations from the paper's point-count adjustment and weight values) are not fully specified in the short paper and matter for reproduction.

2. Key Contributions

A SAM-enabled, automatic LiDAR-camera extrinsic calibration approach requiring zero extra training — uses SAM as-is, no calibration network trained.
A calibration objective built on mask-wise attribute consistency combining reflectivity/intensity, surface normals, and a coarse point-cloud segmentation class (plane vs. clustered objects), enabling calibration without explicit 2D-3D correspondences.
Experimental validation on (i) a KITTI-odometry-based dataset and (ii) an in-house dataset (Pandar64 + camera), demonstrating accuracy improvements vs. a segmentation-based baseline.

3. Method

3.1 High-level Idea

Given camera intrinsics $K$ (assumed known) and an initial extrinsic guess $T_{\text{init}}$ , the method searches for $T$ such that after projecting LiDAR points onto the image, points landing inside each SAM mask $M_i$ exhibit high consistency in their LiDAR-derived attributes.

flowchart TD
  A[RGB image] -->|Segment Anything| B[2D masks]
  C[LiDAR point cloud] --> D[Normal estimation]
  C --> E[Intensity normalization]
  C --> F[Segmentation class: RANSAC + Euclidean clustering]
  D --> G[Per-point attributes]
  E --> G
  F --> G
  B --> H[Project points using candidate T]
  G --> H
  H --> I[Per-mask point sets]
  I --> J[Consistency score per mask]
  J --> K[Aggregate score]
  K --> L[Search/optimize T]

3.2 Pipeline

Step 1 — 2D Segmentation: Run Segment Anything on the RGB image to generate masks covering the entire image. The repo contains an optional mask post-processing script (processed_mask.py) that for large masks (≥2% of image area) keeps only edge-near regions via gradient-based boundary detection; small masks (<2%) are kept as-is.

Step 2 — Point cloud preprocessing: Each LiDAR point is augmented into an 8-element attribute vector:

p = \{x, y, z, n_x, n_y, n_z, r, c\} \tag{1}

where $n_x, n_y, n_z$ are surface normals (via PCA/KNN neighborhood analysis), $r$ is the normalized intensity, and $c$ is a coarse segmentation class obtained via RANSAC plane fitting (ground/walls) followed by Euclidean clustering on remaining points (vehicles, trees).

Step 3 — Projection: For a candidate extrinsic $T$ , project all LiDAR points onto the image:

\lambda \begin{pmatrix} p_u\\ p_v\\ 1 \end{pmatrix} = K T \begin{pmatrix} p_x\\ p_y\\ p_z\\ 1 \end{pmatrix} \tag{2}

Group projected points by the SAM mask they fall into:

P_i = \{p \mid M_i(p_u,p_v)=1\} \tag{3}

Step 4 — Per-mask consistency scoring: For each mask $M_i$ , compute:

s_i = \big(w_R F_R(P_i) + w_N F_N(P_i) + w_S F_S(P_i)\big)\, f(\|P_i\|) \tag{4}

Note: Eq. (4) uses $F_R$ notation, but Eq. (5) below defines it as $F_I$ . This is a genuine notation inconsistency in the original paper.

Intensity consistency — variance of reflectivity within the mask (low variance = high consistency):

F_I(P_i) = 1-\frac{1}{n}\sum_{p \in P_i}(p_r-\overline{p_r})^2 \tag{5}

Normal consistency — self-similarity of surface normals (high dot products = aligned normals):

F_N(P_i)=\frac{1}{n^2}\sum_{v \in A^T A}|v| \tag{6}

Segment consistency — whether points come from a single segment class ( $k$ controls decay for secondary classes):

F_S(P_i)=\frac{1}{C}\sum_i k^i c_i,\ (i=0,1,\ldots) \tag{7}

C=\sum_i c_i,\ (i=0,1,\ldots) \tag{8}

Point-count adjustment $f(n)$ penalizes masks with very few points:

f(n)=1-k_1 n^{k_2},\ (k_1>0,k_2<0) \tag{9}

Step 5 — Global aggregation: Sum over all masks, weighted by point count:

s=\sum_i w_i s_i,\quad w_i=\frac{\|P_i\|}{\sum_i \|P_i\|} \tag{10}

Paper-stated hyperparameters: $k = 0.4$ in Eq. (7); $w_R = w_N = w_S$ equal (exact numerics unspecified); $k_1=1.5$ , $k_2=-0.4$ for Eq. (9).

3.3 Search Strategy

The paper describes a two-stage search: (1) brute-force rotation search over 3DoF with large strides within $[-A, A]$ degrees, then (2) random search in a smaller 6DoF range.

Inputs: RGB image I, LiDAR point cloud P, camera intrinsics K, initial extrinsic T_init

1) Masks:    M = SAM(I)
2) Point attributes:  normals, normalized intensity, class c (RANSAC + clustering)
   -> p = {x,y,z,nx,ny,nz,r,c}
3) Objective (for candidate T):
     Project all points using K and T
     For each mask M_i:  P_i = {p | M_i(p_u,p_v)=1}
     s_i = (w_R*F_R + w_N*F_N + w_S*F_S) * f(|P_i|)
     s = sum(w_i * s_i)
4) Search:
     Coarse: brute-force 3DoF rotation over [-A, A] with stride s
     Refine: random search in smaller 6DoF range
Return best T

Official-code search strategy (different from paper): BruteForceSearch() exists in the code but is never called. The actual Calibrate() function uses a shrinking random search:

do {
    RandomSearch(search_num, trans_init, DEG2RAD(rot_init), 0);
    rot_init /= 2;
    trans_init /= 1.5;
} while (rot_init > 0.3);

Halves rotation range and shrinks translation range by 1.5× each round until rotation range drops below 0.3 deg. With search_num = 5000, ~5 rounds → ~25,000 total score evaluations.

4. Experimental Setup & Results

4.1 Datasets

Dataset	Description	Unspecified Items
KITTI (odometry-based)	"based on KITTI odometry benchmark"	which sequences, frame count, exact protocol
In-house ("ours")	HESAI Pandar64 + color camera (FOV=60 deg)	scene types, size, annotation/GT method

Metrics: L2 loss and Huber loss for translation ( $\|\Delta t\|$ ) and rotation ( $\|\Delta \alpha\|$ ).

4.2 Main Quantitative Comparison (Paper Table I)

On KITTI, Calib-Anything reduces L2 translation error from 20.2 cm to 10.7 cm and rotation error from 0.34 deg to 0.174 deg relative to baseline [19].

4.3 Qualitative Results (Paper Fig. 4)

5. Ablation & Discussion

No explicit ablation study in the paper — no systematic "remove intensity / normals / class" tables. Component sensitivity is unspecified.

Why SAM masks help: SAM provides object-/region-level grouping without training a dataset-specific segmentation model, enabling calibration constraints across diverse scenes.

Why normals and class help: Normals constrain planar regions (ground, walls); class consistency discourages mixing points from different geometric structures within a mask, improving robustness when intensity alone is insufficient.

When it may struggle: Scenes with weak geometric structure, heavy occlusions, sparse LiDAR returns, or masks with significant overlap/fragmentation could reduce per-mask point support and weaken consistency signals. The paper does not quantify these failure modes.

6. Strengths, Limitations & Future Work

Strengths:

Zero-training: no calibration-specific training; uses SAM as-is
Correspondence-free: avoids explicit 2D-3D correspondences, using mask-wise aggregation
Engineering availability: official code and scripts facilitate reproduction

Limitations:

Dependence on initial guess and local search — numeric search bounds/strides/trials are largely unspecified in the paper
Limited baselines — only compares against a single baseline [19]; no systematic ablations
O(n²) normal scoring — full $N^T N$ matrix per mask, expensive for dense point clouds or large masks

Future work (as stated in the paper):

Evaluate on more datasets
Add more quantitative comparisons
Consider stability-aware evaluation when ground truth may contain error

7. Code Walkthrough

7.1 Hyperparameters & Runtime

Category	Item	Value (paper)	Value (code)	Notes
Consistency weights	$w_R, w_N, w_S$ in Eq. (4)	equal (unspecified numerics)	0.2 / 0.3 / 0.5 (intensity / normal / segment)	Discrepancy: code weights are unequal
Class weighting	$k$ in Eq. (7)	0.4	0.5 (geometric halving: 1, 0.5, 0.25, …)	Discrepancy: different decay factor
Point-count adjustment	$k_1, k_2$ in Eq. (9)	1.5, -0.4	bypassed (`adjust = 1`, commented-out code)	Discrepancy: f(n) not implemented
Normal estimation	neighborhood size	unspecified	K=40 nearest neighbors
Plane fitting (RANSAC)	iterations / thresholds	unspecified	3000 iters / 0.2m distance / 0.2 normal weight
Clustering	tolerance / min-max size	unspecified	config `cluster_tolerance` / min 50, max 10000
Downsampling	voxel size	unspecified	hardcoded 0.1m (config value ignored — bug)
Search	rotation range / trials	unspecified	config `rot_deg`, `trans_m`, `search_num`
Runtime	per calibration	unspecified	~25k score evaluations (~5 rounds × 5000)	No runtime table in paper
SAM AMG settings	stability / NMS / offset	unspecified	0.9 / 0.5 / 0.9 (model: vit_l)	From repo README

7.2 Paper-Code Discrepancies

#	Discrepancy	Details
1	f(n) point-count adjustment bypassed	Paper introduces $f(n) = 1 - k_1 n^{k_2}$ (Eq. 9, Fig. 3), but code sets `adjust = 1` with the power function commented out.
2	Search strategy differs	Paper describes brute-force rotation + random 6DoF refinement. Code defines `BruteForceSearch()` but never calls it — uses shrinking random search instead.
3	Consistency weights differ	Paper states $w_R = w_N = w_S$ (equal). Code uses `0.2intensity + 0.3normal + 0.5segment` plus bonus `-0.0001num_masks`.
4	Segment decay factor differs	Paper states $k = 0.4$ in Eq. (7). Code uses geometric halving ( $k = 0.5$ ).
5	Mask edge processing (code-only)	`processed_mask.py` keeps only edge-near regions for large masks (≥2% of image). Not described in paper.
6	Downsample voxel size ignored	Config provides `down_sample_voxel` but code hardcodes leaf size to 0.1m.
7	Normal scoring is O(n²)	Computes the full $N^T N$ matrix for each mask. Not discussed in paper as a computational concern.

7.3 Project Structure & Build

File	Role
`CMakeLists.txt`	Build system: dependencies, targets
`src/run_lidar2camera.cpp`	Entry point: `main()`, config reader, save result
`src/calibration.cpp`	Core implementation: all calibration logic
`include/calibration.hpp`	Calibrator class declaration, custom point type, JSON params struct
`include/dataloader.hpp`	Static methods to load masks, point clouds, calibration files
`include/utility.hpp`	Math helpers: delta-T from 6-DOF, rotation extraction, stats
`include/logging.hpp`	Printf-based logging macros (defined but never used)
`processed_mask.py`	Python script: edge-only processing for large SAM masks

Build dependencies: PCL 1.10, OpenCV, Eigen 3 (via ${EIGEN_ROOT} env var), Boost system, jsoncpp (linked as static libjsoncpp.a).

Build note: There is a bug in the CMake file where CMAKE_CXX_FLAGS is set to -std=c++14 then immediately overwritten with -pthread, losing the C++14 flag. In practice the compiler's default standard is used.

7.4 Entry Point (`run_lidar2camera.cpp`)

main() flow:

Parse config — reads JSON file via readConfig(json_file, json_params) into a JsonParams struct
Construct Calibrator — Calibrator calibrator(json_params) loads all data and preprocesses point clouds
Run calibration — calibrator.Calibrate() performs the iterative shrinking random search
Save result — SaveExtrinsic(refined_extrinsic) writes the 4×4 matrix to extrinsic.txt

readConfig() JSON fields:

JSON Key	What It Fills
`cam_K`	3×3 or 3×4 camera intrinsic matrix
`cam_dist`	Distortion coefficients
`T_lidar_to_cam`	Initial extrinsic guess (4×4)
`T_lidar_to_cam_gt`	Ground truth extrinsic (optional)
`img_folder` / `mask_folder` / `pc_folder`	File paths (relative to JSON file)
`params.search_range.rot_deg` / `trans_m`	Search range in degrees / meters
`params.min_plane_point_num`	Minimum points for RANSAC plane
`params.cluster_tolerance`	Euclidean clustering distance (meters)
`params.search_num`	Random samples per iteration
`params.point_range.top` / `bottom`	Fraction of image height for FOV check
`params.thread.num_thread`	Thread count (0 = single-threaded)
`params.down_sample.voxel_m`	Voxel size for downsampling (code ignores this — bug)

7.5 Calibrator Constructor

Performs all data loading and preprocessing:

Reads the first image to get IMG_H and IMG_W.
For each frame i:
- Loads all mask PNGs from the frame's mask directory into a packed 4-channel CV_8UC4 image via DataLoader::LoadMaskFile (see § 7.11).
- Loads the point cloud (PCD or BIN format) via DataLoader::LoadLidarFile.
- Calls ProcessPointcloud(pc_origin, pc) to filter, downsample, compute normals, and segment.

7.6 Point Cloud Preprocessing (`ProcessPointcloud`)

void Calibrator::ProcessPointcloud(
    const pcl::PointCloud<pcl::PointXYZI>::Ptr pc_origin,
    pcl::PointCloud<PointXYZINS>::Ptr pc);

Custom PCL point type:

struct PointXYZINS {
    PCL_ADD_POINT4D;      // x, y, z
    PCL_ADD_NORMAL4D;     // normal_x, normal_y, normal_z
    float intensity;      // normalized to [0, 1]
    float curvature;      // surface curvature
    int segment;          // cluster/segment ID (-1 = unsegmented)
};

Step 1: FOV filtering — generates 3^6 = 729 combinations of extreme perturbations of the initial extrinsic (each of 6 DOFs at {-1, 0, +1} × search_range via Util::GenVars()). For each raw point, if it projects onto the image under any of these 729 extrinsics, the point is kept. Tracks intensity_max during this pass.

Step 2: Optional downsampling — two-stage voxel grid: octree at 125m resolution partitions space; within each cell, VoxelGrid at leaf size hardcoded 0.1m downsamples. Config down_sample_voxel is never used (bug).

Step 3: Segmentation — calls Segment_pc() (see § 7.7).

Step 4: Construct PointXYZINS cloud — copies xyz; normalizes intensity as intensity / intensity_max; copies normals and curvature; assigns segment IDs.

7.7 Segmentation (`Segment_pc`)

int Calibrator::Segment_pc(
    const pcl::PointCloud<pcl::PointXYZI>::Ptr cloud,
    pcl::PointCloud<pcl::Normal>::Ptr normals,
    std::vector<pcl::PointIndices>& seg_indices);

Step 1: Normal estimation — KD-tree with K=40 nearest neighbors.

Step 2: Iterative RANSAC plane extraction:

seg.setModelType(pcl::SACMODEL_NORMAL_PLANE);
seg.setNormalDistanceWeight(0.2);
seg.setMethodType(pcl::SAC_RANSAC);
seg.setMaxIterations(3000);
seg.setDistanceThreshold(0.2);  // 0.2m

Loop: fit a plane → if inlier count ≥ min_plane_point_num, store indices, exclude them → repeat until next plane is too small. Extracts ground, walls, other large planar surfaces.

Step 3: Euclidean clustering on non-plane points:

ec.setClusterTolerance(params_.cluster_tolerance);  // from config
ec.setMaxClusterSize(10000);
ec.setMinClusterSize(50);

All plane segments and Euclidean clusters concatenated into seg_indices.

7.8 Scoring Function (`CalScore`)

double Calibrator::CalScore(Eigen::Matrix4f T);

Returns a loss value (lower is better), range approximately [1.0, 2.0].

Step 1: Project all points — for each frame, project every point using ProjectOnImage(). If a point lands on the image, look up the mask IDs at that pixel from the packed 4-channel mask image (up to 4 overlapping masks per pixel). Accumulate per-mask arrays of normals, intensities, and segment IDs.

Step 2: Bottom-of-image sanity check — count points near the bottom. If too few relative to expected density from POINT_PER_PIXEL, assign penalty score of 2.0 (worst) for this frame and skip.

Step 3: Per-mask consistency scoring — for each mask:

Filter: skip if mask pixel count < min(IMG_H * IMG_W / 1200, 2000) or projected point count < 10 or < 10% of expected.
Normal consistency (O(n²)):
```
// N = 3 x num_inside matrix of normal vectors
float normal_sim = (N.transpose() * N).array().abs().mean();
```
Full $N^T N$ matrix, element-wise abs, then mean. If all normals aligned, mean ≈ 1.0.
Intensity consistency: 1 - Util::Std(mask_intensity[i]). Low std → uniform material → high score.
Segment consistency: sorted segment counts weighted geometrically (1, 0.5, 0.25, …). Single-segment mask → 1.0.

Step 4: Weighted aggregation — each mask's score weighted by projected point count, averaged via Util::WeightMean().

Step 5: Final score formula:

score += 2 - 0.3 * normal_score
           - 0.2 * intensity_score
           - 0.5 * segment_score
           - 0.0001 * normal_scores.size();

Component	Weight	Meaning
Normal consistency	0.3	How aligned are surface normals within each mask?
Intensity consistency	0.2	How uniform is LiDAR intensity within each mask?
Segment consistency	0.5	Does each mask contain a single LiDAR segment?
Mask count bonus	0.0001	Tiny bonus per valid mask (encourages more masks)

Base is 2.0 (worst). Perfect alignment → approximately 2 - 0.3 - 0.2 - 0.5 - epsilon = 1.0.

Step 6: Average across frames — score /= N_FILE.

7.9 Search Strategy

RandomSearch(search_count, xyz_range, rpy_range, thread_id):

Sample 6 random values uniformly from [-range, +range] (3 rotation, 3 translation).
Compute candidate extrinsic: T_candidate = extrinsic_ * GetDeltaT(delta).
Score via CalScore(T_candidate).
Track best delta found across search_count iterations.
Acquire mutex, update global max_score_ / best_var_ if better.

RandomSearchThread() spawns num_thread threads, each running RandomSearch with search_count / num_thread iterations, then joins all threads.

Calibrate() — main loop:

float rot_init = search_range_rot + 0.5;     // degrees, padded
float trans_init = search_range_trans + 0.05; // meters, padded

do {
    best_var_ << 0, 0, 0, 0, 0, 0;
    RandomSearchThread(search_num, trans_init, DEG2RAD(rot_init));
    extrinsic_ *= Util::GetDeltaT(best_var_);
    rot_init /= 2;          // halve rotation range
    trans_init /= 1.5;      // reduce translation range by 1.5×
} while (rot_init > 0.3);   // stop when rotation range < 0.3 degrees

Example iteration sequence (starting from rot=5.5 deg, trans=0.55m):

Round	rot_init (deg)	trans_init (m)
1	5.50	0.550
2	2.75	0.367
3	1.375	0.244
4	0.688	0.163
5	0.344	0.109
(stop: 0.172 < 0.3)

~5 rounds × 5000 evaluations = ~25,000 total. Delta applied multiplicatively: extrinsic_ *= deltaT — shrinking random search without gradient information.

BruteForceSearch() exists but is never called.

7.10 Projection (`ProjectOnImage`)

bool Calibrator::ProjectOnImage(
    const Eigen::Vector4f &vec,  // [x, y, z, 1] in LiDAR frame
    const Eigen::Matrix4f &T,    // extrinsic (lidar -> camera)
    int &x, int &y,              // output pixel coordinates
    int margin);

Two modes depending on intrinsic size:

3×4 intrinsic (projection matrix P): vec2 = K * T * vec
3×3 intrinsic (standard K): cam_point = T * vec, then vec2 = K * cam_vec

Rejects if vec2(2) <= 0 (point behind camera). Computes pixel as x = round(vec2[0] / vec2[2]), y = round(vec2[1] / vec2[2]). Returns true if within image bounds.

7.11 Mask Loading & Packing (`DataLoader`)

LoadMaskFile(mask_dir, intrinsic, dist, masks, mask_point_num):

Collects all .png files from the mask directory, sorts alphabetically.
Caps at 255 masks (IDs stored as uint8).
For each mask PNG: reads as grayscale, undistorts if needed, then for each white pixel writes mask_id + 1 into the first available channel of a CV_8UC4 image. A single pixel can belong to up to 4 overlapping masks.

This 4-channel packing is critical because SAM often produces overlapping masks (e.g., a car mask overlapping a larger scene mask).

LoadLidarFile(filename, pc): Supports .pcd (PCL format) and .bin (KITTI format: 4 floats per point packed sequentially).

7.12 Mask Edge Processing (`processed_mask.py`)

For large SAM masks (sky, ground), the filled mask provides little calibration signal — only the edges carry alignment information.

For each mask: binarize, compute gradient via np.gradient(), create edge map.
If n_white < 0.02 * H * W (mask covers <2% of image): copy as-is.
For large masks (≥2%):
- Compute margin_inside = 30 + H * W / n_white (larger masks get smaller margins).
- Dilate each edge pixel by margin_inside pixels.
- Output = original mask AND dilated edges (only the band near the boundary is retained).

Example: For a 1920×1080 image with a mask covering 20% (~414k pixels): margin = 30 + 2073600/414720 = 35 pixels. Only a 35-pixel-wide band inside the mask boundary is kept.

Not described in the paper.

7.13 JSON Config Parameters

{
    "cam_K": { "rows": 3, "cols": 3, "data": [[fx, 0, cx], [0, fy, cy], [0, 0, 1]] },
    "cam_dist": { "cols": 5, "data": [k1, k2, p1, p2, k3] },
    "T_lidar_to_cam": { "rows": 4, "cols": 4, "data": [...] },
    "T_lidar_to_cam_gt": { "available": true, "rows": 4, "cols": 4, "data": [...] },
    "img_folder": "images",
    "mask_folder": "processed_masks",
    "pc_folder": "pc",
    "img_format": ".png",
    "pc_format": ".pcd",
    "file_name": ["000000", "000001"],
    "params": {
        "search_range": { "rot_deg": 5.0, "trans_m": 0.5 },
        "min_plane_point_num": 500,
        "cluster_tolerance": 0.5,
        "search_num": 5000,
        "point_range": { "top": 0.2, "bottom": 0.9 },
        "down_sample": { "is_valid": false, "voxel_m": 0.1 },
        "thread": { "is_multi_thread": true, "num_thread": 8 }
    }
}

References

Ref	Link
Calib-Anything paper	https://arxiv.org/abs/2306.02656
Official code	https://github.com/OpenCalib/CalibAnything
Segment Anything (SAM)	https://arxiv.org/abs/2304.02643
KITTI odometry benchmark	https://www.cvlibs.net/datasets/kitti/eval_odometry.php

Table of Contents