Calib-Anything
- Description: Paper note on Calib-Anything — zero-training LiDAR-camera extrinsic calibration via SAM mask-wise attribute consistency; includes paper summary and detailed code walkthrough (arXiv:2306.02656)
- My Notion Note ID: K2E-B-C4-1
- Created: 2026-03-02
- Updated: 2026-06-06
- License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io
Table of Contents
- 1. Summary
- 2. Key Contributions
- 3. Method
- 4. Experimental Setup & Results
- 5. Ablation & Discussion
- 6. Strengths, Limitations & Future Work
- 7. Code Walkthrough
- 7.1 Hyperparameters & Runtime
- 7.2 Paper-Code Discrepancies
- 7.3 Project Structure & Build
- 7.4 Entry Point
- 7.5 Calibrator Constructor
- 7.6 Point Cloud Preprocessing
- 7.7 Segmentation
- 7.8 Scoring Function
- 7.9 Search Strategy
- 7.10 Projection
- 7.11 Mask Loading & Packing
- 7.12 Mask Edge Processing
- 7.13 JSON Config Parameters
- References
1. Summary
Title: Calib-Anything: Zero-training LiDAR-Camera Extrinsic Calibration Method Using Segment Anything
Authors: Zhaotong Luo, Guohang Yan, Yikang Li
Paper: arXiv:2306.02656
Code: OpenCalib/CalibAnything
Calib-Anything proposes a zero-training method for LiDAR-camera extrinsic calibration that uses Segment Anything (SAM) to generate dense 2D masks on RGB images and then searches for the extrinsic transform that maximizes the within-mask consistency of projected LiDAR point attributes. The method avoids explicit 2D-3D correspondences by treating each SAM mask as a region in which LiDAR attributes should be "internally consistent" if the extrinsic is correct.
The paper defines a mask-wise consistency score combining (i) reflectivity/intensity dispersion, (ii) normal-vector agreement, and (iii) a coarse segmentation class obtained from RANSAC plane fitting and Euclidean clustering on the point cloud, and aggregates these scores over all masks to evaluate candidate extrinsics.
Quantitatively, the paper reports improved translation and rotation errors on a KITTI-odometry-based setup compared with baseline [19] and also reports results on an in-house dataset with a HESAI Pandar64 LiDAR and a color camera (FOV = 60 deg).
The official implementation is at OpenCalib/CalibAnything. Several engineering choices visible in the code (mask packing limits, optional mask "edge-only" processing, multi-round shrinking random search, and apparent deviations from the paper's point-count adjustment and weight values) are not fully specified in the short paper and matter for reproduction.
2. Key Contributions
- A SAM-enabled, automatic LiDAR-camera extrinsic calibration approach requiring zero extra training — uses SAM as-is, no calibration network trained.
- A calibration objective built on mask-wise attribute consistency combining reflectivity/intensity, surface normals, and a coarse point-cloud segmentation class (plane vs. clustered objects), enabling calibration without explicit 2D-3D correspondences.
- Experimental validation on (i) a KITTI-odometry-based dataset and (ii) an in-house dataset (Pandar64 + camera), demonstrating accuracy improvements vs. a segmentation-based baseline.
3. Method
3.1 High-level Idea
Given camera intrinsics (assumed known) and an initial extrinsic guess , the method searches for such that after projecting LiDAR points onto the image, points landing inside each SAM mask exhibit high consistency in their LiDAR-derived attributes.
flowchart TD
A[RGB image] -->|Segment Anything| B[2D masks]
C[LiDAR point cloud] --> D[Normal estimation]
C --> E[Intensity normalization]
C --> F[Segmentation class: RANSAC + Euclidean clustering]
D --> G[Per-point attributes]
E --> G
F --> G
B --> H[Project points using candidate T]
G --> H
H --> I[Per-mask point sets]
I --> J[Consistency score per mask]
J --> K[Aggregate score]
K --> L[Search/optimize T]
3.2 Pipeline
Step 1 — 2D Segmentation: Run Segment Anything on the RGB image to generate masks covering the entire image. The repo contains an optional mask post-processing script (processed_mask.py) that for large masks (≥2% of image area) keeps only edge-near regions via gradient-based boundary detection; small masks (<2%) are kept as-is.
Step 2 — Point cloud preprocessing: Each LiDAR point is augmented into an 8-element attribute vector:
where are surface normals (via PCA/KNN neighborhood analysis), is the normalized intensity, and is a coarse segmentation class obtained via RANSAC plane fitting (ground/walls) followed by Euclidean clustering on remaining points (vehicles, trees).
Step 3 — Projection: For a candidate extrinsic , project all LiDAR points onto the image:
Group projected points by the SAM mask they fall into:
Step 4 — Per-mask consistency scoring: For each mask , compute:
Note: Eq. (4) uses notation, but Eq. (5) below defines it as . This is a genuine notation inconsistency in the original paper.
- Intensity consistency — variance of reflectivity within the mask (low variance = high consistency):
- Normal consistency — self-similarity of surface normals (high dot products = aligned normals):
- Segment consistency — whether points come from a single segment class ( controls decay for secondary classes):
Point-count adjustment penalizes masks with very few points:
Step 5 — Global aggregation: Sum over all masks, weighted by point count:
Paper-stated hyperparameters: in Eq. (7); equal (exact numerics unspecified); , for Eq. (9).
3.3 Search Strategy
The paper describes a two-stage search: (1) brute-force rotation search over 3DoF with large strides within degrees, then (2) random search in a smaller 6DoF range.
Inputs: RGB image I, LiDAR point cloud P, camera intrinsics K, initial extrinsic T_init
1) Masks: M = SAM(I)
2) Point attributes: normals, normalized intensity, class c (RANSAC + clustering)
-> p = {x,y,z,nx,ny,nz,r,c}
3) Objective (for candidate T):
Project all points using K and T
For each mask M_i: P_i = {p | M_i(p_u,p_v)=1}
s_i = (w_R*F_R + w_N*F_N + w_S*F_S) * f(|P_i|)
s = sum(w_i * s_i)
4) Search:
Coarse: brute-force 3DoF rotation over [-A, A] with stride s
Refine: random search in smaller 6DoF range
Return best T
Official-code search strategy (different from paper): BruteForceSearch() exists in the code but is never called. The actual Calibrate() function uses a shrinking random search:
do {
RandomSearch(search_num, trans_init, DEG2RAD(rot_init), 0);
rot_init /= 2;
trans_init /= 1.5;
} while (rot_init > 0.3);
Halves rotation range and shrinks translation range by 1.5× each round until rotation range drops below 0.3 deg. With search_num = 5000, ~5 rounds → ~25,000 total score evaluations.
4. Experimental Setup & Results
4.1 Datasets
| Dataset | Description | Unspecified Items |
|---|---|---|
| KITTI (odometry-based) | "based on KITTI odometry benchmark" | which sequences, frame count, exact protocol |
| In-house ("ours") | HESAI Pandar64 + color camera (FOV=60 deg) | scene types, size, annotation/GT method |
Metrics: L2 loss and Huber loss for translation () and rotation ().
4.2 Main Quantitative Comparison (Paper Table I)
On KITTI, Calib-Anything reduces L2 translation error from 20.2 cm to 10.7 cm and rotation error from 0.34 deg to 0.174 deg relative to baseline [19].
4.3 Qualitative Results (Paper Fig. 4)
5. Ablation & Discussion
No explicit ablation study in the paper — no systematic "remove intensity / normals / class" tables. Component sensitivity is unspecified.
Why SAM masks help: SAM provides object-/region-level grouping without training a dataset-specific segmentation model, enabling calibration constraints across diverse scenes.
Why normals and class help: Normals constrain planar regions (ground, walls); class consistency discourages mixing points from different geometric structures within a mask, improving robustness when intensity alone is insufficient.
When it may struggle: Scenes with weak geometric structure, heavy occlusions, sparse LiDAR returns, or masks with significant overlap/fragmentation could reduce per-mask point support and weaken consistency signals. The paper does not quantify these failure modes.
6. Strengths, Limitations & Future Work
Strengths:
- Zero-training: no calibration-specific training; uses SAM as-is
- Correspondence-free: avoids explicit 2D-3D correspondences, using mask-wise aggregation
- Engineering availability: official code and scripts facilitate reproduction
Limitations:
- Dependence on initial guess and local search — numeric search bounds/strides/trials are largely unspecified in the paper
- Limited baselines — only compares against a single baseline [19]; no systematic ablations
- O(n²) normal scoring — full matrix per mask, expensive for dense point clouds or large masks
Future work (as stated in the paper):
- Evaluate on more datasets
- Add more quantitative comparisons
- Consider stability-aware evaluation when ground truth may contain error
7. Code Walkthrough
7.1 Hyperparameters & Runtime
| Category | Item | Value (paper) | Value (code) | Notes |
|---|---|---|---|---|
| Consistency weights | in Eq. (4) | equal (unspecified numerics) | 0.2 / 0.3 / 0.5 (intensity / normal / segment) | Discrepancy: code weights are unequal |
| Class weighting | in Eq. (7) | 0.4 | 0.5 (geometric halving: 1, 0.5, 0.25, …) | Discrepancy: different decay factor |
| Point-count adjustment | in Eq. (9) | 1.5, -0.4 | bypassed (adjust = 1, commented-out code) |
Discrepancy: f(n) not implemented |
| Normal estimation | neighborhood size | unspecified | K=40 nearest neighbors | |
| Plane fitting (RANSAC) | iterations / thresholds | unspecified | 3000 iters / 0.2m distance / 0.2 normal weight | |
| Clustering | tolerance / min-max size | unspecified | config cluster_tolerance / min 50, max 10000 |
|
| Downsampling | voxel size | unspecified | hardcoded 0.1m (config value ignored — bug) | |
| Search | rotation range / trials | unspecified | config rot_deg, trans_m, search_num |
|
| Runtime | per calibration | unspecified | ~25k score evaluations (~5 rounds × 5000) | No runtime table in paper |
| SAM AMG settings | stability / NMS / offset | unspecified | 0.9 / 0.5 / 0.9 (model: vit_l) | From repo README |
7.2 Paper-Code Discrepancies
| # | Discrepancy | Details |
|---|---|---|
| 1 | f(n) point-count adjustment bypassed | Paper introduces (Eq. 9, Fig. 3), but code sets adjust = 1 with the power function commented out. |
| 2 | Search strategy differs | Paper describes brute-force rotation + random 6DoF refinement. Code defines BruteForceSearch() but never calls it — uses shrinking random search instead. |
| 3 | Consistency weights differ | Paper states (equal). Code uses 0.2*intensity + 0.3*normal + 0.5*segment plus bonus -0.0001*num_masks. |
| 4 | Segment decay factor differs | Paper states in Eq. (7). Code uses geometric halving (). |
| 5 | Mask edge processing (code-only) | processed_mask.py keeps only edge-near regions for large masks (≥2% of image). Not described in paper. |
| 6 | Downsample voxel size ignored | Config provides down_sample_voxel but code hardcodes leaf size to 0.1m. |
| 7 | Normal scoring is O(n²) | Computes the full matrix for each mask. Not discussed in paper as a computational concern. |
7.3 Project Structure & Build
| File | Role |
|---|---|
CMakeLists.txt |
Build system: dependencies, targets |
src/run_lidar2camera.cpp |
Entry point: main(), config reader, save result |
src/calibration.cpp |
Core implementation: all calibration logic |
include/calibration.hpp |
Calibrator class declaration, custom point type, JSON params struct |
include/dataloader.hpp |
Static methods to load masks, point clouds, calibration files |
include/utility.hpp |
Math helpers: delta-T from 6-DOF, rotation extraction, stats |
include/logging.hpp |
Printf-based logging macros (defined but never used) |
processed_mask.py |
Python script: edge-only processing for large SAM masks |
Build dependencies: PCL 1.10, OpenCV, Eigen 3 (via ${EIGEN_ROOT} env var), Boost system, jsoncpp (linked as static libjsoncpp.a).
Build note: There is a bug in the CMake file where CMAKE_CXX_FLAGS is set to -std=c++14 then immediately overwritten with -pthread, losing the C++14 flag. In practice the compiler's default standard is used.
7.4 Entry Point (run_lidar2camera.cpp)
main() flow:
- Parse config — reads JSON file via
readConfig(json_file, json_params)into aJsonParamsstruct - Construct Calibrator —
Calibrator calibrator(json_params)loads all data and preprocesses point clouds - Run calibration —
calibrator.Calibrate()performs the iterative shrinking random search - Save result —
SaveExtrinsic(refined_extrinsic)writes the 4×4 matrix toextrinsic.txt
readConfig() JSON fields:
| JSON Key | What It Fills |
|---|---|
cam_K |
3×3 or 3×4 camera intrinsic matrix |
cam_dist |
Distortion coefficients |
T_lidar_to_cam |
Initial extrinsic guess (4×4) |
T_lidar_to_cam_gt |
Ground truth extrinsic (optional) |
img_folder / mask_folder / pc_folder |
File paths (relative to JSON file) |
params.search_range.rot_deg / trans_m |
Search range in degrees / meters |
params.min_plane_point_num |
Minimum points for RANSAC plane |
params.cluster_tolerance |
Euclidean clustering distance (meters) |
params.search_num |
Random samples per iteration |
params.point_range.top / bottom |
Fraction of image height for FOV check |
params.thread.num_thread |
Thread count (0 = single-threaded) |
params.down_sample.voxel_m |
Voxel size for downsampling (code ignores this — bug) |
7.5 Calibrator Constructor
Performs all data loading and preprocessing:
- Reads the first image to get
IMG_HandIMG_W. - For each frame
i:- Loads all mask PNGs from the frame's mask directory into a packed 4-channel
CV_8UC4image viaDataLoader::LoadMaskFile(see § 7.11). - Loads the point cloud (PCD or BIN format) via
DataLoader::LoadLidarFile. - Calls
ProcessPointcloud(pc_origin, pc)to filter, downsample, compute normals, and segment.
- Loads all mask PNGs from the frame's mask directory into a packed 4-channel
7.6 Point Cloud Preprocessing (ProcessPointcloud)
void Calibrator::ProcessPointcloud(
const pcl::PointCloud<pcl::PointXYZI>::Ptr pc_origin,
pcl::PointCloud<PointXYZINS>::Ptr pc);
Custom PCL point type:
struct PointXYZINS {
PCL_ADD_POINT4D; // x, y, z
PCL_ADD_NORMAL4D; // normal_x, normal_y, normal_z
float intensity; // normalized to [0, 1]
float curvature; // surface curvature
int segment; // cluster/segment ID (-1 = unsegmented)
};
Step 1: FOV filtering — generates 3^6 = 729 combinations of extreme perturbations of the initial extrinsic (each of 6 DOFs at {-1, 0, +1} × search_range via Util::GenVars()). For each raw point, if it projects onto the image under any of these 729 extrinsics, the point is kept. Tracks intensity_max during this pass.
Step 2: Optional downsampling — two-stage voxel grid: octree at 125m resolution partitions space; within each cell, VoxelGrid at leaf size hardcoded 0.1m downsamples. Config down_sample_voxel is never used (bug).
Step 3: Segmentation — calls Segment_pc() (see § 7.7).
Step 4: Construct PointXYZINS cloud — copies xyz; normalizes intensity as intensity / intensity_max; copies normals and curvature; assigns segment IDs.
7.7 Segmentation (Segment_pc)
int Calibrator::Segment_pc(
const pcl::PointCloud<pcl::PointXYZI>::Ptr cloud,
pcl::PointCloud<pcl::Normal>::Ptr normals,
std::vector<pcl::PointIndices>& seg_indices);
Step 1: Normal estimation — KD-tree with K=40 nearest neighbors.
Step 2: Iterative RANSAC plane extraction:
seg.setModelType(pcl::SACMODEL_NORMAL_PLANE);
seg.setNormalDistanceWeight(0.2);
seg.setMethodType(pcl::SAC_RANSAC);
seg.setMaxIterations(3000);
seg.setDistanceThreshold(0.2); // 0.2m
Loop: fit a plane → if inlier count ≥ min_plane_point_num, store indices, exclude them → repeat until next plane is too small. Extracts ground, walls, other large planar surfaces.
Step 3: Euclidean clustering on non-plane points:
ec.setClusterTolerance(params_.cluster_tolerance); // from config
ec.setMaxClusterSize(10000);
ec.setMinClusterSize(50);
All plane segments and Euclidean clusters concatenated into seg_indices.
7.8 Scoring Function (CalScore)
double Calibrator::CalScore(Eigen::Matrix4f T);
Returns a loss value (lower is better), range approximately [1.0, 2.0].
Step 1: Project all points — for each frame, project every point using ProjectOnImage(). If a point lands on the image, look up the mask IDs at that pixel from the packed 4-channel mask image (up to 4 overlapping masks per pixel). Accumulate per-mask arrays of normals, intensities, and segment IDs.
Step 2: Bottom-of-image sanity check — count points near the bottom. If too few relative to expected density from POINT_PER_PIXEL, assign penalty score of 2.0 (worst) for this frame and skip.
Step 3: Per-mask consistency scoring — for each mask:
-
Filter: skip if mask pixel count <
min(IMG_H * IMG_W / 1200, 2000)or projected point count < 10 or < 10% of expected. -
Normal consistency (O(n²)):
// N = 3 x num_inside matrix of normal vectors float normal_sim = (N.transpose() * N).array().abs().mean();Full matrix, element-wise abs, then mean. If all normals aligned, mean ≈ 1.0.
-
Intensity consistency:
1 - Util::Std(mask_intensity[i]). Low std → uniform material → high score. -
Segment consistency: sorted segment counts weighted geometrically (1, 0.5, 0.25, …). Single-segment mask → 1.0.
Step 4: Weighted aggregation — each mask's score weighted by projected point count, averaged via Util::WeightMean().
Step 5: Final score formula:
score += 2 - 0.3 * normal_score
- 0.2 * intensity_score
- 0.5 * segment_score
- 0.0001 * normal_scores.size();
| Component | Weight | Meaning |
|---|---|---|
| Normal consistency | 0.3 | How aligned are surface normals within each mask? |
| Intensity consistency | 0.2 | How uniform is LiDAR intensity within each mask? |
| Segment consistency | 0.5 | Does each mask contain a single LiDAR segment? |
| Mask count bonus | 0.0001 | Tiny bonus per valid mask (encourages more masks) |
Base is 2.0 (worst). Perfect alignment → approximately 2 - 0.3 - 0.2 - 0.5 - epsilon = 1.0.
Step 6: Average across frames — score /= N_FILE.
7.9 Search Strategy
RandomSearch(search_count, xyz_range, rpy_range, thread_id):
- Sample 6 random values uniformly from
[-range, +range](3 rotation, 3 translation). - Compute candidate extrinsic:
T_candidate = extrinsic_ * GetDeltaT(delta). - Score via
CalScore(T_candidate). - Track best delta found across
search_countiterations. - Acquire mutex, update global
max_score_/best_var_if better.
RandomSearchThread() spawns num_thread threads, each running RandomSearch with search_count / num_thread iterations, then joins all threads.
Calibrate() — main loop:
float rot_init = search_range_rot + 0.5; // degrees, padded
float trans_init = search_range_trans + 0.05; // meters, padded
do {
best_var_ << 0, 0, 0, 0, 0, 0;
RandomSearchThread(search_num, trans_init, DEG2RAD(rot_init));
extrinsic_ *= Util::GetDeltaT(best_var_);
rot_init /= 2; // halve rotation range
trans_init /= 1.5; // reduce translation range by 1.5×
} while (rot_init > 0.3); // stop when rotation range < 0.3 degrees
Example iteration sequence (starting from rot=5.5 deg, trans=0.55m):
| Round | rot_init (deg) | trans_init (m) |
|---|---|---|
| 1 | 5.50 | 0.550 |
| 2 | 2.75 | 0.367 |
| 3 | 1.375 | 0.244 |
| 4 | 0.688 | 0.163 |
| 5 | 0.344 | 0.109 |
| (stop: 0.172 < 0.3) |
~5 rounds × 5000 evaluations = ~25,000 total. Delta applied multiplicatively: extrinsic_ *= deltaT — shrinking random search without gradient information.
BruteForceSearch() exists but is never called.
7.10 Projection (ProjectOnImage)
bool Calibrator::ProjectOnImage(
const Eigen::Vector4f &vec, // [x, y, z, 1] in LiDAR frame
const Eigen::Matrix4f &T, // extrinsic (lidar -> camera)
int &x, int &y, // output pixel coordinates
int margin);
Two modes depending on intrinsic size:
- 3×4 intrinsic (projection matrix P):
vec2 = K * T * vec - 3×3 intrinsic (standard K):
cam_point = T * vec, thenvec2 = K * cam_vec
Rejects if vec2(2) <= 0 (point behind camera). Computes pixel as x = round(vec2[0] / vec2[2]), y = round(vec2[1] / vec2[2]). Returns true if within image bounds.
7.11 Mask Loading & Packing (DataLoader)
LoadMaskFile(mask_dir, intrinsic, dist, masks, mask_point_num):
- Collects all
.pngfiles from the mask directory, sorts alphabetically. - Caps at 255 masks (IDs stored as uint8).
- For each mask PNG: reads as grayscale, undistorts if needed, then for each white pixel writes
mask_id + 1into the first available channel of aCV_8UC4image. A single pixel can belong to up to 4 overlapping masks.
This 4-channel packing is critical because SAM often produces overlapping masks (e.g., a car mask overlapping a larger scene mask).
LoadLidarFile(filename, pc): Supports .pcd (PCL format) and .bin (KITTI format: 4 floats per point packed sequentially).
7.12 Mask Edge Processing (processed_mask.py)
For large SAM masks (sky, ground), the filled mask provides little calibration signal — only the edges carry alignment information.
- For each mask: binarize, compute gradient via
np.gradient(), create edge map. - If
n_white < 0.02 * H * W(mask covers <2% of image): copy as-is. - For large masks (≥2%):
- Compute
margin_inside = 30 + H * W / n_white(larger masks get smaller margins). - Dilate each edge pixel by
margin_insidepixels. - Output = original mask AND dilated edges (only the band near the boundary is retained).
- Compute
Example: For a 1920×1080 image with a mask covering 20% (~414k pixels): margin = 30 + 2073600/414720 = 35 pixels. Only a 35-pixel-wide band inside the mask boundary is kept.
Not described in the paper.
7.13 JSON Config Parameters
{
"cam_K": { "rows": 3, "cols": 3, "data": [[fx, 0, cx], [0, fy, cy], [0, 0, 1]] },
"cam_dist": { "cols": 5, "data": [k1, k2, p1, p2, k3] },
"T_lidar_to_cam": { "rows": 4, "cols": 4, "data": [...] },
"T_lidar_to_cam_gt": { "available": true, "rows": 4, "cols": 4, "data": [...] },
"img_folder": "images",
"mask_folder": "processed_masks",
"pc_folder": "pc",
"img_format": ".png",
"pc_format": ".pcd",
"file_name": ["000000", "000001"],
"params": {
"search_range": { "rot_deg": 5.0, "trans_m": 0.5 },
"min_plane_point_num": 500,
"cluster_tolerance": 0.5,
"search_num": 5000,
"point_range": { "top": 0.2, "bottom": 0.9 },
"down_sample": { "is_valid": false, "voxel_m": 0.1 },
"thread": { "is_multi_thread": true, "num_thread": 8 }
}
}
References
| Ref | Link |
|---|---|
| Calib-Anything paper | https://arxiv.org/abs/2306.02656 |
| Official code | https://github.com/OpenCalib/CalibAnything |
| Segment Anything (SAM) | https://arxiv.org/abs/2304.02643 |
| KITTI odometry benchmark | https://www.cvlibs.net/datasets/kitti/eval_odometry.php |