BootsTAP Bootstrapped Training for Tracking-Any-Point
- Description: Paper note on BootsTAP, a self-supervised student-teacher method for improving point tracking on real-world videos (ACCV 2024)
- My Notion Note ID: K2E-B-F2-3
- Created: 2026-03-23
- Updated: 2026-06-06
- License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io
Table of Contents
- 1. Paper Information
- 2. Summary
- 3. Key Contributions
- 4. Background & Related Work
- 5. Method Details & Key Equations
- 6. Training Setup & Datasets
- 7. Main Experiments & Quantitative Results
- 8. Ablations, Limitations & Practical Pointers
- 9. Conclusions & Future Work
- References
1. Paper Information
Title: BootsTAP: Bootstrapped Training for Tracking-Any-Point
Authors: Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman
Affiliations: Google DeepMind; VGG, Department of Engineering Science, University of Oxford
Paper: ACCV 2024 / arXiv:2402.00847
Code: github.com/google-deepmind/tapnet (JAX and PyTorch)
Project page: bootstap.github.io
2. Summary
BootsTAP demonstrates that large-scale, unlabeled, uncurated real-world videos can substantially improve a Tracking-Any-Point (TAP) model using a simple self-supervised student-teacher framework. Starting from a TAPIR model pre-trained on synthetic data (Kubric), it bootstraps on ~15 million real video clips by: (1) using the teacher (EMA of student) to produce pseudo-ground-truth trajectories, (2) applying spatial transformations and corruptions to the student's input, and (3) training the student to reproduce the teacher's predictions with a cycle-consistency filter. The result, BootsTAPIR, achieves state-of-the-art on the TAP-Vid benchmark by wide margins: TAP-Vid-DAVIS improves from 61.3% to 67.4% AJ, and TAP-Vid-Kinetics from 57.2% to 62.5% AJ (strided mode).
3. Key Contributions
- First large-scale bootstrapping pipeline for TAP: Demonstrates that unlabeled real-world video can substantially improve point tracking via self-supervised learning, using two consistency principles: equivariance to spatial transforms and invariance to query point choice along a trajectory.
- Simple but effective formulation: Minimal architectural changes to TAPIR — only adding 5 layers of 2D conv-residual layers (channel multiplier 4) to the backbone, roughly doubling parameters.
- Comprehensive ablations: Analyzes data transformations, pseudo-label filtering, training setup, and data requirements. Affine transformations are critical; cycle-consistency filtering helps; even 1% of data provides gains.
- State-of-the-art results: Surpasses all prior methods on TAP-Vid (Kinetics, DAVIS, RGB-Stacking) and RoboTAP by wide margins. Also demonstrates causal model and high-resolution variants.
4. Background & Related Work
Tracking-Any-Point (TAP): Track any point on solid surfaces in a video, outputting tracked positions and occlusion status for every frame. Unlike optical flow (displacement between adjacent frame pairs), TAP outputs per-point trajectories across an entire video (position + occlusion status for every frame), handling long-range tracking, occlusion, and reappearance. TAP-Vid [12] formalized the benchmark with datasets: Kinetics (human actions), DAVIS (object tracking), RGB-Stacking (robotic manipulation), and RoboTAP (real robotic manipulation). State-of-the-art methods (TAPIR, CoTracker) rely heavily on synthetic data (Kubric), creating a sim-to-real domain gap.
Self-supervised correspondence via photometric loss: Photometric losses popular in optical flow but struggle with occlusions, lighting changes, and textureless regions. Multi-frame methods, appearance reconstruction, and high-resolution matching partially address these.
Temporal continuity and cycle-consistency: Temporal continuity provides correspondences for feature learning. Cycle-consistency can yield useful features but fails at occlusions.
Semi-supervised correspondence: Pseudo-GT approaches (using RAFT or TAP-Net) used for optical flow and tracking. OmniMotion infers 3D scene representations but doesn't retrain a general TAP model. Li et al. [34] proposed a reconstruction-based self-supervised loss but achieved much lower performance. Concurrent work [57] (CoTracker v2) uses frozen trajectory labels and fine-tunes on target datasets; BootsTAP continuously updates the teacher via EMA and trains on a single large diverse dataset.
5. Method Details & Key Equations
5.1 Core Principles
Two properties of point tracks on solid, opaque surfaces:
- Equivariance: Spatial transformations of the video produce equivalent transformations of the trajectories.
- Invariance: Different query points along the same trajectory should yield the same track (each trajectory is an equivalence class).
Equivariance: if the video undergoes spatial transformation (e.g., affine: resize + translate), predicted trajectories undergo the same . Invariance: querying any point along the same trajectory should produce the same track.
A naive Siamese approach (minimizing difference between two augmented views) degrades toward trivial solutions. Instead, a student-teacher framework is used: the teacher provides stop-gradient pseudo-labels; the student receives a harder, augmented view.
5.2 TAPIR Loss (Baseline)
Let be predictions: position , occlusion logit , uncertainty logit . Standard TAPIR loss for a single trajectory:
Notation: = model prediction; unhatted = ground-truth. Huber = smooth L1 (robust to outliers). BCE = Binary Cross-Entropy (for binary occlusion/visibility and certainty/uncertainty classification). = visibility mask: when (occluded), position and uncertainty terms become 0 — no ground-truth position to compare. Occlusion loss is NOT masked (the model must always predict occlusion state).
The uncertainty target is with pixels.
5.3 Model Capacity Expansion
After pre-training, 5 layers of 2D conv-residual layers with channel multiplier 4 are added to the backbone, roughly doubling the number of backbone parameters. Initialized to the identity following "zero init" [18].
5.4 Pseudo-Label Generation
Let be student predictions and be teacher pseudo-labels:
The self-supervised loss has the same form as the TAPIR loss, using teacher predictions as ground truth:
The teacher always uses the final (refined) predictions for pseudo-labels, while supervision is applied to unrefined student outputs too, encouraging stronger features for faster convergence.
5.5 Video Degradations
The student receives a degraded view to prevent trivial solutions:
- Spatial: Frames resized to a smaller resolution and superimposed onto a black background at a random position . varies linearly over time, creating a frame-wise affine transformation .
- Non-spatial: Random JPEG compression applied before pasting onto the black background.
- The inverse transform maps student predictions back to original coordinate space for loss computation.
Why degrade? Prevents the student from finding trivial shortcuts (e.g., copying pixel coordinates). Forces learning of higher-level semantic features (shape, structure) rather than low-level texture matching.
5.6 Query Point Sampling (Trajectory Equivalence Class)
- Sample teacher query where is a random coordinate and a random frame.
- Sample student query from the teacher's visible trajectory: .
- With probability 0.5, use (same query point); otherwise, sample uniformly from visible points.
Sampling different query points along the teacher's trajectory enforces the invariance property — the student learns that all points on the same trajectory are equivalent and should produce the same track.
5.7 Cycle-Consistency Filtering
To filter incorrect teacher predictions, apply a cycle-consistency mask:
where pixels.
The student tracks from back to time . If it arrives close to (within 4 pixels) AND predicts the point as visible (), the trajectory is considered reliable. Otherwise, the teacher likely tracked the wrong point — excluded from the loss.
5.8 Final Self-Supervised Loss
Trained with 128 query points per input video, averaging the loss. To prevent catastrophic forgetting, supervised training on Kubric continues alongside the self-supervised loss, using separate Adam optimizers with separate learning rates (SSL uses half the batch size and half the learning rate).
6. Training Setup & Datasets
Training data (real): ~15 million 24-frame clips from publicly accessible online videos:
- Selected categories with high-quality, realistic motion (lifestyle, one-shot videos)
- Excluded low-visual-complexity or unrealistic motion (tutorials, lyrics, animations)
- Only 60fps videos with >200 views
- First/last 2 seconds trimmed; 5 clips randomly sampled per video
- Overlay/watermark frames excluded via gradient analysis; shot boundaries detected and removed
Supervised data: Kubric synthetic dataset (standard TAPIR training)
Multi-task training: Separate Adam optimizers for supervised (Kubric) and self-supervised (real) tasks with separate learning rates. SSL batch size halved (and learning rate halved proportionally) due to the extra forward pass cost.
Evaluation: TAP-Vid benchmark (Kinetics, DAVIS, RGB-Stacking) and RoboTAP, all at 256×256. Metrics: AJ (Average Jaccard), (position accuracy), OA (occlusion accuracy). Two query modes: strided (every 5th point) and query_first (first visible point).
7. Main Experiments & Quantitative Results
7.1 TAP-Vid Benchmark — Strided Mode (Table 1)
| Method | Kinetics AJ / / OA | DAVIS AJ / / OA | RGB-Stacking AJ / / OA |
|---|---|---|---|
| TAP-Net | 46.6 / 60.9 / 85.0 | 38.4 / 53.1 / 82.3 | 59.9 / 72.8 / 90.4 |
| PIPs | 35.3 / 54.8 / 77.4 | 42.0 / 59.4 / 82.1 | 37.3 / 51.0 / 91.6 |
| TAPIR | 57.2 / 70.1 / 87.8 | 61.3 / 73.6 / 88.8 | 62.7 / 74.6 / 91.6 |
| CoTracker | 57.3 / 70.6 / 87.5 | 64.8 / 79.1 / 88.7 | 65.9 / 80.6 / 85.0 |
| BootsTAPIR | 61.4 / 74.2 / 89.7 | 66.2 / 78.1 / 91.0 | 72.4 / 83.1 / 91.2 |
BootsTAPIR outperforms all methods on Kinetics and RGB-Stacking by large margins (4+ AJ points). On DAVIS, achieves best AJ (66.2) and OA (91.0).
7.2 TAP-Vid Benchmark — Query-First Mode (Table 2)
| Method | Kinetics AJ / / OA | DAVIS AJ / / OA | RoboTAP AJ / / OA |
|---|---|---|---|
| TAPIR | 49.6 / 64.2 / 85.0 | 56.2 / 70.0 / 86.5 | 59.6 / 73.4 / 87.0 |
| CoTracker | 48.7 / 64.3 / 86.5 | 60.6 / 75.4 / 89.3 | 54.0 / 65.5 / 78.8 |
| BootsTAPIR | 54.6 / 68.4 / 86.5 | 61.4 / 73.6 / 88.7 | 64.9 / 80.1 / 86.3 |
5% absolute AJ improvement on RoboTAP (64.9 vs 59.6 for TAPIR), despite RoboTAP looking very different from typical online videos.
7.3 Released Model with Bug Fix, Snap-to-Occluder, and Extra Data
Publicly released BootsTAPIR includes: coordinate bug fix, "snap-to-occluder" technique (altering query points near occlusion edges, 1 pixel away, to track the foreground rather than background), and training on higher-resolution, longer clips.
Strided mode (Table 4):
| Method | Kinetics AJ / / OA | DAVIS AJ / / OA | RGB-Stacking AJ / / OA |
|---|---|---|---|
| CoTracker | 57.3 / 70.6 / 87.5 | 64.8 / 79.1 / 88.7 | 65.9 / 80.6 / 85.0 |
| BootsTAPIR+fix+snap+data | 62.5 / 74.8 / 89.5 | 67.4 / 79.0 / 91.3 | 77.4 / 86.7 / 93.2 |
7.4 High-Resolution Evaluation (Table 6)
Evaluating at 512×512 instead of 256×256:
| Method | Kinetics AJ / / OA | DAVIS AJ / / OA |
|---|---|---|
| BootsTAPIR | 62.5 / 74.8 / 89.5 | 67.4 / 79.0 / 91.3 |
| BootsTAPIR+hires | 63.7 / 76.0 / 88.4 | 70.2 / 81.2 / 91.2 |
+1.2% on Kinetics and +2.8% on DAVIS.
7.5 Causal Model and Perception Test
- Causal BootsTAPIR: 4.6% overall improvement on Kinetics, 3.0% on DAVIS (query-first) vs Causal TAPIR
- Perception Test: BootsTAPIR improves Overall 55.7 → 59.6, Static 57.4 → 61.3, Dynamic 46.3 → 49.7
8. Ablations, Limitations & Practical Pointers
Ablations (Table 3):
(a) Data transformations (DAVIS strided AJ / Kinetics q_first AJ):
- BASE: 65.8 / 54.4
- No JPEG augmentation: 65.7 / 53.5 (JPEG mostly matters for Kinetics)
- No affine transform: 54.4 / 44.7 (massive drop — affine is critical)
- Same queries (teacher=student): 65.6 / 53.2
- Uniform query sampling: 65.6 / 54.3
(b) Pseudo-label filtering:
- No filtering: 65.9 / 54.1 (hurts Kinetics)
- Cycle-consistency instead of confidence: 66.1 / 54.3 (slightly better on DAVIS)
(c) Training setup:
- Full model: 66.2 / 54.6
- Kubric-only (same capacity, no SSL): 65.0 / 52.7 (SSL adds ~1.2% DAVIS, ~2% Kinetics)
- Siamese (no stop-gradient): 49.8 / 29.6 (collapses — stop-gradient is essential)
(d) Training data:
- 2-frame clips: 64.3 / 50.5 (longer clips important)
- 6-frame clips: 63.7 / 50.9
- 1% of real data: 66.2 / 54.0 (surprisingly competitive — even small real data helps)
Limitations:
- Training computationally expensive (extra teacher forward pass + real data pipeline)
- Single point estimate per video — cannot elegantly handle duplicated or rotationally-symmetric objects
- Performance continues to improve with more training — approach has not saturated
Practical Pointers:
- Affine transformations are by far the most important augmentation — removing them causes massive performance collapse
- Stop-gradient (EMA student-teacher) is essential; Siamese approach collapses
- 24-frame clips much better than 2- or 6-frame clips — temporal context helps the teacher correct errors
- Even 1% of real data provides significant gains
- Released model: BootsTAPIR tracks 10,000 points on a 256×256, 50-frame video in 5.6s (A100 + JAX); causal model tracks 400 points at 30.1 FPS
9. Conclusions & Future Work
BootsTAP presents an effective method for leveraging large-scale unlabeled data to improve point tracking by applying consistency principles (equivariance to spatial transforms, invariance to query point choice). The formulation avoids complex priors like spatial/temporal smoothness. Despite similarities to "unstable" two-frame self-supervised optical flow approaches, the multi-frame setting makes this approach stable and effective. Performance continues to improve with longer training. Future directions include principled solutions to the foreground/background tracking bias and improved handling of duplicated/symmetric objects.
References
- Doersch, C., Luc, P., Yang, Y., Gokay, D., Koppula, S., Gupta, A., Heyward, J., Rocco, I., Goroshin, R., Carreira, J., & Zisserman, A. (2024). BootsTAP: Bootstrapped Training for Tracking-Any-Point. ACCV 2024. arXiv:2402.00847 — paper