BootsTAP Bootstrapped Training for Tracking-Any-Point


  • Description: Paper note on BootsTAP, a self-supervised student-teacher method for improving point tracking on real-world videos (ACCV 2024)
  • My Notion Note ID: K2E-B-F2-3
  • Created: 2026-03-23
  • Updated: 2026-06-06
  • License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io

Table of Contents


1. Paper Information

Title: BootsTAP: Bootstrapped Training for Tracking-Any-Point
Authors: Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman
Affiliations: Google DeepMind; VGG, Department of Engineering Science, University of Oxford
Paper: ACCV 2024 / arXiv:2402.00847
Code: github.com/google-deepmind/tapnet (JAX and PyTorch)
Project page: bootstap.github.io

2. Summary

BootsTAP demonstrates that large-scale, unlabeled, uncurated real-world videos can substantially improve a Tracking-Any-Point (TAP) model using a simple self-supervised student-teacher framework. Starting from a TAPIR model pre-trained on synthetic data (Kubric), it bootstraps on ~15 million real video clips by: (1) using the teacher (EMA of student) to produce pseudo-ground-truth trajectories, (2) applying spatial transformations and corruptions to the student's input, and (3) training the student to reproduce the teacher's predictions with a cycle-consistency filter. The result, BootsTAPIR, achieves state-of-the-art on the TAP-Vid benchmark by wide margins: TAP-Vid-DAVIS improves from 61.3% to 67.4% AJ, and TAP-Vid-Kinetics from 57.2% to 62.5% AJ (strided mode).

3. Key Contributions

  • First large-scale bootstrapping pipeline for TAP: Demonstrates that unlabeled real-world video can substantially improve point tracking via self-supervised learning, using two consistency principles: equivariance to spatial transforms and invariance to query point choice along a trajectory.
  • Simple but effective formulation: Minimal architectural changes to TAPIR — only adding 5 layers of 2D conv-residual layers (channel multiplier 4) to the backbone, roughly doubling parameters.
  • Comprehensive ablations: Analyzes data transformations, pseudo-label filtering, training setup, and data requirements. Affine transformations are critical; cycle-consistency filtering helps; even 1% of data provides gains.
  • State-of-the-art results: Surpasses all prior methods on TAP-Vid (Kinetics, DAVIS, RGB-Stacking) and RoboTAP by wide margins. Also demonstrates causal model and high-resolution variants.

4. Background & Related Work

Tracking-Any-Point (TAP): Track any point on solid surfaces in a video, outputting tracked positions and occlusion status for every frame. Unlike optical flow (displacement between adjacent frame pairs), TAP outputs per-point trajectories across an entire video (position + occlusion status for every frame), handling long-range tracking, occlusion, and reappearance. TAP-Vid [12] formalized the benchmark with datasets: Kinetics (human actions), DAVIS (object tracking), RGB-Stacking (robotic manipulation), and RoboTAP (real robotic manipulation). State-of-the-art methods (TAPIR, CoTracker) rely heavily on synthetic data (Kubric), creating a sim-to-real domain gap.

Self-supervised correspondence via photometric loss: Photometric losses popular in optical flow but struggle with occlusions, lighting changes, and textureless regions. Multi-frame methods, appearance reconstruction, and high-resolution matching partially address these.

Temporal continuity and cycle-consistency: Temporal continuity provides correspondences for feature learning. Cycle-consistency can yield useful features but fails at occlusions.

Semi-supervised correspondence: Pseudo-GT approaches (using RAFT or TAP-Net) used for optical flow and tracking. OmniMotion infers 3D scene representations but doesn't retrain a general TAP model. Li et al. [34] proposed a reconstruction-based self-supervised loss but achieved much lower performance. Concurrent work [57] (CoTracker v2) uses frozen trajectory labels and fine-tunes on target datasets; BootsTAP continuously updates the teacher via EMA and trains on a single large diverse dataset.

5. Method Details & Key Equations

5.1 Core Principles

Two properties of point tracks on solid, opaque surfaces:

  1. Equivariance: Spatial transformations of the video produce equivalent transformations of the trajectories.
  2. Invariance: Different query points along the same trajectory should yield the same track (each trajectory is an equivalence class).

Equivariance: if the video undergoes spatial transformation Φ\Phi (e.g., affine: resize + translate), predicted trajectories undergo the same Φ\Phi. Invariance: querying any point along the same trajectory should produce the same track.

A naive Siamese approach (minimizing difference between two augmented views) degrades toward trivial solutions. Instead, a student-teacher framework is used: the teacher provides stop-gradient pseudo-labels; the student receives a harder, augmented view.

5.2 TAPIR Loss (Baseline)

Let y^={p^,o^,u^}\hat{y} = \{\hat{p}, \hat{o}, \hat{u}\} be predictions: position p^RT×2\hat{p} \in \mathbb{R}^{T \times 2}, occlusion logit o^RT\hat{o} \in \mathbb{R}^T, uncertainty logit u^RT\hat{u} \in \mathbb{R}^T. Standard TAPIR loss for a single trajectory:

Ltapir(p^[t],o^[t],u^[t])=Huber(p^[t],p[t])(1o[t])+BCE(o^[t],o[t])+BCE(u^[t],u[t])(1o[t])\mathcal{L}_{tapir}(\hat{p}[t], \hat{o}[t], \hat{u}[t]) = \text{Huber}(\hat{p}[t], p[t])(1 - o[t]) + \text{BCE}(\hat{o}[t], o[t]) + \text{BCE}(\hat{u}[t], u[t])(1 - o[t])

Notation: ^\hat{\cdot} = model prediction; unhatted = ground-truth. Huber = smooth L1 (robust to outliers). BCE = Binary Cross-Entropy (for binary occlusion/visibility and certainty/uncertainty classification). (1o[t])(1 - o[t]) = visibility mask: when o[t]=1o[t]=1 (occluded), position and uncertainty terms become 0 — no ground-truth position to compare. Occlusion loss is NOT masked (the model must always predict occlusion state).

The uncertainty target is u[t]=1(d(p^[t],p[t])>δ)u[t] = \mathbb{1}(d(\hat{p}[t], p[t]) > \delta) with δ=6\delta = 6 pixels.

5.3 Model Capacity Expansion

After pre-training, 5 layers of 2D conv-residual layers with channel multiplier 4 are added to the backbone, roughly doubling the number of backbone parameters. Initialized to the identity following "zero init" [18].

5.4 Pseudo-Label Generation

Let y^S={p^S,o^S,u^S}\hat{y}_S = \{\hat{p}_S, \hat{o}_S, \hat{u}_S\} be student predictions and yT={pT,oT,uT}y_\mathcal{T} = \{p_\mathcal{T}, o_\mathcal{T}, u_\mathcal{T}\} be teacher pseudo-labels:

pT[t]=p^T[t];oT[t]=1(o^T[t]>0);uT[t]=1(d(p^T[t],p^S[t])>δ)p_\mathcal{T}[t] = \hat{p}_\mathcal{T}[t] \quad;\quad o_\mathcal{T}[t] = \mathbb{1}(\hat{o}_\mathcal{T}[t] > 0) \quad;\quad u_\mathcal{T}[t] = \mathbb{1}(d(\hat{p}_\mathcal{T}[t], \hat{p}_S[t]) > \delta)

The self-supervised loss ssl\ell_{ssl} has the same form as the TAPIR loss, using teacher predictions as ground truth:

ssl(p^S[t],o^S[t],u^S[t])=Huber(p^S[t],pT[t])(1oT[t])+BCE(o^S[t],oT[t])+BCE(u^S[t],uT[t])(1oT[t])\ell_{ssl}(\hat{p}_S[t], \hat{o}_S[t], \hat{u}_S[t]) = \text{Huber}(\hat{p}_S[t], p_\mathcal{T}[t])(1 - o_\mathcal{T}[t]) + \text{BCE}(\hat{o}_S[t], o_\mathcal{T}[t]) + \text{BCE}(\hat{u}_S[t], u_\mathcal{T}[t])(1 - o_\mathcal{T}[t])

The teacher always uses the final (refined) predictions for pseudo-labels, while supervision is applied to unrefined student outputs too, encouraging stronger features for faster convergence.

5.5 Video Degradations

The student receives a degraded view to prevent trivial solutions:

  • Spatial: Frames resized to a smaller resolution rr and superimposed onto a black background at a random position (h,w)(h, w). rr varies linearly over time, creating a frame-wise affine transformation Φ\Phi.
  • Non-spatial: Random JPEG compression applied before pasting onto the black background.
  • The inverse transform Φ1\Phi^{-1} maps student predictions back to original coordinate space for loss computation.

Why degrade? Prevents the student from finding trivial shortcuts (e.g., copying pixel coordinates). Forces learning of higher-level semantic features (shape, structure) rather than low-level texture matching.

5.6 Query Point Sampling (Trajectory Equivalence Class)

  1. Sample teacher query Q1=(q1,t1)Q_1 = (q_1, t_1) where q1q_1 is a random (x,y)(x,y) coordinate and t1t_1 a random frame.
  2. Sample student query Q2=(q2,t2)Q_2 = (q_2, t_2) from the teacher's visible trajectory: Q2{(pT[t],t);t s.t. oT[t]=0}Q_2 \in \{(p_\mathcal{T}[t], t); t \text{ s.t. } o_\mathcal{T}[t] = 0\}.
  3. With probability 0.5, use Q1=Q2Q_1 = Q_2 (same query point); otherwise, sample uniformly from visible points.

Sampling different query points along the teacher's trajectory enforces the invariance property — the student learns that all points on the same trajectory are equivalent and should produce the same track.

5.7 Cycle-Consistency Filtering

To filter incorrect teacher predictions, apply a cycle-consistency mask:

mcycle=1(d(p^S[t1],q1)<δcycle)1(o^S[t1]0)m_{cycle} = \mathbb{1}(d(\hat{p}_S[t_1], q_1) < \delta_{cycle}) \ast \mathbb{1}(\hat{o}_S[t_1] \le 0)

where δcycle=4\delta_{cycle} = 4 pixels.

The student tracks from q2q_2 back to time t1t_1. If it arrives close to q1q_1 (within 4 pixels) AND predicts the point as visible (o^S[t1]0\hat{o}_S[t_1] \le 0), the trajectory is considered reliable. Otherwise, the teacher likely tracked the wrong point — excluded from the loss.

5.8 Final Self-Supervised Loss

LSSL=tmcycletsslt\mathcal{L}_{SSL} = \sum_t m_{cycle}^t \ast \ell_{ssl}^t

Trained with 128 query points per input video, averaging the loss. To prevent catastrophic forgetting, supervised training on Kubric continues alongside the self-supervised loss, using separate Adam optimizers with separate learning rates (SSL uses half the batch size and half the learning rate).

6. Training Setup & Datasets

Training data (real): ~15 million 24-frame clips from publicly accessible online videos:

  • Selected categories with high-quality, realistic motion (lifestyle, one-shot videos)
  • Excluded low-visual-complexity or unrealistic motion (tutorials, lyrics, animations)
  • Only 60fps videos with >200 views
  • First/last 2 seconds trimmed; 5 clips randomly sampled per video
  • Overlay/watermark frames excluded via gradient analysis; shot boundaries detected and removed

Supervised data: Kubric synthetic dataset (standard TAPIR training)

Multi-task training: Separate Adam optimizers for supervised (Kubric) and self-supervised (real) tasks with separate learning rates. SSL batch size halved (and learning rate halved proportionally) due to the extra forward pass cost.

Evaluation: TAP-Vid benchmark (Kinetics, DAVIS, RGB-Stacking) and RoboTAP, all at 256×256. Metrics: AJ (Average Jaccard), <δavgx< \delta_{avg}^x (position accuracy), OA (occlusion accuracy). Two query modes: strided (every 5th point) and query_first (first visible point).

7. Main Experiments & Quantitative Results

7.1 TAP-Vid Benchmark — Strided Mode (Table 1)

Method Kinetics AJ / <δavgx<\delta_{avg}^x / OA DAVIS AJ / <δavgx<\delta_{avg}^x / OA RGB-Stacking AJ / <δavgx<\delta_{avg}^x / OA
TAP-Net 46.6 / 60.9 / 85.0 38.4 / 53.1 / 82.3 59.9 / 72.8 / 90.4
PIPs 35.3 / 54.8 / 77.4 42.0 / 59.4 / 82.1 37.3 / 51.0 / 91.6
TAPIR 57.2 / 70.1 / 87.8 61.3 / 73.6 / 88.8 62.7 / 74.6 / 91.6
CoTracker 57.3 / 70.6 / 87.5 64.8 / 79.1 / 88.7 65.9 / 80.6 / 85.0
BootsTAPIR 61.4 / 74.2 / 89.7 66.2 / 78.1 / 91.0 72.4 / 83.1 / 91.2

BootsTAPIR outperforms all methods on Kinetics and RGB-Stacking by large margins (4+ AJ points). On DAVIS, achieves best AJ (66.2) and OA (91.0).

7.2 TAP-Vid Benchmark — Query-First Mode (Table 2)

Method Kinetics AJ / <δavgx<\delta_{avg}^x / OA DAVIS AJ / <δavgx<\delta_{avg}^x / OA RoboTAP AJ / <δavgx<\delta_{avg}^x / OA
TAPIR 49.6 / 64.2 / 85.0 56.2 / 70.0 / 86.5 59.6 / 73.4 / 87.0
CoTracker 48.7 / 64.3 / 86.5 60.6 / 75.4 / 89.3 54.0 / 65.5 / 78.8
BootsTAPIR 54.6 / 68.4 / 86.5 61.4 / 73.6 / 88.7 64.9 / 80.1 / 86.3

5% absolute AJ improvement on RoboTAP (64.9 vs 59.6 for TAPIR), despite RoboTAP looking very different from typical online videos.

7.3 Released Model with Bug Fix, Snap-to-Occluder, and Extra Data

Publicly released BootsTAPIR includes: coordinate bug fix, "snap-to-occluder" technique (altering query points near occlusion edges, 1 pixel away, to track the foreground rather than background), and training on higher-resolution, longer clips.

Strided mode (Table 4):

Method Kinetics AJ / <δavgx<\delta_{avg}^x / OA DAVIS AJ / <δavgx<\delta_{avg}^x / OA RGB-Stacking AJ / <δavgx<\delta_{avg}^x / OA
CoTracker 57.3 / 70.6 / 87.5 64.8 / 79.1 / 88.7 65.9 / 80.6 / 85.0
BootsTAPIR+fix+snap+data 62.5 / 74.8 / 89.5 67.4 / 79.0 / 91.3 77.4 / 86.7 / 93.2

7.4 High-Resolution Evaluation (Table 6)

Evaluating at 512×512 instead of 256×256:

Method Kinetics AJ / <δavgx<\delta_{avg}^x / OA DAVIS AJ / <δavgx<\delta_{avg}^x / OA
BootsTAPIR 62.5 / 74.8 / 89.5 67.4 / 79.0 / 91.3
BootsTAPIR+hires 63.7 / 76.0 / 88.4 70.2 / 81.2 / 91.2

+1.2% on Kinetics and +2.8% on DAVIS.

7.5 Causal Model and Perception Test

  • Causal BootsTAPIR: 4.6% overall improvement on Kinetics, 3.0% on DAVIS (query-first) vs Causal TAPIR
  • Perception Test: BootsTAPIR improves Overall 55.7 → 59.6, Static 57.4 → 61.3, Dynamic 46.3 → 49.7

8. Ablations, Limitations & Practical Pointers

Ablations (Table 3):

(a) Data transformations (DAVIS strided AJ / Kinetics q_first AJ):

  • BASE: 65.8 / 54.4
  • No JPEG augmentation: 65.7 / 53.5 (JPEG mostly matters for Kinetics)
  • No affine transform: 54.4 / 44.7 (massive drop — affine is critical)
  • Same queries (teacher=student): 65.6 / 53.2
  • Uniform query sampling: 65.6 / 54.3

(b) Pseudo-label filtering:

  • No filtering: 65.9 / 54.1 (hurts Kinetics)
  • Cycle-consistency instead of confidence: 66.1 / 54.3 (slightly better on DAVIS)

(c) Training setup:

  • Full model: 66.2 / 54.6
  • Kubric-only (same capacity, no SSL): 65.0 / 52.7 (SSL adds ~1.2% DAVIS, ~2% Kinetics)
  • Siamese (no stop-gradient): 49.8 / 29.6 (collapses — stop-gradient is essential)

(d) Training data:

  • 2-frame clips: 64.3 / 50.5 (longer clips important)
  • 6-frame clips: 63.7 / 50.9
  • 1% of real data: 66.2 / 54.0 (surprisingly competitive — even small real data helps)

Limitations:

  • Training computationally expensive (extra teacher forward pass + real data pipeline)
  • Single point estimate per video — cannot elegantly handle duplicated or rotationally-symmetric objects
  • Performance continues to improve with more training — approach has not saturated

Practical Pointers:

  • Affine transformations are by far the most important augmentation — removing them causes massive performance collapse
  • Stop-gradient (EMA student-teacher) is essential; Siamese approach collapses
  • 24-frame clips much better than 2- or 6-frame clips — temporal context helps the teacher correct errors
  • Even 1% of real data provides significant gains
  • Released model: BootsTAPIR tracks 10,000 points on a 256×256, 50-frame video in 5.6s (A100 + JAX); causal model tracks 400 points at 30.1 FPS

9. Conclusions & Future Work

BootsTAP presents an effective method for leveraging large-scale unlabeled data to improve point tracking by applying consistency principles (equivariance to spatial transforms, invariance to query point choice). The formulation avoids complex priors like spatial/temporal smoothness. Despite similarities to "unstable" two-frame self-supervised optical flow approaches, the multi-frame setting makes this approach stable and effective. Performance continues to improve with longer training. Future directions include principled solutions to the foreground/background tracking bias and improved handling of duplicated/symmetric objects.

References

  • Doersch, C., Luc, P., Yang, Y., Gokay, D., Koppula, S., Gupta, A., Heyward, J., Rocco, I., Goroshin, R., Carreira, J., & Zisserman, A. (2024). BootsTAP: Bootstrapped Training for Tracking-Any-Point. ACCV 2024. arXiv:2402.00847 — paper