Description: Paper note on BootsTAP, a self-supervised student-teacher method for improving point tracking on real-world videos (ACCV 2024)
My Notion Note ID: K2E-B-F2-3
Created: 2026-03-23
Updated: 2026-06-06
License: Reuse welcome — please credit Yu Zhang and link back to yuzhang.io

1. Paper Information
2. Summary
3. Key Contributions
4. Background & Related Work
5. Method Details & Key Equations
6. Training Setup & Datasets
7. Main Experiments & Quantitative Results
8. Ablations, Limitations & Practical Pointers
9. Conclusions & Future Work
References

1. Paper Information

Title: BootsTAP: Bootstrapped Training for Tracking-Any-Point
Authors: Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman
Affiliations: Google DeepMind; VGG, Department of Engineering Science, University of Oxford
Paper: ACCV 2024 / arXiv:2402.00847
Code: github.com/google-deepmind/tapnet (JAX and PyTorch)
Project page: bootstap.github.io

2. Summary

BootsTAP demonstrates that large-scale, unlabeled, uncurated real-world videos can substantially improve a Tracking-Any-Point (TAP) model using a simple self-supervised student-teacher framework. Starting from a TAPIR model pre-trained on synthetic data (Kubric), it bootstraps on ~15 million real video clips by: (1) using the teacher (EMA of student) to produce pseudo-ground-truth trajectories, (2) applying spatial transformations and corruptions to the student's input, and (3) training the student to reproduce the teacher's predictions with a cycle-consistency filter. The result, BootsTAPIR, achieves state-of-the-art on the TAP-Vid benchmark by wide margins: TAP-Vid-DAVIS improves from 61.3% to 67.4% AJ, and TAP-Vid-Kinetics from 57.2% to 62.5% AJ (strided mode).

3. Key Contributions

First large-scale bootstrapping pipeline for TAP: Demonstrates that unlabeled real-world video can substantially improve point tracking via self-supervised learning, using two consistency principles: equivariance to spatial transforms and invariance to query point choice along a trajectory.
Simple but effective formulation: Minimal architectural changes to TAPIR — only adding 5 layers of 2D conv-residual layers (channel multiplier 4) to the backbone, roughly doubling parameters.
Comprehensive ablations: Analyzes data transformations, pseudo-label filtering, training setup, and data requirements. Affine transformations are critical; cycle-consistency filtering helps; even 1% of data provides gains.
State-of-the-art results: Surpasses all prior methods on TAP-Vid (Kinetics, DAVIS, RGB-Stacking) and RoboTAP by wide margins. Also demonstrates causal model and high-resolution variants.

Tracking-Any-Point (TAP): Track any point on solid surfaces in a video, outputting tracked positions and occlusion status for every frame. Unlike optical flow (displacement between adjacent frame pairs), TAP outputs per-point trajectories across an entire video (position + occlusion status for every frame), handling long-range tracking, occlusion, and reappearance. TAP-Vid [12] formalized the benchmark with datasets: Kinetics (human actions), DAVIS (object tracking), RGB-Stacking (robotic manipulation), and RoboTAP (real robotic manipulation). State-of-the-art methods (TAPIR, CoTracker) rely heavily on synthetic data (Kubric), creating a sim-to-real domain gap.

Self-supervised correspondence via photometric loss: Photometric losses popular in optical flow but struggle with occlusions, lighting changes, and textureless regions. Multi-frame methods, appearance reconstruction, and high-resolution matching partially address these.

Temporal continuity and cycle-consistency: Temporal continuity provides correspondences for feature learning. Cycle-consistency can yield useful features but fails at occlusions.

Semi-supervised correspondence: Pseudo-GT approaches (using RAFT or TAP-Net) used for optical flow and tracking. OmniMotion infers 3D scene representations but doesn't retrain a general TAP model. Li et al. [34] proposed a reconstruction-based self-supervised loss but achieved much lower performance. Concurrent work [57] (CoTracker v2) uses frozen trajectory labels and fine-tunes on target datasets; BootsTAP continuously updates the teacher via EMA and trains on a single large diverse dataset.

5. Method Details & Key Equations

5.1 Core Principles

Two properties of point tracks on solid, opaque surfaces:

Equivariance: Spatial transformations of the video produce equivalent transformations of the trajectories.
Invariance: Different query points along the same trajectory should yield the same track (each trajectory is an equivalence class).

Equivariance: if the video undergoes spatial transformation $\Phi$ (e.g., affine: resize + translate), predicted trajectories undergo the same $\Phi$ . Invariance: querying any point along the same trajectory should produce the same track.

A naive Siamese approach (minimizing difference between two augmented views) degrades toward trivial solutions. Instead, a student-teacher framework is used: the teacher provides stop-gradient pseudo-labels; the student receives a harder, augmented view.

5.2 TAPIR Loss (Baseline)

Let $\hat{y} = \{\hat{p}, \hat{o}, \hat{u}\}$ be predictions: position $\hat{p} \in \mathbb{R}^{T \times 2}$ , occlusion logit $\hat{o} \in \mathbb{R}^T$ , uncertainty logit $\hat{u} \in \mathbb{R}^T$ . Standard TAPIR loss for a single trajectory:

\mathcal{L}_{tapir}(\hat{p}[t], \hat{o}[t], \hat{u}[t]) = \text{Huber}(\hat{p}[t], p[t])(1 - o[t]) + \text{BCE}(\hat{o}[t], o[t]) + \text{BCE}(\hat{u}[t], u[t])(1 - o[t])

Notation: $\hat{\cdot}$ = model prediction; unhatted = ground-truth. Huber = smooth L1 (robust to outliers). BCE = Binary Cross-Entropy (for binary occlusion/visibility and certainty/uncertainty classification). $(1 - o[t])$ = visibility mask: when $o[t]=1$ (occluded), position and uncertainty terms become 0 — no ground-truth position to compare. Occlusion loss is NOT masked (the model must always predict occlusion state).

The uncertainty target is $u[t] = \mathbb{1}(d(\hat{p}[t], p[t]) > \delta)$ with $\delta = 6$ pixels.

5.3 Model Capacity Expansion

After pre-training, 5 layers of 2D conv-residual layers with channel multiplier 4 are added to the backbone, roughly doubling the number of backbone parameters. Initialized to the identity following "zero init" [18].

5.4 Pseudo-Label Generation

Let $\hat{y}_S = \{\hat{p}_S, \hat{o}_S, \hat{u}_S\}$ be student predictions and $y_\mathcal{T} = \{p_\mathcal{T}, o_\mathcal{T}, u_\mathcal{T}\}$ be teacher pseudo-labels:

p_\mathcal{T}[t] = \hat{p}_\mathcal{T}[t] \quad;\quad o_\mathcal{T}[t] = \mathbb{1}(\hat{o}_\mathcal{T}[t] > 0) \quad;\quad u_\mathcal{T}[t] = \mathbb{1}(d(\hat{p}_\mathcal{T}[t], \hat{p}_S[t]) > \delta)

The self-supervised loss $\ell_{ssl}$ has the same form as the TAPIR loss, using teacher predictions as ground truth:

\ell_{ssl}(\hat{p}_S[t], \hat{o}_S[t], \hat{u}_S[t]) = \text{Huber}(\hat{p}_S[t], p_\mathcal{T}[t])(1 - o_\mathcal{T}[t]) + \text{BCE}(\hat{o}_S[t], o_\mathcal{T}[t]) + \text{BCE}(\hat{u}_S[t], u_\mathcal{T}[t])(1 - o_\mathcal{T}[t])

The teacher always uses the final (refined) predictions for pseudo-labels, while supervision is applied to unrefined student outputs too, encouraging stronger features for faster convergence.

5.5 Video Degradations

The student receives a degraded view to prevent trivial solutions:

Spatial: Frames resized to a smaller resolution $r$ and superimposed onto a black background at a random position $(h, w)$ . $r$ varies linearly over time, creating a frame-wise affine transformation $\Phi$ .
Non-spatial: Random JPEG compression applied before pasting onto the black background.
The inverse transform $\Phi^{-1}$ maps student predictions back to original coordinate space for loss computation.

Why degrade? Prevents the student from finding trivial shortcuts (e.g., copying pixel coordinates). Forces learning of higher-level semantic features (shape, structure) rather than low-level texture matching.

5.6 Query Point Sampling (Trajectory Equivalence Class)

Sample teacher query $Q_1 = (q_1, t_1)$ where $q_1$ is a random $(x,y)$ coordinate and $t_1$ a random frame.
Sample student query $Q_2 = (q_2, t_2)$ from the teacher's visible trajectory: $Q_2 \in \{(p_\mathcal{T}[t], t); t \text{ s.t. } o_\mathcal{T}[t] = 0\}$ .
With probability 0.5, use $Q_1 = Q_2$ (same query point); otherwise, sample uniformly from visible points.

Sampling different query points along the teacher's trajectory enforces the invariance property — the student learns that all points on the same trajectory are equivalent and should produce the same track.

5.7 Cycle-Consistency Filtering

To filter incorrect teacher predictions, apply a cycle-consistency mask:

m_{cycle} = \mathbb{1}(d(\hat{p}_S[t_1], q_1) < \delta_{cycle}) \ast \mathbb{1}(\hat{o}_S[t_1] \le 0)

where $\delta_{cycle} = 4$ pixels.

The student tracks from $q_2$ back to time $t_1$ . If it arrives close to $q_1$ (within 4 pixels) AND predicts the point as visible ( $\hat{o}_S[t_1] \le 0$ ), the trajectory is considered reliable. Otherwise, the teacher likely tracked the wrong point — excluded from the loss.

5.8 Final Self-Supervised Loss

\mathcal{L}_{SSL} = \sum_t m_{cycle}^t \ast \ell_{ssl}^t

Trained with 128 query points per input video, averaging the loss. To prevent catastrophic forgetting, supervised training on Kubric continues alongside the self-supervised loss, using separate Adam optimizers with separate learning rates (SSL uses half the batch size and half the learning rate).

6. Training Setup & Datasets

Training data (real): ~15 million 24-frame clips from publicly accessible online videos:

Selected categories with high-quality, realistic motion (lifestyle, one-shot videos)
Excluded low-visual-complexity or unrealistic motion (tutorials, lyrics, animations)
Only 60fps videos with >200 views
First/last 2 seconds trimmed; 5 clips randomly sampled per video
Overlay/watermark frames excluded via gradient analysis; shot boundaries detected and removed

Supervised data: Kubric synthetic dataset (standard TAPIR training)

Multi-task training: Separate Adam optimizers for supervised (Kubric) and self-supervised (real) tasks with separate learning rates. SSL batch size halved (and learning rate halved proportionally) due to the extra forward pass cost.

Evaluation: TAP-Vid benchmark (Kinetics, DAVIS, RGB-Stacking) and RoboTAP, all at 256×256. Metrics: AJ (Average Jaccard), $< \delta_{avg}^x$ (position accuracy), OA (occlusion accuracy). Two query modes: strided (every 5th point) and query_first (first visible point).

7. Main Experiments & Quantitative Results

7.1 TAP-Vid Benchmark — Strided Mode (Table 1)

Method	Kinetics AJ / $<\delta_{avg}^x$ / OA	DAVIS AJ / $<\delta_{avg}^x$ / OA	RGB-Stacking AJ / $<\delta_{avg}^x$ / OA
TAP-Net	46.6 / 60.9 / 85.0	38.4 / 53.1 / 82.3	59.9 / 72.8 / 90.4
PIPs	35.3 / 54.8 / 77.4	42.0 / 59.4 / 82.1	37.3 / 51.0 / 91.6
TAPIR	57.2 / 70.1 / 87.8	61.3 / 73.6 / 88.8	62.7 / 74.6 / 91.6
CoTracker	57.3 / 70.6 / 87.5	64.8 / 79.1 / 88.7	65.9 / 80.6 / 85.0
BootsTAPIR	61.4 / 74.2 / 89.7	66.2 / 78.1 / 91.0	72.4 / 83.1 / 91.2

BootsTAPIR outperforms all methods on Kinetics and RGB-Stacking by large margins (4+ AJ points). On DAVIS, achieves best AJ (66.2) and OA (91.0).

7.2 TAP-Vid Benchmark — Query-First Mode (Table 2)

Method	Kinetics AJ / $<\delta_{avg}^x$ / OA	DAVIS AJ / $<\delta_{avg}^x$ / OA	RoboTAP AJ / $<\delta_{avg}^x$ / OA
TAPIR	49.6 / 64.2 / 85.0	56.2 / 70.0 / 86.5	59.6 / 73.4 / 87.0
CoTracker	48.7 / 64.3 / 86.5	60.6 / 75.4 / 89.3	54.0 / 65.5 / 78.8
BootsTAPIR	54.6 / 68.4 / 86.5	61.4 / 73.6 / 88.7	64.9 / 80.1 / 86.3

5% absolute AJ improvement on RoboTAP (64.9 vs 59.6 for TAPIR), despite RoboTAP looking very different from typical online videos.

7.3 Released Model with Bug Fix, Snap-to-Occluder, and Extra Data

Publicly released BootsTAPIR includes: coordinate bug fix, "snap-to-occluder" technique (altering query points near occlusion edges, 1 pixel away, to track the foreground rather than background), and training on higher-resolution, longer clips.

Strided mode (Table 4):

Method	Kinetics AJ / $<\delta_{avg}^x$ / OA	DAVIS AJ / $<\delta_{avg}^x$ / OA	RGB-Stacking AJ / $<\delta_{avg}^x$ / OA
CoTracker	57.3 / 70.6 / 87.5	64.8 / 79.1 / 88.7	65.9 / 80.6 / 85.0
BootsTAPIR+fix+snap+data	62.5 / 74.8 / 89.5	67.4 / 79.0 / 91.3	77.4 / 86.7 / 93.2

7.4 High-Resolution Evaluation (Table 6)

Evaluating at 512×512 instead of 256×256:

Method	Kinetics AJ / $<\delta_{avg}^x$ / OA	DAVIS AJ / $<\delta_{avg}^x$ / OA
BootsTAPIR	62.5 / 74.8 / 89.5	67.4 / 79.0 / 91.3
BootsTAPIR+hires	63.7 / 76.0 / 88.4	70.2 / 81.2 / 91.2

+1.2% on Kinetics and +2.8% on DAVIS.

7.5 Causal Model and Perception Test

Causal BootsTAPIR: 4.6% overall improvement on Kinetics, 3.0% on DAVIS (query-first) vs Causal TAPIR
Perception Test: BootsTAPIR improves Overall 55.7 → 59.6, Static 57.4 → 61.3, Dynamic 46.3 → 49.7

8. Ablations, Limitations & Practical Pointers

Ablations (Table 3):

(a) Data transformations (DAVIS strided AJ / Kinetics q_first AJ):

BASE: 65.8 / 54.4
No JPEG augmentation: 65.7 / 53.5 (JPEG mostly matters for Kinetics)
No affine transform: 54.4 / 44.7 (massive drop — affine is critical)
Same queries (teacher=student): 65.6 / 53.2
Uniform query sampling: 65.6 / 54.3

(b) Pseudo-label filtering:

No filtering: 65.9 / 54.1 (hurts Kinetics)
Cycle-consistency instead of confidence: 66.1 / 54.3 (slightly better on DAVIS)

(c) Training setup:

Full model: 66.2 / 54.6
Kubric-only (same capacity, no SSL): 65.0 / 52.7 (SSL adds ~1.2% DAVIS, ~2% Kinetics)
Siamese (no stop-gradient): 49.8 / 29.6 (collapses — stop-gradient is essential)

(d) Training data:

2-frame clips: 64.3 / 50.5 (longer clips important)
6-frame clips: 63.7 / 50.9
1% of real data: 66.2 / 54.0 (surprisingly competitive — even small real data helps)

Limitations:

Training computationally expensive (extra teacher forward pass + real data pipeline)
Single point estimate per video — cannot elegantly handle duplicated or rotationally-symmetric objects
Performance continues to improve with more training — approach has not saturated

Practical Pointers:

Affine transformations are by far the most important augmentation — removing them causes massive performance collapse
Stop-gradient (EMA student-teacher) is essential; Siamese approach collapses
24-frame clips much better than 2- or 6-frame clips — temporal context helps the teacher correct errors
Even 1% of real data provides significant gains
Released model: BootsTAPIR tracks 10,000 points on a 256×256, 50-frame video in 5.6s (A100 + JAX); causal model tracks 400 points at 30.1 FPS

9. Conclusions & Future Work

BootsTAP presents an effective method for leveraging large-scale unlabeled data to improve point tracking by applying consistency principles (equivariance to spatial transforms, invariance to query point choice). The formulation avoids complex priors like spatial/temporal smoothness. Despite similarities to "unstable" two-frame self-supervised optical flow approaches, the multi-frame setting makes this approach stable and effective. Performance continues to improve with longer training. Future directions include principled solutions to the foreground/background tracking bias and improved handling of duplicated/symmetric objects.

References

Doersch, C., Luc, P., Yang, Y., Gokay, D., Koppula, S., Gupta, A., Heyward, J., Rocco, I., Goroshin, R., Carreira, J., & Zisserman, A. (2024). BootsTAP: Bootstrapped Training for Tracking-Any-Point. ACCV 2024. arXiv:2402.00847 — paper

BootsTAP Bootstrapped Training for Tracking-Any-Point

Table of Contents

1. Paper Information

2. Summary

3. Key Contributions

5. Method Details & Key Equations

5.1 Core Principles

5.2 TAPIR Loss (Baseline)

5.3 Model Capacity Expansion

5.4 Pseudo-Label Generation

5.5 Video Degradations

5.6 Query Point Sampling (Trajectory Equivalence Class)

5.7 Cycle-Consistency Filtering

5.8 Final Self-Supervised Loss

6. Training Setup & Datasets

7. Main Experiments & Quantitative Results

7.1 TAP-Vid Benchmark — Strided Mode (Table 1)

7.2 TAP-Vid Benchmark — Query-First Mode (Table 2)

7.3 Released Model with Bug Fix, Snap-to-Occluder, and Extra Data

7.4 High-Resolution Evaluation (Table 6)

7.5 Causal Model and Perception Test

8. Ablations, Limitations & Practical Pointers

9. Conclusions & Future Work

References

Table of Contents

1. Paper Information

2. Summary

3. Key Contributions

4. Background & Related Work

5. Method Details & Key Equations

5.1 Core Principles

5.2 TAPIR Loss (Baseline)

5.3 Model Capacity Expansion

5.4 Pseudo-Label Generation

5.5 Video Degradations

5.6 Query Point Sampling (Trajectory Equivalence Class)

5.7 Cycle-Consistency Filtering

5.8 Final Self-Supervised Loss

6. Training Setup & Datasets

7. Main Experiments & Quantitative Results

7.1 TAP-Vid Benchmark — Strided Mode (Table 1)

7.2 TAP-Vid Benchmark — Query-First Mode (Table 2)

7.3 Released Model with Bug Fix, Snap-to-Occluder, and Extra Data

7.4 High-Resolution Evaluation (Table 6)

7.5 Causal Model and Perception Test

8. Ablations, Limitations & Practical Pointers

9. Conclusions & Future Work

References