Robust Point Tracking with Epipolar Constraints

Point tracking is a core primitive for video understanding, editing, and reconstruction. Modern deep trackers like CoTracker can produce dense tracks over long sequences, but geometric drift still accumulates—especially when there’s significant camera motion—leading to correspondences that violate basic multi-view constraints.

This project explores a simple question: Can we reduce long-horizon drift by explicitly enforcing epipolar geometry?

Links:

Code: https://github.com/adithyaknarayan/EpiTrack
Full report (PDF): /assets/pdf/Geom_Final_report-2.pdf
Project page: /projects/RobustPointTrackingEpipolar/

Key idea: epipolar constraints as “guardrails”

For rigid/static parts of the scene, correspondences between two frames should satisfy: [ x_t^\top F x_0 = 0 ] where (F) is the fundamental matrix between the two views.

When tracks drift, this constraint is violated. We use that signal in two ways.

Approach 1: Model-agnostic post-processing (iterative epipolar refinement)

Given tracks from any point tracker:

Estimate (F) with RANSAC using mutually visible correspondences.
Identify outliers via epipolar error (point-to-epipolar-line distance).
Correct outliers using a local SIFT search along the epipolar line (fallback: snap to closest point on the line).
Re-estimate (F) with a tighter threshold and iterate until convergence.

Post-processing pipeline overview: estimate fundamental matrices and iteratively correct outliers to enforce epipolar consistency.

Approach 2: Weakly-supervised finetuning (soft epipolar loss)

We also finetune CoTracker with a soft epipolar loss on rigid/static regions, using a teacher–student setup to encourage:

stronger geometric consistency on static points
minimal disruption to non-rigid/dynamic motion

Results: geometric consistency improves (sometimes at a cost)

Across long sequences, both approaches reduce geometric drift and lower epipolar error. In some settings, enforcing geometry can slightly reduce standard tracking metrics (especially in dynamic regions), but it yields significantly better multi-view consistency over long horizons.

On TAP‑Vid DAVIS (reported in our writeup), we see:

Method	δ_visavg (%)	OA (%)	AJ (%)	Mean Error (px)	Median Error (px)
PIPs	64.0	77.0	63.5	4.87	3.42
TAPIR	62.9	88.0	73.3	3.95	2.76
CoTracker3 (baseline)	77.3	91.8	64.5	3.24	2.18
CoTracker3 + Finetune (ours)	75.1	90.5	62.7	2.41	1.58
Ours + Post-processing	75.8	90.3	63.1	1.31	0.89

Project page

Full summary + figures live here:

/projects/RobustPointTrackingEpipolar/

Key idea: epipolar constraints as “guardrails”

Approach 1: Model-agnostic post-processing (iterative epipolar refinement)

Approach 2: Weakly-supervised finetuning (soft epipolar loss)

Results: geometric consistency improves (sometimes at a cost)

Project page

Enjoy Reading This Article?