PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

Jiaxiong Liu¹, Zhen Tan¹, Jinpu Zhang¹, Yi Zhou², Hui Shen¹, Xieyuanli Chen¹, Dewen Hu¹

¹National University of Defense Technology, ²Hunan University
CVPR 2026

Abstract

Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. We will release the code and dataset upon acceptance to support future research.

Main Architecture

Learning from model weights

TAPFormer overview. (a) The overall framework: frames and events are fused by the transient asynchronous fusion mechanism and Cross-Modal Local Weighted Fusion modules to produce high-frequency transient features, refined by temporal attention and decoded into multi-scale fusion features. The resulting features, together with the initial query points position q, are fed into a transformer-based optimization module to iteratively predict tracking trajectories x and occlusion states v. M denotes the number of iterations. (b) The fusion network: image and event tokens are integrated by local weighted cross-attention to construct and update transient representations.

Experiments

Learning from model weights

Quantitative comparison with various modality-based methods on the TAP task.

Learning from model weights

Quantitative comparison with various modality-based methods on the feature tracking task.

We evaluate our approach on two challenging real-world TAP datasets collected by ourselves, InivTAP and DrivTAP. InivTAP covers complex and representative challenging scenarios, while DrivTAP includes real driving sequences captured in both daytime and nighttime conditions. In addition, we further validate our method on the widely used feature point tracking datasets EDS and EC, demonstrating its effectiveness and strong generalization capability in real-world scenarios.

Qualitative Results

Tracking Any Point Task Result

Feature Tracking Task Result

Additional Experimental Analysis

Learning from model weights

We analyzed how the input frame rate affects tracking performance for our method and CoTracker3 under slow, normal, and fast motion settings. This experiment highlights the adaptability of our transient asynchronous fusion mechanism to low-frame-rate and high-speed conditions, demonstrating its potential for real-world deployment where frame capture rates are constrained.

Learning from model weights

We evaluate temporal robustness and discriminability by comparing point features from frame-only, event-only, and fused models. Features sampled along ground-truth trajectories are projected to 2D via PCA. The fused model produces tighter clusters for the same point and clearer separation between different points over time, demonstrating superior temporal coherence and embedding quality compared to single-modality methods.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}