Temporal-Aware Encoder for Egocentric Video Retrieval

Date:

Python PyTorch

My final year project at the University of Bristol, supervised by Shawn Shen. The goal is to improve video-text retrieval on egocentric videos by adding temporal modeling to frozen vision-language backbones.

  • Identified that CLIP/PE text encoders give 0.97+ cosine similarity between temporal opposites like “open fridge” vs “close fridge”, and mean pooling over frames destroys temporal order.
  • Designed and compared four temporal adapter architectures (MLP, Transformer, Conv1D, ST-Adapter) on frozen Perception Encoder features.
  • Best model (ST-Adapter with middle fusion) tripled zero-shot baseline R@1 on EPIC-Kitchens-100 (V2T R@1: 1.09% → 2.78%).
  • Ran verb-aware hard negative ablation — negative result showing the bottleneck is in the frozen text encoder, not the training loss.
  • Evaluated on EPIC-Kitchens-100 and Ego4D egocentric video benchmarks.