Temporal-Aware Encoder for Egocentric Video Retrieval
Date:
My final year project at the University of Bristol, supervised by Shawn Shen. The goal is to improve video-text retrieval on egocentric videos by adding temporal modeling to frozen vision-language backbones.
- Identified that CLIP/PE text encoders give 0.97+ cosine similarity between temporal opposites like “open fridge” vs “close fridge”, and mean pooling over frames destroys temporal order.
- Designed and compared four temporal adapter architectures (MLP, Transformer, Conv1D, ST-Adapter) on frozen Perception Encoder features.
- Best model (ST-Adapter with middle fusion) tripled zero-shot baseline R@1 on EPIC-Kitchens-100 (V2T R@1: 1.09% → 2.78%).
- Ran verb-aware hard negative ablation — negative result showing the bottleneck is in the frozen text encoder, not the training loss.
- Evaluated on EPIC-Kitchens-100 and Ego4D egocentric video benchmarks.
