Spatial Video Grounding via Graph Transformers

Role: Researcher
Organization: University of California, Santa Barbara (UCSB)
Date: 2024
Focus: Vision-Language Models, Graph Transformers, Video QA

Built a video question-answering system capable of grounding textual queries in spatial locations within video frames.

The model integrates transformer architectures with graph neural networks to reason about video content. It identifies relevant visual evidence and highlights spatial regions related to the answer. This approach improves transparency and interpretability in video AI systems.