ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Dohwan Ko1*, Sihyeon Kim1*, Yumin Suh2, Vijay Kumar2,
Minseo Yoon1, Manmohan Chandraker2, 3, Hyunwoo J. Kim4,

1Korea University   2NEC Labs America   3University of California, San Diego   4KAIST

Abstract

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, e.g., autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (e.g., ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning.


Illustration of ST-VLM Pseudo-labeling Pipeline.


Task examples from the proposed STKit-Bench along with predictions from ST-VLM.


Spatio-temporal reasoning in dynamic videos of moving objects. (left) A challenging case with a complex trajectory. (right) An emerging capability of ST-VLM.


Qualitative results on emerging capabilities of ST-VLM with multi-step reasoning questions.