Building Scalable Video Understanding Benchmarks
through Sports
-
Aniket Agarwal*
IIT Roorkee -
Alex Zhang*
Princeton University -
Karthik Narasimhan
Princeton University -
Igor Gilitschenski
University of Toronto -
Vishvak Murahari^
Princeton University -
Yash Kant^
University of Toronto
*Denotes Equal Contribution
^Equal Advising
tl;dr
ASAP
We propose an automated Annotation and Video Stream Alignment Pipeline, that we dub as ASAP, which aligns video footage of sports matches to online commentary and scorecard information. We specifically target sports because:
- Video understanding over sports is much easier to define than arbitrary videos.
- Sports have a wide variety of possible events and actions that occur frequently.
- There is an abundance of rich natural language online commentary.
LCric
We use ASAP to construct a benchmark for Long-horizon video understanding by annotating
matches from the sport of Cricket, which
we call LCric.
We chose Cricket because it has a relatively small action space that makes reasoning over it simpler
than most long-video understanding problems. This makes it feasible as a benchmark for video
understanding models, which currently struggle to handle reasoning over longer time spans.
Specifically, our dataset builds on top of twelve distinct events that happen in Cricket:
- The number of runs(points) scored by a team. The number of runs can range from 0 to 9
- A foul ball (termed as wide)
- An out ball (termed as wicket)
A model with long horizon video understanding capabilities must be able to localize and detect the
occurrence of distinct important events, while also aggregating such information to form a
conclusion. In our dataset, we also create automated compositional queries that individually test
the detection and aggregation skills of a model across a long horizon task.
We automatically compose several types of queries to test the model's reasoning abilities.
Specifically,
- binary queries are a yes/no query about the occurrence of some set of events. An example of such a query is shown above.
- multi-choice queries expand on the binary queries by asking the model to use context from the video to select an answer from a set of possible answers.
- regression queries ask the model to determine the number of runs that occurred over a given time horizon.
We evaluate two of the recent baselines, namely TQN and MemVit, using on LCric and the compositional query set. We also ran a set of Amazon Mechanical Turk experiments to get a human baseline for our dataset that greatly outperforms the baselines, suggesting a huge room for improvement for state-of-the-art video understanding models on the LCric Benchmark and general long-horizon video understanding problems.
Citation
The website template was borrowed from Michaël Gharbi.