Building Scalable Video Understanding Benchmarks
through Sports

*Denotes Equal Contribution

^Equal Advising




We propose an automated Annotation and Video Stream Alignment Pipeline, that we dub as ASAP, which aligns video footage of sports matches to online commentary and scorecard information. We specifically target sports because:

  1. Video understanding over sports is much easier to define than arbitrary videos.
  2. Sports have a wide variety of possible events and actions that occur frequently.
  3. There is an abundance of rich natural language online commentary.
The full annotation pipeline is summarized below:

scales scales


We use ASAP to construct a benchmark for Long-horizon video understanding by annotating matches from the sport of Cricket, which we call LCric.

We chose Cricket because it has a relatively small action space that makes reasoning over it simpler than most long-video understanding problems. This makes it feasible as a benchmark for video understanding models, which currently struggle to handle reasoning over longer time spans. Specifically, our dataset builds on top of twelve distinct events that happen in Cricket:

  • The number of runs(points) scored by a team. The number of runs can range from 0 to 9
  • A foul ball (termed as wide)
  • An out ball (termed as wicket)
These "atomic" events are then aggregated over a long time span, so we can ask queries to a model about what happened over the course of the time span. We call this set of queries our "Query Set".

A model with long horizon video understanding capabilities must be able to localize and detect the occurrence of distinct important events, while also aggregating such information to form a conclusion. In our dataset, we also create automated compositional queries that individually test the detection and aggregation skills of a model across a long horizon task.

We automatically compose several types of queries to test the model's reasoning abilities. Specifically,

  • binary queries are a yes/no query about the occurrence of some set of events. An example of such a query is shown above.
  • multi-choice queries expand on the binary queries by asking the model to use context from the video to select an answer from a set of possible answers.
  • regression queries ask the model to determine the number of runs that occurred over a given time horizon.
Details for how query composition is formed can be found in the supplementary sections.

We evaluate two of the recent baselines, namely TQN and MemVit, using on LCric and the compositional query set. We also ran a set of Amazon Mechanical Turk experiments to get a human baseline for our dataset that greatly outperforms the baselines, suggesting a huge room for improvement for state-of-the-art video understanding models on the LCric Benchmark and general long-horizon video understanding problems.


The website template was borrowed from Michaël Gharbi.