FunQA: Towards Surprising Video Comprehension

Beijing University of Posts and Telecommunications
S-Lab, Nanyang Technological University
Allen Institute for Artificial Intelligence

*Indicates Equal Contribution
MY ALT TEXT

FunQA

4.3K Video Clips

312K Free-Text QA Pairs

34.2 average answer word length

MY ALT TEXT

HumorQA

1,769 Video Clips

142.1K Free-Text QA Pairs

28.2 average answer word length

MY ALT TEXT

CreativeQA

927 Video Clips

78.7K Free-Text QA Pairs

59.1 average answer word length

MY ALT TEXT

MagicQA

1,672 Video Clips

92.1K Free-Text QA Pairs

22.7 average answer word length

FunQA benchmarks funny, creative, and magic videos for challenging tasks including
timestamp localization, video description, reasoning, and beyond.

Abstract

Surprising videos, e.g., funny clips, creative performances, or visual illusions, attract significant attention. Enjoyment of these videos is not simply a response to visual stimuli; rather, it hinges on the human capacity to understand (and appreciate) commonsense violations depicted in these videos. We introduce FunQA, a challenging video question answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. Unlike most video QA benchmarks which focus on less surprising contexts, e.g., cooking or instructional videos, FunQA covers three previously unexplored types of surprising videos: 1) HumorQA , 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous QA tasks designed to assess the model’s capability in counter-intuitive timestamp localization, detailed video description, and reasoning around counterintuitiveness. We also pose higher-level tasks, such as attributing a fitting and vivid title to the video, and scoring the video creativity. In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips, spanning a total of 24 video hours. Moreover, we propose FunMentor, an agent designed for Vision-Language Models (VLMs) that uses multi-turn dialogues to enhance models’ understanding of counterintuitiveness. Extensive experiments with existing VLMs demonstrate the effectiveness of FunMentor and reveal significant performance gaps for the FunQA videos across spatial-temporal reasoning, visual-centered reasoning, and free-text generation.

Paper

BibTeX

@inproceedings{xie2025funqa,
  title={Funqa: Towards surprising video comprehension},
  author={Xie, Binzhu and Zhang, Sicheng and Zhou, Zitang and Li, Bo and Zhang, Yuanhan and Hessel, Jack and Yang, Jingkang and Liu, Ziwei},
  booktitle={European Conference on Computer Vision},
  pages={39--57},
  year={2025},
  organization={Springer}
}