FunQA: Towards Surprising Video Comprehension

Binzhu Xie^*, Sicheng Zhang^*, Zitang Zhou^*,
Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu

Beijing University of Posts and Telecommunications
S-Lab, Nanyang Technological University
Allen Institute for Artificial Intelligence
^*Indicates Equal Contribution

arXiv

🤗

Dataset Code

🎞

FunQA Video

FunQA

4.3K Video Clips

312K Free-Text QA Pairs

34.2 average answer word length

HumorQA

1,769 Video Clips

142.1K Free-Text QA Pairs

28.2 average answer word length

CreativeQA

927 Video Clips

78.7K Free-Text QA Pairs

59.1 average answer word length

MagicQA

1,672 Video Clips

92.1K Free-Text QA Pairs

22.7 average answer word length

FunQA benchmarks funny, creative, and magic videos for challenging tasks including
timestamp localization, video description, reasoning, and beyond.

When is the most amusing part?

The funny moment is from frame 70 to 140.

What happened during the entertaining segment of this video?

A man tried to get some ketchup and couldn't pour it out, so he vigorously shook the ketchup bottle, but unintentionally splattering the person eating a burger with ketchup.

Why the segment is funny?

This video is funny for two main reasons. Firstly, the comedic element arises from the accidental ketchup spill that lands on the man's face. Moreover, the man's expression of shock, coupled with his ketchup-covered visage, elicits contagious laughter.

Please give this humorous short video a vivid and appropriate title.

Miracle Tomato Sauce.

What's the most interesting moment of this video?

The most intersting moment is from frame 22 to 211.

Can you describe the humorous segment of this video?

There was a small toy at the door of the bathroom. A black and white cat went behind the bathroom door and started to use its paws to reach its toy from the doorway.

Why is this video interesting and engaging？

The cat could have taken its toy directly to play, but it had to go behind the door to reach its toy from the doorway. As the paws are too short to reach to reach, this behavior result in a cute and funny scene.

Write an suitable title for this funny video.

Legs are never longer than enough.

When is the creative segment of the video?

The creative moment is from frame 2 to 53.

What happened during the creative segment?

In the middle of the stage was a man sitting on a chair playing the guitar. A little girl wearing piano outfit turned around. Then she took off the outermost clothes and made herself inverted. The man picked her up onto his shoulders, and his one hand held the girl's legs, the other hand catched a small wooden stick.

Why do you think the segment is creative?

The creative point of this video is that, the little girl changes outfits and body postures constantly to simulate the violin, with the man simulating the scene of playing the violin.

Please give this creative video a title to best represent its shine point. Also, score the creativity of the video on a scale of 0 to 20.

The title is: the 'startled' concert. And I will score this video 19 for creativity.

In what period the magic took place?

The magic moment is from frame 71 to 119.

Can you describe the magical segment of this video?

The man reached out his hand, took one of the clouds from the sky, and put it into his mouth.

How can you tell that the video showcases magic?

What's commonsense-violating is that, people standing on the ground can't touch the clouds, so the clouds can't be taken down by the man.

What method do you think the magician used to accomplish the magic trick?

The magician used video editing technology to combine a video of reaching for the cloud with a video of eating a marshmallow to create the magic effect.

Overview of FunQA.

FunQA comprises three subsets of surprising videos: 1) HumorQA, 2) CreativeQA, and 3) MagicQA. Each subset is associated with three common tasks: 1) counter-intuitive timestamp localization}, 2) detailed video description}, and 3) reasoning around counter-intuitiveness } (see H1-3, C1-3, and M1-3). Furthermore, we offer higher-level tasks tailored for each video type, such as attributing a fitting and vivid title for HumorQA and CreativeQA (see H4, H4), etc.

Dataset statistics

FunQA consists of three subsets, each corresponding to different video types, and is annotated with free-text QA pairs. The first row displays word clouds representing critical annotations for each subset. The second row provides key dataset statistics, including the number of videos for different splits, video length, and QA pair count for three subsets. In the last row, (g) highlights the high-frequency time span of the answer for localization questions in red, (h) shows the average word count of answers, and (i) presents the percentage of consensus between annotators for the same answer in a sampled set.

Comparison between different benchmarks

Compare to other datasets, FunQA revolves around the captivating realm of interesting and counter-intuitive videos. The tasks within FunQA are specifically designed to challenge the vision capabilities of models, requiring strong skills in producing an in-depth description, interpretation, and spatial-temporal reasoning. Here we clarify the abbreviation in the table. Avg Lenb> denotes video average length; # Clipsb> means number of video clips; VC for visual-centric, Des. for Description, Exp. for Explanation, STR for Spatial-temporal Reasoning, MC means Multiple Choice QA, and OE shows Open Ended QA with Average Word Count per response.

Main result for baselines

The FunQA benchmark consists of four task categories. H1, C1, M1 represent the counter-intuitive timestamp localization task, where IOU is used as the metric. H2, C2, M2 represent the detailed video description task, and H3, C3, M3 represent reasoning around counter-intuitiveness. For the higher-level tasks, H4, C4 involve attributing a fitting and vivid title. The responses for all these tasks in free-text format. We use the following metrics: BLEU-4 / ROUGE-L / CIDEr (shown in the first row) and BLEURT / GPT-4 (shown in the second row) for evaluation. C5 represents scoring the video creativity, and the metric is the Accuracy between the predicted score and the official score. We tested the caption-based and instruction-based models. Here we clarify the abbreviation in the table. F denotes Frame-rate L.M.: GIT_LARGE_MSRVTT; L.V.: GIT_LARGE_VATEX; D.C. means finetuned on Dense Caption; FunQA means finetuned on FunQA.

Comparison of responses from different models

Here shows the answers given by VideoChat, Video-ChatGPT, and Otter on HumorQA video. On task H2, H3, VideoChat has the best performance. On task H4, Video-ChatGPT and Otter answer better, which is in line with our experiment result. However, the answers from all models are still far from the ground truth. The descriptions of details and counter-intuitive explanations have numerous shortcomings. For example, Video-ChatGPT added incorrect details to the description, such as "wearing sunglasses", the humorous reason for "throwing ketchup" was wrongly interpreted by VideoChat as "knocking over the ketchup bottle", etc.