BlackSwan Challenge Leaderboard

MCQ Test Phase — MCQ Test Split

Results

The leaderboard shows Accuracy (overall), as well as accuracy of MCQ and YN variants of Detective and Reporter tasks separately. For human baseline, please refer to our paper.

B Baseline P Participant
Rank Team Accuracy Detective_Accuracy Reporter_Accuracy
1 cola_lover (v1) (Southeast University, Opus AI Research) P 75.94 72.33 81.75
2 Boat (v1) (Jiangnan University) P 72.88 67.78 81.11
3 UBC-ViL (Baseline-GPT4o) B 69.06 63.18 78.53
4 IMG_AI 64.17 56.86 75.96
5 UBC-ViL (Baseline-Gemini) B 62.20 57.09 70.60
6 ASU_Computer_Vision (v2) P 62.06 56.22 71.47
7 UBC-ViL (Baseline-LlavaVideo) B 60.63 54.55 70.44
8 casia-base 59.94 52.07 72.62
9 UBC-ViL (Baseline-VideoLlama2-7B) B 53.15 53.27 52.96
10 UBC-ViL (Baseline-VILA-7B) B 50.49 49.44 52.19
11 longAI 39.27 37.44 45.14
12 UBC-ViL (Baseline-VideoChat2) B 36.66 28.55 49.74