Results
The leaderboard shows Accuracy (overall), as well as accuracy of MCQ and YN variants of Detective and Reporter tasks separately. For human baseline, please refer to our paper.
B Baseline
P Participant
| Rank | Team | Accuracy | Detective_Accuracy | Reporter_Accuracy |
|---|---|---|---|---|
| 1 | cola_lover (v1) (Southeast University, Opus AI Research) P | 75.94 | 72.33 | 81.75 |
| 2 | Boat (v1) (Jiangnan University) P | 72.88 | 67.78 | 81.11 |
| 3 | UBC-ViL (Baseline-GPT4o) B | 69.06 | 63.18 | 78.53 |
| 4 | IMG_AI | 64.17 | 56.86 | 75.96 |
| 5 | UBC-ViL (Baseline-Gemini) B | 62.20 | 57.09 | 70.60 |
| 6 | ASU_Computer_Vision (v2) P | 62.06 | 56.22 | 71.47 |
| 7 | UBC-ViL (Baseline-LlavaVideo) B | 60.63 | 54.55 | 70.44 |
| 8 | casia-base | 59.94 | 52.07 | 72.62 |
| 9 | UBC-ViL (Baseline-VideoLlama2-7B) B | 53.15 | 53.27 | 52.96 |
| 10 | UBC-ViL (Baseline-VILA-7B) B | 50.49 | 49.44 | 52.19 |
| 11 | longAI | 39.27 | 37.44 | 45.14 |
| 12 | UBC-ViL (Baseline-VideoChat2) B | 36.66 | 28.55 | 49.74 |