About the Dataset
BlackSwan is a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis.
We release our data with two splits:
- Validation Split: Ground truth labels are accessible for model development.
- Test Split: Ground truth labels are hidden; email your predictions to the organizers for evaluation.
We encourage participants to use the validation split during development and submit a final model version for test set evaluation. The validation set contains 827 videos (50% of data) and the test set contains 828 videos, making it slightly more challenging. Overall, the dataset comprises over 3,800 MCQ tasks spanning 1,655 videos. The challenge evaluates MCQ tasks only.