Visualize VLM evaluations across datasets
Visualize model outputs for AITW benchmark
Visualize benchmark samples with media and details