Representative examples from each task category. Each QA pair includes a question, chain-of-thought reasoning, and frame-indexed bounding box coordinates for spatial grounding.
SGMRI-VQA (Spatially Grounded MRI Visual Question Answering) evaluates whether vision-language models can spatially ground their clinical findings across volumetric MRI, providing frame-indexed bounding boxes alongside chain-of-thought reasoning.
SGMRI-VQA organizes tasks hierarchically to mirror clinical reading. At the volume level, models process the full 3D scan as a multi-frame sequence. Each QA pair couples a clinician-aligned chain-of-thought reasoning trace with frame-indexed bounding box coordinates.
Representative examples from each task category. Each QA pair includes a question, chain-of-thought reasoning, and frame-indexed bounding box coordinates for spatial grounding.
Built on expert radiologist annotations from the fastMRI+ dataset, SGMRI-VQA covers 1,970 MRI volumes (996 brain, 974 knee) with 32,142 image-level and 9,165 volume-level QA pairs.
Sunburst chart of clinical finding categories showing normal vs. abnormal distribution across brain and knee MRI.
Distribution of QA pairs across tasks and anatomical domains.
GPT-4o generates QA pairs from radiologist annotations, but introduces systematic hallucinations. We identify and correct these through automated validation and iterative clinical expert review.
We evaluate 10 VLMs with three complementary metrics: A-Score (factual accuracy), AR-Score (clinical reasoning quality), and V-Score (spatial grounding via mIoU).
| Model | Detection A-Score |
Classification A-Score |
Diagnosis A-Score |
Localization V-Score |
Localization AR-Score |
Captioning AR-Score |
Avg. |
|---|---|---|---|---|---|---|---|
| Proprietary Models | |||||||
| GPT-4o | 50.74 | 52.49 | 44.71 | 3.41 | 26.22 | 16.90 | 32.41 |
| Gemini-2.5-Pro | 19.24 | 31.33 | 34.32 | 3.33 | 14.89 | 11.91 | 19.17 |
| Gemini-2.5-Flash | 67.36 | 86.31 | 89.68 | 5.95 | 27.85 | 21.72 | 49.81 |
| Open-Source Models (7–8B) | |||||||
| LLaVA-Video-7B | 52.41 | 65.41 | 71.58 | 2.29 | 23.11 | 13.31 | 38.02 |
| Eagle2.5-8B | 26.94 | 57.55 | 61.39 | 2.59 | 23.98 | 15.41 | 31.31 |
| Qwen3-VL-8B | 46.11 | 76.02 | 80.29 | 3.11 | 24.55 | 19.86 | 41.66 |
| InternVL2.5-8B | 22.25 | 37.33 | 30.43 | 1.82 | 25.78 | 14.37 | 22.00 |
| Qwen2.5-VL-7B | 12.00 | 50.68 | 39.88 | 1.51 | 26.75 | 18.19 | 24.83 |
| Domain-Specific Medical VLMs (image-level only) | |||||||
| LLaVA-Med-v1.5 (7B) | 6.90 | 85.36 | 59.18 | 0.00 | 23.43 | 12.20 | 31.18 |
| MedGemma-1.5 (4B) | 26.94 | 65.98 | 64.61 | 2.77 | 24.93 | 17.24 | 33.74 |
| Fine-Tuned | |||||||
| Ours (Qwen3-VL-8B-FT) | 95.11 | 97.56 | 94.50 | 15.51 | 28.99 | 25.05 | 59.45 |
| Model | Detection A-Score |
Counting A-Score |
Classification A-Score |
Localization V-Score |
Localization AR-Score |
Captioning AR-Score |
Avg. |
|---|---|---|---|---|---|---|---|
| Proprietary Models | |||||||
| GPT-4o | 70.57 | 35.97 | 67.51 | 1.20 | 24.83 | 20.19 | 36.71 |
| Gemini-2.5-Pro | 16.03 | 4.08 | 28.21 | 0.26 | 12.42 | 12.71 | 12.29 |
| Gemini-2.5-Flash | 54.89 | 17.66 | 56.31 | 1.83 | 23.27 | 22.27 | 29.37 |
| Open-Source Models (7–8B) | |||||||
| Qwen3-VL-8B | 62.23 | 30.98 | 64.72 | 0.16 | 21.93 | 19.05 | 33.18 |
| Eagle2.5-8B | 70.65 | 16.30 | 70.73 | 0.02 | 19.07 | 16.20 | 32.16 |
| InternVL2.5-8B | 28.80 | 14.95 | 44.01 | 0.41 | 17.66 | 16.29 | 20.35 |
| LLaVA-Video-7B | 37.23 | 13.59 | 48.94 | 0.06 | 16.26 | 12.36 | 21.41 |
| Qwen2.5-VL-7B | 24.18 | 25.27 | 62.54 | 0.07 | 15.63 | 16.67 | 24.06 |
| Fine-Tuned | |||||||
| Ours (Qwen3-VL-8B-FT) | 99.18 | 37.77 | 97.70 | 5.97 | 28.24 | 26.54 | 49.23 |
Bounding box predictions from our fine-tuned Qwen3-VL-8B compared to ground truth, demonstrating that targeted spatial supervision enables accurate localization.
@inproceedings{moukheiber2026sgmrivqa,
title = {Beyond a Single Frame: A Benchmark for Multi-Frame Spatially
Grounded Visual Reasoning for MRI},
author = {Moukheiber, Lama and Yeung, Caleb M. and Xue, Haotian
and Helbling, Alec and Zhao, Zelin and Chen, Yongxin},
year = {2026}
}