Representative examples from each task category. Each QA pair includes a question, chain-of-thought reasoning, and frame-indexed bounding box coordinates for spatial grounding.
SGMRI-VQA (Spatially Grounded MRI Visual Question Answering) evaluates whether vision-language models can spatially ground their clinical findings across volumetric MRI, providing frame-indexed bounding boxes alongside chain-of-thought reasoning.
SGMRI-VQA organizes tasks hierarchically to mirror clinical reading. At the volume level, models process the full 3D scan as a multi-frame sequence. Each QA pair couples a clinician-aligned chain-of-thought reasoning trace with frame-indexed bounding box coordinates.
Representative examples from each task category. Each QA pair includes a question, chain-of-thought reasoning, and frame-indexed bounding box coordinates for spatial grounding.
Built on expert radiologist annotations from the fastMRI+ dataset, SGMRI-VQA covers 1,970 MRI volumes (996 brain, 974 knee) with 32,142 image-level and 9,165 volume-level QA pairs.
Sunburst chart of clinical finding categories showing normal vs. abnormal distribution across brain and knee MRI.
Distribution of QA pairs across tasks and anatomical domains.
GPT-4o generates QA pairs from radiologist annotations, but introduces systematic hallucinations. We identify and correct these through automated validation and iterative clinical expert review.
We evaluate 10 VLMs with three complementary metrics: A-Score (factual accuracy), AR-Score (clinical reasoning quality), and V-Score (spatial grounding via mIoU).
| Model | Detection A-Score |
Classification A-Score |
Diagnosis A-Score |
Localization V-Score |
Localization AR-Score |
Captioning AR-Score |
Avg. |
|---|---|---|---|---|---|---|---|
| Proprietary Models | |||||||
| GPT-4o | 50.74 | 52.49 | 44.71 | 3.41 | 26.22 | 16.90 | 32.41 |
| Gemini-2.5-Pro | 19.24 | 31.33 | 34.32 | 3.33 | 14.89 | 11.91 | 19.17 |
| Gemini-2.5-Flash | 67.36 | 86.31 | 89.68 | 5.95 | 27.85 | 21.72 | 49.81 |
| Open-Source Models (7–8B) | |||||||
| LLaVA-Video-7B | 52.41 | 65.41 | 71.58 | 2.29 | 23.11 | 13.31 | 38.02 |
| Eagle2.5-8B | 26.94 | 57.55 | 61.39 | 2.59 | 23.98 | 15.41 | 31.31 |
| Qwen3-VL-8B | 46.11 | 76.02 | 80.29 | 3.11 | 24.55 | 19.86 | 41.66 |
| InternVL2.5-8B | 22.25 | 37.33 | 30.43 | 1.82 | 25.78 | 14.37 | 22.00 |
| Qwen2.5-VL-7B | 12.00 | 50.68 | 39.88 | 1.51 | 26.75 | 18.19 | 24.83 |
| Domain-Specific Medical VLMs (image-level only) | |||||||
| LLaVA-Med-v1.5 (7B) | 6.90 | 85.36 | 59.18 | 0.00 | 23.43 | 12.20 | 31.18 |
| MedGemma-1.5 (4B) | 26.94 | 65.98 | 64.61 | 2.77 | 24.93 | 17.24 | 33.74 |
| Fine-Tuned | |||||||
| Ours (Qwen3-VL-8B-FT) | 95.11 | 97.56 | 94.50 | 15.51 | 28.99 | 25.05 | 59.45 |
| Model | Detection A-Score |
Counting A-Score |
Classification A-Score |
Localization V-Score |
Localization AR-Score |
Captioning AR-Score |
Avg. |
|---|---|---|---|---|---|---|---|
| Proprietary Models | |||||||
| GPT-4o | 70.57 | 35.97 | 67.51 | 1.20 | 24.83 | 20.19 | 36.71 |
| Gemini-2.5-Pro | 16.03 | 4.08 | 28.21 | 0.26 | 12.42 | 12.71 | 12.29 |
| Gemini-2.5-Flash | 54.89 | 17.66 | 56.31 | 1.83 | 23.27 | 22.27 | 29.37 |
| Open-Source Models (7–8B) | |||||||
| Qwen3-VL-8B | 62.23 | 30.98 | 64.72 | 0.16 | 21.93 | 19.05 | 33.18 |
| Eagle2.5-8B | 70.65 | 16.30 | 70.73 | 0.02 | 19.07 | 16.20 | 32.16 |
| InternVL2.5-8B | 28.80 | 14.95 | 44.01 | 0.41 | 17.66 | 16.29 | 20.35 |
| LLaVA-Video-7B | 37.23 | 13.59 | 48.94 | 0.06 | 16.26 | 12.36 | 21.41 |
| Qwen2.5-VL-7B | 24.18 | 25.27 | 62.54 | 0.07 | 15.63 | 16.67 | 24.06 |
| Fine-Tuned | |||||||
| Ours (Qwen3-VL-8B-FT) | 99.18 | 37.77 | 97.70 | 5.97 | 28.24 | 26.54 | 49.23 |
Bounding box predictions from our fine-tuned Qwen3-VL-8B compared to ground truth, demonstrating that targeted spatial supervision enables accurate localization.
@misc{moukheiber2026singleframemultiframespatially,
title={Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI},
author={Lama Moukheiber and Caleb M. Yeung and Haotian Xue and Alec Helbling and Zelin Zhao and Yongxin Chen},
year={2026},
eprint={2604.15808},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.15808},
}