Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

Lama Moukheiber¹, Caleb M. Yeung^2,3, Haotian Xue¹, Alec Helbling¹, Zelin Zhao¹, Yongxin Chen¹

Georgia Institute of Technology, ²

Harvard University, ³

Georgetown University

Paper GitHub

Hugging Face Dataset Cite

SGMRI-VQA (Spatially Grounded MRI Visual Question Answering) evaluates whether vision-language models can spatially ground their clinical findings across volumetric MRI, providing frame-indexed bounding boxes alongside chain-of-thought reasoning.

Feed a full brain or knee MRI volume, and SGMRI-VQA tests detection, localization, classification, counting, and captioning—requiring models to reason about what is present, where it is, and across which frames it extends.

Brain MRI (Axial T1 Post-contrast)

Edema, mass, posttreatment change

Knee MRI (Sagittal PD)

Cartilage defect, ACL sprain, meniscus tear, joint effusion

Highlights

Benchmark
41,307 QA pairs

                  Largest multi-frame spatially grounded VQA benchmark for MRI, spanning brain and knee across image-level and volume-level tasks.
                

Grounding Gap
Text vs. spatial disconnect

                  Models that produce clinically plausible text fail at spatial localization—V-Scores below 4% across all zero-shot baselines.
                

Hierarchical Tasks
Radiologist-aligned evaluation

                  Detection, localization, classification, counting, and captioning mirror how radiologists reconstruct 3D understanding across slices.
                

Fine-Tuning
Spatial supervision works

                  Qwen3-VL-8B fine-tuned with bounding box supervision achieves the best overall performance, bridging the grounding gap.
                

Benchmark

Hierarchical Task Structure

SGMRI-VQA organizes tasks hierarchically to mirror clinical reading. At the volume level, models process the full 3D scan as a multi-frame sequence. Each QA pair couples a clinician-aligned chain-of-thought reasoning trace with frame-indexed bounding box coordinates.

Representative examples from each task category. Each QA pair includes a question, chain-of-thought reasoning, and frame-indexed bounding box coordinates for spatial grounding.

Data

Dataset Composition

Built on expert radiologist annotations from the fastMRI+ dataset, SGMRI-VQA covers 1,970 MRI volumes (996 brain, 974 knee) with 32,142 image-level and 9,165 volume-level QA pairs.

Sunburst chart of clinical finding categories showing normal vs. abnormal distribution across brain and knee MRI.

Distribution of QA pairs across tasks and anatomical domains.

Quality

Multi-Stage Quality Assurance

GPT-4o generates QA pairs from radiologist annotations, but introduces systematic hallucinations. We identify and correct these through automated validation and iterative clinical expert review.

Results

The Grounding Gap

We evaluate 10 VLMs with three complementary metrics: A-Score (factual accuracy), AR-Score (clinical reasoning quality), and V-Score (spatial grounding via mIoU).

Image-Level Per-Task Performance

All values in %. Best in blue, second best underlined.

Model	Detection A-Score	Classification A-Score	Diagnosis A-Score	Localization V-Score	Localization AR-Score	Captioning AR-Score	Avg.
Proprietary Models
GPT-4o	50.74	52.49	44.71	3.41	26.22	16.90	32.41
Gemini-2.5-Pro	19.24	31.33	34.32	3.33	14.89	11.91	19.17
Gemini-2.5-Flash	67.36	86.31	89.68	5.95	27.85	21.72	49.81
Open-Source Models (7–8B)
LLaVA-Video-7B	52.41	65.41	71.58	2.29	23.11	13.31	38.02
Eagle2.5-8B	26.94	57.55	61.39	2.59	23.98	15.41	31.31
Qwen3-VL-8B	46.11	76.02	80.29	3.11	24.55	19.86	41.66
InternVL2.5-8B	22.25	37.33	30.43	1.82	25.78	14.37	22.00
Qwen2.5-VL-7B	12.00	50.68	39.88	1.51	26.75	18.19	24.83
Domain-Specific Medical VLMs (image-level only)
LLaVA-Med-v1.5 (7B)	6.90	85.36	59.18	0.00	23.43	12.20	31.18
MedGemma-1.5 (4B)	26.94	65.98	64.61	2.77	24.93	17.24	33.74
Fine-Tuned
Ours (Qwen3-VL-8B-FT)	95.11	97.56	94.50	15.51	28.99	25.05	59.45

Volume-Level Per-Task Performance

All values in %. Best in blue, second best underlined.

Model	Detection A-Score	Counting A-Score	Classification A-Score	Localization V-Score	Localization AR-Score	Captioning AR-Score	Avg.
Proprietary Models
GPT-4o	70.57	35.97	67.51	1.20	24.83	20.19	36.71
Gemini-2.5-Pro	16.03	4.08	28.21	0.26	12.42	12.71	12.29
Gemini-2.5-Flash	54.89	17.66	56.31	1.83	23.27	22.27	29.37
Open-Source Models (7–8B)
Qwen3-VL-8B	62.23	30.98	64.72	0.16	21.93	19.05	33.18
Eagle2.5-8B	70.65	16.30	70.73	0.02	19.07	16.20	32.16
InternVL2.5-8B	28.80	14.95	44.01	0.41	17.66	16.29	20.35
LLaVA-Video-7B	37.23	13.59	48.94	0.06	16.26	12.36	21.41
Qwen2.5-VL-7B	24.18	25.27	62.54	0.07	15.63	16.67	24.06
Fine-Tuned
Ours (Qwen3-VL-8B-FT)	99.18	37.77	97.70	5.97	28.24	26.54	49.23

Qualitative

Spatial Grounding After SFT

Bounding box predictions from our fine-tuned Qwen3-VL-8B compared to ground truth, demonstrating that targeted spatial supervision enables accurate localization.

BibTeX

@misc{moukheiber2026singleframemultiframespatially,
      title={Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI},
      author={Lama Moukheiber and Caleb M. Yeung and Haotian Xue and Alec Helbling and Zelin Zhao and Yongxin Chen},
      year={2026},
      eprint={2604.15808},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.15808},
}