Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

1GTGeorgia Institute of Technology, 2HarvardHarvard University, 3GeorgetownGeorgetown University

SGMRI-VQA (Spatially Grounded MRI Visual Question Answering) evaluates whether vision-language models can spatially ground their clinical findings across volumetric MRI, providing frame-indexed bounding boxes alongside chain-of-thought reasoning.

Feed a full brain or knee MRI volume, and SGMRI-VQA tests detection, localization, classification, counting, and captioning—requiring models to reason about what is present, where it is, and across which frames it extends.
Brain MRI volume animation
Brain MRI (Axial T1 Post-contrast)
Edema, mass, posttreatment change
Knee MRI volume animation
Knee MRI (Sagittal PD)
Cartilage defect, ACL sprain, meniscus tear, joint effusion

Highlights

Benchmark
41,307 QA pairs
Largest multi-frame spatially grounded VQA benchmark for MRI, spanning brain and knee across image-level and volume-level tasks.
Grounding Gap
Text vs. spatial disconnect
Models that produce clinically plausible text fail at spatial localization—V-Scores below 4% across all zero-shot baselines.
Hierarchical Tasks
Radiologist-aligned evaluation
Detection, localization, classification, counting, and captioning mirror how radiologists reconstruct 3D understanding across slices.
Fine-Tuning
Spatial supervision works
Qwen3-VL-8B fine-tuned with bounding box supervision achieves the best overall performance, bridging the grounding gap.
Benchmark

Hierarchical Task Structure

SGMRI-VQA organizes tasks hierarchically to mirror clinical reading. At the volume level, models process the full 3D scan as a multi-frame sequence. Each QA pair couples a clinician-aligned chain-of-thought reasoning trace with frame-indexed bounding box coordinates.

Task examples

Representative examples from each task category. Each QA pair includes a question, chain-of-thought reasoning, and frame-indexed bounding box coordinates for spatial grounding.

Data

Dataset Composition

Built on expert radiologist annotations from the fastMRI+ dataset, SGMRI-VQA covers 1,970 MRI volumes (996 brain, 974 knee) with 32,142 image-level and 9,165 volume-level QA pairs.

Finding distribution

Sunburst chart of clinical finding categories showing normal vs. abnormal distribution across brain and knee MRI.

Dataset statistics

Distribution of QA pairs across tasks and anatomical domains.

Quality

Multi-Stage Quality Assurance

GPT-4o generates QA pairs from radiologist annotations, but introduces systematic hallucinations. We identify and correct these through automated validation and iterative clinical expert review.

GPT-4o errors
Results

The Grounding Gap

We evaluate 10 VLMs with three complementary metrics: A-Score (factual accuracy), AR-Score (clinical reasoning quality), and V-Score (spatial grounding via mIoU).

Image-Level Per-Task Performance
All values in %. Best in blue, second best underlined.
Model Detection
A-Score
Classification
A-Score
Diagnosis
A-Score
Localization
V-Score
Localization
AR-Score
Captioning
AR-Score
Avg.
Proprietary Models
GPT-4o50.7452.4944.713.4126.2216.9032.41
Gemini-2.5-Pro19.2431.3334.323.3314.8911.9119.17
Gemini-2.5-Flash67.3686.3189.685.9527.8521.7249.81
Open-Source Models (7–8B)
LLaVA-Video-7B52.4165.4171.582.2923.1113.3138.02
Eagle2.5-8B26.9457.5561.392.5923.9815.4131.31
Qwen3-VL-8B46.1176.0280.293.1124.5519.8641.66
InternVL2.5-8B22.2537.3330.431.8225.7814.3722.00
Qwen2.5-VL-7B12.0050.6839.881.5126.7518.1924.83
Domain-Specific Medical VLMs (image-level only)
LLaVA-Med-v1.5 (7B)6.9085.3659.180.0023.4312.2031.18
MedGemma-1.5 (4B)26.9465.9864.612.7724.9317.2433.74
Fine-Tuned
Ours (Qwen3-VL-8B-FT)95.1197.5694.5015.5128.9925.0559.45
Volume-Level Per-Task Performance
All values in %. Best in blue, second best underlined.
Model Detection
A-Score
Counting
A-Score
Classification
A-Score
Localization
V-Score
Localization
AR-Score
Captioning
AR-Score
Avg.
Proprietary Models
GPT-4o70.5735.9767.511.2024.8320.1936.71
Gemini-2.5-Pro16.034.0828.210.2612.4212.7112.29
Gemini-2.5-Flash54.8917.6656.311.8323.2722.2729.37
Open-Source Models (7–8B)
Qwen3-VL-8B62.2330.9864.720.1621.9319.0533.18
Eagle2.5-8B70.6516.3070.730.0219.0716.2032.16
InternVL2.5-8B28.8014.9544.010.4117.6616.2920.35
LLaVA-Video-7B37.2313.5948.940.0616.2612.3621.41
Qwen2.5-VL-7B24.1825.2762.540.0715.6316.6724.06
Fine-Tuned
Ours (Qwen3-VL-8B-FT)99.1837.7797.705.9728.2426.5449.23
Qualitative

Spatial Grounding After SFT

Bounding box predictions from our fine-tuned Qwen3-VL-8B compared to ground truth, demonstrating that targeted spatial supervision enables accurate localization.

Bbox results

BibTeX

@inproceedings{moukheiber2026sgmrivqa,
  title     = {Beyond a Single Frame: A Benchmark for Multi-Frame Spatially
               Grounded Visual Reasoning for MRI},
  author    = {Moukheiber, Lama and Yeung, Caleb M. and Xue, Haotian
               and Helbling, Alec and Zhao, Zelin and Chen, Yongxin},
  year      = {2026}
}