MedGRPO

Last updated: 2026-05-03

Summary

MedGRPO is a reinforcement learning framework for heterogeneous medical video understanding. Its key claim is that standard RL can collapse when medical video datasets and tasks have incompatible reward scales, and that balanced multi-dataset learning requires reward normalization plus domain-specific caption evaluation.

Key Points

The paper introduces MedVidBench, a benchmark with 531,850 video-instruction pairs from 8 medical video sources and 8 task types. source
MedVidBench spans video-level, segment-level, and frame-level tasks, so models must handle procedural summaries, temporal grounding, region captioning, and instrument localization. source
The data pipeline converts existing expert annotations into instruction-following QA pairs using annotation-enriched prompting, dual-model generation, and validation. source
Naive GRPO fails because easy dataset-task pairs produce larger raw rewards than hard pairs, causing optimization to over-focus on easy sources and destabilize learning. source
MedGRPO applies dataset-task-specific logistic reward normalization so each dataset-task median maps to reward 0.5. This creates "median fairness" across heterogeneous tasks. source
For medical captioning, the paper argues that generic semantic similarity misses clinically important errors such as wrong instruments, vague anatomy, inaccurate actions, or weak spatial detail. source
The medical LLM judge scores captions across terminology precision, instrument and anatomy identification, specificity, clinical procedure context, and action/state accuracy. source
On Qwen2.5-VL-7B, MedGRPO improves the SFT baseline on most reported grounding and captioning metrics, while NAP decreases because it is not optimized by the RL reward set. source

Practical Takeaways

For multi-dataset RL, do not assume raw metrics are comparable across datasets or tasks.
Use a held-out or baseline model to estimate reward percentiles before RL, then normalize rewards per dataset-task pair.
In clinical domains, evaluate generated text with criteria that reflect domain-specific correctness rather than only semantic closeness.
Pair grounding tasks with captioning tasks when the domain requires both spatial localization and procedural semantics.

Results To Remember

Model	CVS acc	STG mIoU	TAG@0.3	DVC F1	VS LLM	RC LLM
Qwen2.5-VL-7B SFT	0.894	0.177	0.142	0.165	3.596	2.757
Qwen2.5-VL-7B MedGRPO	0.896	0.202	0.216	0.214	4.184	3.442
MedGRPO without reward normalization	0.020	0.010	0.004	0.002	3.805	3.469

Sources

Open Questions

Is MedVidBench fully downloadable, or is access mediated through the project site?
Does median-centered reward normalization remain stable if the SFT baseline is weak on most dataset-task pairs?
How much of the medical LLM judge's reliability depends on GPT-4.1 versus the rubric design?