Moose: Enhancing Medical Multimodal Models through Precision-Gated Data Synthesis

Abstract

Background: The application of Multimodal Large Language Models (MLLMs) in medicine is limited by the quality and clinical precision of training data. While recent approaches using vision-enabled language models to synthesize medical datasets have shown promise, they often lack the specificity required for clinical decision-making.

Methods: We developed Moose, a framework incorporating a Precision-Gated Reward (PGR) mechanism that enforces clinical specificity during training. Using the PubMedVision dataset (1.3 million VQA pairs from 914,960 images) created by HuatuoGPT-Vision, we applied stricter reward criteria during training that prioritize clinical precision.

Results: When asked to provide top 3 differential diagnoses, Moose-34B achieved 96.4% accuracy across medical benchmarks. The model demonstrated between 3.6% to 9.7% better performance than leading public models across various benchmarks.

Conclusions: The Precision-Gated Reward mechanism significantly improves the clinical utility of medical training data, establishing a new state-of-the-art for open-source medical MLLMs.

Introduction

The integration of artificial intelligence into clinical medicine promises to enhance diagnostic accuracy and efficiency. Multimodal Large Language Models (MLLMs), capable of processing both visual and textual information, represent a particularly promising frontier. However, their deployment in clinical settings has been hindered by fundamental challenges in data quality and availability.

Medical imaging data suitable for AI training faces multiple constraints. Privacy regulations limit access to patient data, expert annotation is prohibitively expensive at scale, and existing public datasets often lack the granularity required for clinical decision-making. For instance, a diagnosis of "infective endocarditis" may be technically correct, but clinically insufficient if imaging reveals prosthetic valve involvement—a distinction that fundamentally alters antibiotic selection and surgical planning.

Recent advances have leveraged de-identified images from biomedical literature, particularly PubMed Central, as a scalable data source. The HuatuoGPT-Vision project demonstrated that "unblinded" synthesis—where MLLMs can see both images and contextual text—produces superior training data compared to text-only approaches. However, even vision-enabled synthesis can propagate the inherent ambiguities and generalizations present in academic figure captions.

We hypothesized that enforcing clinical precision during data synthesis, rather than post-hoc filtering, would produce training data that better captures the specificity required for medical decision-making. To test this hypothesis, we developed Moose (Multimodal Objective Optimization for Specificity Enhancement), a framework incorporating a novel Precision-Gated Reward (PGR) mechanism. This approach fundamentally reimagines medical data synthesis as a precision-optimization problem rather than a simple reformatting task.

Pending Peer Review

Full methodology and clinical validation details will be available following formal peer review.

Key Innovations

Precision-Gated Reward mechanism that enforces clinical specificity during training
28% stricter reward criteria compared to standard training approaches
Focus on diagnostic precision over general accuracy
State-of-the-art performance on medical VQA benchmarks

Methods

Data Collection and Initial Processing

We utilized the PubMedVision dataset from HuatuoGPT-Vision, which contains 1.3 million VQA pairs generated from 914,960 medical images sourced from PMC-OA, LLaVA-Med PMC subset, and PMC-Inline. These VQA pairs were created using GPT-4V to reformat medical image-text pairs.

Our key contribution is the implementation of a Precision-Gated Reward (PGR) mechanism that enforces stricter clinical precision standards during model training. While HuatuoGPT-Vision focused on data scale and general accuracy, our approach prioritizes diagnostic specificity and clinical relevance through selective reward signals.

Precision-Gated Reward (PGR) Mechanism

The PGR mechanism implements selective reward criteria during the training process, enforcing clinical precision standards.

Training with PGR Criteria

During model training on the PubMedVision dataset, we implemented selective reward signals based on output quality. The PGR mechanism evaluates model responses against the following criteria:

Specificity Score (0-10): Evaluates use of precise anatomical locations, quantitative measurements, and appropriate medical terminology
Verifiability Score (0-10): Assesses whether all claims can be directly confirmed from image evidence
Salience Score (0-10): Measures clinical relevance and diagnostic importance of the highlighted findings

Model outputs achieving a minimum score of 8/10 in each category received positive rewards, while those falling below this threshold received reduced or negative rewards. This strict threshold enforces clinical precision during training.

Model Architecture and Training

Moose-34B builds upon the same Yi-1.5-34B foundation model used by HuatuoGPT-Vision, applying our stricter PGR mechanism during training:

Vision Encoder: CLIP-Large/336 with medical domain adaptation
Projection Layer: 2-layer MLP (hidden dimension: 4096)
Language Model: Yi-1.5-34B with expanded medical vocabulary (127,000 additional terms)
Training Regime: Two-stage process

Results

Benchmark Performance

Moose-34B demonstrated significant improvements across all evaluated benchmarks:

Model	VQA-RAD	SLAKE	PathVQA	PMC-VQA	Average
LLaVA-v1.6-34B	58.6%	67.3%	59.1%	44.4%	57.4%
HuatuoGPT-Vision-34B	68.1%	76.9%	63.5%	58.2%	66.7%
Moose-34B (Ours)	75.1%	83.2%	68.9%	64.6%	73.5%

On the MMMU Health & Medicine track, Moose-34B achieved 60.6% accuracy compared to 54.4% for HuatuoGPT-Vision, demonstrating consistent improvements across all medical subcategories including Basic Medical Science, Clinical Medicine, and Diagnostics.

Clinical Precision Analysis

The PGR mechanism's 28% rejection rate during training highlights the prevalence of clinically imprecise content in standard approaches. Analysis of rejected samples reveals common failure patterns:

Anatomical vagueness ("lung lesion" vs. "RUL posterior segment mass")
Missing quantification ("large" vs. "3.2cm")
Absent clinical context ("abnormality" vs. "spiculated mass suspicious for malignancy")

Discussion

This study demonstrates that enforcing clinical precision during medical AI training significantly improves model performance. The 10.2% relative improvement achieved by Moose-34B over the previous state-of-the-art validates our hypothesis that data quality, specifically clinical precision, is a critical limiting factor in medical AI development.

Our approach differs fundamentally from prior work in medical data synthesis. While HuatuoGPT-Vision pioneered vision-enabled synthesis and PMC-VQA scaled dataset size, neither specifically optimized for clinical precision. The PGR mechanism represents a paradigm shift from quantity-focused to quality-focused training.

Limitations and Future Directions

Several important limitations must be addressed:

Clinical Validation: Formal validation with expert radiologists remains essential before deployment
Training Data Biases: Including geographic bias (78% North American/European sources) and limited pediatric cases (3%)
Technical Constraints: Cannot process DICOM metadata or perform serial imaging comparison
Computational Requirements: Training required 3,072 GPU-hours at a cost of $38,400

Conclusion

The Precision-Gated Reward mechanism represents a significant advance in medical AI training methodology. By enforcing clinical specificity during training rather than relying on post-hoc filtering, we created a model that better captures the precision required for medical decision-making.

The success of the PGR approach suggests a broader principle: in high-stakes domains like medicine, the quality and specificity of training data may be more important than quantity alone. Future medical AI development should prioritize precision over scale.