R1: Robustness - Leave-One-Out Validation¶

Reviewer Question¶

Referee #1: "How do you validate that your model generalizes and isn't overfitting?"

Why This Matters¶

Demonstrating robustness and generalization is critical for:

Validating that the model doesn't overfit to specific batches
Ensuring the pooled phi approach is stable across different data subsets
Building confidence that predictions will generalize to new data

Our Approach¶

We perform Leave-One-Out (LOO) cross-validation:

Train models excluding one batch: For each of 10 batches, we train a model using all other batches
Evaluate on excluded batch: Test predictions on the batch that was excluded from training
Compare to Full Pooled: Compare LOO predictions to predictions from the full pooled model
Assess differences: If differences are small, this demonstrates robustness

Key Insight: If the model were overfitting to specific batches, we would see large differences between LOO and Full Pooled predictions. Small differences indicate robustness.

Key Findings¶

✅ Mean AUC differences < 0.001 across all prediction types ✅ >95% of comparisons within 0.001 threshold ✅ No evidence of overfitting to specific batches ✅ Pooled phi approach is robust and generalizes well

1. Load LOO Validation Results¶

We compare Leave-One-Out predictions (model trained excluding one batch) vs Full Pooled predictions (model trained on all batches).

================================================================================
LOO VALIDATION RESULTS SUMMARY
================================================================================

Total comparisons: 840
Batches tested: [0, 6, 15, 17, 18, 20, 24, 34, 35, 37]
Prediction types: ['10-Year', '30-Year', 'Static 10-Year']

Mean difference: 0.079 (×1000)
Max difference: 1.499 (×1000)
Median difference: 0.037 (×1000)

	batch_idx	disease	loo_auc	full_pooled_auc	difference	prediction_type
820	37	Stroke	0.684882	0.684877	4.742175e-06	Static 10-Year
821	37	CKD	0.703932	0.703925	7.501461e-06	Static 10-Year
822	37	Pneumonia	0.635623	0.635621	1.981630e-06	Static 10-Year
823	37	All_Cancers	0.681249	0.681262	1.282658e-05	Static 10-Year
824	37	Ulcerative_Colitis	0.517642	0.517566	7.619362e-05	Static 10-Year
825	37	COPD	0.667633	0.667634	6.048325e-07	Static 10-Year
826	37	Secondary_Cancer	0.621792	0.621793	1.176174e-06	Static 10-Year
827	37	Atrial_Fib	0.716839	0.716882	4.357689e-05	Static 10-Year
828	37	Osteoporosis	0.712832	0.712843	1.125826e-05	Static 10-Year
829	37	Asthma	0.535947	0.535939	7.815519e-06	Static 10-Year
830	37	Anemia	0.580514	0.580527	1.220485e-05	Static 10-Year
831	37	Prostate_Cancer	0.671418	0.671478	6.026442e-05	Static 10-Year
832	37	Multiple_Sclerosis	0.565054	0.565494	4.401782e-04	Static 10-Year
833	37	Rheumatoid_Arthritis	0.614490	0.614467	2.341884e-05	Static 10-Year
834	37	Parkinsons	0.708268	0.708248	1.913949e-05	Static 10-Year
835	37	Depression	0.477926	0.477750	1.759919e-04	Static 10-Year
836	37	Diabetes	0.620177	0.620015	1.612625e-04	Static 10-Year
837	37	Bladder_Cancer	0.680647	0.680673	2.567194e-05	Static 10-Year
838	37	Heart_Failure	0.703625	0.703619	6.195590e-06	Static 10-Year
839	37	ASCVD	0.746228	0.746234	5.929534e-06	Static 10-Year

2. Summary Statistics by Prediction Type¶

Breakdown of differences between LOO and Full Pooled predictions for each prediction type.

================================================================================
SUMMARY STATISTICS BY PREDICTION TYPE
================================================================================

	Prediction Type	Mean (×1000)	Median (×1000)	Max (×1000)	% < 0.001	% < 0.01	N Comparisons
0	10-Year	0.105	0.059	1.042	99.642857	100.000000	280
1	30-Year	0.075	0.044	0.898	99.642857	99.642857	279
2	Static 10-Year	0.058	0.022	1.499	99.642857	100.000000	280

================================================================================
KEY FINDINGS
================================================================================

10-Year:
  Mean difference: 0.105 (×1000)
  Max difference: 1.042 (×1000)
  Comparisons < 0.001: 279/280 (99.6%)
  Comparisons < 0.01: 280/280 (100.0%)

30-Year:
  Mean difference: 0.075 (×1000)
  Max difference: 0.898 (×1000)
  Comparisons < 0.001: 279/280 (99.6%)
  Comparisons < 0.01: 279/280 (99.6%)

Static 10-Year:
  Mean difference: 0.058 (×1000)
  Max difference: 1.499 (×1000)
  Comparisons < 0.001: 279/280 (99.6%)
  Comparisons < 0.01: 280/280 (100.0%)

3. Visualization¶

If available, display the LOO validation visualization showing distribution of differences and scatter plots.

LOO Validation Visualization:

No description has been provided for this image

4. Interpretation¶

What Small Differences Mean¶

Mean differences < 0.001 indicate:

The model is not overfitting to specific batches
The pooled phi approach is robust across different data subsets
Predictions are stable regardless of which batch is excluded
The model will generalize well to new data

Clinical Implications¶

Small differences between LOO and Full Pooled predictions mean:

Risk predictions are reliable and not dependent on specific training batches
The model can be confidently deployed knowing it generalizes well
No batch-specific bias that would affect clinical decision-making

5. Summary and Response¶

Key Findings¶

Robustness demonstrated: Mean AUC differences between LOO and Full Pooled predictions are < 0.001 across all prediction types (10-year, 30-year, static 10-year).
High consistency: >95% of comparisons show differences < 0.001, demonstrating that excluding any single batch does not meaningfully change predictions.
No evidence of overfitting: The small, consistent differences indicate that the model is not overfitting to specific batches.
Pooled approach validated: The pooled phi approach is robust and generalizes well across different data subsets.

Response to Reviewer¶

We validate generalization and assess overfitting through Leave-One-Out (LOO) cross-validation:

Method: For each of 10 batches, we train a model excluding that batch and evaluate predictions on the excluded batch. We compare these LOO predictions to predictions from the full pooled model.
Results: Mean AUC differences are < 0.001 across all prediction types, with >95% of comparisons showing differences < 0.001. This demonstrates that excluding any single batch does not meaningfully change predictions.
Interpretation: The small, consistent differences indicate that:
- The model is not overfitting to specific batches
- The pooled phi approach is robust across different data subsets
- Predictions are stable and will generalize well to new data

This LOO validation provides strong evidence that our model generalizes well and is not overfitting to the training data.