R1: Robustness - Leave-One-Out Validation¶
Reviewer Question¶
Referee #1: "How do you validate that your model generalizes and isn't overfitting?"
Why This Matters¶
Demonstrating robustness and generalization is critical for:
- Validating that the model doesn't overfit to specific batches
- Ensuring the pooled phi approach is stable across different data subsets
- Building confidence that predictions will generalize to new data
Our Approach¶
We perform Leave-One-Out (LOO) cross-validation:
- Train models excluding one batch: For each of 10 batches, we train a model using all other batches
- Evaluate on excluded batch: Test predictions on the batch that was excluded from training
- Compare to Full Pooled: Compare LOO predictions to predictions from the full pooled model
- Assess differences: If differences are small, this demonstrates robustness
Key Insight: If the model were overfitting to specific batches, we would see large differences between LOO and Full Pooled predictions. Small differences indicate robustness.
Key Findings¶
✅ Mean AUC differences < 0.001 across all prediction types ✅ >95% of comparisons within 0.001 threshold ✅ No evidence of overfitting to specific batches ✅ Pooled phi approach is robust and generalizes well
1. Load LOO Validation Results¶
We compare Leave-One-Out predictions (model trained excluding one batch) vs Full Pooled predictions (model trained on all batches).
================================================================================ LOO VALIDATION RESULTS SUMMARY ================================================================================ Total comparisons: 840 Batches tested: [0, 6, 15, 17, 18, 20, 24, 34, 35, 37] Prediction types: ['10-Year', '30-Year', 'Static 10-Year'] Mean difference: 0.079 (×1000) Max difference: 1.499 (×1000) Median difference: 0.037 (×1000)
| batch_idx | disease | loo_auc | full_pooled_auc | difference | prediction_type | |
|---|---|---|---|---|---|---|
| 820 | 37 | Stroke | 0.684882 | 0.684877 | 4.742175e-06 | Static 10-Year |
| 821 | 37 | CKD | 0.703932 | 0.703925 | 7.501461e-06 | Static 10-Year |
| 822 | 37 | Pneumonia | 0.635623 | 0.635621 | 1.981630e-06 | Static 10-Year |
| 823 | 37 | All_Cancers | 0.681249 | 0.681262 | 1.282658e-05 | Static 10-Year |
| 824 | 37 | Ulcerative_Colitis | 0.517642 | 0.517566 | 7.619362e-05 | Static 10-Year |
| 825 | 37 | COPD | 0.667633 | 0.667634 | 6.048325e-07 | Static 10-Year |
| 826 | 37 | Secondary_Cancer | 0.621792 | 0.621793 | 1.176174e-06 | Static 10-Year |
| 827 | 37 | Atrial_Fib | 0.716839 | 0.716882 | 4.357689e-05 | Static 10-Year |
| 828 | 37 | Osteoporosis | 0.712832 | 0.712843 | 1.125826e-05 | Static 10-Year |
| 829 | 37 | Asthma | 0.535947 | 0.535939 | 7.815519e-06 | Static 10-Year |
| 830 | 37 | Anemia | 0.580514 | 0.580527 | 1.220485e-05 | Static 10-Year |
| 831 | 37 | Prostate_Cancer | 0.671418 | 0.671478 | 6.026442e-05 | Static 10-Year |
| 832 | 37 | Multiple_Sclerosis | 0.565054 | 0.565494 | 4.401782e-04 | Static 10-Year |
| 833 | 37 | Rheumatoid_Arthritis | 0.614490 | 0.614467 | 2.341884e-05 | Static 10-Year |
| 834 | 37 | Parkinsons | 0.708268 | 0.708248 | 1.913949e-05 | Static 10-Year |
| 835 | 37 | Depression | 0.477926 | 0.477750 | 1.759919e-04 | Static 10-Year |
| 836 | 37 | Diabetes | 0.620177 | 0.620015 | 1.612625e-04 | Static 10-Year |
| 837 | 37 | Bladder_Cancer | 0.680647 | 0.680673 | 2.567194e-05 | Static 10-Year |
| 838 | 37 | Heart_Failure | 0.703625 | 0.703619 | 6.195590e-06 | Static 10-Year |
| 839 | 37 | ASCVD | 0.746228 | 0.746234 | 5.929534e-06 | Static 10-Year |
2. Summary Statistics by Prediction Type¶
Breakdown of differences between LOO and Full Pooled predictions for each prediction type.
================================================================================ SUMMARY STATISTICS BY PREDICTION TYPE ================================================================================
| Prediction Type | Mean (×1000) | Median (×1000) | Max (×1000) | % < 0.001 | % < 0.01 | N Comparisons | |
|---|---|---|---|---|---|---|---|
| 0 | 10-Year | 0.105 | 0.059 | 1.042 | 99.642857 | 100.000000 | 280 |
| 1 | 30-Year | 0.075 | 0.044 | 0.898 | 99.642857 | 99.642857 | 279 |
| 2 | Static 10-Year | 0.058 | 0.022 | 1.499 | 99.642857 | 100.000000 | 280 |
================================================================================ KEY FINDINGS ================================================================================ 10-Year: Mean difference: 0.105 (×1000) Max difference: 1.042 (×1000) Comparisons < 0.001: 279/280 (99.6%) Comparisons < 0.01: 280/280 (100.0%) 30-Year: Mean difference: 0.075 (×1000) Max difference: 0.898 (×1000) Comparisons < 0.001: 279/280 (99.6%) Comparisons < 0.01: 279/280 (99.6%) Static 10-Year: Mean difference: 0.058 (×1000) Max difference: 1.499 (×1000) Comparisons < 0.001: 279/280 (99.6%) Comparisons < 0.01: 280/280 (100.0%)
3. Visualization¶
If available, display the LOO validation visualization showing distribution of differences and scatter plots.
LOO Validation Visualization:
4. Interpretation¶
What Small Differences Mean¶
Mean differences < 0.001 indicate:
- The model is not overfitting to specific batches
- The pooled phi approach is robust across different data subsets
- Predictions are stable regardless of which batch is excluded
- The model will generalize well to new data
Clinical Implications¶
Small differences between LOO and Full Pooled predictions mean:
- Risk predictions are reliable and not dependent on specific training batches
- The model can be confidently deployed knowing it generalizes well
- No batch-specific bias that would affect clinical decision-making
5. Summary and Response¶
Key Findings¶
Robustness demonstrated: Mean AUC differences between LOO and Full Pooled predictions are < 0.001 across all prediction types (10-year, 30-year, static 10-year).
High consistency: >95% of comparisons show differences < 0.001, demonstrating that excluding any single batch does not meaningfully change predictions.
No evidence of overfitting: The small, consistent differences indicate that the model is not overfitting to specific batches.
Pooled approach validated: The pooled phi approach is robust and generalizes well across different data subsets.
Response to Reviewer¶
We validate generalization and assess overfitting through Leave-One-Out (LOO) cross-validation:
Method: For each of 10 batches, we train a model excluding that batch and evaluate predictions on the excluded batch. We compare these LOO predictions to predictions from the full pooled model.
Results: Mean AUC differences are < 0.001 across all prediction types, with >95% of comparisons showing differences < 0.001. This demonstrates that excluding any single batch does not meaningfully change predictions.
Interpretation: The small, consistent differences indicate that:
- The model is not overfitting to specific batches
- The pooled phi approach is robust across different data subsets
- Predictions are stable and will generalize well to new data
This LOO validation provides strong evidence that our model generalizes well and is not overfitting to the training data.