This documentation is organized into five main sections for reviewers:
Note: Installation instructions are not required for reviewers. A pre-configured environment will be provided for running the code.
Aladynoulli is a comprehensive Bayesian survival model that predicts disease trajectories by integrating genetic and clinical data. The model captures:
| Feature | Description |
|---|---|
| ✅ Scalable | Handles large-scale genetic and clinical datasets (400K+ individuals) |
| ✅ Flexible | Supports both discovery and prediction modes |
| ✅ Robust | Proper Bayesian uncertainty quantification |
| ✅ Fast | GPU-accelerated training and inference |
| ✅ Reproducible | Complete code and data processing pipelines |
The model predicts disease probability at time t as:
π_i,d,t = κ × Σ_k θ_i,k,t × φ_k,d,t
Where: - θ_i,k,t = softmax(λ_i,k,t) (signature proportions) - λ_i,k,t ~ GP(μ_k + G_i γ_k, K_λ) (temporal dynamics) - φ_k,d,t = sigmoid(μ_d+ψ_k,d, K_φ) (disease probabilities)
Comprehensive interactive analyses addressing reviewer questions and model validation.
| Analysis | Description | Link |
|---|---|---|
| Clinical Utility | Dynamic risk updating and clinical decision-making | R1_Clinical_Utility_Dynamic_Risk_Updating.html |
| AUC Comparisons | Performance vs. established clinical risk scores | R1_Q9_AUC_Comparisons.html |
| Age-Stratified | Performance across different age groups | R1_Q10_Age_Specific.html |
| Heritability | Genetic architecture and heritability estimates | R1_Q7_Heritability.html |
| GWAS Validation | Genome-wide association studies on signatures; identifies 10 novel loci for Signature 5 not found in individual trait GWAS | R1_Genetic_Validation_GWAS.html |
| Gene-Based RVAS | Rare variant association studies on signatures | R1_Genetic_Validation_Gene_Based_RVAS.html |
| Biological Plausibility | CHIP analysis and biological validation | R1_Biological_Plausibility_CHIP.html |
| LOO Validation | Leave-one-out cross-validation robustness | R1_Robustness_LOO_Validation.html |
| Selection Bias | Assessment of selection bias and participation | R1_Q1_Selection_Bias.html |
| Clinical Meaning | Analysis of Familial hypercholesterolemia patients | R1_Q3_Clinical_Meaning.html |
| ICD vs PheCode | Detailed comparison of coding systems | R1_Q3_ICD_vs_PheCode_Comparison.html |
| Competing Risks | Multi-disease patterns and competing risks | R1_Multi_Disease_Patterns_Competing_Risks.html |
| Analysis | Description | Link |
|---|---|---|
| Temporal Leakage | Assessment of temporal leakage and prediction accuracy | R2_Temporal_Leakage.html |
| Washout Comparisons | Multi-approach washout analysis (time horizon, floating prediction, fixed timepoint) | R2_Washout_Comparisons.html |
| Delphi Phecode Mapping | Principled Delphi comparison using Phecode-based ICD mapping | R2_Delphi_Phecode_Mapping.html |
| Model Validity | Model learning and validity assessment | R2_R3_Model_Validity_Learning.html |
| Analysis | Description | Link |
|---|---|---|
| Avoiding Reverse Causation | Reverse causation assessment with 0, 1, 3, 6-month washout periods | R3_AvoidingReverseCausation.html |
| Competing Risks | Detailed competing risks analysis | R3_Competing_Risks.html |
| Decreasing Hazards (Censoring Bias) | Analysis of decreasing hazards at older ages due to censoring bias | R3_Q4_Decreasing_Hazards_Censoring_Bias.html |
| Verify Corrected Data | Verification of corrected E matrix and prevalence calculations | R3_Verify_Corrected_Data.html |
| Linear vs Nonlinear | Linear vs nonlinear mixing approaches | R3_Linear_vs_NonLinear_Mixing.html |
| Population Stratification | Ancestry-stratified analysis | R3_Population_Stratification_Ancestry.html |
| Heterogeneity (Main Paper Method) | Main paper method with PRS validation (MI and breast cancer) | R3_Q8_Heterogeneity_MainPaper_Method.html |
| Heterogeneity (Continued) | Complete pathway analysis demonstrating biological heterogeneity | R3_Q8_Heterogeneity_Continued.html |
| Cross-Cohort Similarity | Cross-cohort signature correspondence analysis | R3_Cross_Cohort_Similarity.html |
In the standard (“centered”) formulation, genetic effects (γ) enter only through the GP prior mean for λ. Because λ is a free parameter that directly enters the likelihood, the optimizer fits the data by adjusting λ and γ receives weaker gradient signal (only the W = 10⁻⁴ scaled GP prior). This can limit the accuracy of γ recovery — the individual-specific genetic effects that are essential for biological interpretation and out-of-sample prediction.
The non-centered formulation addresses this by decomposing λ into a genetic mean and a residual:
λ_i,k,t = μ_k + G_i γ_k + δ_i,k,t
where δ (not λ) carries the GP prior. Now γ flows through the forward pass into the NLL via the chain rule, receiving full gradient signal. Additionally, κ is fixed at 1 rather than learned, since κ and γ are not jointly identifiable (only κ·γ enters the likelihood).
| Analysis | Description | Link |
|---|---|---|
| Parameter Recovery Simulation | Synthetic data simulation (N=1000, D=50, K=5) comparing γ recovery: centered model (r ≈ 0.80) vs non-centered model (r ≈ 0.95). Uses the actual production model classes. | parameter_recovery_simulation.ipynb |
| Component | File | Description |
|---|---|---|
| Discovery Model (Reparam) | clust_huge_amp_vectorized_reparam.py | Non-centered model: λ = μ(γ) + δ, κ=1 fixed. γ and ψ receive NLL gradients directly. |
| Training Script | train_nokappa_v3.py | Constant LR=0.1, W=10⁻⁴, 300 epochs, no cosine scheduling, no gradient clipping |
The Aladynoulli workflow consists of 5 main steps:
| Resource | Description | Link |
|---|---|---|
| Framework Overview | Discovery vs prediction framework - Essential reading | Discovery_Prediction_Framework_Overview.html |
| Complete Workflow Guide | Step-by-step preprocessing → training → prediction | WORKFLOW.md |
| Preprocessing Guide | Preprocessing file creation guide | create_preprocessing_files.html |
| Component | File | Description |
|---|---|---|
| Discovery Model | clust_huge_amp_vectorized.py | Full model that learns phi and psi |
| Prediction Model | clust_huge_amp_fixedPhi_vectorized_fixed_gamma_fixed_kappa.py | Fixed-phi, fixed-gamma, fixed-kappa model for fast predictions |
Note: The prediction model uses fixed gamma (genetic effects) and kappa (calibration parameter) from pooled training batches. This ensures complete separation between training and testing data in each validation fold. Only lambda (individual-specific signature loadings) is learned during prediction.
| Script | Location | Purpose |
|---|---|---|
| Preprocessing | preprocessing_utils.py | Preprocessing utilities |
| Batch Training | run_aladyn_batch_vector_e_censor_nolor.py | Batch model training with corrected E (no LR regularization on gamma) |
| Master Checkpoint | create_master_checkpoints.py | Create pooled checkpoints (phi and psi) |
| Pool Gamma & Kappa | pool_kappa_and_gamma_from_nolr_batches.py | Pool gamma (genetic effects) and kappa (calibration) from training batch checkpoints |
| Prediction | run_aladyn_predict_with_master_vector_cenosrE_fixedgk.py | Run enrollment-based predictions using enrollment E matrix (E_enrollment_full.pt) with master checkpoint from corrected E training, using fixed gamma and kappa from pooled training batches (only lambda is learned per batch) |
For 10K individuals, 348 diseases, 52 timepoints: - Training Time: ~8-10 minutes per batch (converges after ~200 epochs) - Prediction Time: ~8 minutes per batch - Memory: ~8GB RAM (peak usage during training) - CPU: Multi-core recommended (4+ cores); PyTorch uses BLAS for parallel matrix operations
Scaling: - Full UK Biobank (400K individuals): Processed in 39 batches of ~10K each - Total training time: ~5-7 hours for all batches (can be parallelized) - Memory scales linearly: ~8GB per 10K batch
Why it’s fast: - Vectorized PyTorch operations (batched matrix decompositions) - BLAS Level 3 operations for efficient linear algebra - ~100-fold speedup compared to loop-based implementation
This repository contains no patient-identifying information. No individual-level identifiers (EIDs, MRNs, or other participant IDs) from UK Biobank, All of Us, or Mass General Brigham are included in any files or git history. All analyses use de-identified, aggregate-level results only.
Access to the underlying individual-level data from UK Biobank, All of Us, and Mass General Brigham requires separate approval from each institution’s data access committee and is available only with PI permission and institutional authorization.
If you use Aladynoulli in your research, please cite:
@article{aladynoulli2024,
title={Aladynoulli: A Bayesian Survival Model for Disease Trajectory Prediction},
author={Sur, P. and others},
journal={medRxiv},
year={2024},
doi={10.1101/2024.09.29.24314557}
}For questions or issues, please open an issue on GitHub.
Last Updated: February 2026