Discovery vs Prediction Framework Overview¶
This notebook provides a comprehensive overview of the two modes of Aladynoulli:
- Discovery Mode: Joint phi estimation for understanding disease biology
- Prediction Mode: Fixed phi for clinical predictions
We also demonstrate:
- How thetas update over time with different washout periods
- How to calculate GWAS AUC without genotypes (G) using signature loadings
Part 1: Discovery vs Prediction Framework¶
🔬 Discovery Mode: Joint Phi Estimation¶
Purpose: Learn disease signatures (phi) from data to understand disease connections
How it works:
- Phi is learned from the data (joint estimation with lambda)
- Model learns which diseases cluster together into signatures
- Used for understanding disease biology and relationships
- Lambda (individual signature loadings) is also learned
- Theta = softmax(lambda) = individual signature proportions over time
Location: /Dropbox/enrollment_retrospective_full/
- Generated by:
run_full_retrospective.sh→run_aladyn_batch.py - Outputs:
enrollment_model_W0.0001_batch_*_*.pt(contains learned phi and lambda)
Key Characteristics:
- Phi can vary between batches (batch-specific disease signatures)
- Theta reflects individual signature loadings learned jointly with phi
- Best for: Understanding disease biology, pathway discovery, heritability analysis
🎯 Prediction Mode: Fixed Phi¶
Purpose: Make predictions using pre-learned signatures (stable, generalizable)
How it works:
- Phi is fixed from master checkpoints (pre-learned signatures)
- Only lambda (individual signature loadings) is estimated
- Theta = softmax(lambda) = individual signature proportions over time
- Used for clinical prediction and performance evaluation
Location: /Dropbox/models_fromAWS_enrollment_predictions_fixedphi_RETROSPECTIVE_pooled/
- Generated by:
run_aladyn_predict_with_master.py(on AWS) - Outputs:
pi_enroll_fixedphi_sex_*.pt(pi tensors for predictions)
Key Characteristics:
- Phi is stable across all patients (from master checkpoint)
- Theta reflects individual signature loadings given fixed phi
- Best for: Clinical predictions, performance evaluation, washout analyses
Comparison Table¶
| Aspect | Discovery Mode | Prediction Mode |
|---|---|---|
| Phi | Learned (joint) | Fixed (from master) |
| Lambda | Learned (joint) | Learned (phi fixed) |
| Theta | softmax(lambda) | softmax(lambda) |
| Purpose | Disease biology | Clinical prediction |
| Stability | Batch-specific | Stable across all |
| Use Cases | Pathway discovery, heritability | AUC, washout, age offset |
Part 2: Timeline of Theta Updates¶
How Thetas Update Over Time¶
Thetas (signature proportions) update as more data becomes available. This is demonstrated using the age offset approach:
- Age Offset 0: Predict at enrollment age using data up to enrollment
- Age Offset 5: Predict at enrollment+5 years using data up to enrollment+5 years
- Age Offset 9: Predict at enrollment+9 years using data up to enrollment+9 years
As more data becomes available (higher age offset), the model learns more about each patient's signature loadings, and thetas update accordingly.
================================================================================ THETA UPDATE TIMELINE DEMONSTRATION ================================================================================ This shows how individual signature loadings (thetas) update as more data becomes available over time. Age Offset 0: Data up to enrollment age Age Offset 5: Data up to enrollment+5 years Age Offset 9: Data up to enrollment+9 years As offset increases, thetas become more refined based on observed events.
Visualization function created.
To use this function, you would load thetas from different age offsets:
thetas_by_offset = {
0: load_thetas(offset=0), # Enrollment only
5: load_thetas(offset=5), # Enrollment + 5 years
9: load_thetas(offset=9) # Enrollment + 9 years
}
Loading model from: /Users/sarahurbut/Library/CloudStorage/Dropbox/enrollment_retrospective_full/enrollment_model_W0.0001_batch_40000_50000.pt
Loaded lambda: shape = (10000, 21, 52) (N=10000, K=21, T=52)
Computed theta: shape = (10000, 21, 52)
Loading Y tensor and disease names...
Loaded Y tensor, subsetting to batch 40000-50000
Loaded 349 disease names
First few: ['nan', '1.0', '2.0', '3.0', '4.0']
Loaded cluster assignments: shape (348,)
Cluster assignments map diseases to signatures (clusters)
Signature 5 (cluster 5) diseases: 7 diseases
Examples: ['52.0', '111.0', '112.0', '113.0', '114.0']
Signature 6 (cluster 6) diseases: 8 diseases
Examples: ['13.0', '24.0', '25.0', '26.0', '27.0']
✓ Found patient 4986:
Sig 5 disease (52.0) at time 34 with sig5 loading=0.630
Sig 6 disease (25.0) at time 50 with sig6 loading=0.244
Visualizing patient 4986
Checking for disease events for patient 4986...
Checking 7 sig 5 diseases...
Found sig 5 disease at age 64: 52.0 (sig5=0.630)
Checking 8 sig 6 diseases...
Found sig 6 disease at age 80: 25.0 (sig6=0.244)
Y-axis range: 0 to 0.630 (set to 0.724 with padding)
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_29579/1204367522.py:194: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
colors = list(cm.get_cmap('tab20')(np.linspace(0, 1, K)))
================================================================================ THETA TRAJECTORY SUMMARY ================================================================================ Model: enrollment_model_W0.0001_batch_40000_50000.pt Patient 4986 has 21 signatures tracked over 52 timepoints (ages 30-81) Theta statistics (mean across timepoints): Signature 0: mean=0.031, AUC=1.60 Signature 1: mean=0.059, AUC=3.03 Signature 2: mean=0.054, AUC=2.80 Signature 3: mean=0.128, AUC=6.47 Signature 4: mean=0.013, AUC=0.68 Signature 5: mean=0.149, AUC=7.77 Signature 6: mean=0.027, AUC=1.11 Signature 7: mean=0.112, AUC=5.76 Signature 8: mean=0.076, AUC=3.86 Signature 9: mean=0.024, AUC=1.22 Signature 10: mean=0.018, AUC=0.92 Signature 11: mean=0.006, AUC=0.33 Signature 12: mean=0.009, AUC=0.44 Signature 13: mean=0.029, AUC=1.47 Signature 14: mean=0.018, AUC=0.95 Signature 15: mean=0.013, AUC=0.68 Signature 16: mean=0.083, AUC=4.17 Signature 17: mean=0.083, AUC=4.30 Signature 18: mean=0.013, AUC=0.69 Signature 19: mean=0.050, AUC=2.49 Signature 20: mean=0.005, AUC=0.28 ✓ Thetas sum to 1 at each timepoint: True ✓ All thetas are non-negative: True Note: Theta AUC values are used as phenotypes in GWAS analyses! To visualize a different patient, change patient_idx (currently 4986, max=9999)
================================================================================ THETA AUC SUMMARY ================================================================================ Theta AUC values (used as phenotypes in GWAS): Signature 0: 1.602 Signature 1: 3.033 Signature 2: 2.796 Signature 3: 6.471 Signature 4: 0.675 Signature 5: 7.766 Signature 6: 1.106 Signature 7: 5.757 Signature 8: 3.859 Signature 9: 1.216 Signature 10: 0.922 Signature 11: 0.333 Signature 12: 0.444 Signature 13: 1.468 Signature 14: 0.947 Signature 15: 0.682 Signature 16: 4.167 Signature 17: 4.302 Signature 18: 0.686 Signature 19: 2.492 Signature 20: 0.275 Total AUC (should equal number of timepoints): 51.0 (expected: 52) ✓ Theta AUCs can be used as phenotypes for GWAS on each signature!
Part 3: GWAS AUC Calculation Without Genotypes (G)¶
Overview¶
One powerful feature of Aladynoulli is that we can perform GWAS on signature loadings without needing genotypes (G). Instead, we use:
- Lambda (raw signature loadings) or Theta (softmax of lambda) from the model
- Theta AUC: Area under the curve of theta trajectories over time
- GWAS: Regress theta AUCs against genotypes
Workflow¶
Step 1: Fit model (with or without G)
↓
Step 2: Extract lambda (signature loadings) for each patient
↓
Step 3: Calculate theta = softmax(lambda) [N, K, T]
↓
Step 4: Calculate theta AUC = ∫ theta(t) dt for each signature [N, K]
↓
Step 5: Run GWAS: theta_AUC ~ genotype for each signature
Key Insight¶
The model learns signature loadings (lambda/theta) from disease data (Y, E), which capture genetic and environmental factors. These loadings can then be used as phenotypes for GWAS, even if G was not used in the model fitting.
def softmax(x):
"""
Compute softmax values for each set of scores in x.
x shape: (n_individuals, n_signatures, n_timepoints) or (n_individuals, n_signatures)
"""
if x.ndim == 3:
# Reshape to (n_individuals * n_timepoints, n_signatures)
x_reshaped = x.transpose(0, 2, 1).reshape(-1, x.shape[1])
# Compute softmax
e_x = np.exp(x_reshaped - np.max(x_reshaped, axis=1, keepdims=True))
softmax_x = e_x / np.sum(e_x, axis=1, keepdims=True)
# Reshape back to original shape
return softmax_x.reshape(x.shape[0], x.shape[2], x.shape[1]).transpose(0, 2, 1)
elif x.ndim == 2:
# Simple 2D case
e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return e_x / np.sum(e_x, axis=1, keepdims=True)
else:
raise ValueError(f"Expected 2D or 3D array, got {x.ndim}D")
def calculate_theta_aucs(all_lambdas, timepoints=None):
"""
Calculate theta AUCs from lambda values.
Parameters:
- all_lambdas: Array of shape (N, K, T) where N=patients, K=signatures, T=timepoints
- timepoints: Array of timepoints (default: np.arange(T))
Returns:
- theta_aucs: Array of shape (N, K) with AUC values for each patient-signature pair
"""
N, K, T = all_lambdas.shape
if timepoints is None:
timepoints = np.arange(T)
# Calculate thetas using softmax
all_thetas = softmax(all_lambdas) # Shape: (N, K, T)
# Calculate AUC for each patient-signature pair
theta_aucs = np.zeros((N, K))
for i in range(N):
for s in range(K):
theta_aucs[i, s] = np.trapz(all_thetas[i, s, :], timepoints)
return theta_aucs, all_thetas
print("="*80)
print("GWAS AUC CALCULATION WITHOUT GENOTYPES")
print("="*80)
print("\nFunctions created:")
print(" 1. softmax(): Converts lambda to theta")
print(" 2. calculate_theta_aucs(): Calculates AUC for each patient-signature pair")
print("\nWorkflow:")
print(" 1. Load lambda from model (shape: [N, K, T])")
print(" 2. Calculate theta = softmax(lambda)")
print(" 3. Calculate theta_AUC = ∫ theta(t) dt for each signature")
print(" 4. Run GWAS: theta_AUC[:, signature] ~ genotype for each SNP")
print("\nThis allows GWAS on signature loadings without needing G in the model!")
================================================================================ GWAS AUC CALCULATION WITHOUT GENOTYPES ================================================================================ Functions created: 1. softmax(): Converts lambda to theta 2. calculate_theta_aucs(): Calculates AUC for each patient-signature pair Workflow: 1. Load lambda from model (shape: [N, K, T]) 2. Calculate theta = softmax(lambda) 3. Calculate theta_AUC = ∫ theta(t) dt for each signature 4. Run GWAS: theta_AUC[:, signature] ~ genotype for each SNP This allows GWAS on signature loadings without needing G in the model!
Schematic created showing the complete workflow.
Summary¶
Key Takeaways¶
Discovery Mode (Joint Phi):
- Phi and lambda learned jointly from data
- Best for understanding disease biology
- Thetas reflect individual signature loadings learned with batch-specific phi
Prediction Mode (Fixed Phi):
- Phi fixed from master checkpoint
- Lambda learned given fixed phi
- Best for clinical predictions
- Thetas reflect individual signature loadings given stable phi
Theta Updates:
- Thetas update as more data becomes available (age offset approach)
- Higher age offset = more data = more refined thetas
- Theta AUC captures cumulative signature loading over time
GWAS Without G:
- Model learns signature loadings from disease data (Y, E)
- These loadings capture genetic and environmental factors
- Can be used as phenotypes for GWAS even if G wasn't in the model
- Workflow: Lambda → Theta → Theta AUC → GWAS
Applications¶
- Discovery Mode: Pathway analysis, heritability, population stratification
- Prediction Mode: Clinical risk prediction, AUC evaluation, washout validation
- Theta AUC: GWAS on signature loadings, genetic architecture of disease signatures