Discovery vs Prediction Framework Overview¶

This notebook provides a comprehensive overview of the two modes of Aladynoulli:

  1. Discovery Mode: Joint phi estimation for understanding disease biology
  2. Prediction Mode: Fixed phi for clinical predictions

We also demonstrate:

  • How thetas update over time with different washout periods
  • How to calculate GWAS AUC without genotypes (G) using signature loadings

Part 1: Discovery vs Prediction Framework¶

🔬 Discovery Mode: Joint Phi Estimation¶

Purpose: Learn disease signatures (phi) from data to understand disease connections

How it works:

  • Phi is learned from the data (joint estimation with lambda)
  • Model learns which diseases cluster together into signatures
  • Used for understanding disease biology and relationships
  • Lambda (individual signature loadings) is also learned
  • Theta = softmax(lambda) = individual signature proportions over time

Location: /Dropbox/enrollment_retrospective_full/

  • Generated by: run_full_retrospective.sh → run_aladyn_batch.py
  • Outputs: enrollment_model_W0.0001_batch_*_*.pt (contains learned phi and lambda)

Key Characteristics:

  • Phi can vary between batches (batch-specific disease signatures)
  • Theta reflects individual signature loadings learned jointly with phi
  • Best for: Understanding disease biology, pathway discovery, heritability analysis

🎯 Prediction Mode: Fixed Phi¶

Purpose: Make predictions using pre-learned signatures (stable, generalizable)

How it works:

  • Phi is fixed from master checkpoints (pre-learned signatures)
  • Only lambda (individual signature loadings) is estimated
  • Theta = softmax(lambda) = individual signature proportions over time
  • Used for clinical prediction and performance evaluation

Location: /Dropbox/models_fromAWS_enrollment_predictions_fixedphi_RETROSPECTIVE_pooled/

  • Generated by: run_aladyn_predict_with_master.py (on AWS)
  • Outputs: pi_enroll_fixedphi_sex_*.pt (pi tensors for predictions)

Key Characteristics:

  • Phi is stable across all patients (from master checkpoint)
  • Theta reflects individual signature loadings given fixed phi
  • Best for: Clinical predictions, performance evaluation, washout analyses

Comparison Table¶

Aspect Discovery Mode Prediction Mode
Phi Learned (joint) Fixed (from master)
Lambda Learned (joint) Learned (phi fixed)
Theta softmax(lambda) softmax(lambda)
Purpose Disease biology Clinical prediction
Stability Batch-specific Stable across all
Use Cases Pathway discovery, heritability AUC, washout, age offset

Part 2: Timeline of Theta Updates¶

How Thetas Update Over Time¶

Thetas (signature proportions) update as more data becomes available. This is demonstrated using the age offset approach:

  • Age Offset 0: Predict at enrollment age using data up to enrollment
  • Age Offset 5: Predict at enrollment+5 years using data up to enrollment+5 years
  • Age Offset 9: Predict at enrollment+9 years using data up to enrollment+9 years

As more data becomes available (higher age offset), the model learns more about each patient's signature loadings, and thetas update accordingly.

================================================================================
THETA UPDATE TIMELINE DEMONSTRATION
================================================================================

This shows how individual signature loadings (thetas) update
as more data becomes available over time.

Age Offset 0: Data up to enrollment age
Age Offset 5: Data up to enrollment+5 years
Age Offset 9: Data up to enrollment+9 years

As offset increases, thetas become more refined based on observed events.
Visualization function created.

To use this function, you would load thetas from different age offsets:
  thetas_by_offset = {
      0: load_thetas(offset=0),  # Enrollment only
      5: load_thetas(offset=5),  # Enrollment + 5 years
      9: load_thetas(offset=9)   # Enrollment + 9 years
  }
Loading model from: /Users/sarahurbut/Library/CloudStorage/Dropbox/enrollment_retrospective_full/enrollment_model_W0.0001_batch_40000_50000.pt
Loaded lambda: shape = (10000, 21, 52) (N=10000, K=21, T=52)
Computed theta: shape = (10000, 21, 52)

Loading Y tensor and disease names...
  Loaded Y tensor, subsetting to batch 40000-50000
  Loaded 349 disease names
  First few: ['nan', '1.0', '2.0', '3.0', '4.0']

Loaded cluster assignments: shape (348,)
  Cluster assignments map diseases to signatures (clusters)

Signature 5 (cluster 5) diseases: 7 diseases
  Examples: ['52.0', '111.0', '112.0', '113.0', '114.0']
Signature 6 (cluster 6) diseases: 8 diseases
  Examples: ['13.0', '24.0', '25.0', '26.0', '27.0']

✓ Found patient 4986:
  Sig 5 disease (52.0) at time 34 with sig5 loading=0.630
  Sig 6 disease (25.0) at time 50 with sig6 loading=0.244

Visualizing patient 4986

Checking for disease events for patient 4986...
  Checking 7 sig 5 diseases...
    Found sig 5 disease at age 64: 52.0 (sig5=0.630)
  Checking 8 sig 6 diseases...
    Found sig 6 disease at age 80: 25.0 (sig6=0.244)
  Y-axis range: 0 to 0.630 (set to 0.724 with padding)
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_29579/1204367522.py:194: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
  colors = list(cm.get_cmap('tab20')(np.linspace(0, 1, K)))
No description has been provided for this image
================================================================================
THETA TRAJECTORY SUMMARY
================================================================================

Model: enrollment_model_W0.0001_batch_40000_50000.pt
Patient 4986 has 21 signatures tracked over 52 timepoints (ages 30-81)

Theta statistics (mean across timepoints):
  Signature 0: mean=0.031, AUC=1.60
  Signature 1: mean=0.059, AUC=3.03
  Signature 2: mean=0.054, AUC=2.80
  Signature 3: mean=0.128, AUC=6.47
  Signature 4: mean=0.013, AUC=0.68
  Signature 5: mean=0.149, AUC=7.77
  Signature 6: mean=0.027, AUC=1.11
  Signature 7: mean=0.112, AUC=5.76
  Signature 8: mean=0.076, AUC=3.86
  Signature 9: mean=0.024, AUC=1.22
  Signature 10: mean=0.018, AUC=0.92
  Signature 11: mean=0.006, AUC=0.33
  Signature 12: mean=0.009, AUC=0.44
  Signature 13: mean=0.029, AUC=1.47
  Signature 14: mean=0.018, AUC=0.95
  Signature 15: mean=0.013, AUC=0.68
  Signature 16: mean=0.083, AUC=4.17
  Signature 17: mean=0.083, AUC=4.30
  Signature 18: mean=0.013, AUC=0.69
  Signature 19: mean=0.050, AUC=2.49
  Signature 20: mean=0.005, AUC=0.28

✓ Thetas sum to 1 at each timepoint: True
✓ All thetas are non-negative: True

Note: Theta AUC values are used as phenotypes in GWAS analyses!

To visualize a different patient, change patient_idx (currently 4986, max=9999)
No description has been provided for this image
================================================================================
THETA AUC SUMMARY
================================================================================

Theta AUC values (used as phenotypes in GWAS):
  Signature 0: 1.602
  Signature 1: 3.033
  Signature 2: 2.796
  Signature 3: 6.471
  Signature 4: 0.675
  Signature 5: 7.766
  Signature 6: 1.106
  Signature 7: 5.757
  Signature 8: 3.859
  Signature 9: 1.216
  Signature 10: 0.922
  Signature 11: 0.333
  Signature 12: 0.444
  Signature 13: 1.468
  Signature 14: 0.947
  Signature 15: 0.682
  Signature 16: 4.167
  Signature 17: 4.302
  Signature 18: 0.686
  Signature 19: 2.492
  Signature 20: 0.275

Total AUC (should equal number of timepoints): 51.0 (expected: 52)

✓ Theta AUCs can be used as phenotypes for GWAS on each signature!

Part 3: GWAS AUC Calculation Without Genotypes (G)¶

Overview¶

One powerful feature of Aladynoulli is that we can perform GWAS on signature loadings without needing genotypes (G). Instead, we use:

  1. Lambda (raw signature loadings) or Theta (softmax of lambda) from the model
  2. Theta AUC: Area under the curve of theta trajectories over time
  3. GWAS: Regress theta AUCs against genotypes

Workflow¶

Step 1: Fit model (with or without G)
  ↓
Step 2: Extract lambda (signature loadings) for each patient
  ↓
Step 3: Calculate theta = softmax(lambda) [N, K, T]
  ↓
Step 4: Calculate theta AUC = ∫ theta(t) dt for each signature [N, K]
  ↓
Step 5: Run GWAS: theta_AUC ~ genotype for each signature

Key Insight¶

The model learns signature loadings (lambda/theta) from disease data (Y, E), which capture genetic and environmental factors. These loadings can then be used as phenotypes for GWAS, even if G was not used in the model fitting.

In [9]:
def softmax(x):
    """
    Compute softmax values for each set of scores in x.
    x shape: (n_individuals, n_signatures, n_timepoints) or (n_individuals, n_signatures)
    """
    if x.ndim == 3:
        # Reshape to (n_individuals * n_timepoints, n_signatures)
        x_reshaped = x.transpose(0, 2, 1).reshape(-1, x.shape[1])
        
        # Compute softmax
        e_x = np.exp(x_reshaped - np.max(x_reshaped, axis=1, keepdims=True))
        softmax_x = e_x / np.sum(e_x, axis=1, keepdims=True)
        
        # Reshape back to original shape
        return softmax_x.reshape(x.shape[0], x.shape[2], x.shape[1]).transpose(0, 2, 1)
    elif x.ndim == 2:
        # Simple 2D case
        e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return e_x / np.sum(e_x, axis=1, keepdims=True)
    else:
        raise ValueError(f"Expected 2D or 3D array, got {x.ndim}D")


def calculate_theta_aucs(all_lambdas, timepoints=None):
    """
    Calculate theta AUCs from lambda values.
    
    Parameters:
    - all_lambdas: Array of shape (N, K, T) where N=patients, K=signatures, T=timepoints
    - timepoints: Array of timepoints (default: np.arange(T))
    
    Returns:
    - theta_aucs: Array of shape (N, K) with AUC values for each patient-signature pair
    """
    N, K, T = all_lambdas.shape
    
    if timepoints is None:
        timepoints = np.arange(T)
    
    # Calculate thetas using softmax
    all_thetas = softmax(all_lambdas)  # Shape: (N, K, T)
    
    # Calculate AUC for each patient-signature pair
    theta_aucs = np.zeros((N, K))
    for i in range(N):
        for s in range(K):
            theta_aucs[i, s] = np.trapz(all_thetas[i, s, :], timepoints)
    
    return theta_aucs, all_thetas


print("="*80)
print("GWAS AUC CALCULATION WITHOUT GENOTYPES")
print("="*80)
print("\nFunctions created:")
print("  1. softmax(): Converts lambda to theta")
print("  2. calculate_theta_aucs(): Calculates AUC for each patient-signature pair")
print("\nWorkflow:")
print("  1. Load lambda from model (shape: [N, K, T])")
print("  2. Calculate theta = softmax(lambda)")
print("  3. Calculate theta_AUC = ∫ theta(t) dt for each signature")
print("  4. Run GWAS: theta_AUC[:, signature] ~ genotype for each SNP")
print("\nThis allows GWAS on signature loadings without needing G in the model!")
================================================================================
GWAS AUC CALCULATION WITHOUT GENOTYPES
================================================================================

Functions created:
  1. softmax(): Converts lambda to theta
  2. calculate_theta_aucs(): Calculates AUC for each patient-signature pair

Workflow:
  1. Load lambda from model (shape: [N, K, T])
  2. Calculate theta = softmax(lambda)
  3. Calculate theta_AUC = ∫ theta(t) dt for each signature
  4. Run GWAS: theta_AUC[:, signature] ~ genotype for each SNP

This allows GWAS on signature loadings without needing G in the model!
No description has been provided for this image
Schematic created showing the complete workflow.

Summary¶

Key Takeaways¶

  1. Discovery Mode (Joint Phi):

    • Phi and lambda learned jointly from data
    • Best for understanding disease biology
    • Thetas reflect individual signature loadings learned with batch-specific phi
  2. Prediction Mode (Fixed Phi):

    • Phi fixed from master checkpoint
    • Lambda learned given fixed phi
    • Best for clinical predictions
    • Thetas reflect individual signature loadings given stable phi
  3. Theta Updates:

    • Thetas update as more data becomes available (age offset approach)
    • Higher age offset = more data = more refined thetas
    • Theta AUC captures cumulative signature loading over time
  4. GWAS Without G:

    • Model learns signature loadings from disease data (Y, E)
    • These loadings capture genetic and environmental factors
    • Can be used as phenotypes for GWAS even if G wasn't in the model
    • Workflow: Lambda → Theta → Theta AUC → GWAS

Applications¶

  • Discovery Mode: Pathway analysis, heritability, population stratification
  • Prediction Mode: Clinical risk prediction, AUC evaluation, washout validation
  • Theta AUC: GWAS on signature loadings, genetic architecture of disease signatures