================================================================================
THETA UPDATE TIMELINE DEMONSTRATION
================================================================================

This shows how individual signature loadings (thetas) update
as more data becomes available over time.

Age Offset 0: Data up to enrollment age
Age Offset 5: Data up to enrollment+5 years
Age Offset 9: Data up to enrollment+9 years

As offset increases, thetas become more refined based on observed events.

Visualization function created.

To use this function, you would load thetas from different age offsets:
  thetas_by_offset = {
      0: load_thetas(offset=0),  # Enrollment only
      5: load_thetas(offset=5),  # Enrollment + 5 years
      9: load_thetas(offset=9)   # Enrollment + 9 years
  }

Loading model from: /Users/sarahurbut/Library/CloudStorage/Dropbox/enrollment_retrospective_full/enrollment_model_W0.0001_batch_40000_50000.pt
Loaded lambda: shape = (10000, 21, 52) (N=10000, K=21, T=52)
Computed theta: shape = (10000, 21, 52)

Loading Y tensor and disease names...
  Loaded Y tensor, subsetting to batch 40000-50000
  Loaded 349 disease names
  First few: ['nan', '1.0', '2.0', '3.0', '4.0']

Loaded cluster assignments: shape (348,)
  Cluster assignments map diseases to signatures (clusters)

Signature 5 (cluster 5) diseases: 7 diseases
  Examples: ['52.0', '111.0', '112.0', '113.0', '114.0']
Signature 6 (cluster 6) diseases: 8 diseases
  Examples: ['13.0', '24.0', '25.0', '26.0', '27.0']

✓ Found patient 4986:
  Sig 5 disease (52.0) at time 34 with sig5 loading=0.630
  Sig 6 disease (25.0) at time 50 with sig6 loading=0.244

Visualizing patient 4986

Checking for disease events for patient 4986...
  Checking 7 sig 5 diseases...
    Found sig 5 disease at age 64: 52.0 (sig5=0.630)
  Checking 8 sig 6 diseases...
    Found sig 6 disease at age 80: 25.0 (sig6=0.244)
  Y-axis range: 0 to 0.630 (set to 0.724 with padding)

/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_29579/1204367522.py:194: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
  colors = list(cm.get_cmap('tab20')(np.linspace(0, 1, K)))

================================================================================
THETA TRAJECTORY SUMMARY
================================================================================

Model: enrollment_model_W0.0001_batch_40000_50000.pt
Patient 4986 has 21 signatures tracked over 52 timepoints (ages 30-81)

Theta statistics (mean across timepoints):
  Signature 0: mean=0.031, AUC=1.60
  Signature 1: mean=0.059, AUC=3.03
  Signature 2: mean=0.054, AUC=2.80
  Signature 3: mean=0.128, AUC=6.47
  Signature 4: mean=0.013, AUC=0.68
  Signature 5: mean=0.149, AUC=7.77
  Signature 6: mean=0.027, AUC=1.11
  Signature 7: mean=0.112, AUC=5.76
  Signature 8: mean=0.076, AUC=3.86
  Signature 9: mean=0.024, AUC=1.22
  Signature 10: mean=0.018, AUC=0.92
  Signature 11: mean=0.006, AUC=0.33
  Signature 12: mean=0.009, AUC=0.44
  Signature 13: mean=0.029, AUC=1.47
  Signature 14: mean=0.018, AUC=0.95
  Signature 15: mean=0.013, AUC=0.68
  Signature 16: mean=0.083, AUC=4.17
  Signature 17: mean=0.083, AUC=4.30
  Signature 18: mean=0.013, AUC=0.69
  Signature 19: mean=0.050, AUC=2.49
  Signature 20: mean=0.005, AUC=0.28

✓ Thetas sum to 1 at each timepoint: True
✓ All thetas are non-negative: True

Note: Theta AUC values are used as phenotypes in GWAS analyses!

To visualize a different patient, change patient_idx (currently 4986, max=9999)

================================================================================
THETA AUC SUMMARY
================================================================================

Theta AUC values (used as phenotypes in GWAS):
  Signature 0: 1.602
  Signature 1: 3.033
  Signature 2: 2.796
  Signature 3: 6.471
  Signature 4: 0.675
  Signature 5: 7.766
  Signature 6: 1.106
  Signature 7: 5.757
  Signature 8: 3.859
  Signature 9: 1.216
  Signature 10: 0.922
  Signature 11: 0.333
  Signature 12: 0.444
  Signature 13: 1.468
  Signature 14: 0.947
  Signature 15: 0.682
  Signature 16: 4.167
  Signature 17: 4.302
  Signature 18: 0.686
  Signature 19: 2.492
  Signature 20: 0.275

Total AUC (should equal number of timepoints): 51.0 (expected: 52)

✓ Theta AUCs can be used as phenotypes for GWAS on each signature!

def softmax(x):
    """
    Compute softmax values for each set of scores in x.
    x shape: (n_individuals, n_signatures, n_timepoints) or (n_individuals, n_signatures)
    """
    if x.ndim == 3:
        # Reshape to (n_individuals * n_timepoints, n_signatures)
        x_reshaped = x.transpose(0, 2, 1).reshape(-1, x.shape[1])
        
        # Compute softmax
        e_x = np.exp(x_reshaped - np.max(x_reshaped, axis=1, keepdims=True))
        softmax_x = e_x / np.sum(e_x, axis=1, keepdims=True)
        
        # Reshape back to original shape
        return softmax_x.reshape(x.shape[0], x.shape[2], x.shape[1]).transpose(0, 2, 1)
    elif x.ndim == 2:
        # Simple 2D case
        e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return e_x / np.sum(e_x, axis=1, keepdims=True)
    else:
        raise ValueError(f"Expected 2D or 3D array, got {x.ndim}D")


def calculate_theta_aucs(all_lambdas, timepoints=None):
    """
    Calculate theta AUCs from lambda values.
    
    Parameters:
    - all_lambdas: Array of shape (N, K, T) where N=patients, K=signatures, T=timepoints
    - timepoints: Array of timepoints (default: np.arange(T))
    
    Returns:
    - theta_aucs: Array of shape (N, K) with AUC values for each patient-signature pair
    """
    N, K, T = all_lambdas.shape
    
    if timepoints is None:
        timepoints = np.arange(T)
    
    # Calculate thetas using softmax
    all_thetas = softmax(all_lambdas)  # Shape: (N, K, T)
    
    # Calculate AUC for each patient-signature pair
    theta_aucs = np.zeros((N, K))
    for i in range(N):
        for s in range(K):
            theta_aucs[i, s] = np.trapz(all_thetas[i, s, :], timepoints)
    
    return theta_aucs, all_thetas


print("="*80)
print("GWAS AUC CALCULATION WITHOUT GENOTYPES")
print("="*80)
print("\nFunctions created:")
print("  1. softmax(): Converts lambda to theta")
print("  2. calculate_theta_aucs(): Calculates AUC for each patient-signature pair")
print("\nWorkflow:")
print("  1. Load lambda from model (shape: [N, K, T])")
print("  2. Calculate theta = softmax(lambda)")
print("  3. Calculate theta_AUC = ∫ theta(t) dt for each signature")
print("  4. Run GWAS: theta_AUC[:, signature] ~ genotype for each SNP")
print("\nThis allows GWAS on signature loadings without needing G in the model!")

================================================================================
GWAS AUC CALCULATION WITHOUT GENOTYPES
================================================================================

Functions created:
  1. softmax(): Converts lambda to theta
  2. calculate_theta_aucs(): Calculates AUC for each patient-signature pair

Workflow:
  1. Load lambda from model (shape: [N, K, T])
  2. Calculate theta = softmax(lambda)
  3. Calculate theta_AUC = ∫ theta(t) dt for each signature
  4. Run GWAS: theta_AUC[:, signature] ~ genotype for each SNP

This allows GWAS on signature loadings without needing G in the model!

Schematic created showing the complete workflow.

Aspect	Discovery Mode	Prediction Mode
Phi	Learned (joint)	Fixed (from master)
Lambda	Learned (joint)	Learned (phi fixed)
Theta	softmax(lambda)	softmax(lambda)
Purpose	Disease biology	Clinical prediction
Stability	Batch-specific	Stable across all
Use Cases	Pathway discovery, heritability	AUC, washout, age offset

Discovery vs Prediction Framework Overview¶

Part 1: Discovery vs Prediction Framework¶

🔬 Discovery Mode: Joint Phi Estimation¶

🎯 Prediction Mode: Fixed Phi¶

Comparison Table¶

Part 2: Timeline of Theta Updates¶

How Thetas Update Over Time¶

Part 3: GWAS AUC Calculation Without Genotypes (G)¶

Overview¶

Workflow¶

Key Insight¶

Summary¶

Key Takeaways¶

Applications¶