R1: Genetic Validation - GWAS of Signature Exposure¶
Reviewer Question¶
Referee #1: "The authors say in several places that the models describe clinically meaningful biological processes without giving any proof of the clinical and certainly not biological meaningfulness."
Why This Matters¶
Demonstrating genetic associations with signature exposure provides evidence that signatures have a genetic basis and capture biologically meaningful pathways.
Our Approach¶
We performed genome-wide association studies (GWAS) using signature exposure as quantitative phenotypes:
- Calculate Average Signature Exposure (AEX): For each individual, we compute the average signature loading over time
- GWAS on AEX: Test genome-wide SNPs for association with signature exposure
- Identify Signature-Specific Loci: Find genetic variants associated with signatures but not with individual diseases
- Map to Nearest Genes: Annotate significant hits with nearest genes
Key Innovation: We test genetic loading of 0 (baseline genetic effects) to identify loci that influence signature trajectories independently.
Key Findings¶
✅ Multiple genome-wide significant loci identified for each signature ✅ Signature-specific loci found that are not associated with individual diseases ✅ Biologically plausible gene associations (e.g., lipid genes for Signature 5)
================================================================================ GWAS LOCI SUMMARY ================================================================================ Total loci: 151 Signatures with loci: 16 Loci per signature:
| Signature | N_Loci | Percentage | |
|---|---|---|---|
| 12 | SIG5 | 56 | 37.1% |
| 7 | SIG17 | 19 | 12.6% |
| 13 | SIG7 | 14 | 9.3% |
| 9 | SIG19 | 11 | 7.3% |
| 1 | SIG1 | 8 | 5.3% |
| 0 | SIG0 | 6 | 4.0% |
| 2 | SIG10 | 6 | 4.0% |
| 10 | SIG2 | 6 | 4.0% |
| 11 | SIG3 | 6 | 4.0% |
| 3 | SIG13 | 5 | 3.3% |
| 4 | SIG14 | 4 | 2.6% |
| 5 | SIG15 | 4 | 2.6% |
| 14 | SIG8 | 3 | 2.0% |
| 6 | SIG16 | 1 | 0.7% |
| 8 | SIG18 | 1 | 0.7% |
| 15 | SIG9 | 1 | 0.7% |
| SIG | #CHR | POS | UID | EA | OA | EAF | BETA | SE | LOG10P | ... | cb_start | cb_end | cytoband | giemsa | pops | locus_id_chr | locus_id_start | locus_id_end | locus_id | Unnamed: 54 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SIG0 | 20 | 42954982 | 20:42954982:A:G | G | A | 0.011206 | 0.018610 | 0.003373 | 7.46280 | ... | 39759919 | 47823902 | q12 | gpos | NaN | 20 | 42454982 | 43454982 | 116 | NaN |
| 1 | SIG0 | 21 | 30519457 | 21:30519457:T:A | A | T | 0.157962 | 0.005564 | 0.000977 | 7.90813 | ... | -1 | -1 | . | . | NaN | 21 | 30019457 | 31096186 | 119 | NaN |
| 2 | SIG0 | 4 | 111718067 | 4:111718067:G:A | A | G | 0.804755 | -0.007606 | 0.000893 | 16.79370 | ... | 96010721 | 123004616 | q24 | gneg | NaN | 4 | 111107315 | 112231200 | 34 | NaN |
| 3 | SIG0 | 6 | 160997118 | 6:160997118:A:T | T | A | 0.080091 | 0.008127 | 0.001307 | 9.30298 | ... | -1 | -1 | . | . | NaN | 6 | 159742487 | 162167545 | 47 | NaN |
| 4 | SIG0 | 8 | 102490380 | 8:102490380:C:T | T | C | 0.847157 | 0.005424 | 0.000986 | 7.41973 | ... | 84457765 | 109374423 | q31 | gpos | NaN | 8 | 101990380 | 102990380 | 52 | NaN |
| 5 | SIG0 | 9 | 97590631 | 9:97590631:T:A | A | T | 0.678148 | 0.004440 | 0.000759 | 8.30449 | ... | 96441248 | 108417612 | q36 | gneg | NaN | 9 | 97090631 | 98090631 | 57 | NaN |
| 6 | SIG10 | 1 | 196652124 | 1:196652124:T:TA | TA | T | 0.616718 | -0.005430 | 0.000800 | 10.94850 | ... | 194712323 | 203473144 | q36 | gneg | NaN | 1 | 196146176 | 197427791 | 8 | NaN |
| 7 | SIG10 | 10 | 124230024 | 10:124230024:A:C | C | A | 0.212114 | 0.007976 | 0.000954 | 16.21150 | ... | -1 | -1 | . | . | NaN | 10 | 123709684 | 124735355 | 68 | NaN |
| 8 | SIG10 | 11 | 86400443 | 11:86400443:A:G | G | A | 0.613320 | -0.004536 | 0.000800 | 7.84024 | ... | 81728398 | 93518069 | q23 | gvar | NaN | 11 | 85899411 | 86900443 | 73 | NaN |
| 9 | SIG10 | 15 | 27498832 | 15:27498832:G:A | A | G | 0.164789 | -0.006111 | 0.001049 | 8.24171 | ... | 17945541 | 33087091 | p14 | gneg | NaN | 15 | 26998832 | 27998832 | 91 | NaN |
10 rows × 55 columns
2. Top Loci by Signature¶
For each signature, we identify the top genetic loci (by p-value) and their nearest genes.
3. Novel Genetic Discoveries: Signature-Specific Loci¶
A key finding from our signature-based GWAS is the identification of novel loci that are genome-wide significant for composite signatures but not found in individual disease GWAS. This demonstrates the power of joint modeling to detect pleiotropic effects that are too weak to detect in single-disease analyses.
The 10 Unique Signature 5 Discoveries¶
For Signature 5 (cardiovascular/lipid), we identified 10 unique loci that are genome-wide significant (p < 5×10⁻⁸) in the signature-based GWAS but are not present in any constituent trait GWAS (Angina, MI, Hypercholesterolemia, Coronary atherosclerosis, Acute IHD, Chronic IHD) within a 1MB window. These represent novel discoveries enabled by joint modeling of multiple related diseases.
Key Examples:
- rs1532085 (LIPC): Hepatic lipase gene involved in HDL metabolism. Not significant in any individual CV trait GWAS, but genome-wide significant (p = 3.8×10⁻⁸) in Signature 5, demonstrating distributed pleiotropic effects across multiple cardiovascular traits.
- rs6687726 (IL6R): Interleukin-6 receptor, key inflammation pathway. Novel discovery through joint modeling.
- rs1499813 (FNDC3B): Insulin signaling and adipogenesis. Strongest novel association (p = 1.1×10⁻¹⁰).
================================================================================ TOP 10 GENETIC LOCI PER SIGNATURE ================================================================================ SIG0 - Heart Failure/Arrhythmia (7 total loci) rs10455872 LPA p=2.75e-130 (LOG10P=129.56) rs6843082 PITX2 p=1.61e-17 (LOG10P=16.79) rs74617384 LPA p=4.98e-10 (LOG10P=9.30) rs10125609 C9orf3 p=4.96e-09 (LOG10P=8.30) rs12627426 MAP3K7CL p=1.24e-08 (LOG10P=7.91) rs77410568 R3HDML p=3.45e-08 (LOG10P=7.46) rs2509765 KB-1562D12.1 p=3.80e-08 (LOG10P=7.42) SIG16 - Neurodegeneration (2 total loci) rs7412 APOE p=1.84e-59 (LOG10P=58.74) rs429358 APOE p=9.88e-14 (LOG10P=13.01) SIG17 - GI/Colorectal (29 total loci) rs1333042 CDKN2B-AS1 p=9.02e-100 (LOG10P=99.04) rs4977575 CDKN2B-AS1 p=1.18e-20 (LOG10P=19.93) rs58658771 GREM1 p=1.32e-18 (LOG10P=17.88) rs687621 RP11-430N14.4 p=1.72e-17 (LOG10P=16.76) rs9275218 HLA-DQB1 p=5.53e-17 (LOG10P=16.26) rs4939567 SMAD7 p=5.40e-16 (LOG10P=15.27) rs6121558 RPS21 p=3.24e-14 (LOG10P=13.49) rs3184504 SH2B3 p=1.13e-13 (LOG10P=12.95) rs10774625 ATXN2 p=1.40e-12 (LOG10P=11.85) rs16888589 EIF3H p=5.19e-12 (LOG10P=11.28) SIG5 - Cardiovascular/Lipid (78 total loci) rs10455872 LPA p=2.75e-130 (LOG10P=129.56) rs1333042 CDKN2B-AS1 p=9.02e-100 (LOG10P=99.04) rs7412 APOE p=1.84e-59 (LOG10P=58.74) rs11887534 ABCG8 p=1.27e-55 (LOG10P=54.90) rs12740374 CELSR2 p=1.66e-42 (LOG10P=41.78) rs138294113 LDLR p=5.90e-39 (LOG10P=38.23) rs9349379 PHACTR1 p=6.50e-30 (LOG10P=29.19) rs11591147 PCSK9 p=7.50e-25 (LOG10P=24.12) rs4977575 CDKN2B-AS1 p=1.18e-20 (LOG10P=19.93) rs2351524 NBEAL1 p=2.30e-20 (LOG10P=19.64) SIG7 - Hypertension/Vascular (23 total loci) rs9275218 HLA-DQB1 p=5.53e-17 (LOG10P=16.26) rs9272451 HLA-DQA1 p=4.94e-15 (LOG10P=14.31) rs3184504 SH2B3 p=1.13e-13 (LOG10P=12.95) rs10774625 ATXN2 p=1.40e-12 (LOG10P=11.85) rs1275977 KCNK3 p=3.06e-12 (LOG10P=11.51) rs7192155 CFDP1 p=4.60e-11 (LOG10P=10.34) rs12509595 FGF5 p=1.57e-10 (LOG10P=9.80) rs2071278 NOTCH4 p=3.29e-10 (LOG10P=9.48) rs3806155 BTNL2 p=3.33e-10 (LOG10P=9.48) rs72831343 C10orf107 p=3.82e-10 (LOG10P=9.42)
| Signature | SNP | Nearest_Gene | LOG10P | P_value | |
|---|---|---|---|---|---|
| 0 | SIG0 | rs10455872 | LPA | 129.56 | 2.75e-130 |
| 1 | SIG0 | rs6843082 | PITX2 | 16.79 | 1.61e-17 |
| 2 | SIG0 | rs74617384 | LPA | 9.30 | 4.98e-10 |
| 3 | SIG0 | rs10125609 | C9orf3 | 8.30 | 4.96e-09 |
| 4 | SIG0 | rs12627426 | MAP3K7CL | 7.91 | 1.24e-08 |
| 5 | SIG0 | rs77410568 | R3HDML | 7.46 | 3.45e-08 |
| 6 | SIG0 | rs2509765 | KB-1562D12.1 | 7.42 | 3.80e-08 |
| 7 | SIG16 | rs7412 | APOE | 58.74 | 1.84e-59 |
| 8 | SIG16 | rs429358 | APOE | 13.01 | 9.88e-14 |
| 9 | SIG17 | rs1333042 | CDKN2B-AS1 | 99.04 | 9.02e-100 |
| 10 | SIG17 | rs4977575 | CDKN2B-AS1 | 19.93 | 1.18e-20 |
| 11 | SIG17 | rs58658771 | GREM1 | 17.88 | 1.32e-18 |
| 12 | SIG17 | rs687621 | RP11-430N14.4 | 16.76 | 1.72e-17 |
| 13 | SIG17 | rs9275218 | HLA-DQB1 | 16.26 | 5.53e-17 |
| 14 | SIG17 | rs4939567 | SMAD7 | 15.27 | 5.40e-16 |
| 15 | SIG17 | rs6121558 | RPS21 | 13.49 | 3.24e-14 |
| 16 | SIG17 | rs3184504 | SH2B3 | 12.95 | 1.13e-13 |
| 17 | SIG17 | rs10774625 | ATXN2 | 11.85 | 1.40e-12 |
| 18 | SIG17 | rs16888589 | EIF3H | 11.28 | 5.19e-12 |
| 19 | SIG5 | rs10455872 | LPA | 129.56 | 2.75e-130 |
| 20 | SIG5 | rs1333042 | CDKN2B-AS1 | 99.04 | 9.02e-100 |
| 21 | SIG5 | rs7412 | APOE | 58.74 | 1.84e-59 |
| 22 | SIG5 | rs11887534 | ABCG8 | 54.90 | 1.27e-55 |
| 23 | SIG5 | rs12740374 | CELSR2 | 41.78 | 1.66e-42 |
| 24 | SIG5 | rs138294113 | LDLR | 38.23 | 5.90e-39 |
| 25 | SIG5 | rs9349379 | PHACTR1 | 29.19 | 6.50e-30 |
| 26 | SIG5 | rs11591147 | PCSK9 | 24.12 | 7.50e-25 |
| 27 | SIG5 | rs4977575 | CDKN2B-AS1 | 19.93 | 1.18e-20 |
| 28 | SIG5 | rs2351524 | NBEAL1 | 19.64 | 2.30e-20 |
| 29 | SIG7 | rs9275218 | HLA-DQB1 | 16.26 | 5.53e-17 |
| 30 | SIG7 | rs9272451 | HLA-DQA1 | 14.31 | 4.94e-15 |
| 31 | SIG7 | rs3184504 | SH2B3 | 12.95 | 1.13e-13 |
| 32 | SIG7 | rs10774625 | ATXN2 | 11.85 | 1.40e-12 |
| 33 | SIG7 | rs1275977 | KCNK3 | 11.51 | 3.06e-12 |
| 34 | SIG7 | rs7192155 | CFDP1 | 10.34 | 4.60e-11 |
| 35 | SIG7 | rs12509595 | FGF5 | 9.80 | 1.57e-10 |
| 36 | SIG7 | rs2071278 | NOTCH4 | 9.48 | 3.29e-10 |
| 37 | SIG7 | rs3806155 | BTNL2 | 9.48 | 3.33e-10 |
| 38 | SIG7 | rs72831343 | C10orf107 | 9.42 | 3.82e-10 |
# Load the presence matrix to identify unique Signature 5 loci
# Try multiple possible locations
present_matrix_paths = [
Path("/Users/sarahurbut/Downloads/present_matrix_1mb_sig5.csv"),
Path("/Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/present_matrix_1mb_sig5.csv"),
]
present_matrix_file = None
for path in present_matrix_paths:
if path.exists():
present_matrix_file = path
break
if present_matrix_file is not None and "loci_df" in locals():
present_df = pd.read_csv(present_matrix_file)
# The 10 unique Signature 5 SNPs (not in any constituent trait GWAS within 1MB)
unique_sig5_snps = {
'rs6687726': 'IL6R',
'rs2509121': 'HYOU1',
'rs4760278': 'R3HDM2',
'rs1532085': 'LIPC',
'rs7168222': 'NR2F2-AS1',
'rs35039495': 'PLCG2',
'rs8121509': 'OPRL1',
'rs1499813': 'FNDC3B',
'4:96088139': 'UNC5C',
'rs4732365': 'C7orf55'
}
# Biological roles for each gene
biological_roles = {
'IL6R': 'Inflammation (Interleukin-6 receptor)',
'HYOU1': 'Hypoxia/ER stress response',
'R3HDM2': 'RNA binding protein',
'LIPC': 'HDL metabolism (Hepatic lipase)',
'NR2F2-AS1': 'Nuclear receptor antisense RNA',
'PLCG2': 'Platelet activation, immunity',
'OPRL1': 'Opioid receptor, stress/pain signaling',
'FNDC3B': 'Insulin signaling, adipogenesis',
'UNC5C': 'Axon guidance (Netrin receptor)',
'C7orf55': 'Unknown function'
}
# Get Signature 5 loci
sig5_loci = loci_df[loci_df['locus_SIG5'] == 1].copy()
# Find the unique loci in our data
unique_loci_results = []
for rsid, expected_gene in unique_sig5_snps.items():
if rsid.startswith('rs'):
match = sig5_loci[sig5_loci['rsid'] == rsid]
else:
# For position-based IDs like 4:96088139
match = sig5_loci[sig5_loci['UID'].str.contains(rsid.replace(':', ':'), na=False)]
if len(match) > 0:
row = match.iloc[0]
unique_loci_results.append({
'Rank': len(unique_loci_results) + 1,
'rsID': row['rsid'],
'Gene': row['nearestgene'],
'Chr': row['#CHR'],
'Position': row['POS'],
'LOG10P': round(row['LOG10P'], 2),
'P_value': f"{10**(-row['LOG10P']):.2e}",
'Beta': round(row['BETA'], 4),
'EAF': round(row['EAF'], 3),
'Biological_Role': biological_roles.get(expected_gene, 'Unknown')
})
else:
# SNP not found in our data - still include it
unique_loci_results.append({
'Rank': len(unique_loci_results) + 1,
'rsID': rsid,
'Gene': expected_gene,
'Chr': '?',
'Position': '?',
'LOG10P': '?',
'P_value': '?',
'Beta': '?',
'EAF': '?',
'Biological_Role': biological_roles.get(expected_gene, 'Unknown')
})
unique_sig5_df = pd.DataFrame(unique_loci_results)
print("="*80)
print("THE 10 UNIQUE SIGNATURE 5 DISCOVERIES")
print("="*80)
print("Loci genome-wide significant in Signature 5 but NOT in constituent trait GWAS")
print("(Angina, MI, Hypercholesterolemia, Coronary atherosclerosis, Acute IHD, Chronic IHD)")
print("-"*80)
print(f"\nFound {len([r for r in unique_loci_results if r['LOG10P'] != '?'])} of 10 unique loci in the data\n")
display(unique_sig5_df[['Rank', 'rsID', 'Gene', 'LOG10P', 'P_value', 'Biological_Role']])
print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)
print("These 10 loci represent novel genetic discoveries enabled by joint modeling.")
print("Each locus has distributed pleiotropic effects across multiple cardiovascular")
print("traits that are too weak to detect individually but collectively reach")
print("genome-wide significance when analyzed jointly through Signature 5.")
print("="*80)
else:
if present_matrix_file is None:
print("⚠️ Presence matrix file not found in any of these locations:")
for path in present_matrix_paths:
print(f" {path}")
if "loci_df" not in locals():
print("⚠️ loci_df not found. Please run the previous cell to load the GWAS loci data.")
================================================================================ THE 10 UNIQUE SIGNATURE 5 DISCOVERIES ================================================================================ Loci genome-wide significant in Signature 5 but NOT in constituent trait GWAS (Angina, MI, Hypercholesterolemia, Coronary atherosclerosis, Acute IHD, Chronic IHD) -------------------------------------------------------------------------------- Found 10 of 10 unique loci in the data
| Rank | rsID | Gene | LOG10P | P_value | Biological_Role | |
|---|---|---|---|---|---|---|
| 0 | 1 | rs6687726 | IL6R | 7.90 | 1.27e-08 | Inflammation (Interleukin-6 receptor) |
| 1 | 2 | rs2509121 | HYOU1 | 8.99 | 1.03e-09 | Hypoxia/ER stress response |
| 2 | 3 | rs4760278 | R3HDM2 | 7.50 | 3.17e-08 | RNA binding protein |
| 3 | 4 | rs1532085 | ALDH1A2 | 7.42 | 3.76e-08 | HDL metabolism (Hepatic lipase) |
| 4 | 5 | rs7168222 | NR2F2-AS1 | 7.58 | 2.66e-08 | Nuclear receptor antisense RNA |
| 5 | 6 | rs35039495 | PLCG2 | 7.58 | 2.63e-08 | Platelet activation, immunity |
| 6 | 7 | rs8121509 | OPRL1 | 8.43 | 3.74e-09 | Opioid receptor, stress/pain signaling |
| 7 | 8 | rs1499813 | FNDC3B | 9.94 | 1.14e-10 | Insulin signaling, adipogenesis |
| 8 | 9 | 4:96088139_ATATG_A | UNC5C | 7.43 | 3.70e-08 | Axon guidance (Netrin receptor) |
| 9 | 10 | rs4732365 | C7orf55 | 8.94 | 1.14e-09 | Unknown function |
================================================================================ INTERPRETATION ================================================================================ These 10 loci represent novel genetic discoveries enabled by joint modeling. Each locus has distributed pleiotropic effects across multiple cardiovascular traits that are too weak to detect individually but collectively reach genome-wide significance when analyzed jointly through Signature 5. ================================================================================
4. Summary and Response¶
Key Findings¶
- Genome-wide significant loci identified: Multiple genetic loci are associated with signature exposure (151 total loci across 16 signatures).
- Signature-specific loci: Genetic variants associated with signatures but not with individual diseases.
- Novel discoveries: 10 unique loci for Signature 5 that are not found in constituent trait GWAS.
- Biologically plausible gene associations: Signature 5 is enriched for lipid metabolism genes (e.g., LDLR, APOE, PCSK9, LPA).
Response to Reviewer¶
We demonstrate biological meaningfulness through genetic association analysis. We performed GWAS using average signature exposure (AEX) as quantitative phenotypes, identifying genetic variants associated with disease signatures. Signature 5 (cardiovascular) is enriched for genes with known roles in lipid metabolism (e.g., LDLR, APOB, PCSK9, LPA), providing strong biological validation.
Critically, we identified 10 novel loci for Signature 5 that are genome-wide significant in the joint analysis but not detected in any individual constituent trait GWAS. This demonstrates that signature-based GWAS can discover genetic associations with distributed pleiotropic effects that are too weak to detect in single-disease analyses, providing direct evidence for the biological meaningfulness of our disease signatures.