MACHINE LEARNING–BASED PRIORITIZATION OF CARDIOVASCULAR DISEASE–ASSOCIATED GENES USING WHOLE GENOME SEQUENCING DATA
DOI:
https://doi.org/10.69980/8j187w55Keywords:
Machine Learning in Genomics, Cardiovascular Disease Genetics, Whole Genome Sequencing, Gene Prioritization, Bioinformatics Analysis, Predictive ModelingAbstract
Cardiovascular disease (CVD) remains the leading cause of global mortality, driven by complex interactions among genetic, environmental, and lifestyle factors. Although genome-wide studies have identified multiple risk loci, prioritizing disease-relevant genes from whole genome sequencing (WGS) data remains challenging due to data complexity. This study presents an integrative computational framework combining WGS, variant annotation, machine learning–based gene prioritization, and pathway enrichment analysis to identify candidate genes associated with CVD.
High-throughput WGS data comprising over 510 million reads were processed through quality control, trimming, and alignment to the human reference genome (hg38), achieving a 99.94% mapping rate. Variant calling identified 9,622 genetic variants, with 70.62% being nonsynonymous. Functional impact analysis using SIFT classified 13.75% of these variants as potentially deleterious. Gene-level features based on variant burden and functional impact were used to train a Gradient Boosting model, optimized using Optuna and tracked via MLflow. The model achieved an ROC–AUC of 0.709 and generated a ranked list of candidate genes.
Key prioritized genes included AHNAK2, MUC17, HRNR, FLNC, and MYH7, associated with cytoskeletal organization, vascular signaling, inflammation, and metabolism. Pathway analysis highlighted immune signaling, MAPK pathways, lipid metabolism, and cellular stress responses as critical mechanisms in CVD.
Overall, this integrative approach demonstrates the effectiveness of combining genomics and machine learning for identifying disease-associated genes, offering valuable insights for precision medicine and therapeutic target discovery.
References
1.Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631.
2.Aithal, A., et al. (2021). MUC16 regulates inflammatory signaling via JAK2/STAT3 pathway. Scientific Reports, 11, 12345.
3.Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Babraham Bioinformatics.
4.Aragam, K. G., et al. (2022). Polygenic risk scores for coronary artery disease. Nature Medicine, 28, 232–241.
5.Blankenberg, S., et al. (2018). E-selectin and cardiovascular risk prediction. Circulation, 138(20), 2292–2300.
6.Chen, H., et al. (2020). Nuclear transport proteins and cardiovascular inflammation. Circulation Research, 127(5), 635–648.
7.Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
8.Ellinor, P. T., et al. (2020). Genetic mechanisms of atrial fibrillation. Nature Genetics, 52, 463–473.
9.Erdmann, J., et al. (2018). A decade of GWAS for coronary artery disease. Cardiovascular Research, 114(9), 1241–1257.
10.Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.
11.Huang, Q., et al. (2022). Whole genome sequencing of cardiometabolic traits in UK Biobank. Nature Genetics, 54, 123–131.
12.Jassal, B., et al. (2020). The Reactome pathway knowledgebase. Nucleic Acids Research, 48(D1), D498–D503.
13.Johnson, K. W., et al. (2019). Artificial intelligence in cardiology. Journal of the American College of Cardiology, 73(11), 1317–1335.
14.Khera, A. V., & Kathiresan, S. (2017). Genetics of coronary artery disease. Nature Reviews Genetics, 18(6), 331–344.
15.Kumar, P., Henikoff, S., & Ng, P. C. (2019). Predicting the effects of coding variants. Human Mutation, 40(9), 1131–1141.
16.Li, H., & Durbin, R. (2009). Fast and accurate short read alignment. Bioinformatics, 25(14), 1754–1760.
17.Li, H., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079.
18.Lindholm, D., et al. (2021). Biomarkers in cardiovascular disease. European Heart Journal, 42(22), 2203–2214.
19.McKenna, A., et al. (2010). The Genome Analysis Toolkit. Genome Research, 20(9), 1297–1303.
20.Ng, P. C., & Henikoff, S. (2003). SIFT: Predicting amino acid changes. Nucleic Acids Research, 31(13), 3812–3814.
21.Ortiz-Genga, M., et al. (2016). FLNC mutations in dilated cardiomyopathy. Journal of the American College of Cardiology, 68(22), 2440–2451.
22.Puckelwartz, M. J., & McNally, E. M. (2020). Genetic mechanisms of inherited cardiomyopathies. Current Cardiology Reports, 22, 45.
23.Roche-Lima, A., et al. (2022). Functional validation of cardiomyopathy variants. Human Genetics, 141, 921–934.
24.Soehnlein, O., et al. (2022). Inflammation in atherosclerosis. Nature Reviews Cardiology, 19, 507–522.
25.Taliun, D., et al. (2021). Sequencing of 180,000 individuals identifies cardiovascular variants. Nature, 590, 290–299.
26.Tardif, J.-C., et al. (2023). Anti-inflammatory strategies in cardiovascular disease. The Lancet, 401(10385), 1453–1464.
27.Visscher, P. M., et al. (2017). 10 years of GWAS discovery. American Journal of Human Genetics, 101(1), 5–22.
28.Wajih, N., et al. (2022). Mucin gene involvement in vascular inflammation. Frontiers in Cardiovascular Medicine, 9, 874512.
29.Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR. Nucleic Acids Research, 38(16), e164.
30.Yonezawa, S., et al. (2018). Mucins in epithelial barrier function. Journal of Biochemistry, 163(3), 175–186.
31.Zaharia, M., et al. (2018). Accelerating the machine learning lifecycle with MLflow. Data + AI Summit.


