1
|
Nam Y, Kim J, Jung SH, Woerner J, Suh EH, Lee DG, Shivakumar M, Lee ME, Kim D. Harnessing Artificial Intelligence in Multimodal Omics Data Integration: Paving the Path for the Next Frontier in Precision Medicine. Annu Rev Biomed Data Sci 2024; 7:225-250. [PMID: 38768397 DOI: 10.1146/annurev-biodatasci-102523-103801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
The integration of multiomics data with detailed phenotypic insights from electronic health records marks a paradigm shift in biomedical research, offering unparalleled holistic views into health and disease pathways. This review delineates the current landscape of multimodal omics data integration, emphasizing its transformative potential in generating a comprehensive understanding of complex biological systems. We explore robust methodologies for data integration, ranging from concatenation-based to transformation-based and network-based strategies, designed to harness the intricate nuances of diverse data types. Our discussion extends from incorporating large-scale population biobanks to dissecting high-dimensional omics layers at the single-cell level. The review underscores the emerging role of large language models in artificial intelligence, anticipating their influence as a near-future pivot in data integration approaches. Highlighting both achievements and hurdles, we advocate for a concerted effort toward sophisticated integration models, fortifying the foundation for groundbreaking discoveries in precision medicine.
Collapse
Affiliation(s)
- Yonghyun Nam
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Jaesik Kim
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Department of Bioengineering, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Sang-Hyuk Jung
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Jakob Woerner
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Erica H Suh
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Dong-Gi Lee
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Manu Shivakumar
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Matthew E Lee
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| | - Dokyoon Kim
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA;
| |
Collapse
|
2
|
Raschia MA, Ríos PJ, Maizon DO, Demitrio D, Poli MA. Methodology for the identification of relevant loci for milk traits in dairy cattle, using machine learning algorithms. MethodsX 2022; 9:101733. [PMID: 35637693 PMCID: PMC9144035 DOI: 10.1016/j.mex.2022.101733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 05/11/2022] [Indexed: 11/29/2022] Open
Abstract
Machine learning methods were considered efficient in identifying single nucleotide polymorphisms (SNP) underlying a trait of interest. This study aimed to construct predictive models using machine learning algorithms, to identify loci that best explain the variance in milk traits of dairy cattle. Further objectives involved validating the results by comparison with reported relevant regions and retrieving the pathways overrepresented by the genes flanking relevant SNPs. Regression models using XGBoost (XGB), LightGBM (LGB), and Random Forest (RF) algorithms were trained using estimated breeding values for milk production (EBVM), milk fat content (EBVF) and milk protein content (EBVP) as phenotypes and genotypes on 40417 SNPs as predictor variables. To evaluate their efficiency, metrics for actual vs. predicted values were determined in validation folds (XGB and LGB) and out-of-bag data (RF). Less than 4500 relevant SNPs were retrieved for each trait. Among the genes flanking them, signaling and transmembrane transporter activities were overrepresented. The models trained:Predicted breeding values for animals not included in the dataset. Were efficient in identifying a subset of SNPs explaining phenotypic variation.
The results obtained using XGB and LGB algorithms agreed with previous results. Therefore, the method proposed could be applied for future association studies on milk traits.
Collapse
Affiliation(s)
- María Agustina Raschia
- Instituto Nacional de Tecnología Agropecuaria, CICVyA-CNIA, Instituto de Genética “Ewald A. Favret”. Hurlingham, Buenos Aires, Argentina
- Corresponding author.
| | - Pablo Javier Ríos
- Universidad de Buenos Aires, Buenos Aires, Argentina
- Facultad de Ciencias Exactas, Universidad Nacional de La Plata, Argentina
| | - Daniel Omar Maizon
- Instituto Nacional de Tecnología Agropecuaria, E.E.A. Anguil. Anguil, La Pampa, Argentina
- Facultad de Agronomía, Universidad Nacional de La Pampa, Argentina
| | - Daniel Demitrio
- Instituto Nacional de Tecnología Agropecuaria, Dirección General de Sistemas de Información, Comunicación y Procesos - Gerencia de Informática y Gestión de la Información. Buenos Aires, Argentina
- Facultad de Ciencias Exactas, Universidad Nacional de La Plata, Argentina
| | - Mario Andrés Poli
- Instituto Nacional de Tecnología Agropecuaria, CICVyA-CNIA, Instituto de Genética “Ewald A. Favret”. Hurlingham, Buenos Aires, Argentina
- Facultad de Ciencias Agrarias y Veterinarias, Universidad del Salvador, Argentina
| |
Collapse
|
3
|
Yoosefzadeh-Najafabadi M, Eskandari M, Belzile F, Torkamaneh D. Genome-Wide Association Study Statistical Models: A Review. Methods Mol Biol 2022; 2481:43-62. [PMID: 35641758 DOI: 10.1007/978-1-0716-2237-7_4] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Statistical models are at the core of the genome-wide association study (GWAS). In this chapter, we provide an overview of single- and multilocus statistical models, Bayesian, and machine learning approaches for association studies in plants. These models are discussed based on their basic methodology, cofactors adjustment accounted for, statistical power and computational efficiency. New statistical models and machine learning algorithms are both showing improved performance in detecting missed signals, rare mutations and prioritizing causal genetic variants; nevertheless, further optimization and validation studies are required to maximize the power of GWAS.
Collapse
Affiliation(s)
| | - Milad Eskandari
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - François Belzile
- Département de Phytologie, Université Laval, Quebec City, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC, Canada
| | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Quebec City, QC, Canada.
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC, Canada.
| |
Collapse
|
4
|
Arani AA, Sehhati M, Tabatabaiefar MA. Predicting deleterious missense genetic variants via integrative supervised nonnegative matrix tri-factorization. Sci Rep 2021; 11:23747. [PMID: 34887492 PMCID: PMC8660898 DOI: 10.1038/s41598-021-03230-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 11/30/2021] [Indexed: 11/21/2022] Open
Abstract
Among an assortment of genetic variations, Missense are major ones which a small subset of them may led to the upset of the protein function and ultimately end in human diseases. Various machine learning methods were declared to differentiate deleterious and benign missense variants by means of a large number of features, including structure, sequence, interaction networks, gene disease associations as well as phenotypes. However, development of a reliable and accurate algorithm for merging heterogeneous information is highly needed as it could be captured all information of complex interactions on network that genes participate in. In this study we proposed a new method based on the non-negative matrix tri-factorization clustering method. We outlined two versions of the proposed method: two-source and three-source algorithms. Two-source algorithm aggregates individual deleteriousness prediction methods and PPI network, and three-source algorithm incorporates gene disease associations into the other sources already mentioned. Four benchmark datasets were employed for internally and externally validation of both algorithms of our predictor. The results at all datasets confirmed that, our method outperforms most state of the art variant prediction tools. Two key features of our variant effect prediction method are worth mentioning. Firstly, despite the fact that the incorporation of gene disease information at three-source algorithm can improve prediction performance by comparison with two-source algorithm, our method did not hinder by type 2 circularity error unlike some recent ensemble-based prediction methods. Type 2 circularity error occurs when the predictor annotates variants on the basis of the genes located on. Secondly, the performance of our predictor is superior over other ensemble-based methods for variants positioned on genes in which we do not have enough information about their pathogenicity.
Collapse
Affiliation(s)
- Asieh Amousoltani Arani
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
- Student Research Committee, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammadreza Sehhati
- Department of Bioinformatics, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran.
- Deputy of Research and Technology, GTaC Corp, Isfahan University of Medical Sciences, Isfahan, Iran.
| | - Mohammad Amin Tabatabaiefar
- Deputy of Research and Technology, GTaC Corp, Isfahan University of Medical Sciences, Isfahan, Iran
- Department of Genetics and Molecular Biology, School of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|
5
|
Peng GCY, Alber M, Tepole AB, Cannon WR, De S, Dura-Bernal S, Garikipati K, Karniadakis G, Lytton WW, Perdikaris P, Petzold L, Kuhl E. Multiscale modeling meets machine learning: What can we learn? ARCHIVES OF COMPUTATIONAL METHODS IN ENGINEERING : STATE OF THE ART REVIEWS 2021; 28:1017-1037. [PMID: 34093005 PMCID: PMC8172124 DOI: 10.1007/s11831-020-09405-5] [Citation(s) in RCA: 73] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Accepted: 02/09/2020] [Indexed: 05/10/2023]
Abstract
Machine learning is increasingly recognized as a promising technology in the biological, biomedical, and behavioral sciences. There can be no argument that this technique is incredibly successful in image recognition with immediate applications in diagnostics including electrophysiology, radiology, or pathology, where we have access to massive amounts of annotated data. However, machine learning often performs poorly in prognosis, especially when dealing with sparse data. This is a field where classical physics-based simulation seems to remain irreplaceable. In this review, we identify areas in the biomedical sciences where machine learning and multiscale modeling can mutually benefit from one another: Machine learning can integrate physics-based knowledge in the form of governing equations, boundary conditions, or constraints to manage ill-posted problems and robustly handle sparse and noisy data; multiscale modeling can integrate machine learning to create surrogate models, identify system dynamics and parameters, analyze sensitivities, and quantify uncertainty to bridge the scales and understand the emergence of function. With a view towards applications in the life sciences, we discuss the state of the art of combining machine learning and multiscale modeling, identify applications and opportunities, raise open questions, and address potential challenges and limitations. We anticipate that it will stimulate discussion within the community of computational mechanics and reach out to other disciplines including mathematics, statistics, computer science, artificial intelligence, biomedicine, systems biology, and precision medicine to join forces towards creating robust and efficient models for biological systems.
Collapse
Affiliation(s)
| | - Mark Alber
- University of California, Riverside, USA
| | | | - William R Cannon
- Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Suvranu De
- Rensselaer Polytechnic Institute, Troy, New York, USA
| | | | | | | | | | | | - Linda Petzold
- University of California, Santa Barbara, California, USA
| | - Ellen Kuhl
- Stanford University, Stanford, California, USA
| |
Collapse
|
6
|
Sun S, Dong B, Zou Q. Revisiting genome-wide association studies from statistical modelling to machine learning. Brief Bioinform 2020; 22:5943789. [PMID: 33126243 DOI: 10.1093/bib/bbaa263] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 09/06/2020] [Accepted: 09/11/2020] [Indexed: 11/14/2022] Open
Abstract
Over the last decade, genome-wide association studies (GWAS) have discovered thousands of genetic variants underlying complex human diseases and agriculturally important traits. These findings have been utilized to dissect the biological basis of diseases, to develop new drugs, to advance precision medicine and to boost breeding. However, the potential of GWAS is still underexploited due to methodological limitations. Many challenges have emerged, including detecting epistasis and single-nucleotide polymorphisms (SNPs) with small effects and distinguishing causal variants from other SNPs associated through linkage disequilibrium. These issues have motivated advancements in GWAS analyses in two contrasting cultures-statistical modelling and machine learning. In this review, we systematically present the basic concepts and the benefits and limitations in both methods. We further discuss recent efforts to mitigate their weaknesses. Additionally, we summarize the state-of-the-art tools for detecting the missed signals, ultrarare mutations and gene-gene interactions and for prioritizing SNPs. Our work can offer both theoretical and practical guidelines for performing GWAS analyses and for developing further new robust methods to fully exploit the potential of GWAS.
Collapse
Affiliation(s)
- Shanwen Sun
- Institute of Fundamental and Frontier Sciences at the University of Electronic Science and Technology of China, Chengdu, China
| | - Benzhi Dong
- College of Computer Science and Engineering, Northeast Forestry University, Harbin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences at the University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
7
|
Hamaya R, Hoshino M, Yonetsu T, Lee JM, Koo BK, Escaned J, Kakuta T. Defining heterogeneity of epicardial functional stenosis with low coronary flow reserve by unsupervised machine learning. Heart Vessels 2020; 35:1527-1536. [PMID: 32506182 DOI: 10.1007/s00380-020-01640-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 05/29/2020] [Indexed: 10/24/2022]
Abstract
Low CFR is associated with poor prognosis, whereas it is a heterogeneous condition according to the actual coronary flow, such as high resting or low hyperemic coronary flow, which should have different physiological traits and clinical implications. This study aimed to detect and define the sub-phenotypes of vessels with low coronary flow reserve (CFR) epicardial disease by unsupervised machine-learning methods. Hierarchical clustering was applied to 376 vessels from 364 patients with CFR less than the median and fractional flow reserve ≤ 0.8 from a global, multicenter registry. Detailed features of coronary flow physiology and survivals from vessel-oriented composite outcomes (VOCO) were assessed according to the clusters. Clustering defined three distinct physiological subgroups (PS). PS1 (n = 151) were characterized by high resting coronary flow, dominantly left anterior descending artery (LAD) lesions. PS2 (n = 131) were, in contrast, low hyperemic coronary flow, mainly LAD lesions. PS3 (n = 82) mostly consisted of non-LAD lesions with similar flow status to PS1 except for the low hyperemic Pd. Survivals from VOCO were significantly different according to the clusters (p = 0.005) and PS3 had the highest rate of VOCO. In a COX proportional model predicting VOCO, there was a significant interaction between PCI and PSs, suggesting potentially different effects of PCI on outcome between PS1 and PS2. The unsupervised machine-learning approaches provided unique insights into low CFR condition. Among low CFR epicardial lesions, high resting flow with low hyperemic Pd might be related to poor prognosis, and low hyperemic flow in LAD could benefit from elective PCI. CLINICAL TRIAL REGISTRATION INFORMATION: https://clinicaltrials.gov/ct2/show/NCT03690713 , NCT03690713.
Collapse
Affiliation(s)
- Rikuta Hamaya
- Division of Cardiovascular Medicine, Tsuchiura Kyodo General Hospital, Ibaraki, Japan.,Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Masahiro Hoshino
- Division of Cardiovascular Medicine, Tsuchiura Kyodo General Hospital, Ibaraki, Japan
| | - Taishi Yonetsu
- Department of Cardiovascular Medicine, Tokyo Medical and Dental University, Tokyo, Japan
| | - Joo Myung Lee
- Division of Cardiology, Department of Internal Medicine, Heart Vascular Stroke Institute, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, South Korea
| | - Bon-Kwon Koo
- Department of Internal Medicine and Cardiovascular Center, Seoul National University Hospital, Seoul, South Korea.,Institute on Aging, Seoul National University, Seoul, South Korea
| | - Javier Escaned
- Cardiovascular Institute, Hospital Clinico San Carlos, Madrid, Spain.,Centro Nacional de Investigaciónes Cardiovasculares Carlos III (CNIC), Madrid, Spain
| | - Tsunekazu Kakuta
- Division of Cardiovascular Medicine, Tsuchiura Kyodo General Hospital, Ibaraki, Japan. .,Department of Cardiology, Tsuchiura Kyodo General Hospital, 4-4-1 Otsuno, Tsuchiura, Ibaraki, 300-0028, Japan.
| |
Collapse
|
8
|
Nicholls HL, John CR, Watson DS, Munroe PB, Barnes MR, Cabrera CP. Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci. Front Genet 2020; 11:350. [PMID: 32351543 PMCID: PMC7174742 DOI: 10.3389/fgene.2020.00350] [Citation(s) in RCA: 71] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 03/23/2020] [Indexed: 12/21/2022] Open
Abstract
Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS - the ability to detect genetic association by linkage disequilibrium (LD) - is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.
Collapse
Affiliation(s)
- Hannah L. Nicholls
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Christopher R. John
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - David S. Watson
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Oxford Internet Institute, University of Oxford, Oxford, United Kingdom
| | - Patricia B. Munroe
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Michael R. Barnes
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- The Alan Turing Institute, British Library, London, United Kingdom
| | - Claudia P. Cabrera
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| |
Collapse
|