1
|
Lu B. Cancer phylogenetic inference using copy number alterations detected from DNA sequencing data. CANCER PATHOGENESIS AND THERAPY 2025; 3:16-29. [PMID: 39872371 PMCID: PMC11764021 DOI: 10.1016/j.cpt.2024.04.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/05/2024] [Accepted: 04/15/2024] [Indexed: 01/30/2025]
Abstract
Cancer is an evolutionary process involving the accumulation of diverse somatic mutations and clonal evolution over time. Phylogenetic inference from samples obtained from an individual patient offers a powerful approach to unraveling the intricate evolutionary history of cancer and provides insights that can inform cancer treatment. Somatic copy number alterations (CNAs) are important in cancer evolution and are often used as markers, alone or with other somatic mutations, for phylogenetic inferences, particularly in low-coverage DNA sequencing data. Many phylogenetic inference methods using CNAs detected from bulk or single-cell DNA sequencing data have been developed over the years. However, there have been no systematic reviews on these methods. To summarize the state-of-the-art of the field and inform future development, this review presents a comprehensive survey on the major challenges in inference, different types of methods, and applications of these methods. The challenges are discussed from the aspects of input data, models of evolution, and inference algorithms. The different methods are grouped according to the markers used for inference and the types of the reconstructed trees. The applications include using phylogenetic inference to understand intra-tumor heterogeneity, metastasis, treatment resistance, and early cancer development. This review also sheds light on future directions of cancer phylogenetic inference using CNAs, including the improvement of scalability, the utilization of new types of data, and the development of more realistic models of evolution.
Collapse
Affiliation(s)
- Bingxin Lu
- School of Biosciences and Medicine, University of Surrey, Guildford GU2 7XH, UK
- Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford GU2 7XH, UK
| |
Collapse
|
2
|
Bereczki Z, Benczik B, Balogh OM, Marton S, Puhl E, Pétervári M, Váczy-Földi M, Papp ZT, Makkos A, Glass K, Locquet F, Euler G, Schulz R, Ferdinandy P, Ágg B. Mitigating off-target effects of small RNAs: conventional approaches, network theory and artificial intelligence. Br J Pharmacol 2025; 182:340-379. [PMID: 39293936 DOI: 10.1111/bph.17302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 05/07/2024] [Accepted: 06/17/2024] [Indexed: 09/20/2024] Open
Abstract
Three types of highly promising small RNA therapeutics, namely, small interfering RNAs (siRNAs), microRNAs (miRNAs) and the RNA subtype of antisense oligonucleotides (ASOs), offer advantages over small-molecule drugs. These small RNAs can target any gene product, opening up new avenues of effective and safe therapeutic approaches for a wide range of diseases. In preclinical research, synthetic small RNAs play an essential role in the investigation of physiological and pathological pathways as silencers of specific genes, facilitating discovery and validation of drug targets in different conditions. Off-target effects of small RNAs, however, could make it difficult to interpret experimental results in the preclinical phase and may contribute to adverse events of small RNA therapeutics. Out of the two major types of off-target effects we focused on the hybridization-dependent, especially on the miRNA-like off-target effects. Our main aim was to discuss several approaches, including sequence design, chemical modifications and target prediction, to reduce hybridization-dependent off-target effects that should be considered even at the early development phase of small RNA therapy. Because there is no standard way of predicting hybridization-dependent off-target effects, this review provides an overview of all major state-of-the-art computational methods and proposes new approaches, such as the possible inclusion of network theory and artificial intelligence (AI) in the prediction workflows. Case studies and a concise survey of experimental methods for validating in silico predictions are also presented. These methods could contribute to interpret experimental results, to minimize off-target effects and hopefully to avoid off-target-related adverse events of small RNA therapeutics. LINKED ARTICLES: This article is part of a themed issue Non-coding RNA Therapeutics. To view the other articles in this section visit http://onlinelibrary.wiley.com/doi/10.1111/bph.v182.2/issuetoc.
Collapse
Affiliation(s)
- Zoltán Bereczki
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
| | - Bettina Benczik
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Pharmahungary Group, Szeged, Hungary
| | - Olivér M Balogh
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
| | - Szandra Marton
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
| | - Eszter Puhl
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
| | - Mátyás Pétervári
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Sanovigado Kft, Budapest, Hungary
| | - Máté Váczy-Földi
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
| | - Zsolt Tamás Papp
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
| | - András Makkos
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Pharmahungary Group, Szeged, Hungary
| | - Kimberly Glass
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Fabian Locquet
- Physiologisches Institut, Justus-Liebig-Universität Gießen, Giessen, Germany
| | - Gerhild Euler
- Physiologisches Institut, Justus-Liebig-Universität Gießen, Giessen, Germany
| | - Rainer Schulz
- Physiologisches Institut, Justus-Liebig-Universität Gießen, Giessen, Germany
| | - Péter Ferdinandy
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Pharmahungary Group, Szeged, Hungary
| | - Bence Ágg
- Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
- HUN-REN-SU System Pharmacology Research Group, Department of Pharmacology and Pharmacotherapy, Semmelweis University, Budapest, Hungary
- Pharmahungary Group, Szeged, Hungary
| |
Collapse
|
3
|
Horvath J, Jedlicka P, Kratka M, Kubat Z, Kejnovsky E, Lexa M. Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning. BioData Min 2024; 17:57. [PMID: 39696434 DOI: 10.1186/s13040-024-00410-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 11/22/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems. To enhance our understanding of these key sequence modules, we focused on the contrasts between LTRs of various retrotransposon families and other genomic regions. Furthermore, this approach can be utilized for the classification and prediction of LTRs. RESULTS We used machine learning methods suitable for DNA sequence classification and applied them to a large dataset of plant LTR retrotransposon sequences. We trained three machine learning models using (i) traditional model ensembles (Gradient Boosting), (ii) hybrid convolutional/long and short memory network models, and (iii) a DNA pre-trained transformer-based model using k-mer sequence representation. All three approaches were successful in classifying and isolating LTRs in this data, as well as providing valuable insights into LTR sequence composition. The best classification (expressed as F1 score) achieved for LTR detection was 0.85 using the hybrid network model. The most accurate classification task was superfamily classification (F1=0.89) while the least accurate was family classification (F1=0.74). The trained models were subjected to explainability analysis. Positional analysis identified a mixture of interesting features, many of which had a preferred absolute position within the LTR and/or were biologically relevant, such as a centrally positioned TATA-box regulatory sequence, and TG..CA nucleotide patterns around both LTR edges. CONCLUSIONS Our results show that the models used here recognized biologically relevant motifs, such as core promoter elements in the LTR detection task, and a development and stress-related subclass of transcription factor binding sites in the family classification task. Explainability analysis also highlighted the importance of 5'- and 3'- edges in LTR identity and revealed need to analyze more than just dinucleotides at these ends. Our work shows the applicability of machine learning models to regulatory sequence analysis and classification, and demonstrates the important role of the identified motifs in LTR detection.
Collapse
Affiliation(s)
- Jakub Horvath
- Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, 60200, Czech Republic.
| | - Pavel Jedlicka
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
| | - Marie Kratka
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 5, Brno, 62500, Czech Republic
| | - Zdenek Kubat
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
| | - Eduard Kejnovsky
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
| | - Matej Lexa
- Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, 60200, Czech Republic.
| |
Collapse
|
4
|
Alipanahi R, Safari L, Khanteymoori A. DTMP-prime: A deep transformer-based model for predicting prime editing efficiency and PegRNA activity. MOLECULAR THERAPY. NUCLEIC ACIDS 2024; 35:102370. [PMID: 39654539 PMCID: PMC11626815 DOI: 10.1016/j.omtn.2024.102370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 10/24/2024] [Indexed: 12/12/2024]
Abstract
Prime editors are CRISPR-based genome engineering tools with significant potential for rectifying patient mutations. However, their usage requires experimental optimization of the prime editing guide RNA (PegRNA) to achieve high editing efficiency. This paper introduces the deep transformer-based model for predicting prime editing efficiency (DTMP-Prime), a tool specifically designed to predict PegRNA activity and prime editing (PE) efficiency. DTMP-Prime facilitates the design of appropriate PegRNA and ngRNA. A transformer-based model was constructed to scrutinize a wide-ranging set of PE data, enabling the extraction of effective features of PegRNAs and target DNA sequences. The integration of these features with the proposed encoding strategy and DNABERT-based embedding has notably improved the predictive capabilities of DTMP-Prime for off-target sites. Moreover, DTMP-Prime is a promising tool for precisely predicting off-target sites in CRISPR experiments. The integration of a multi-head attention framework has additionally improved the precision and generalizability of DTMP-Prime across various PE models and cell lines. Evaluation results based on the Pearson and Spearman correlation coefficient demonstrate that DTMP-Prime outperforms other state-of-the-art models in predicting the efficiency and outcomes of PE experiments.
Collapse
Affiliation(s)
| | - Leila Safari
- Department of Computer Engineering, University of Zanjan, Zanjan, Iran
| | | |
Collapse
|
5
|
Lee HJ, Emani PS, Gerstein MB. Improved Prediction of Ligand-Protein Binding Affinities by Meta-modeling. J Chem Inf Model 2024; 64:8684-8704. [PMID: 39576762 PMCID: PMC11632770 DOI: 10.1021/acs.jcim.4c01116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 10/21/2024] [Accepted: 10/28/2024] [Indexed: 11/24/2024]
Abstract
The accurate screening of candidate drug ligands against target proteins through computational approaches is of prime interest to drug development efforts. Such virtual screening depends in part on methods to predict the binding affinity between ligands and proteins. Many computational models for binding affinity prediction have been developed, but with varying results across targets. Given that ensembling or meta-modeling approaches have shown great promise in reducing model-specific biases, we develop a framework to integrate published force-field-based empirical docking and sequence-based deep learning models. In building this framework, we evaluate many combinations of individual base models, training databases, and several meta-modeling approaches. We show that many of our meta-models significantly improve affinity predictions over base models. Our best meta-models achieve comparable performance to state-of-the-art deep learning tools exclusively based on 3D structures while allowing for improved database scalability and flexibility through the explicit inclusion of features such as physicochemical properties or molecular descriptors. We further demonstrate improved generalization capability by our models using a large-scale benchmark of affinity prediction as well as a virtual screening application benchmark. Overall, we demonstrate that diverse modeling approaches can be ensembled together to gain meaningful improvement in binding affinity prediction.
Collapse
Affiliation(s)
- Ho-Joon Lee
- Department
of Genetics and Yale Center for Genome Analysis, Yale University, New Haven, Connecticut 06510, United States
| | - Prashant S. Emani
- Department
of Molecular Biophysics & Biochemistry, Yale University, New Haven, Connecticut 06520, United States
| | - Mark B. Gerstein
- Department
of Molecular Biophysics & Biochemistry, Yale University, New Haven, Connecticut 06520, United States
- Program
in Computational Biology & Bioinformatics, Department of Computer
Science, Department
of Statistics & Data Science, and Department of Biomedical Informatics
& Data Science, Yale University, New Haven, Connecticut 06520, United States
| |
Collapse
|
6
|
Le HN, de Freitas MV, Antunes DA. Strengths and limitations of web servers for the modeling of TCRpMHC complexes. Comput Struct Biotechnol J 2024; 23:2938-2948. [PMID: 39104710 PMCID: PMC11298609 DOI: 10.1016/j.csbj.2024.06.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 06/22/2024] [Accepted: 06/23/2024] [Indexed: 08/07/2024] Open
Abstract
Cellular immunity relies on the ability of a T-cell receptor (TCR) to recognize a peptide (p) presented by a class I major histocompatibility complex (MHC) receptor on the surface of a cell. The TCR-peptide-MHC (TCRpMHC) interaction is a crucial step in activating T-cells, and the structural characteristics of these molecules play a significant role in determining the specificity and affinity of this interaction. Hence, obtaining 3D structures of TCRpMHC complexes offers valuable insights into various aspects of cellular immunity and can facilitate the development of T-cell-based immunotherapies. Here, we aimed to compare three popular web servers for modeling the structures of TCRpMHC complexes, namely ImmuneScape (IS), TCRpMHCmodels, and TCRmodel2, to examine their strengths and limitations. Each method employs a different modeling strategy, including docking, homology modeling, and deep learning. The accuracy of each method was evaluated by reproducing the 3D structures of a dataset of 87 TCRpMHC complexes with experimentally determined crystal structures available on the Protein Data Bank (PDB). All selected structures were limited to human MHC alleles, presenting a diverse set of peptide ligands. A detailed analysis of produced models was conducted using multiple metrics, including Root Mean Square Deviation (RMSD) and standardized assessments from CAPRI and DockQ. Special attention was given to the complementarity-determining region (CDR) loops of the TCRs and to the peptide ligands, which define most of the unique features and specificity of a given TCRpMHC interaction. Our study provides an optimistic view of the current state-of-the-art for TCRpMHC modeling but highlights some remaining challenges that must be addressed in order to support the future application of these tools for TCR engineering and computer-aided design of TCR-based immunotherapies.
Collapse
Affiliation(s)
- Hoa Nhu Le
- University of Houston, Departments of Biology and Biochemistry, Houston, 77204, TX, USA
| | | | - Dinler Amaral Antunes
- University of Houston, Departments of Biology and Biochemistry, Houston, 77204, TX, USA
| |
Collapse
|
7
|
Luo J, Fu J, Lu Z, Tu J. Deep learning in integrating spatial transcriptomics with other modalities. Brief Bioinform 2024; 26:bbae719. [PMID: 39800876 PMCID: PMC11725393 DOI: 10.1093/bib/bbae719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 11/05/2024] [Accepted: 12/30/2024] [Indexed: 01/16/2025] Open
Abstract
Spatial transcriptomics technologies have been extensively applied in biological research, enabling the study of transcriptome while preserving the spatial context of tissues. Paired with spatial transcriptomics data, platforms often provide histology and (or) chromatin images, which capture cellular morphology and chromatin organization. Additionally, single-cell RNA sequencing (scRNA-seq) data from matching tissues often accompany spatial data, offering a transcriptome-wide gene expression profile of individual cells. Integrating such additional data from other modalities can effectively enhance spatial transcriptomics data, and, conversely, spatial transcriptomics data can supplement scRNA-seq with spatial information. Moreover, the rapid development of spatial multi-omics technology has spurred the demand for the integration of spatial multi-omics data to present a more detailed molecular landscape within tissues. Numerous deep learning (DL) methods have been developed for integrating spatial transcriptomics with other modalities. However, a comprehensive review of DL approaches for integrating spatial transcriptomics data with other modalities remains absent. In this study, we systematically review the applications of DL in integrating spatial transcriptomics data with other modalities. We first delineate the DL techniques applied in this integration and the key tasks involved. Next, we detail these methods and categorize them based on integrated modality and key task. Furthermore, we summarize the integration strategies of these integration methods. Finally, we discuss the challenges and future directions in integrating spatial transcriptomics with other modalities, aiming to facilitate the development of robust computational methods that more comprehensively exploit multimodal information.
Collapse
Affiliation(s)
- Jiajian Luo
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, 2 Sipailou, Xuanwu District, Nanjing 210096, China
| | - Jiye Fu
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, 2 Sipailou, Xuanwu District, Nanjing 210096, China
| | - Zuhong Lu
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, 2 Sipailou, Xuanwu District, Nanjing 210096, China
| | - Jing Tu
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, 2 Sipailou, Xuanwu District, Nanjing 210096, China
| |
Collapse
|
8
|
Nagarajan R, Kondo M, Salas F, Sezgin E, Yao Y, Klotzman V, Godambe SA, Khan N, Limon A, Stephenson G, Taraman S, Walton N, Ehwerhemuepha L, Pandit J, Pandita D, Weiss M, Golden C, Gold A, Henderson J, Shippy A, Celi LA, Hogan WR, Oermann EK, Sanger T, Martel S. Economics and Equity of Large Language Models: Health Care Perspective. J Med Internet Res 2024; 26:e64226. [PMID: 39541580 PMCID: PMC11605263 DOI: 10.2196/64226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 08/28/2024] [Accepted: 09/16/2024] [Indexed: 11/16/2024] Open
Abstract
Large language models (LLMs) continue to exhibit noteworthy capabilities across a spectrum of areas, including emerging proficiencies across the health care continuum. Successful LLM implementation and adoption depend on digital readiness, modern infrastructure, a trained workforce, privacy, and an ethical regulatory landscape. These factors can vary significantly across health care ecosystems, dictating the choice of a particular LLM implementation pathway. This perspective discusses 3 LLM implementation pathways-training from scratch pathway (TSP), fine-tuned pathway (FTP), and out-of-the-box pathway (OBP)-as potential onboarding points for health systems while facilitating equitable adoption. The choice of a particular pathway is governed by needs as well as affordability. Therefore, the risks, benefits, and economics of these pathways across 4 major cloud service providers (Amazon, Microsoft, Google, and Oracle) are presented. While cost comparisons, such as on-demand and spot pricing across the cloud service providers for the 3 pathways, are presented for completeness, the usefulness of managed services and cloud enterprise tools is elucidated. Managed services can complement the traditional workforce and expertise, while enterprise tools, such as federated learning, can overcome sample size challenges when implementing LLMs using health care data. Of the 3 pathways, TSP is expected to be the most resource-intensive regarding infrastructure and workforce while providing maximum customization, enhanced transparency, and performance. Because TSP trains the LLM using enterprise health care data, it is expected to harness the digital signatures of the population served by the health care system with the potential to impact outcomes. The use of pretrained models in FTP is a limitation. It may impact its performance because the training data used in the pretrained model may have hidden bias and may not necessarily be health care-related. However, FTP provides a balance between customization, cost, and performance. While OBP can be rapidly deployed, it provides minimal customization and transparency without guaranteeing long-term availability. OBP may also present challenges in interfacing seamlessly with downstream applications in health care settings with variations in pricing and use over time. Lack of customization in OBP can significantly limit its ability to impact outcomes. Finally, potential applications of LLMs in health care, including conversational artificial intelligence, chatbots, summarization, and machine translation, are highlighted. While the 3 implementation pathways discussed in this perspective have the potential to facilitate equitable adoption and democratization of LLMs, transitions between them may be necessary as the needs of health systems evolve. Understanding the economics and trade-offs of these onboarding pathways can guide their strategic adoption and demonstrate value while impacting health care outcomes favorably.
Collapse
Affiliation(s)
- Radha Nagarajan
- Children's Hospital of Orange County, Orange, CA, United States
| | - Midori Kondo
- Fred Hutch Patient Care, Seattle, WA, United States
| | - Franz Salas
- Amazon Web Services, Detroit, MI, United States
| | - Emre Sezgin
- Nationwide Children's Hospital, Columbus, OH, United States
| | - Yuan Yao
- Amazon Web Services, San Francisco, CA, United States
| | | | | | - Naqi Khan
- Amazon Web Services, Seattle, WA, United States
| | - Alfonso Limon
- Children's Hospital of Orange County, Orange, CA, United States
| | | | | | - Nephi Walton
- National Institutes of Health, Bethesda, MD, United States
| | | | - Jay Pandit
- Scripps Research Translational Institute, La Jolla, CA, United States
| | - Deepti Pandita
- University of California Irvine Health, Irvine, CA, United States
| | - Michael Weiss
- Children's Hospital of Orange County, Orange, CA, United States
| | - Charles Golden
- Children's Hospital of Orange County, Orange, CA, United States
| | - Adam Gold
- Children's Hospital of Orange County, Orange, CA, United States
| | - John Henderson
- Children's Hospital of Orange County, Orange, CA, United States
| | | | - Leo Anthony Celi
- Massachusetts Institute of Technology, Cambridge, MA, United States
| | | | | | - Terence Sanger
- Children's Hospital of Orange County, Orange, CA, United States
| | - Steven Martel
- Children's Hospital of Orange County, Orange, CA, United States
- Physicians Specialty Faculty, Orange, CA, United States
| |
Collapse
|
9
|
Abhadionmhen AO, Asogwa CN, Ezema ME, Nzeh RC, Ezeora NJ, Abhadiomhen SE, Echezona SC, Udanor CN. Machine Learning Approaches for Microorganism Identification, Virulence Assessment, and Antimicrobial Susceptibility Evaluation Using DNA Sequencing Methods: A Systematic Review. Mol Biotechnol 2024:10.1007/s12033-024-01309-0. [PMID: 39520638 DOI: 10.1007/s12033-024-01309-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Accepted: 10/16/2024] [Indexed: 11/16/2024]
Abstract
Microbial infections pose a substantial global health challenge, particularly impacting immunocompromised individuals and exacerbating the issue of antimicrobial resistance (AMR). High virulence of pathogens can lead to severe infections and prolonged antimicrobial treatment, increasing the risk of developing resistant strains. Integrating machine-learning (ML) with DNA sequencing technologies offers potential solutions by enhancing microbial identification, virulence assessment, and antimicrobial susceptibility evaluation. This review explores recent advancements in these integrated approaches, addressing current limitations and identifying gaps in the literature. A comprehensive literature search was conducted across databases including PubMed, Scopus, Web of Science, and IEEE Xplore, covering publications from January 2014 to June 2024. Using a detailed Boolean search string, relevant studies focusing on ML applications in microorganism identification, antimicrobial susceptibility testing, and microbial virulence were included. The screening process involved a two-stage review of titles, abstracts, and full texts, with data extraction and critical appraisal performed using the QIAO tool. Data were analyzed through narrative synthesis to identify common themes and innovations. Out of 1,650 initially identified records, 19 studies met the inclusion criteria. These studies primarily focused on AMR, with additional research on microbial virulence and identification. Machine learning algorithms such as Random Forest, Support Vector Machines, and Convolutional Neural Networks, combined with DNA sequencing techniques like Whole Genome Sequencing and Metagenomic Sequencing, demonstrated significant advancements in predictive accuracy and efficiency. High-quality studies achieved impressive performance metrics, including F1-scores up to 0.88 and AUC scores up to 0.96. The integration of ML and DNA sequencing technologies has significantly enhanced microbial analysis, improving the identification of pathogens, assessment of virulence, and evaluation of antimicrobial susceptibility. Despite advancements, challenges such as data quality, high costs, and model interpretability persist. This review highlights the need for continued innovation and provides recommendations for future research to address these limitations and improve disease management and public health strategies. The systematic review is registered with PROSPERO (CRD42024571347).
Collapse
Affiliation(s)
| | | | - Modesta Ero Ezema
- Department of Computer Science, University of Nigeria, Nsukka, Nigeria.
| | - Royransom Chiemela Nzeh
- Department of Computer Science, University of Nigeria, Nsukka, Nigeria
- School of Computer Science and Communication Engineering, JiangSu University, Zhenjiang, 212013, JiangSu, China
| | | | | | | | | |
Collapse
|
10
|
de Weerd HA, Guala D, Gustafsson M, Synnergren J, Tegnér J, Lubovac-Pilav Z, Magnusson R. Latent space arithmetic on data embeddings from healthy multi-tissue human RNA-seq decodes disease modules. PATTERNS (NEW YORK, N.Y.) 2024; 5:101093. [PMID: 39568475 PMCID: PMC11573900 DOI: 10.1016/j.patter.2024.101093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 08/26/2024] [Accepted: 10/11/2024] [Indexed: 11/22/2024]
Abstract
Computational analyses of transcriptomic data have dramatically improved our understanding of complex diseases. However, such approaches are limited by small sample sets of disease-affected material. We asked if a variational autoencoder trained on large groups of healthy human RNA sequencing (RNA-seq) data can capture the fundamental gene regulation system and generalize to unseen disease changes. Importantly, we found this model to successfully compress unseen transcriptomic changes from 25 independent disease datasets. We decoded disease-specific signals from the latent space and found them to contain more disease-specific genes than the corresponding differential expression analysis in 20 of 25 cases. Finally, we matched these disease signals with known drug targets and extracted sets of known and potential pharmaceutical candidates. In summary, our study demonstrates how data-driven representation learning enables the arithmetic deconstruction of the latent space, facilitating the dissection of disease mechanisms and drug targets.
Collapse
Affiliation(s)
- Hendrik A de Weerd
- School of Bioscience, Systems Biology Research Center, University of Skövde, 541 45 Skövde, Sweden
- Department of Physics, Chemistry and Biology, Linköping University, 581 83 Linköping, Sweden
- Department of Biomedical Engineering, Linköping University, 581 83 Linköping, Sweden
| | - Dimitri Guala
- Department of Biochemistry and Biophysics, Stockholm University, 171 21 Solna, Sweden
- Merck AB, 169 70 Solna, Sweden
| | - Mika Gustafsson
- Department of Physics, Chemistry and Biology, Linköping University, 581 83 Linköping, Sweden
| | - Jane Synnergren
- School of Bioscience, Systems Biology Research Center, University of Skövde, 541 45 Skövde, Sweden
- Department of Molecular and Clinical Medicine, Institute of Medicine, The Sahlgrenska Academy at University of Gothenburg, 413 45 Gothenburg, Sweden
| | - Jesper Tegnér
- Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Unit of Computational Medicine, Department of Medicine, Center for Molecular Medicine, Karolinska Institutet, Karolinska University Hospital, L8:05, 171 76, Stockholm, Sweden
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Science for Life Laboratory, Tomtebodavägen 23A, 171 65, Solna, Sweden
| | - Zelmina Lubovac-Pilav
- School of Bioscience, Systems Biology Research Center, University of Skövde, 541 45 Skövde, Sweden
| | - Rasmus Magnusson
- School of Bioscience, Systems Biology Research Center, University of Skövde, 541 45 Skövde, Sweden
- Department of Biomedical Engineering, Linköping University, 581 83 Linköping, Sweden
| |
Collapse
|
11
|
Zhou J, Zhang B, Li G, Chen X, Li H, Xu X, Chen S, He W, Xu C, Liu L, Gao X. An AI Agent for Fully Automated Multi-Omic Analyses. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2407094. [PMID: 39361263 PMCID: PMC11600294 DOI: 10.1002/advs.202407094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 08/11/2024] [Indexed: 11/28/2024]
Abstract
With the fast-growing and evolving omics data, the demand for streamlined and adaptable tools to handle bioinformatics analysis continues to grow. In response to this need, Automated Bioinformatics Analysis (AutoBA) is introduced, an autonomous AI agent designed explicitly for fully automated multi-omic analyses based on large language models (LLMs). AutoBA simplifies the analytical process by requiring minimal user input while delivering detailed step-by-step plans for various bioinformatics tasks. AutoBA's unique capacity to self-design analysis processes based on input data variations further underscores its versatility. Compared with online bioinformatic services, AutoBA offers multiple LLM backends, with options for both online and local usage, prioritizing data security and user privacy. In comparison to ChatGPT and open-source LLMs, an automated code repair (ACR) mechanism in AutoBA is designed to improve its stability in automated end-to-end bioinformatics analysis tasks. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. Overall, AutoBA represents an advanced and convenient tool, offering robustness and adaptability for conventional multi-omic analyses.
Collapse
Grants
- FCC/1/1976-44-01 Global Collaborative Research, King Abdullah University of Science and Technology
- FCC/1/1976-45-01 Global Collaborative Research, King Abdullah University of Science and Technology
- REI/1/5202-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- REI/1/5234-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- REI/1/4940-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- RGC/3/4816-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- REI/1/0018-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- REI/1/5414-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- REI/1/5289-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- REI/1/5404-01-01 Global Collaborative Research, King Abdullah University of Science and Technology
- Global Collaborative Research, King Abdullah University of Science and Technology
Collapse
Affiliation(s)
- Juexiao Zhou
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Bin Zhang
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Guowei Li
- Laboratory of Health IntelligenceHuawei Technologies Co., LtdShenzhen210000China
| | - Xiuying Chen
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Haoyang Li
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Siyuan Chen
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Wenjia He
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Chencheng Xu
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| | - Liwei Liu
- Advanced Computing and Storage LaboratoryCentral Research Institute2012 Laboratories, Huawei Technologies Co., LtdNanjingJiangsu210000China
| | - Xin Gao
- Computer Science ProgramComputer, Electrical and Mathematical Sciences and Engineering DivisionKing Abdullah University of Science and Technology (KAUST)Thuwal23955‐6900Kingdom of Saudi Arabia
- Center of Excellence on Smart HealthKing Abdullah University of Science and TechnologyThuwal23955‐6900Kingdom of Saudi Arabia
| |
Collapse
|
12
|
Silvestro D, Latrille T, Salamin N. Toward a Semi-Supervised Learning Approach to Phylogenetic Estimation. Syst Biol 2024; 73:789-806. [PMID: 38916476 PMCID: PMC11639169 DOI: 10.1093/sysbio/syae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/21/2024] [Accepted: 06/24/2024] [Indexed: 06/26/2024] Open
Abstract
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep-learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
Collapse
Affiliation(s)
- Daniele Silvestro
- Department of Biology, University of Fribourg and Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, 40530 Gothenburg, Sweden
| | - Thibault Latrille
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
13
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknome". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for over 450 enzymes of unknown function from the model bacteria Escherichia coli uxgsing the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome. Article Summary Many proteins in any genome, ranging from 30 to 70%, lack an assigned function. This knowledge gap limits the full use of the vast available genomic data. Machine learning has shown promise in transferring functional knowledge from proteins of known functions to similar ones, but largely fails to predict novel functions not seen in its training data. Understanding these failures can guide the development of better machine-learning methods to help experts make accurate functional predictions for uncharacterized proteins.
Collapse
|
14
|
Li ZL, Pei S, Chen Z, Huang TY, Wang XD, Shen L, Chen X, Wang QQ, Wang DX, Ao YF. Machine learning-assisted amidase-catalytic enantioselectivity prediction and rational design of variants for improving enantioselectivity. Nat Commun 2024; 15:8778. [PMID: 39389964 PMCID: PMC11467325 DOI: 10.1038/s41467-024-53048-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 09/30/2024] [Indexed: 10/12/2024] Open
Abstract
Biocatalysis is an attractive approach for the synthesis of chiral pharmaceuticals and fine chemicals, but assessing and/or improving the enantioselectivity of biocatalyst towards target substrates is often time and resource intensive. Although machine learning has been used to reveal the underlying relationship between protein sequences and biocatalytic enantioselectivity, the establishment of substrate fitness space is usually disregarded by chemists and is still a challenge. Using 240 datasets collected in our previous works, we adopt chemistry and geometry descriptors and build random forest classification models for predicting the enantioselectivity of amidase towards new substrates. We further propose a heuristic strategy based on these models, by which the rational protein engineering can be efficiently performed to synthesize chiral compounds with higher ee values, and the optimized variant results in a 53-fold higher E-value comparing to the wild-type amidase. This data-driven methodology is expected to broaden the application of machine learning in biocatalysis research.
Collapse
Affiliation(s)
- Zi-Lin Li
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Shuxin Pei
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing, China
| | - Ziying Chen
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing, China
| | - Teng-Yu Huang
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xu-Dong Wang
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Beijing, China
| | - Lin Shen
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing, China.
- Yantai-Jingshi Institute of Material Genome Engineering, Yantai, China.
| | - Xuebo Chen
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing, China.
- Yantai-Jingshi Institute of Material Genome Engineering, Yantai, China.
- Shandong Laboratory of Yantai Advanced Materials and Green Manufacturing, Yantai, China.
| | - Qi-Qiang Wang
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - De-Xian Wang
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yu-Fei Ao
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
15
|
Li Q, Geng S, Luo H, Wang W, Mo YQ, Luo Q, Wang L, Song GB, Sheng JP, Xu B. Signaling pathways involved in colorectal cancer: pathogenesis and targeted therapy. Signal Transduct Target Ther 2024; 9:266. [PMID: 39370455 PMCID: PMC11456611 DOI: 10.1038/s41392-024-01953-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 07/25/2024] [Accepted: 08/16/2024] [Indexed: 10/08/2024] Open
Abstract
Colorectal cancer (CRC) remains one of the leading causes of cancer-related mortality worldwide. Its complexity is influenced by various signal transduction networks that govern cellular proliferation, survival, differentiation, and apoptosis. The pathogenesis of CRC is a testament to the dysregulation of these signaling cascades, which culminates in the malignant transformation of colonic epithelium. This review aims to dissect the foundational signaling mechanisms implicated in CRC, to elucidate the generalized principles underpinning neoplastic evolution and progression. We discuss the molecular hallmarks of CRC, including the genomic, epigenomic and microbial features of CRC to highlight the role of signal transduction in the orchestration of the tumorigenic process. Concurrently, we review the advent of targeted and immune therapies in CRC, assessing their impact on the current clinical landscape. The development of these therapies has been informed by a deepening understanding of oncogenic signaling, leading to the identification of key nodes within these networks that can be exploited pharmacologically. Furthermore, we explore the potential of integrating AI to enhance the precision of therapeutic targeting and patient stratification, emphasizing their role in personalized medicine. In summary, our review captures the dynamic interplay between aberrant signaling in CRC pathogenesis and the concerted efforts to counteract these changes through targeted therapeutic strategies, ultimately aiming to pave the way for improved prognosis and personalized treatment modalities in colorectal cancer.
Collapse
Affiliation(s)
- Qing Li
- The Shapingba Hospital, Chongqing University, Chongqing, China
- Chongqing Key Laboratory of Intelligent Oncology for Breast Cancer, Chongqing University Cancer Hospital and School of Medicine, Chongqing University, Chongqing, China
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, College of Bioengineering, Chongqing University, Chongqing, China
| | - Shan Geng
- Central Laboratory, The Affiliated Dazu Hospital of Chongqing Medical University, Chongqing, China
| | - Hao Luo
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, College of Bioengineering, Chongqing University, Chongqing, China
- Cancer Center, Daping Hospital, Army Medical University, Chongqing, China
| | - Wei Wang
- Chongqing Municipal Health and Health Committee, Chongqing, China
| | - Ya-Qi Mo
- Chongqing Key Laboratory of Intelligent Oncology for Breast Cancer, Chongqing University Cancer Hospital and School of Medicine, Chongqing University, Chongqing, China
| | - Qing Luo
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, College of Bioengineering, Chongqing University, Chongqing, China
| | - Lu Wang
- Chongqing Key Laboratory of Intelligent Oncology for Breast Cancer, Chongqing University Cancer Hospital and School of Medicine, Chongqing University, Chongqing, China
| | - Guan-Bin Song
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, College of Bioengineering, Chongqing University, Chongqing, China.
| | - Jian-Peng Sheng
- College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China.
| | - Bo Xu
- Chongqing Key Laboratory of Intelligent Oncology for Breast Cancer, Chongqing University Cancer Hospital and School of Medicine, Chongqing University, Chongqing, China.
| |
Collapse
|
16
|
Capitanchik C, Wilkins OG, Wagner N, Gagneur J, Ule J. From computational models of the splicing code to regulatory mechanisms and therapeutic implications. Nat Rev Genet 2024:10.1038/s41576-024-00774-2. [PMID: 39358547 DOI: 10.1038/s41576-024-00774-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/27/2024] [Indexed: 10/04/2024]
Abstract
Since the discovery of RNA splicing and its role in gene expression, researchers have sought a set of rules, an algorithm or a computational model that could predict the splice isoforms, and their frequencies, produced from any transcribed gene in a specific cellular context. Over the past 30 years, these models have evolved from simple position weight matrices to deep-learning models capable of integrating sequence data across vast genomic distances. Most recently, new model architectures are moving the field closer to context-specific alternative splicing predictions, and advances in sequencing technologies are expanding the type of data that can be used to inform and interpret such models. Together, these developments are driving improved understanding of splicing regulatory mechanisms and emerging applications of the splicing code to the rational design of RNA- and splicing-based therapeutics.
Collapse
Affiliation(s)
- Charlotte Capitanchik
- The Francis Crick Institute, London, UK
- UK Dementia Research Institute at King's College London, London, UK
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK
| | - Oscar G Wilkins
- The Francis Crick Institute, London, UK
- UCL Queen Square Motor Neuron Disease Centre, Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Nils Wagner
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
| | - Jernej Ule
- The Francis Crick Institute, London, UK.
- UK Dementia Research Institute at King's College London, London, UK.
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK.
- National Institute of Chemistry, Ljubljana, Slovenia.
| |
Collapse
|
17
|
Lam HYI, Ong XE, Mutwil M. Large language models in plant biology. TRENDS IN PLANT SCIENCE 2024; 29:1145-1155. [PMID: 38797656 DOI: 10.1016/j.tplants.2024.04.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 04/29/2024] [Accepted: 04/30/2024] [Indexed: 05/29/2024]
Abstract
Large language models (LLMs), such as ChatGPT, have taken the world by storm. However, LLMs are not limited to human language and can be used to analyze sequential data, such as DNA, protein, and gene expression. The resulting foundation models can be repurposed to identify the complex patterns within the data, resulting in powerful, multipurpose prediction tools able to predict the state of cellular systems. This review outlines the different types of LLMs and showcases their recent uses in biology. Since LLMs have not yet been embraced by the plant community, we also cover how these models can be deployed for the plant kingdom.
Collapse
Affiliation(s)
- Hilbert Yuen In Lam
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Xing Er Ong
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
18
|
Qiao B, Wang S, Hou M, Chen H, Zhou Z, Xie X, Pang S, Yang C, Yang F, Zou Q, Sun S. Identifying nucleotide-binding leucine-rich repeat receptor and pathogen effector pairing using transfer-learning and bilinear attention network. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae581. [PMID: 39331576 DOI: 10.1093/bioinformatics/btae581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 08/24/2024] [Accepted: 09/25/2024] [Indexed: 09/29/2024]
Abstract
MOTIVATION Nucleotide-binding leucine-rich repeat (NLR) family is a class of immune receptors capable of detecting and defending against pathogen invasion. They have been widely used in crop breeding. Notably, the correspondence between NLRs and effectors (CNE) determines the applicability and effectiveness of NLRs. Unfortunately, CNE data is very scarce. In fact, we've found a substantial 91 291 NLRs confirmed via wet experiments and bioinformatics methods but only 387 CNEs are recognized, which greatly restricts the potential application of NLRs. RESULTS We propose a deep learning algorithm called ProNEP to identify NLR-effector pairs in a high-throughput manner. Specifically, we conceptualized the CNE prediction task as a protein-protein interaction (PPI) prediction task. Then, ProNEP predicts the interaction between NLRs and effectors by combining the transfer learning with a bilinear attention network. ProNEP achieves superior performance against state-of-the-art models designed for PPI predictions. Based on ProNEP, we conduct extensive identification of potential CNEs for 91 291 NLRs. With the rapid accumulation of genomic data, we expect that this tool will be widely used to predict CNEs in new species, advancing biology, immunology, and breeding. AVAILABILITY AND IMPLEMENTATION The ProNEP is available at http://nerrd.cn/#/prediction. The project code is available at https://github.com/QiaoYJYJ/ProNEP.
Collapse
Affiliation(s)
- Baixue Qiao
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150001, China
| | - Shuda Wang
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150001, China
| | - Mingjun Hou
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Haodi Chen
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Zhengwenyang Zhou
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Xueying Xie
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Shaozi Pang
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Chunxue Yang
- College of Landscape Architecture, Northeast Forestry University, Harbin 150001, China
| | - Fenglong Yang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350122, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Shanwen Sun
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150001, China
| |
Collapse
|
19
|
Qin Z, Ren H, Zhao P, Wang K, Liu H, Miao C, Du Y, Li J, Wu L, Chen Z. Current computational tools for protein lysine acylation site prediction. Brief Bioinform 2024; 25:bbae469. [PMID: 39316944 PMCID: PMC11421846 DOI: 10.1093/bib/bbae469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/20/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.
Collapse
Affiliation(s)
- Zhaohui Qin
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Haoran Ren
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Kaiyuan Wang
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Huixia Liu
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Chunbo Miao
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Yanxiu Du
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Liuji Wu
- National Key Laboratory of Wheat and Maize Crop Science, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| |
Collapse
|
20
|
Li Q, Hu Z, Wang Y, Li L, Fan Y, King I, Jia G, Wang S, Song L, Li Y. Progress and opportunities of foundation models in bioinformatics. Brief Bioinform 2024; 25:bbae548. [PMID: 39461902 PMCID: PMC11512649 DOI: 10.1093/bib/bbae548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 08/20/2024] [Accepted: 10/12/2024] [Indexed: 10/29/2024] Open
Abstract
Bioinformatics has undergone a paradigm shift in artificial intelligence (AI), particularly through foundation models (FMs), which address longstanding challenges in bioinformatics such as limited annotated data and data noise. These AI techniques have demonstrated remarkable efficacy across various downstream validation tasks, effectively representing diverse biological entities and heralding a new era in computational biology. The primary goal of this survey is to conduct a general investigation and summary of FMs in bioinformatics, tracing their evolutionary trajectory, current research landscape, and methodological frameworks. Our primary focus is on elucidating the application of FMs to specific biological problems, offering insights to guide the research community in choosing appropriate FMs for tasks like sequence analysis, structure prediction, and function annotation. Each section delves into the intricacies of the targeted challenges, contrasting the architectures and advancements of FMs with conventional methods and showcasing their utility across different biological domains. Further, this review scrutinizes the hurdles and constraints encountered by FMs in biology, including issues of data noise, model interpretability, and potential biases. This analysis provides a theoretical groundwork for understanding the circumstances under which certain FMs may exhibit suboptimal performance. Lastly, we outline prospective pathways and methodologies for the future development of FMs in biological research, facilitating ongoing innovation in the field. This comprehensive examination not only serves as an academic reference but also as a roadmap for forthcoming explorations and applications of FMs in biology.
Collapse
Affiliation(s)
- Qing Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Zhihang Hu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Yixuan Wang
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Lei Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Yimin Fan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Irwin King
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Gengjie Jia
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, 518120, China
| | - Sheng Wang
- Shanghai Zelixir Biotech Company Ltd., Shanghai, 200030, China
- Shenzhen Institute of Advanced Technology, Xueyuan Avenue, Shenzhen University Town, Nanshan District, Shenzhen, Guangdong, 518055, China
| | - Le Song
- BioMap, Zhongguancun Life Science Park, Haidian District, Beijing, 100085, China
| | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| |
Collapse
|
21
|
Qin Z, Yuan B, Qu G, Sun Z. Rational enzyme design by reducing the number of hotspots and library size. Chem Commun (Camb) 2024; 60:10451-10463. [PMID: 39210728 DOI: 10.1039/d4cc01394h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Biocatalysts that are eco-friendly, sustainable, and highly specific have great potential for applications in the production of fine chemicals, food, detergents, biofuels, pharmaceuticals, and more. However, due to factors such as low activity, narrow substrate scope, poor thermostability, or incorrect selectivity, most natural enzymes cannot be directly used for large-scale production of the desired products. To overcome these obstacles, protein engineering methods have been developed over decades and have become powerful and versatile tools for adapting enzymes with improved catalytic properties or new functions. The vastness of the protein sequence space makes screening a bottleneck in obtaining advantageous mutated enzymes in traditional directed evolution. In the realm of mathematics, there are two major constraints in the protein sequence space: (1) the number of residue substitutions (M); and (2) the number of codons encoding amino acids as building blocks (N). This feature review highlights protein engineering strategies to reduce screening efforts from two dimensions by reducing the numbers M and N, and also discusses representative seminal studies of rationally engineered natural enzymes to deliver new catalytic functions.
Collapse
Affiliation(s)
- Zongmin Qin
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
| | - Bo Yuan
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
- Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin 300308, China
| | - Ge Qu
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
- Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin 300308, China
| | - Zhoutong Sun
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
- Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin 300308, China
| |
Collapse
|
22
|
Yang B, Zhou X, Liu S. Tracing the genealogy origin of geographic populations based on genomic variation and deep learning. Mol Phylogenet Evol 2024; 198:108142. [PMID: 38964594 DOI: 10.1016/j.ympev.2024.108142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 05/30/2024] [Accepted: 07/01/2024] [Indexed: 07/06/2024]
Abstract
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from 93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
Collapse
Affiliation(s)
- Bing Yang
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China.
| | - Shanlin Liu
- Department of Entomology, China Agricultural University, Beijing 100193, China; Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
23
|
McCarthy S, Gonen S. δ-Conotoxin Structure Prediction and Analysis through Large-Scale Comparative and Deep Learning Modeling Approaches. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2404786. [PMID: 39033537 PMCID: PMC11425241 DOI: 10.1002/advs.202404786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Revised: 06/27/2024] [Indexed: 07/23/2024]
Abstract
The δ-conotoxins, a class of peptides produced in the venom of cone snails, are of interest due to their ability to inhibit the inactivation of voltage-gated sodium channels causing paralysis and other neurological responses, but difficulties in their isolation and synthesis have made structural characterization challenging. Taking advantage of recent breakthroughs in computational algorithms for structure prediction that have made modeling especially useful when experimental data is sparse, this work uses both the deep-learning-based algorithm AlphaFold and comparative modeling method RosettaCM to model and analyze 18 previously uncharacterized δ-conotoxins derived from piscivorous, vermivorous, and molluscivorous cone snails. The models provide useful insights into the structural aspects of these peptides and suggest features likely to be significant in influencing their binding and different pharmacological activities against their targets, with implications for drug development. Additionally, the described protocol provides a roadmap for the modeling of similar disulfide-rich peptides by these complementary methods.
Collapse
Affiliation(s)
- Stephen McCarthy
- Department of Molecular Biology and Biochemistry, University of California, Irvine, CA, 92697, USA
| | - Shane Gonen
- Department of Molecular Biology and Biochemistry, University of California, Irvine, CA, 92697, USA
| |
Collapse
|
24
|
Stock M, Van Criekinge W, Boeckaerts D, Taelman S, Van Haeverbeke M, Dewulf P, De Baets B. Hyperdimensional computing: A fast, robust, and interpretable paradigm for biological data. PLoS Comput Biol 2024; 20:e1012426. [PMID: 39316621 PMCID: PMC11421772 DOI: 10.1371/journal.pcbi.1012426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024] Open
Abstract
Advances in bioinformatics are primarily due to new algorithms for processing diverse biological data sources. While sophisticated alignment algorithms have been pivotal in analyzing biological sequences, deep learning has substantially transformed bioinformatics, addressing sequence, structure, and functional analyses. However, these methods are incredibly data-hungry, compute-intensive, and hard to interpret. Hyperdimensional computing (HDC) has recently emerged as an exciting alternative. The key idea is that random vectors of high dimensionality can represent concepts such as sequence identity or phylogeny. These vectors can then be combined using simple operators for learning, reasoning, or querying by exploiting the peculiar properties of high-dimensional spaces. Our work reviews and explores HDC's potential for bioinformatics, emphasizing its efficiency, interpretability, and adeptness in handling multimodal and structured data. HDC holds great potential for various omics data searching, biosignal analysis, and health applications.
Collapse
Affiliation(s)
- Michiel Stock
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Wim Van Criekinge
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Dimitri Boeckaerts
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Steff Taelman
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- BioLizard nv, Ghent, Belgium
| | - Maxime Van Haeverbeke
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Pieter Dewulf
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Bernard De Baets
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| |
Collapse
|
25
|
Peng S, Rajjou L. Advancing plant biology through deep learning-powered natural language processing. PLANT CELL REPORTS 2024; 43:208. [PMID: 39102077 DOI: 10.1007/s00299-024-03294-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 07/19/2024] [Indexed: 08/06/2024]
Abstract
The application of deep learning methods, specifically the utilization of Large Language Models (LLMs), in the field of plant biology holds significant promise for generating novel knowledge on plant cell systems. The LLM framework exhibits exceptional potential, particularly with the development of Protein Language Models (PLMs), allowing for in-depth analyses of nucleic acid and protein sequences. This analytical capacity facilitates the discernment of intricate patterns and relationships within biological data, encompassing multi-scale information within DNA or protein sequences. The contribution of PLMs extends beyond mere sequence patterns and structure--function recognition; it also supports advancements in genetic improvements for agriculture. The integration of deep learning approaches into the domain of plant sciences offers opportunities for major breakthroughs in basic research across multi-scale plant traits. Consequently, the strategic application of deep learning methodologies, particularly leveraging the potential of LLMs, will undoubtedly play a pivotal role in advancing plant sciences, plant production, plant uses and propelling the trajectory toward sustainable agroecological and agro-food transitions.
Collapse
Affiliation(s)
- Shuang Peng
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France
| | - Loïc Rajjou
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France.
| |
Collapse
|
26
|
Lobentanzer S, Rodriguez-Mier P, Bauer S, Saez-Rodriguez J. Molecular causality in the advent of foundation models. Mol Syst Biol 2024; 20:848-858. [PMID: 38890548 PMCID: PMC11297329 DOI: 10.1038/s44320-024-00041-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 03/18/2024] [Accepted: 03/21/2024] [Indexed: 06/20/2024] Open
Abstract
Correlation is not causation: this simple and uncontroversial statement has far-reaching implications. Defining and applying causality in biomedical research has posed significant challenges to the scientific community. In this perspective, we attempt to connect the partly disparate fields of systems biology, causal reasoning, and machine learning to inform future approaches in the field of systems biology and molecular medicine.
Collapse
Affiliation(s)
- Sebastian Lobentanzer
- Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Institute for Computational Biomedicine, Heidelberg, Germany.
| | - Pablo Rodriguez-Mier
- Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Institute for Computational Biomedicine, Heidelberg, Germany
| | | | - Julio Saez-Rodriguez
- Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Institute for Computational Biomedicine, Heidelberg, Germany.
| |
Collapse
|
27
|
Huang B, Guo L, Yin H, Wu Y, Zeng Z, Xu S, Lou Y, Ai Z, Zhang W, Kan X, Yu Q, Du S, Li C, Wu L, Huang X, Wang S, Wang X. Deep learning enhancing guide RNA design for CRISPR/Cas12a-based diagnostics. IMETA 2024; 3:e214. [PMID: 39135699 PMCID: PMC11316927 DOI: 10.1002/imt2.214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 05/27/2024] [Accepted: 05/27/2024] [Indexed: 08/15/2024]
Abstract
Rapid and accurate diagnostic tests are fundamental for improving patient outcomes and combating infectious diseases. The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) Cas12a-based detection system has emerged as a promising solution for on-site nucleic acid testing. Nonetheless, the effective design of CRISPR RNA (crRNA) for Cas12a-based detection remains challenging and time-consuming. In this study, we propose an enhanced crRNA design system with deep learning for Cas12a-mediated diagnostics, referred to as EasyDesign. This system employs an optimized convolutional neural network (CNN) prediction model, trained on a comprehensive data set comprising 11,496 experimentally validated Cas12a-based detection cases, encompassing a wide spectrum of prevalent pathogens, achieving Spearman's ρ = 0.812. We further assessed the model performance in crRNA design for four pathogens not included in the training data: Monkeypox Virus, Enterovirus 71, Coxsackievirus A16, and Listeria monocytogenes. The results demonstrated superior prediction performance compared to the traditional experiment screening. Furthermore, we have developed an interactive web server (https://crispr.zhejianglab.com/) that integrates EasyDesign with recombinase polymerase amplification (RPA) primer design, enhancing user accessibility. Through this web-based platform, we successfully designed optimal Cas12a crRNAs for six human papillomavirus (HPV) subtypes. Remarkably, all the top five predicted crRNAs for each HPV subtype exhibited robust fluorescent signals in CRISPR assays, thereby suggesting that the platform could effectively facilitate clinical sample testing. In conclusion, EasyDesign offers a rapid and reliable solution for crRNA design in Cas12a-based detection, which could serve as a valuable tool for clinical diagnostics and research applications.
Collapse
Affiliation(s)
| | | | | | - Yue Wu
- Zhejiang LabHangzhouChina
| | | | | | - Yufeng Lou
- Department of Laboratory Medicine, The First Affiliated HospitalZhejiang University School of MedicineHangzhouChina
- Key Laboratory of Clinical In Vitro Diagnostic Techniques of Zhejiang ProvinceHangzhouChina
- Institute of Laboratory MedicineZhejiang UniversityHangzhouChina
| | | | | | | | | | | | - Chao Li
- Department of Applied Mathematics and Theoretical PhysicsUniversity of CambridgeCambridgeUK
- School of Medicine, School of Science and EngineeringUniversity of Dundee, NethergateDundeeUK
| | - Lina Wu
- School of Food Science and Pharmaceutical EngineeringNanjing Normal UniversityNanjingChina
| | | | | | - Xinjie Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at ShenzhenChinese Academy of Agricultural SciencesShenzhenChina
| |
Collapse
|
28
|
Chen V, Yang M, Cui W, Kim JS, Talwalkar A, Ma J. Applying interpretable machine learning in computational biology-pitfalls, recommendations and opportunities for new developments. Nat Methods 2024; 21:1454-1461. [PMID: 39122941 PMCID: PMC11348280 DOI: 10.1038/s41592-024-02359-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 06/24/2024] [Indexed: 08/12/2024]
Abstract
Recent advances in machine learning have enabled the development of next-generation predictive models for complex computational biology problems, thereby spurring the use of interpretable machine learning (IML) to unveil biological insights. However, guidelines for using IML in computational biology are generally underdeveloped. We provide an overview of IML methods and evaluation techniques and discuss common pitfalls encountered when applying IML methods to computational biology problems. We also highlight open questions, especially in the era of large language models, and call for collaboration between IML and computational biology researchers.
Collapse
Affiliation(s)
- Valerie Chen
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Muyu Yang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Wenbo Cui
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Joon Sik Kim
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ameet Talwalkar
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
29
|
Liu X, Shi J, Jiao Y, An J, Tian J, Yang Y, Zhuo L. Integrated multi-omics with machine learning to uncover the intricacies of kidney disease. Brief Bioinform 2024; 25:bbae364. [PMID: 39082652 PMCID: PMC11289682 DOI: 10.1093/bib/bbae364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 06/20/2024] [Accepted: 07/17/2024] [Indexed: 08/03/2024] Open
Abstract
The development of omics technologies has driven a profound expansion in the scale of biological data and the increased complexity in internal dimensions, prompting the utilization of machine learning (ML) as a powerful toolkit for extracting knowledge and understanding underlying biological patterns. Kidney disease represents one of the major growing global health threats with intricate pathogenic mechanisms and a lack of precise molecular pathology-based therapeutic modalities. Accordingly, there is a need for advanced high-throughput approaches to capture implicit molecular features and complement current experiments and statistics. This review aims to delineate strategies for integrating multi-omics data with appropriate ML methods, highlighting key clinical translational scenarios, including predicting disease progression risks to improve medical decision-making, comprehensively understanding disease molecular mechanisms, and practical applications of image recognition in renal digital pathology. Examining the benefits and challenges of current integration efforts is expected to shed light on the complexity of kidney disease and advance clinical practice.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Li Zhuo
- Corresponding author. Department of Nephrology, China-Japan Friendship Hospital, Beijing 100029, China; China-Japan Friendship Clinic Medical College, Beijing University of Chinese Medicine, 100029 Beijing, China. E-mail:
| |
Collapse
|
30
|
Wang H, Wang C, Wang Z, Niu X. Active Discovery of the Allosteric Inhibitor Targeting Botrytis cinerea Chitinase Based on Neural Relational Inference for Food Preservation. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2024; 72:16128-16139. [PMID: 39003764 DOI: 10.1021/acs.jafc.4c03023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Currently, allosteric inhibitors have emerged as an effective strategy in the development of preservatives against the drug-resistant Botrytis cinerea (B. cinerea). However, their passively driven development efficiency has proven challenging to meet the practical demands. Here, leveraging the deep learning Neural Relational Inference (NRI) framework, we actively identified an allosteric inhibitor targeting B. cinerea Chitinase, namely, 2-acetonaphthone. 2-Acetonaphthone binds to the crucial domain of Chitinase, forming the strong interaction with the allosteric sites. Throughout the interaction process, 2-acetonaphthone diminished the overall connectivity of the protein, inducing conformational changes. These findings align with the results obtained from Chitinase activity experiments, revealing an IC50 value of 67.6 μg/mL. Moreover, 2-acetonaphthone exhibited outstanding anti-B. cinerea activity by inhibiting Chitinase. In the gray mold infection model, 2-acetonaphthone significantly extended the preservation time of cherry tomatoes, positioning it as a promising preservative for fruit storage.
Collapse
Affiliation(s)
- Hongsu Wang
- College of Food Science and Engineering, Jilin University, Changchun 130062, P.R. China
| | - Chenyang Wang
- College of Food Science and Engineering, Jilin University, Changchun 130062, P.R. China
| | - Ziyou Wang
- College of Food Science and Engineering, Jilin University, Changchun 130062, P.R. China
| | - Xiaodi Niu
- College of Food Science and Engineering, Jilin University, Changchun 130062, P.R. China
| |
Collapse
|
31
|
Ganesan P, Feng R, Deb B, Tjong FVY, Rogers AJ, Ruipérez-Campillo S, Somani S, Clopton P, Baykaner T, Rodrigo M, Zou J, Haddad F, Zaharia M, Narayan SM. Novel Domain Knowledge-Encoding Algorithm Enables Label-Efficient Deep Learning for Cardiac CT Segmentation to Guide Atrial Fibrillation Treatment in a Pilot Dataset. Diagnostics (Basel) 2024; 14:1538. [PMID: 39061675 PMCID: PMC11276420 DOI: 10.3390/diagnostics14141538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/07/2024] [Accepted: 07/10/2024] [Indexed: 07/28/2024] Open
Abstract
Background: Segmenting computed tomography (CT) is crucial in various clinical applications, such as tailoring personalized cardiac ablation for managing cardiac arrhythmias. Automating segmentation through machine learning (ML) is hindered by the necessity for large, labeled training data, which can be challenging to obtain. This article proposes a novel approach for automated, robust labeling using domain knowledge to achieve high-performance segmentation by ML from a small training set. The approach, the domain knowledge-encoding (DOKEN) algorithm, reduces the reliance on large training datasets by encoding cardiac geometry while automatically labeling the training set. The method was validated in a hold-out dataset of CT results from an atrial fibrillation (AF) ablation study. Methods: The DOKEN algorithm parses left atrial (LA) structures, extracts "anatomical knowledge" by leveraging digital LA models (available publicly), and then applies this knowledge to achieve high ML segmentation performance with a small number of training samples. The DOKEN-labeled training set was used to train a nnU-Net deep neural network (DNN) model for segmenting cardiac CT in N = 20 patients. Subsequently, the method was tested in a hold-out set with N = 100 patients (five times larger than training set) who underwent AF ablation. Results: The DOKEN algorithm integrated with the nn-Unet model achieved high segmentation performance with few training samples, with a training to test ratio of 1:5. The Dice score of the DOKEN-enhanced model was 96.7% (IQR: 95.3% to 97.7%), with a median error in surface distance of boundaries of 1.51 mm (IQR: 0.72 to 3.12) and a mean centroid-boundary distance of 1.16 mm (95% CI: -4.57 to 6.89), similar to expert results (r = 0.99; p < 0.001). In digital hearts, the novel DOKEN approach segmented the LA structures with a mean difference for the centroid-boundary distances of -0.27 mm (95% CI: -3.87 to 3.33; r = 0.99; p < 0.0001). Conclusions: The proposed novel domain knowledge-encoding algorithm was able to perform the segmentation of six substructures of the LA, reducing the need for large training data sets. The combination of domain knowledge encoding and a machine learning approach could reduce the dependence of ML on large training datasets and could potentially be applied to AF ablation procedures and extended in the future to other imaging, 3D printing, and data science applications.
Collapse
Affiliation(s)
- Prasanth Ganesan
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Ruibin Feng
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Brototo Deb
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Fleur V. Y. Tjong
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
- Heart Center, Department of Clinical and Experimental Cardiology, Amsterdam UMC, University of Amsterdam, 1105 AZ Amsterdam, The Netherlands
| | - Albert J. Rogers
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Samuel Ruipérez-Campillo
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
| | - Sulaiman Somani
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Paul Clopton
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Tina Baykaner
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Miguel Rodrigo
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
- CoMMLab, Universitat de València, 46100 Valencia, Spain
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Francois Haddad
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| | - Matei Zaharia
- Department of Computer Science, University of California Berkeley, Berkeley, CA 94720, USA
| | - Sanjiv M. Narayan
- Department of Medicine and Stanford Cardiovascular Institute (CVI), Stanford University, Stanford, CA 94305, USA; (P.G.); (R.F.)
| |
Collapse
|
32
|
Li M, Guo H, Wang K, Kang C, Yin Y, Zhang H. AVBAE-MODFR: A novel deep learning framework of embedding and feature selection on multi-omics data for pan-cancer classification. Comput Biol Med 2024; 177:108614. [PMID: 38796884 DOI: 10.1016/j.compbiomed.2024.108614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 02/27/2024] [Accepted: 05/11/2024] [Indexed: 05/29/2024]
Abstract
Integration analysis of cancer multi-omics data for pan-cancer classification has the potential for clinical applications in various aspects such as tumor diagnosis, analyzing clinically significant features, and providing precision medicine. In these applications, the embedding and feature selection on high-dimensional multi-omics data is clinically necessary. Recently, deep learning algorithms become the most promising cancer multi-omic integration analysis methods, due to the powerful capability of capturing nonlinear relationships. Developing effective deep learning architectures for cancer multi-omics embedding and feature selection remains a challenge for researchers in view of high dimensionality and heterogeneity. In this paper, we propose a novel two-phase deep learning model named AVBAE-MODFR for pan-cancer classification. AVBAE-MODFR achieves embedding by a multi2multi autoencoder based on the adversarial variational Bayes method and further performs feature selection utilizing a dual-net-based feature ranking method. AVBAE-MODFR utilizes AVBAE to pre-train the network parameters, which improves the classification performance and enhances feature ranking stability in MODFR. Firstly, AVBAE learns high-quality representation among multiple omics features for unsupervised pan-cancer classification. We design an efficient discriminator architecture to distinguish the latent distributions for updating forward variational parameters. Secondly, we propose MODFR to simultaneously evaluate multi-omics feature importance for feature selection by training a designed multi2one selector network, where the efficient evaluation approach based on the average gradient of random mask subsets can avoid bias caused by input feature drift. We conduct experiments on the TCGA pan-cancer dataset and compare it with four state-of-the-art methods for each phase. The results show the superiority of AVBAE-MODFR over SOTA methods.
Collapse
Affiliation(s)
- Minghe Li
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Huike Guo
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Keao Wang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Chuanze Kang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China
| | - Yanbin Yin
- Department of Food Science and Technology, University of Nebraska - Lincoln, NE, USA
| | - Han Zhang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Engineering Research Center of Trusted Behavior Intelligence, Ministry of Education, College of Artificial Intelligence, Nankai University, Tongyan Road, Tianjin, China.
| |
Collapse
|
33
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
34
|
Baciu-Drăgan MA, Beerenwinkel N. Oncotree2vec - a method for embedding and clustering of tumor mutation trees. Bioinformatics 2024; 40:i180-i188. [PMID: 38940124 PMCID: PMC11211817 DOI: 10.1093/bioinformatics/btae214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Understanding the genomic heterogeneity of tumors is an important task in computational oncology, especially in the context of finding personalized treatments based on the genetic profile of each patient's tumor. Tumor clustering that takes into account the temporal order of genetic events, as represented by tumor mutation trees, is a powerful approach for grouping together patients with genetically and evolutionarily similar tumors and can provide insights into discovering tumor subtypes, for more accurate clinical diagnosis and prognosis. RESULTS Here, we propose oncotree2vec, a method for clustering tumor mutation trees by learning vector representations of mutation trees that capture the different relationships between subclones in an unsupervised manner. Learning low-dimensional tree embeddings facilitates the visualization of relations between trees in large cohorts and can be used for downstream analyses, such as deep learning approaches for single-cell multi-omics data integration. We assessed the performance and the usefulness of our method in three simulation studies and on two real datasets: a cohort of 43 trees from six cancer types with different branching patterns corresponding to different modes of spatial tumor evolution and a cohort of 123 AML mutation trees. AVAILABILITY AND IMPLEMENTATION https://github.com/cbg-ethz/oncotree2vec.
Collapse
Affiliation(s)
- Monica-Andreea Baciu-Drăgan
- Department of Biosystems Science and Engineering, ETH Zürich, Schanzenstrasse 44, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Schanzenstrasse 44, Basel 4056, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zürich, Schanzenstrasse 44, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Schanzenstrasse 44, Basel 4056, Switzerland
| |
Collapse
|
35
|
Gurusinghe SNS, Wu Y, DeGrado W, Shifman JM. ProBASS - a language model with sequence and structural features for predicting the effect of mutations on binding affinity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.21.600041. [PMID: 38979193 PMCID: PMC11230163 DOI: 10.1101/2024.06.21.600041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Protein-protein interactions (PPIs) govern virtually all cellular processes. Even a single mutation within PPI can significantly influence overall protein functionality and potentially lead to various types of diseases. To date, numerous approaches have emerged for predicting the change in free energy of binding (ΔΔGbind) resulting from mutations, yet the majority of these methods lack precision. In recent years, protein language models (PLMs) have been developed and shown powerful predictive capabilities by leveraging both sequence and structural data from protein-protein complexes. Yet, PLMs have not been optimized specifically for predicting ΔΔGbind. We developed an approach to predict effects of mutations on PPI binding affinity based on two most advanced protein language models ESM2 and ESM-IF1 that incorporate PPI sequence and structural features, respectively. We used the two models to generate embeddings for each PPI mutant and subsequently fine-tuned our model by training on a large dataset of experimental ΔΔGbind values. Our model, ProBASS (Protein Binding Affinity from Structure and Sequence) achieved a correlation with experimental ΔΔGbind values of 0.83 ± 0.05 for single mutations and 0.69 ± 0.04 for double mutations when model training and testing was done on the same PDB. Moreover, ProBASS exhibited very high correlation (0.81 ± 0.02) between prediction and experiment when training and testing was performed on a dataset containing 2325 single mutations in 132 PPIs. ProBASS surpasses the state-of-the-art methods in correlation with experimental data and could be further trained as more experimental data becomes available. Our results demonstrate that the integration of extensive datasets containing ΔΔGbind values across multiple PPIs to refine the pre-trained PLMs represents a successful approach for achieving a precise and broadly applicable model for ΔΔGbind prediction, greatly facilitating future protein engineering and design studies.
Collapse
Affiliation(s)
- Sagara N S Gurusinghe
- Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Yibing Wu
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California San Francisco, CA, USA
| | - William DeGrado
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California San Francisco, CA, USA
| | - Julia M Shifman
- Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
36
|
Wang J, Wang X, Chu Y, Li C, Li X, Meng X, Fang Y, No KT, Mao J, Zeng X. Exploring the Conformational Ensembles of Protein-Protein Complex with Transformer-Based Generative Model. J Chem Theory Comput 2024; 20:4469-4480. [PMID: 38816696 DOI: 10.1021/acs.jctc.4c00255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Protein-protein interactions are the basis of many protein functions, and understanding the contact and conformational changes of protein-protein interactions is crucial for linking the protein structure to biological function. Although difficult to detect experimentally, molecular dynamics (MD) simulations are widely used to study the conformational ensembles and dynamics of protein-protein complexes, but there are significant limitations in sampling efficiency and computational costs. In this study, a generative neural network was trained on protein-protein complex conformations obtained from molecular simulations to directly generate novel conformations with physical realism. We demonstrated the use of a deep learning model based on the transformer architecture to explore the conformational ensembles of protein-protein complexes through MD simulations. The results showed that the learned latent space can be used to generate unsampled conformations of protein-protein complexes for obtaining new conformations complementing pre-existing ones, which can be used as an exploratory tool for the analysis and enhancement of molecular simulations of protein-protein complexes.
Collapse
Affiliation(s)
- Jianmin Wang
- The Interdisciplinary Graduate Program in Integrative Biotechnology, Yonsei University, Incheon 21983, Korea
| | - Xun Wang
- School of Computer Science and Technology, China University of Petroleum, Qingdao, Shandong 266580, P. R. China
- High Performance Computer Research Center, University of Chinese Academy of Sciences, Beijing 100190, P. R. China
| | - Yanyi Chu
- Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, United States
| | - Chunyan Li
- School of Informatics, Yunnan Normal University, Kunming, Yunnan 650500, P. R. China
| | - Xue Li
- School of Computer Science and Technology, China University of Petroleum, Qingdao, Shandong 266580, P. R. China
| | - Xiangyu Meng
- School of Computer Science and Technology, China University of Petroleum, Qingdao, Shandong 266580, P. R. China
| | - Yitian Fang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, P. R. China
| | - Kyoung Tai No
- The Interdisciplinary Graduate Program in Integrative Biotechnology, Yonsei University, Incheon 21983, Korea
| | - Jiashun Mao
- School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan 646000, P. R. China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, P. R. China
| |
Collapse
|
37
|
Raimondi D, Passemiers A, Verplaetse N, Corso M, Ferrero-Serrano Á, Nazzicari N, Biscarini F, Fariselli P, Moreau Y. Biologically meaningful genome interpretation models to address data underdetermination for the leaf and seed ionome prediction in Arabidopsis thaliana. Sci Rep 2024; 14:13188. [PMID: 38851759 PMCID: PMC11162433 DOI: 10.1038/s41598-024-63855-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 06/03/2024] [Indexed: 06/10/2024] Open
Abstract
Genome interpretation (GI) encompasses the computational attempts to model the relationship between genotype and phenotype with the goal of understanding how the first leads to the second. While traditional approaches have focused on sub-problems such as predicting the effect of single nucleotide variants or finding genetic associations, recent advances in neural networks (NNs) have made it possible to develop end-to-end GI models that take genomic data as input and predict phenotypes as output. However, technical and modeling issues still need to be fixed for these models to be effective, including the widespread underdetermination of genomic datasets, making them unsuitable for training large, overfitting-prone, NNs. Here we propose novel GI models to address this issue, exploring the use of two types of transfer learning approaches and proposing a novel Biologically Meaningful Sparse NN layer specifically designed for end-to-end GI. Our models predict the leaf and seed ionome in A.thaliana, obtaining comparable results to our previous over-parameterized model while reducing the number of parameters by 8.8 folds. We also investigate how the effect of population stratification influences the evaluation of the performances, highlighting how it leads to (1) an instance of the Simpson's Paradox, and (2) model generalization limitations.
Collapse
Affiliation(s)
| | | | | | - Massimiliano Corso
- Université Paris-Saclay, INRAE, AgroParisTech, Institute Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France
| | - Ángel Ferrero-Serrano
- Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | | | | | - Piero Fariselli
- Department of Medical Sciences, University of Torino, 10123, Turin, Italy
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001, Leuven, Belgium
| |
Collapse
|
38
|
Álvarez-Machancoses Ó, Faraggi E, deAndrés-Galiana EJ, Fernández-Martínez JL, Kloczkowski A. Prediction of Deleterious Single Amino Acid Polymorphisms with a Consensus Holdout Sampler. Curr Genomics 2024; 25:171-184. [PMID: 39086995 PMCID: PMC11288160 DOI: 10.2174/0113892029236347240308054538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 08/03/2023] [Accepted: 09/22/2023] [Indexed: 08/02/2024] Open
Abstract
Background Single Amino Acid Polymorphisms (SAPs) or nonsynonymous Single Nucleotide Variants (nsSNVs) are the most common genetic variations. They result from missense mutations where a single base pair substitution changes the genetic code in such a way that the triplet of bases (codon) at a given position is coding a different amino acid. Since genetic mutations sometimes cause genetic diseases, it is important to comprehend and foresee which variations are harmful and which ones are neutral (not causing changes in the phenotype). This can be posed as a classification problem. Methods Computational methods using machine intelligence are gradually replacing repetitive and exceedingly overpriced mutagenic tests. By and large, uneven quality, deficiencies, and irregularities of nsSNVs datasets debase the convenience of artificial intelligence-based methods. Subsequently, strong and more exact approaches are needed to address these problems. In the present work paper, we show a consensus classifier built on the holdout sampler, which appears strong and precise and outflanks all other popular methods. Results We produced 100 holdouts to test the structures and diverse classification variables of diverse classifiers during the training phase. The finest performing holdouts were chosen to develop a consensus classifier and tested using a k-fold (1 ≤ k ≤5) cross-validation method. We also examined which protein properties have the biggest impact on the precise prediction of the effects of nsSNVs. Conclusion Our Consensus Holdout Sampler outflanks other popular algorithms, and gives excellent results, highly accurate with low standard deviation. The advantage of our method emerges from using a tree of holdouts, where diverse LM/AI-based programs are sampled in diverse ways.
Collapse
Affiliation(s)
- Óscar Álvarez-Machancoses
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Eshel Faraggi
- School of Science, Indiana University-Purdue University Indianapolis, IN, USA
| | - Enrique J deAndrés-Galiana
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
- Department of Computer Science, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Juan L Fernández-Martínez
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C. Federico García Lorca, 18, 33007, Oviedo, Spain
| | - Andrzej Kloczkowski
- Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA
- Department of Pediatrics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
39
|
Da Conceição LMA, Cabral LM, Pereira GRC, De Mesquita JF. An In Silico Analysis of Genetic Variants and Structural Modeling of the Human Frataxin Protein in Friedreich's Ataxia. Int J Mol Sci 2024; 25:5796. [PMID: 38891993 PMCID: PMC11172458 DOI: 10.3390/ijms25115796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 05/15/2024] [Accepted: 05/20/2024] [Indexed: 06/21/2024] Open
Abstract
Friedreich's Ataxia (FRDA) stands out as the most prevalent form of hereditary ataxias, marked by progressive movement ataxia, loss of vibratory sensitivity, and skeletal deformities, severely affecting daily functioning. To date, the only medication available for treating FRDA is Omaveloxolone (Skyclarys®), recently approved by the FDA. Missense mutations within the human frataxin (FXN) gene, responsible for intracellular iron homeostasis regulation, are linked to FRDA development. These mutations induce FXN dysfunction, fostering mitochondrial iron accumulation and heightened oxidative stress, ultimately triggering neuronal cell death pathways. This study amalgamated 226 FXN genetic variants from the literature and database searches, with only 18 previously characterized. Predictive analyses revealed a notable prevalence of detrimental and destabilizing predictions for FXN mutations, predominantly impacting conserved residues crucial for protein function. Additionally, an accurate, comprehensive three-dimensional model of human FXN was constructed, serving as the basis for generating genetic variants I154F and W155R. These variants, selected for their severe clinical implications, underwent molecular dynamics (MD) simulations, unveiling flexibility and essential dynamic alterations in their N-terminal segments, encompassing FXN42, FXN56, and FXN78 domains pivotal for protein maturation. Thus, our findings indicate potential interaction profile disturbances in the FXN42, FXN56, and FXN78 domains induced by I154F and W155R mutations, aligning with the existing literature.
Collapse
Affiliation(s)
- Loiane Mendonça Abrantes Da Conceição
- Laboratory of Bioinformatics and Computational Biology, Federal University of the State of Rio de Janeiro (UNIRIO), Avenida Pasteur, 296, Urca, Rio de Janeiro 22290-250, Brazil (J.F.D.M.)
| | - Lucio Mendes Cabral
- Pharmaceutical Industrial Technology Laboratory, Federal University of Rio de Janeiro (UFRJ), Avenida Carlos Chagas Filho, 373, Cidade Universitária, Rio de Janeiro 21941-590, Brazil
| | - Gabriel Rodrigues Coutinho Pereira
- Pharmaceutical Industrial Technology Laboratory, Federal University of Rio de Janeiro (UFRJ), Avenida Carlos Chagas Filho, 373, Cidade Universitária, Rio de Janeiro 21941-590, Brazil
- Laboratory of Molecular Modeling & QSAR, Federal University of Rio de Janeiro (UFRJ), Avenida Carlos Chagas Filho, 373, Cidade Universitária, Rio de Janeiro 21941-590, Brazil
| | - Joelma Freire De Mesquita
- Laboratory of Bioinformatics and Computational Biology, Federal University of the State of Rio de Janeiro (UNIRIO), Avenida Pasteur, 296, Urca, Rio de Janeiro 22290-250, Brazil (J.F.D.M.)
| |
Collapse
|
40
|
Ranjan R, Srijan S, Balekuttira S, Agarwal T, Ramey M, Dobbins M, Kuhn R, Wang X, Hudson K, Li Y, Varala K. Organ-delimited gene regulatory networks provide high accuracy in candidate transcription factor selection across diverse processes. Proc Natl Acad Sci U S A 2024; 121:e2322751121. [PMID: 38652750 PMCID: PMC11066984 DOI: 10.1073/pnas.2322751121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 03/14/2024] [Indexed: 04/25/2024] Open
Abstract
Organ-specific gene expression datasets that include hundreds to thousands of experiments allow the reconstruction of organ-level gene regulatory networks (GRNs). However, creating such datasets is greatly hampered by the requirements of extensive and tedious manual curation. Here, we trained a supervised classification model that can accurately classify the organ-of-origin for a plant transcriptome. This K-Nearest Neighbor-based multiclass classifier was used to create organ-specific gene expression datasets for the leaf, root, shoot, flower, and seed in Arabidopsis thaliana. A GRN inference approach was used to determine the: i. influential transcription factors (TFs) in each organ and, ii. most influential TFs for specific biological processes in that organ. These genome-wide, organ-delimited GRNs (OD-GRNs), recalled many known regulators of organ development and processes operating in those organs. Importantly, many previously unknown TF regulators were uncovered as potential regulators of these processes. As a proof-of-concept, we focused on experimentally validating the predicted TF regulators of lipid biosynthesis in seeds, an important food and biofuel trait. Of the top 20 predicted TFs, eight are known regulators of seed oil content, e.g., WRI1, LEC1, FUS3. Importantly, we validated our prediction of MybS2, TGA4, SPL12, AGL18, and DiV2 as regulators of seed lipid biosynthesis. We elucidated the molecular mechanism of MybS2 and show that it induces purple acid phosphatase family genes and lipid synthesis genes to enhance seed lipid content. This general approach has the potential to be extended to any species with sufficiently large gene expression datasets to find unique regulators of any trait-of-interest.
Collapse
Affiliation(s)
- Rajeev Ranjan
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
- Center for Plant Biology, Purdue University, West Lafayette, IN47907
| | - Sonali Srijan
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
| | - Somaiah Balekuttira
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
| | - Tina Agarwal
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
- Center for Plant Biology, Purdue University, West Lafayette, IN47907
| | - Melissa Ramey
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
| | - Madison Dobbins
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
| | - Rachel Kuhn
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
| | - Xiaojin Wang
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
- Center for Plant Biology, Purdue University, West Lafayette, IN47907
| | - Karen Hudson
- United States Department of Agriculture-Agricultural Research Service Crop Production and Pest Control Research Unit, West Lafayette, IN47907
| | - Ying Li
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
- Center for Plant Biology, Purdue University, West Lafayette, IN47907
| | - Kranthi Varala
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN47907
- Center for Plant Biology, Purdue University, West Lafayette, IN47907
| |
Collapse
|
41
|
Lu H, Zhang J, Cao Y, Wu S, Wei Y, Yin R. Advances in applications of artificial intelligence algorithms for cancer-related miRNA research. Zhejiang Da Xue Xue Bao Yi Xue Ban 2024; 53:231-243. [PMID: 38650448 PMCID: PMC11057993 DOI: 10.3724/zdxbyxb-2023-0511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 01/30/2024] [Indexed: 04/25/2024]
Abstract
MiRNAs are a class of small non-coding RNAs, which regulate gene expression post-transcriptionally by partial complementary base pairing. Aberrant miRNA expressions have been reported in tumor tissues and peripheral blood of cancer patients. In recent years, artificial intelligence algorithms such as machine learning and deep learning have been widely used in bioinformatic research. Compared to traditional bioinformatic tools, miRNA target prediction tools based on artificial intelligence algorithms have higher accuracy, and can successfully predict subcellular localization and redistribution of miRNAs to deepen our understanding. Additionally, the construction of clinical models based on artificial intelligence algorithms could significantly improve the mining efficiency of miRNA used as biomarkers. In this article, we summarize recent development of bioinformatic miRNA tools based on artificial intelligence algorithms, focusing on the potential of machine learning and deep learning in cancer-related miRNA research.
Collapse
Affiliation(s)
- Hongyu Lu
- School of Pharmacy, Jiangsu University, Zhenjiang 212013, Jiangsu Province, China.
| | - Jia Zhang
- School of Pharmacy, Jiangsu University, Zhenjiang 212013, Jiangsu Province, China
| | - Yixin Cao
- Department of Medical Oncology, Affiliated Hospital of Jiangsu University, Zhenjiang 212013, Jiangsu Province, China
| | - Shuming Wu
- School of Pharmacy, Jiangsu University, Zhenjiang 212013, Jiangsu Province, China
| | - Yuan Wei
- School of Pharmacy, Jiangsu University, Zhenjiang 212013, Jiangsu Province, China.
| | - Runting Yin
- School of Pharmacy, Jiangsu University, Zhenjiang 212013, Jiangsu Province, China.
| |
Collapse
|
42
|
Kalleberg J, Rissman J, Schnabel RD. Overcoming Limitations to Deep Learning in Domesticated Animals with TrioTrain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589602. [PMID: 38659907 PMCID: PMC11042298 DOI: 10.1101/2024.04.15.589602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Variant calling across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human genomes. Here, we use bovine genomes to assess the limits of human-genome-trained models in other species. We introduce the first multi-species DV model that achieves a lower Mendelian Inheritance Error (MIE) rate during single-sample genotyping. Our novel approach, TrioTrain, automates extending DV for species without Genome In A Bottle (GIAB) resources and uses region shuffling to mitigate barriers for SLURM-based clusters. To offset imperfect truth labels for animal genomes, we remove Mendelian discordant variants before training, where models are tuned to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to build 30 model iterations across five phases. We observe remarkable performance across phases when testing the GIAB human trios with a mean SNP F1 score >0.990. In HG002, our phase 4 bovine model identifies more variants at a lower MIE rate than DeepTrio. In bovine F1-hybrid genomes, our model substantially reduces inheritance errors with a mean MIE rate of 0.03 percent. Although constrained by imperfect labels, we find that multi-species, trio-based training produces a robust variant calling model. Our research demonstrates that exclusively training with human genomes restricts the application of deep-learning approaches for comparative genomics.
Collapse
Affiliation(s)
- Jenna Kalleberg
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Jacob Rissman
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Robert D Schnabel
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
- University of Missouri, Genetics Area Program, Columbia, MO, 65201 USA
| |
Collapse
|
43
|
Reveguk I, Simonson T. Classifying protein kinase conformations with machine learning. Protein Sci 2024; 33:e4918. [PMID: 38501429 PMCID: PMC10962494 DOI: 10.1002/pro.4918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 03/20/2024]
Abstract
Protein kinases are key actors of signaling networks and important drug targets. They cycle between active and inactive conformations, distinguished by a few elements within the catalytic domain. One is the activation loop, whose conserved DFG motif can occupy DFG-in, DFG-out, and some rarer conformations. Annotation and classification of the structural kinome are important, as different conformations can be targeted by different inhibitors and activators. Valuable resources exist; however, large-scale applications will benefit from increased automation and interpretability of structural annotation. Interpretable machine learning models are described for this purpose, based on ensembles of decision trees. To train them, a set of catalytic domain sequences and structures was collected, somewhat larger and more diverse than existing resources. The structures were clustered based on the DFG conformation and manually annotated. They were then used as training input. Two main models were constructed, which distinguished active/inactive and in/out/other DFG conformations. They considered initially 1692 structural variables, spanning the whole catalytic domain, then identified ("learned") a small subset that sufficed for accurate classification. The first model correctly labeled all but 3 of 3289 structures as active or inactive, while the second assigned the correct DFG label to all but 17 of 8826 structures. The most potent classifying variables were all related to well-known structural elements in or near the activation loop and their ranking gives insights into the conformational preferences. The models were used to automatically annotate 3850 kinase structures predicted recently with the Alphafold2 tool, showing that Alphafold2 reproduced the active/inactive but not the DFG-in proportions seen in the Protein Data Bank. We expect the models will be useful for understanding and engineering kinases.
Collapse
Affiliation(s)
- Ivan Reveguk
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654)Ecole PolytechniquePalaiseauFrance
| | - Thomas Simonson
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654)Ecole PolytechniquePalaiseauFrance
| |
Collapse
|
44
|
Song H, Chu J, Li W, Li X, Fang L, Han J, Zhao S, Ma Y. A Novel Approach Utilizing Domain Adversarial Neural Networks for the Detection and Classification of Selective Sweeps. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2304842. [PMID: 38308186 PMCID: PMC11005742 DOI: 10.1002/advs.202304842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/10/2024] [Indexed: 02/04/2024]
Abstract
The identification and classification of selective sweeps are of great significance for improving the understanding of biological evolution and exploring opportunities for precision medicine and genetic improvement. Here, a domain adaptation sweep detection and classification (DASDC) method is presented to balance the alignment of two domains and the classification performance through a domain-adversarial neural network and its adversarial learning modules. DASDC effectively addresses the issue of mismatch between training data and real genomic data in deep learning models, leading to a significant improvement in its generalization capability, prediction robustness, and accuracy. The DASDC method demonstrates improved identification performance compared to existing methods and excels in classification performance, particularly in scenarios where there is a mismatch between application data and training data. The successful implementation of DASDC in real data of three distinct species highlights its potential as a useful tool for identifying crucial functional genes and investigating adaptive evolutionary mechanisms, particularly with the increasing availability of genomic data.
Collapse
Affiliation(s)
- Hui Song
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Jinyu Chu
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Wangjiao Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Xinyun Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
| | - Lingzhao Fang
- Center for Quantitative Genetics and GenomicsAarhus UniversityAarhus8000Denmark
| | - Jianlin Han
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- CAAS‐ILRI Joint Laboratory on Livestock and Forage Genetic ResourcesInstitute of Animal ScienceChinese Academy of Agricultural Sciences (CAAS)Beijing100193China
- Livestock Genetics ProgramInternational Livestock Research Institute (ILRI)Nairobi00100Kenya
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| | - Yunlong Ma
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| |
Collapse
|
45
|
Kumar N, Srivastava R. Deep learning in structural bioinformatics: current applications and future perspectives. Brief Bioinform 2024; 25:bbae042. [PMID: 38701422 PMCID: PMC11066934 DOI: 10.1093/bib/bbae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 01/05/2024] [Accepted: 01/18/2024] [Indexed: 05/05/2024] Open
Abstract
In this review article, we explore the transformative impact of deep learning (DL) on structural bioinformatics, emphasizing its pivotal role in a scientific revolution driven by extensive data, accessible toolkits and robust computing resources. As big data continue to advance, DL is poised to become an integral component in healthcare and biology, revolutionizing analytical processes. Our comprehensive review provides detailed insights into DL, featuring specific demonstrations of its notable applications in bioinformatics. We address challenges tailored for DL, spotlight recent successes in structural bioinformatics and present a clear exposition of DL-from basic shallow neural networks to advanced models such as convolution, recurrent, artificial and transformer neural networks. This paper discusses the emerging use of DL for understanding biomolecular structures, anticipating ongoing developments and applications in the realm of structural bioinformatics.
Collapse
Affiliation(s)
- Niranjan Kumar
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Rakesh Srivastava
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| |
Collapse
|
46
|
Safdar Ali Khan M, Husen A, Nisar S, Ahmed H, Shah Muhammad S, Aftab S. Offloading the computational complexity of transfer learning with generic features. PeerJ Comput Sci 2024; 10:e1938. [PMID: 38660182 PMCID: PMC11041970 DOI: 10.7717/peerj-cs.1938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 02/19/2024] [Indexed: 04/26/2024]
Abstract
Deep learning approaches are generally complex, requiring extensive computational resources and having high time complexity. Transfer learning is a state-of-the-art approach to reducing the requirements of high computational resources by using pre-trained models without compromising accuracy and performance. In conventional studies, pre-trained models are trained on datasets from different but similar domains with many domain-specific features. The computational requirements of transfer learning are directly dependent on the number of features that include the domain-specific and the generic features. This article investigates the prospects of reducing the computational requirements of the transfer learning models by discarding domain-specific features from a pre-trained model. The approach is applied to breast cancer detection using the dataset curated breast imaging subset of the digital database for screening mammography and various performance metrics such as precision, accuracy, recall, F1-score, and computational requirements. It is seen that discarding the domain-specific features to a specific limit provides significant performance improvements as well as minimizes the computational requirements in terms of training time (reduced by approx. 12%), processor utilization (reduced approx. 25%), and memory usage (reduced approx. 22%). The proposed transfer learning strategy increases accuracy (approx. 7%) and offloads computational complexity expeditiously.
Collapse
Affiliation(s)
- Muhammad Safdar Ali Khan
- Department of Computer Science and Information Technology, Virtual University of Pakistan, Lahore, Punjab, Pakistan
| | - Arif Husen
- Department of Computer Science and Information Technology, Virtual University of Pakistan, Lahore, Punjab, Pakistan
- Department of Computer Science, COMSATS Institute of Information Technology, Lahore, Punjab, Pakistan
| | - Shafaq Nisar
- Department of Computer Science and Information Technology, Virtual University of Pakistan, Lahore, Punjab, Pakistan
| | - Hasnain Ahmed
- Department of Computer Science and Information Technology, Virtual University of Pakistan, Lahore, Punjab, Pakistan
| | - Syed Shah Muhammad
- Department of Computer Science and Information Technology, Virtual University of Pakistan, Lahore, Punjab, Pakistan
| | - Shabib Aftab
- Department of Computer Science and Information Technology, Virtual University of Pakistan, Lahore, Punjab, Pakistan
| |
Collapse
|
47
|
This S, Costantino S, Melichar HJ. Machine learning predictions of T cell antigen specificity from intracellular calcium dynamics. SCIENCE ADVANCES 2024; 10:eadk2298. [PMID: 38446885 PMCID: PMC10917351 DOI: 10.1126/sciadv.adk2298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 01/29/2024] [Indexed: 03/08/2024]
Abstract
Adoptive T cell therapies rely on the production of T cells with an antigen receptor that directs their specificity toward tumor-specific antigens. Methods for identifying relevant T cell receptor (TCR) sequences, predominantly achieved through the enrichment of antigen-specific T cells, represent a major bottleneck in the production of TCR-engineered cell therapies. Fluctuation of intracellular calcium is a proximal readout of TCR signaling and candidate marker for antigen-specific T cell identification that does not require T cell expansion; however, calcium fluctuations downstream of TCR engagement are highly variable. We propose that machine learning algorithms may allow for T cell classification from complex datasets such as polyclonal T cell signaling events. Using deep learning tools, we demonstrate accurate prediction of TCR-transgenic CD8+ T cell activation based on calcium fluctuations and test the algorithm against T cells bearing a distinct TCR as well as polyclonal T cells. This provides the foundation for an antigen-specific TCR sequence identification pipeline for adoptive T cell therapies.
Collapse
Affiliation(s)
- Sébastien This
- Centre de recherche de l'Hôpital Maisonneuve-Rosemont, Montréal, Québec, Canada
- Département de Microbiologie, Infectiologie et Immunologie, Université de Montréal, Montréal, Québec, Canada
- Department of Microbiology and Immunology, Goodman Cancer Institute, McGill University, Montréal, Québec, Canada
| | - Santiago Costantino
- Centre de recherche de l'Hôpital Maisonneuve-Rosemont, Montréal, Québec, Canada
- Département d’Ophtalmologie, Université de Montréal, Montréal, Québec, Canada
| | - Heather J. Melichar
- Centre de recherche de l'Hôpital Maisonneuve-Rosemont, Montréal, Québec, Canada
- Department of Microbiology and Immunology, Goodman Cancer Institute, McGill University, Montréal, Québec, Canada
- Département de Médecine, Université de Montréal, Montréal, Québec, Canada
| |
Collapse
|
48
|
Chafai N, Bonizzi L, Botti S, Badaoui B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024; 61:140-163. [PMID: 37815417 DOI: 10.1080/10408363.2023.2259466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023]
Abstract
The integration of artificial intelligence technologies has propelled the progress of clinical and genomic medicine in recent years. The significant increase in computing power has facilitated the ability of artificial intelligence models to analyze and extract features from extensive medical data and images, thereby contributing to the advancement of intelligent diagnostic tools. Artificial intelligence (AI) models have been utilized in the field of personalized medicine to integrate clinical data and genomic information of patients. This integration allows for the identification of customized treatment recommendations, ultimately leading to enhanced patient outcomes. Notwithstanding the notable advancements, the application of artificial intelligence (AI) in the field of medicine is impeded by various obstacles such as the limited availability of clinical and genomic data, the diversity of datasets, ethical implications, and the inconclusive interpretation of AI models' results. In this review, a comprehensive evaluation of multiple machine learning algorithms utilized in the fields of clinical and genomic medicine is conducted. Furthermore, we present an overview of the implementation of artificial intelligence (AI) in the fields of clinical medicine, drug discovery, and genomic medicine. Finally, a number of constraints pertaining to the implementation of artificial intelligence within the healthcare industry are examined.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
| | - Luigi Bonizzi
- Department of Biomedical, Surgical and Dental Science, University of Milan, Milan, Italy
| | - Sara Botti
- PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, Italy
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laâyoune, Morocco
| |
Collapse
|
49
|
Leuchtenberger AF, von Haeseler A. Learning From an Artificial Neural Network in Phylogenetics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:278-288. [PMID: 38198267 DOI: 10.1109/tcbb.2024.3352268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2024]
Abstract
We show that an iterative ansatz of deep learning and human intelligence guided simplification may lead to surprisingly simple solutions for a difficult problem in phylogenetics. Distinguishing Farris and Felsenstein trees is a longstanding problem in phylogenetic tree reconstruction. The Artificial Neural Network F-zoneNN solves this problem for 4-taxon alignments evolved under the Jukes-Cantor model. It distinguishes between Farris and Felsenstein trees, but owing to its complexity, lacks transparency in its mechanism of discernment. Based on the simplification of F-zoneNN and alignment properties we constructed the function FarFelDiscerner. In contrast to F-zoneNN, FarFelDiscerner's decision process is understandable. Moreover, FarFelDiscerner is significantly simpler than F-zoneNN. Despite its simplicity this function infers the tree-type almost perfectly on noise-free data, and also performs well on simulated noisy alignments of finite length. We applied FarFelDiscerner to the historical Holometabola alignments where it places Strepsiptera with beetles, concordant with the current scientific view.
Collapse
|
50
|
Xu P, Lin NQ, Zhang ZQ, Liu JZ. Strategies to increase the robustness of microbial cell factories. ADVANCED BIOTECHNOLOGY 2024; 2:9. [PMID: 39883204 PMCID: PMC11740849 DOI: 10.1007/s44307-024-00018-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 02/15/2024] [Accepted: 02/19/2024] [Indexed: 01/31/2025]
Abstract
Engineering microbial cell factories have achieved much progress in producing fuels, natural products and bulk chemicals. However, in industrial fermentation, microbial cells often face various predictable and stochastic disturbances resulting from intermediate metabolites or end product toxicity, metabolic burden and harsh environment. These perturbances can potentially decrease productivity and titer. Therefore, strain robustness is essential to ensure reliable and sustainable production efficiency. In this review, the current strategies to improve host robustness were summarized, including knowledge-based engineering approaches, such as transcription factors, membrane/transporters and stress proteins, and the traditional adaptive laboratory evolution based on natural selection. Computation-assisted (e.g. GEMs, deep learning and machine learning) design of robust industrial hosts was also introduced. Furthermore, the challenges and future perspectives on engineering microbial host robustness are proposed to promote the development of green, efficient and sustainable biomanufacturers.
Collapse
Affiliation(s)
- Pei Xu
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Nuo-Qiao Lin
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Zhi-Qian Zhang
- Tidetron Bioworks Technology (Guangzhou) Co., Ltd., Guangzhou, 510399, China
| | - Jian-Zhong Liu
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, 510275, China.
- Joint Research Center of Engineering Biology Technology of Sun Yat-Sen University and Tidetron Bioworks, Guangzhou, 510275, China.
| |
Collapse
|