1
|
Mendoza-Revilla J, Trop E, Gonzalez L, Roller M, Dalla-Torre H, de Almeida BP, Richard G, Caton J, Lopez Carranza N, Skwark M, Laterre A, Beguir K, Pierrot T, Lopez M. A foundational large language model for edible plant genomes. Commun Biol 2024; 7:835. [PMID: 38982288 PMCID: PMC11233511 DOI: 10.1038/s42003-024-06465-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 06/17/2024] [Indexed: 07/11/2024] Open
Abstract
Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterization of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterization. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research. The pre-trained AgroNT model is publicly available on HuggingFace at https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b for future research purposes.
Collapse
|
2
|
Si Y, Zou J, Gao Y, Chuai G, Liu Q, Chen L. Foundation models in molecular biology. BIOPHYSICS REPORTS 2024; 10:135-151. [PMID: 39027316 PMCID: PMC11252241 DOI: 10.52601/bpr.2024.240006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 03/04/2024] [Indexed: 07/20/2024] Open
Abstract
Determining correlations between molecules at various levels is an important topic in molecular biology. Large language models have demonstrated a remarkable ability to capture correlations from large amounts of data in the field of natural language processing as well as image generation, and correlations captured from data using large language models can also be applicable to solving a wide range of specific tasks, hence large language models are also referred to as foundation models. The massive amount of data that exists in the field of molecular biology provides an excellent basis for the development of foundation models, and the recent emergence of foundation models in the field of molecular biology has really pushed the entire field forward. We summarize the foundation models developed based on RNA sequence data, DNA sequence data, protein sequence data, single-cell transcriptome data, and spatial transcriptome data respectively, and further discuss the research directions for the development of foundation models in molecular biology.
Collapse
Affiliation(s)
- Yunda Si
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
| | - Jiawei Zou
- Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| | - Yicheng Gao
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Guohui Chuai
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Qi Liu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Luonan Chen
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
- Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
3
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600283. [PMID: 38979289 PMCID: PMC11230257 DOI: 10.1101/2024.06.25.600283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability VIPdb version 2 is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
| | - Arul S. Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Currently at: Illumina, Foster City, California 94404, USA
| | - Steven E. Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
4
|
Zhai J, Gokaslan A, Schiff Y, Berthel A, Liu ZY, Miller ZR, Scheben A, Stitzer MC, Romay MC, Buckler ES, Kuleshov V. Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.04.596709. [PMID: 38895432 PMCID: PMC11185591 DOI: 10.1101/2024.06.04.596709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.
Collapse
Affiliation(s)
- Jingjing Zhai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Aaron Gokaslan
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Yair Schiff
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Ana Berthel
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Zong-Yan Liu
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
| | - Zachary R. Miller
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Armin Scheben
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY USA 11724
| | | | - M. Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
- USDA-ARS; Ithaca, NY, USA 14853
| | - Volodymyr Kuleshov
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| |
Collapse
|
5
|
Tripathi S, Gabriel K, Tripathi PK, Kim E. Large language models reshaping molecular biology and drug development. Chem Biol Drug Des 2024; 103:e14568. [PMID: 38898381 DOI: 10.1111/cbdd.14568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 05/18/2024] [Accepted: 06/04/2024] [Indexed: 06/21/2024]
Abstract
The utilization of large language models (LLMs) has become a significant advancement in the domains of medicine and clinical informatics, providing a revolutionary potential for scientific breakthroughs and customized therapies. LLM models are trained on large datasets and exhibit the capacity to comprehend and analyze intricate biological data, encompassing genomic sequences, protein structures, and clinical health records. With the utilization of their comprehension of the language of biology, they possess the ability to reveal concealed patterns and insights that may evade human researchers. LLMs have been shown to positively impact various aspects of molecular biology, including the following: genomic analysis, drug development, precision medicine, biomarker development, experimental design, collaborative research, and accessibility to specialized expertise. However, it is imperative to acknowledge and tackle the obstacles and ethical implications involved. The careful consideration of data bias and generalization, data privacy and security, explainability and interpretability, and ethical concerns around responsible application is vital. The successful resolution of these obstacles will enable us to fully utilize the capabilities of LLMs, leading to substantial progress in the fields of molecular biology and pharmaceutical research. This progression also has the ability to bolster influential impacts for both the individual and the broader community.
Collapse
Affiliation(s)
- Satvik Tripathi
- Drexel University, Philadelphia, Pennsylvania, USA
- Harvard Medical School, Boston, Massachusetts, USA
| | - Kyla Gabriel
- Harvard Medical School, Boston, Massachusetts, USA
| | | | - Edward Kim
- Drexel University, Philadelphia, Pennsylvania, USA
| |
Collapse
|
6
|
Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JPA. Large language models for science and medicine. Eur J Clin Invest 2024; 54:e14183. [PMID: 38381530 DOI: 10.1111/eci.14183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/06/2024] [Accepted: 02/10/2024] [Indexed: 02/23/2024]
Abstract
Large language models (LLMs) are a type of machine learning model that learn statistical patterns over text, such as predicting the next words in a sequence of text. Both general purpose and task-specific LLMs have demonstrated potential across diverse applications. Science and medicine have many data types that are highly suitable for LLMs, such as scientific texts (publications, patents and textbooks), electronic medical records, large databases of DNA and protein sequences and chemical compounds. Carefully validated systems that can understand and reason across all these modalities may maximize benefits. Despite the inevitable limitations and caveats of any new technology and some uncertainties specific to LLMs, LLMs have the potential to be transformative in science and medicine.
Collapse
Affiliation(s)
- Amalio Telenti
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, California, USA
- Vir Biotechnology, Inc., San Francisco, California, USA
| | | | - Brian L Hie
- FAIR, Meta, Menlo Park, California, USA
- Department of Chemical Engineering, Stanford University, Stanford, California, USA
| | - Cyrus Maher
- Vir Biotechnology, Inc., San Francisco, California, USA
| | - Suchi Saria
- Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, Maryland, USA
| | - John P A Ioannidis
- Department of Medicine, Stanford University, Stanford, California, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, California, USA
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA
- Department of Statistics, Stanford University, Stanford, California, USA
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA
| |
Collapse
|
7
|
Lam HYI, Ong XE, Mutwil M. Large language models in plant biology. TRENDS IN PLANT SCIENCE 2024:S1360-1385(24)00118-3. [PMID: 38797656 DOI: 10.1016/j.tplants.2024.04.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 04/29/2024] [Accepted: 04/30/2024] [Indexed: 05/29/2024]
Abstract
Large language models (LLMs), such as ChatGPT, have taken the world by storm. However, LLMs are not limited to human language and can be used to analyze sequential data, such as DNA, protein, and gene expression. The resulting foundation models can be repurposed to identify the complex patterns within the data, resulting in powerful, multipurpose prediction tools able to predict the state of cellular systems. This review outlines the different types of LLMs and showcases their recent uses in biology. Since LLMs have not yet been embraced by the plant community, we also cover how these models can be deployed for the plant kingdom.
Collapse
Affiliation(s)
- Hilbert Yuen In Lam
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Xing Er Ong
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
8
|
Schraiber JG, Edge MD, Pennell M. Unifying approaches from statistical genetics and phylogenetics for mapping phenotypes in structured populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.10.579721. [PMID: 38496530 PMCID: PMC10942266 DOI: 10.1101/2024.02.10.579721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including covariance matrix eigenvectors as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Collapse
|
9
|
Tang Z, Koo PK. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.29.582810. [PMID: 38464101 PMCID: PMC10925287 DOI: 10.1101/2024.02.29.582810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
The emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.
Collapse
Affiliation(s)
- Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|
10
|
Liu J, Yang M, Yu Y, Xu H, Li K, Zhou X. Large language models in bioinformatics: applications and perspectives. ARXIV 2024:arXiv:2401.04155v1. [PMID: 38259343 PMCID: PMC10802675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will present a summary of the prominent large language models used in natural language processing, such as BERT and GPT, and focus on exploring the applications of large language models at different omics levels in bioinformatics, mainly including applications of large language models in genomics, transcriptomics, proteomics, drug discovery and single cell analysis. Finally, this review summarizes the potential and prospects of large language models in solving bioinformatic problems.
Collapse
Affiliation(s)
- Jiajia Liu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Mengyuan Yang
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Yankai Yu
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
| | - Haixia Xu
- The Center of Gerontology and Geriatrics, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- School of Dentistry, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|