1
|
Shum MHH, Lee Y, Tam L, Xia H, Chung OLW, Guo Z, Lam TTY. Binding affinity between coronavirus spike protein and human ACE2 receptor. Comput Struct Biotechnol J 2024; 23:759-770. [PMID: 38304547 PMCID: PMC10831124 DOI: 10.1016/j.csbj.2024.01.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/03/2024] Open
Abstract
Coronaviruses (CoVs) pose a major risk to global public health due to their ability to infect diverse animal species and potential for emergence in humans. The CoV spike protein mediates viral entry into the cell and plays a crucial role in determining the binding affinity to host cell receptors. With particular emphasis on α- and β-coronaviruses that infect humans and domestic animals, current research on CoV receptor use suggests that the exploitation of the angiotensin-converting enzyme 2 (ACE2) receptor poses a significant threat for viral emergence with pandemic potential. This review summarizes the approaches used to study binding interactions between CoV spike proteins and the human ACE2 (hACE2) receptor. Solid-phase enzyme immunoassays and cell binding assays allow qualitative assessment of binding but lack quantitative evaluation of affinity. Surface plasmon resonance, Bio-layer interferometry, and Microscale Thermophoresis on the other hand, provide accurate affinity measurement through equilibrium dissociation constants (KD). In silico modeling predicts affinity through binding structure modeling, protein-protein docking simulations, and binding energy calculations but reveals inconsistent results due to the lack of a standardized approach. Machine learning and deep learning models utilize simulated and experimental protein-protein interaction data to elucidate the critical residues associated with CoV binding affinity to hACE2. Further optimization and standardization of existing approaches for studying binding affinity could aid pandemic preparedness. Specifically, prioritizing surveillance of CoVs that can bind to human receptors stands to mitigate the risk of zoonotic spillover.
Collapse
Affiliation(s)
- Marcus Ho-Hin Shum
- State Key Laboratory of Emerging Infectious Diseases, The University of Hong Kong, Hong Kong, China
- School of Public Health, The University of Hong Kong, Hong Kong, China
- Laboratory of Data Discovery for Health (D24H), Hong Kong Science Park, Hong Kong, China
| | - Yang Lee
- School of Public Health, The University of Hong Kong, Hong Kong, China
- Centre for Immunology and Infection (C2i), Hong Kong Science Park, Hong Kong, China
| | - Leighton Tam
- School of Public Health, The University of Hong Kong, Hong Kong, China
- Laboratory of Data Discovery for Health (D24H), Hong Kong Science Park, Hong Kong, China
| | - Hui Xia
- Department of Chemistry, South University of Science and Technology of China, China
- Department of Chemistry, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| | - Oscar Lung-Wa Chung
- Department of Chemistry, South University of Science and Technology of China, China
| | - Zhihong Guo
- Department of Chemistry, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| | - Tommy Tsan-Yuk Lam
- State Key Laboratory of Emerging Infectious Diseases, The University of Hong Kong, Hong Kong, China
- School of Public Health, The University of Hong Kong, Hong Kong, China
- Laboratory of Data Discovery for Health (D24H), Hong Kong Science Park, Hong Kong, China
- Centre for Immunology and Infection (C2i), Hong Kong Science Park, Hong Kong, China
| |
Collapse
|
2
|
Chow CFW, Ghosh S, Hadarovich A, Toth-Petroczy A. SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences. Proc Natl Acad Sci U S A 2024; 121:e2401622121. [PMID: 39383002 PMCID: PMC11494347 DOI: 10.1073/pnas.2401622121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 08/30/2024] [Indexed: 10/11/2024] Open
Abstract
Intrinsically disordered regions (IDRs) are structurally flexible protein segments with regulatory functions in multiple contexts, such as in the assembly of biomolecular condensates. Since IDRs undergo more rapid evolution than ordered regions, identifying homology of such poorly conserved regions remains challenging for state-of-the-art alignment-based methods that rely on position-specific conservation of residues. Thus, systematic functional annotation and evolutionary analysis of IDRs have been limited, despite them comprising ~21% of proteins. To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers). We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment-based approaches in assessing evolutionary homology in unalignable sequences. Furthermore, it correctly identified dissimilar but functionally analogous IDRs in IDR-replacement experiments reported in the literature, whereas alignment-based tools were incapable of detecting such functional relationships. SHARK-dive not only predicts functionally similar IDRs at a proteome-wide scale but also identifies cryptic sequence properties and motifs that drive remote homology and analogy, thereby providing interpretable and experimentally verifiable hypotheses of the sequence determinants that underlie such relationships. SHARK-dive acts as an alternative to alignment to facilitate systematic analysis and functional annotation of the unalignable protein universe.
Collapse
Affiliation(s)
- Chi Fung Willis Chow
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
- Cluster of Excellence Physics of Life, Technische Universität Dresden, Dresden01062, Germany
| | - Soumyadeep Ghosh
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
| | - Anna Hadarovich
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
| | - Agnes Toth-Petroczy
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
- Cluster of Excellence Physics of Life, Technische Universität Dresden, Dresden01062, Germany
| |
Collapse
|
3
|
Vazzana G, Savojardo C, Martelli PL, Casadio R. Testing the Capability of Embedding-Based Alignments on the GST Superfamily Classification: The Role of Protein Length. Molecules 2024; 29:4616. [PMID: 39407545 PMCID: PMC11478096 DOI: 10.3390/molecules29194616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 09/19/2024] [Accepted: 09/20/2024] [Indexed: 10/20/2024] Open
Abstract
In order to shed light on the usage of protein language model-based alignment procedures, we attempted the classification of Glutathione S-transferases (GST; EC 2.5.1.18) and compared our results with the ARBA/UNI rule-based annotation in UniProt. GST is a protein superfamily involved in cellular detoxification from harmful xenobiotics and endobiotics, widely distributed in prokaryotes and eukaryotes. What is particularly interesting is that the superfamily is characterized by different classes, comprising proteins from different taxa that can act in different cell locations (cytosolic, mitochondrial and microsomal compartments) with different folds and different levels of sequence identity with remote homologs. For this reason, GST functional annotation in a specific class is problematic: unless a structure is released, the protein can be classified only on the basis of sequence similarity, which excludes the annotation of remote homologs. Here, we adopt an embedding-based alignment to classify 15,061 GST proteins automatically annotated by the UniProt-ARBA/UNI rules. Embedding is based on the Meta ESM2-15b protein language. The embedding-based alignment reaches more than a 99% rate of perfect matching with the UniProt automatic procedure. Data analysis indicates that 46% of the UniProt automatically classified proteins do not conserve the typical length of canonical GSTs, whose structure is known. Therefore, 46% of the classified proteins do not conserve the template/s structure required for their family classification. Our approach finds that 41% of 64,207 GST UniProt proteins not yet assigned to any class can be classified consistently with the structural template length.
Collapse
Affiliation(s)
| | | | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, 40126 Bologna, Italy; (G.V.); (C.S.)
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, 40126 Bologna, Italy; (G.V.); (C.S.)
| |
Collapse
|
4
|
Wei G, Wu N, Zhao K, Yang S, Wang L, Liu Y. DeepCheck: multitask learning aids in assessing microbial genome quality. Brief Bioinform 2024; 25:bbae539. [PMID: 39438078 PMCID: PMC11495869 DOI: 10.1093/bib/bbae539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/26/2024] [Accepted: 10/09/2024] [Indexed: 10/25/2024] Open
Abstract
Metagenomic analyses facilitate the exploration of the microbial world, advancing our understanding of microbial roles in ecological and biological processes. A pivotal aspect of metagenomic analysis involves assessing the quality of metagenome-assembled genomes (MAGs), crucial for accurate biological insights. Current machine learning-based methods often treat completeness and contamination prediction as separate tasks, overlooking their inherent relationship and limiting models' generalization. In this study, we present DeepCheck, a multitasking deep learning framework for simultaneous prediction of MAG completeness and contamination. DeepCheck consistently outperforms existing tools in accuracy across various experimental settings and demonstrates comparable speed while maintaining high predictive accuracy even for new lineages. Additionally, we employ interpretable machine learning techniques to identify specific genes and pathways that drive the model's predictions, enabling independent investigation and assessment of these biological elements for deeper insights.
Collapse
Affiliation(s)
- Guo Wei
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, 163 Xianlin Avenue, Qixia District, Nanjing 210000, China
| | - Nannan Wu
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, 163 Xianlin Avenue, Qixia District, Nanjing 210000, China
| | - Kunyang Zhao
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, 163 Xianlin Avenue, Qixia District, Nanjing 210000, China
| | - Sihai Yang
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, 163 Xianlin Avenue, Qixia District, Nanjing 210000, China
- Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, 159 Panlong road, Xuanwu District, Nanjing 210000, China
| | - Long Wang
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, 163 Xianlin Avenue, Qixia District, Nanjing 210000, China
| | - Yan Liu
- Department of Computer Science, Yangzhou University, 196 Huaxi Road, Hanjiang District, Yangzhou 225100, China
| |
Collapse
|
5
|
Eren AM, Banfield JF. Modern microbiology: Embracing complexity through integration across scales. Cell 2024; 187:5151-5170. [PMID: 39303684 PMCID: PMC11450119 DOI: 10.1016/j.cell.2024.08.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Revised: 08/14/2024] [Accepted: 08/14/2024] [Indexed: 09/22/2024]
Abstract
Microbes were the only form of life on Earth for most of its history, and they still account for the vast majority of life's diversity. They convert rocks to soil, produce much of the oxygen we breathe, remediate our sewage, and sustain agriculture. Microbes are vital to planetary health as they maintain biogeochemical cycles that produce and consume major greenhouse gases and support large food webs. Modern microbiologists analyze nucleic acids, proteins, and metabolites; leverage sophisticated genetic tools, software, and bioinformatic algorithms; and process and integrate complex and heterogeneous datasets so that microbial systems may be harnessed to address contemporary challenges in health, the environment, and basic science. Here, we consider an inevitably incomplete list of emergent themes in our discipline and highlight those that we recognize as the archetypes of its modern era that aim to address the most pressing problems of the 21st century.
Collapse
Affiliation(s)
- A Murat Eren
- Helmholtz Institute for Functional Marine Biodiversity, 26129 Oldenburg, Germany; Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, Germany; Institute for Chemistry and Biology of the Marine Environment, University of Oldenburg, Oldenburg, Germany; Marine Biological Laboratory, Woods Hole, MA, USA; Max Planck Institute for Marine Microbiology, Bremen, Germany.
| | - Jillian F Banfield
- Department of Earth and Planetary Sciences, University of California, Berkeley, Berkeley, CA, USA; Earth and Environmental Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA, USA; Biomedicine Discovery Institute, Monash University, Clayton, VIC, Australia; Department of Environmental Science Policy, and Management, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
6
|
Bordin N, Scholes H, Rauer C, Roca-Martínez J, Sillitoe I, Orengo C. Clustering protein functional families at large scale with hierarchical approaches. Protein Sci 2024; 33:e5140. [PMID: 39145441 PMCID: PMC11325189 DOI: 10.1002/pro.5140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/22/2024] [Accepted: 07/24/2024] [Indexed: 08/16/2024]
Abstract
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Harry Scholes
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, UK
- Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain
| | - Joel Roca-Martínez
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK
| |
Collapse
|
7
|
Tan Y, Li M, Zhou B, Zhong B, Zheng L, Tan P, Zhou Z, Yu H, Fan G, Hong L. Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models. J Chem Inf Model 2024; 64:6338-6349. [PMID: 39110130 DOI: 10.1021/acs.jcim.4c00689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Fine-tuning pretrained protein language models (PLMs) has emerged as a prominent strategy for enhancing downstream prediction tasks, often outperforming traditional supervised learning approaches. As a widely applied powerful technique in natural language processing, employing parameter-efficient fine-tuning techniques could potentially enhance the performance of PLMs. However, the direct transfer to life science tasks is nontrivial due to the different training strategies and data forms. To address this gap, we introduce SES-Adapter, a simple, efficient, and scalable adapter method for enhancing the representation learning of PLMs. SES-Adapter incorporates PLM embeddings with structural sequence embeddings to create structure-aware representations. We show that the proposed method is compatible with different PLM architectures and across diverse tasks. Extensive evaluations are conducted on 2 types of folding structures with notable quality differences, 9 state-of-the-art baselines, and 9 benchmark data sets across distinct downstream tasks. Results show that compared to vanilla PLMs, SES-Adapter improves downstream task performance by a maximum of 11% and an average of 3%, with significantly accelerated convergence speed by a maximum of 1034% and an average of 362%, the training efficiency is also improved by approximately 2 times. Moreover, positive optimization is observed even with low-quality predicted structures. The source code for SES-Adapter is available at https://github.com/tyang816/SES-Adapter.
Collapse
Affiliation(s)
- Yang Tan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
| | - Mingchen Li
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
| | - Bingxin Zhou
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Bozitao Zhong
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Lirong Zheng
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Department of Cell and Developmental Biology & Michigan Neuroscience Institute, University of Michigan Medical School, Ann Arbor, Michigan 48104, United States
| | - Pan Tan
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ziyi Zhou
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Huiqun Yu
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Liang Hong
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
8
|
Margelevičius M. GTalign: spatial index-driven protein structure alignment, superposition, and search. Nat Commun 2024; 15:7305. [PMID: 39181863 PMCID: PMC11344802 DOI: 10.1038/s41467-024-51669-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 08/14/2024] [Indexed: 08/27/2024] Open
Abstract
With protein databases growing rapidly due to advances in structural and computational biology, the ability to accurately align and rapidly search protein structures has become essential for biological research. In response to the challenge posed by vast protein structure repositories, GTalign offers an innovative solution to protein structure alignment and search-an algorithm that achieves optimal superposition at high speeds. Through the design and implementation of spatial structure indexing, GTalign parallelizes all stages of superposition search across residues and protein structure pairs, yielding rapid identification of optimal superpositions. Rigorous evaluation across diverse datasets reveals GTalign as the most accurate among structure aligners while presenting orders of magnitude in speedup at state-of-the-art accuracy. GTalign's high speed and accuracy make it useful for numerous applications, including functional inference, evolutionary analyses, protein design, and drug discovery, contributing to advancing understanding of protein structure and function.
Collapse
|
9
|
Kabir A, Moldwin A, Bromberg Y, Shehu A. In the twilight zone of protein sequence homology: do protein language models learn protein structure? BIOINFORMATICS ADVANCES 2024; 4:vbae119. [PMID: 39183802 PMCID: PMC11344590 DOI: 10.1093/bioadv/vbae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 08/01/2024] [Accepted: 08/12/2024] [Indexed: 08/27/2024]
Abstract
Motivation Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. Results We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementation We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
Collapse
Affiliation(s)
- Anowarul Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| | - Asher Moldwin
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| | - Yana Bromberg
- Department of Computer Science, Emory University, Atlanta, GA 30307, United States
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| |
Collapse
|
10
|
Espinoza JL, Phillips A, Prentice MB, Tan GS, Kamath PL, Lloyd KG, Dupont CL. Unveiling the microbial realm with VEBA 2.0: a modular bioinformatics suite for end-to-end genome-resolved prokaryotic, (micro)eukaryotic and viral multi-omics from either short- or long-read sequencing. Nucleic Acids Res 2024; 52:e63. [PMID: 38909293 DOI: 10.1093/nar/gkae528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 05/21/2024] [Accepted: 06/10/2024] [Indexed: 06/24/2024] Open
Abstract
The microbiome is a complex community of microorganisms, encompassing prokaryotic (bacterial and archaeal), eukaryotic, and viral entities. This microbial ensemble plays a pivotal role in influencing the health and productivity of diverse ecosystems while shaping the web of life. However, many software suites developed to study microbiomes analyze only the prokaryotic community and provide limited to no support for viruses and microeukaryotes. Previously, we introduced the Viral Eukaryotic Bacterial Archaeal (VEBA) open-source software suite to address this critical gap in microbiome research by extending genome-resolved analysis beyond prokaryotes to encompass the understudied realms of eukaryotes and viruses. Here we present VEBA 2.0 with key updates including a comprehensive clustered microeukaryotic protein database, rapid genome/protein-level clustering, bioprospecting, non-coding/organelle gene modeling, genome-resolved taxonomic/pathway profiling, long-read support, and containerization. We demonstrate VEBA's versatile application through the analysis of diverse case studies including marine water, Siberian permafrost, and white-tailed deer lung tissues with the latter showcasing how to identify integrated viruses. VEBA represents a crucial advancement in microbiome research, offering a powerful and accessible software suite that bridges the gap between genomics and biotechnological solutions.
Collapse
Affiliation(s)
- Josh L Espinoza
- Department of Environment and Sustainability, J. Craig Venter Institute, La Jolla, CA 92037, USA
- Department of Genomic Medicine and Infectious Diseases, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Allan Phillips
- Department of Environment and Sustainability, J. Craig Venter Institute, La Jolla, CA 92037, USA
- Department of Genomic Medicine and Infectious Diseases, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Melanie B Prentice
- School of Food and Agriculture, University of Maine, Orono, ME 04469, USA
| | - Gene S Tan
- Department of Genomic Medicine and Infectious Diseases, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Pauline L Kamath
- School of Food and Agriculture, University of Maine, Orono, ME 04469, USA
- Maine Center for Genetics in the Environment, University of Maine, Orono, ME 04469, USA
| | - Karen G Lloyd
- Microbiology Department, University of Tennessee, Knoxville, TN 37917, USA
| | - Chris L Dupont
- Department of Environment and Sustainability, J. Craig Venter Institute, La Jolla, CA 92037, USA
- Department of Genomic Medicine and Infectious Diseases, J. Craig Venter Institute, La Jolla, CA 92037, USA
| |
Collapse
|
11
|
Hong L, Hu Z, Sun S, Tang X, Wang J, Tan Q, Zheng L, Wang S, Xu S, King I, Gerstein M, Li Y. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat Biotechnol 2024:10.1038/s41587-024-02353-6. [PMID: 39123049 DOI: 10.1038/s41587-024-02353-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 07/12/2024] [Indexed: 08/12/2024]
Abstract
The identification of protein homologs in large databases using conventional methods, such as protein sequence comparison, often misses remote homologs. Here, we offer an ultrafast, highly sensitive method, dense homolog retriever (DHR), for detecting homologs on the basis of a protein language model and dense retrieval techniques. Its dual-encoder architecture generates different embeddings for the same protein sequence and easily locates homologs by comparing these representations. Its alignment-free nature improves speed and the protein language model incorporates rich evolutionary and structural information within DHR embeddings. DHR achieves a >10% increase in sensitivity compared to previous methods and a >56% increase in sensitivity at the superfamily level for samples that are challenging to identify using alignment-based approaches. It is up to 22 times faster than traditional methods such as PSI-BLAST and DIAMOND and up to 28,700 times faster than HMMER. The new remote homologs exclusively found by DHR are useful for revealing connections between well-characterized proteins and improving our knowledge of protein evolution, structure and function.
Collapse
Affiliation(s)
- Liang Hong
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Zhihang Hu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Siqi Sun
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China.
- Shanghai AI Laboratory, Shanghai, China.
| | - Xiangru Tang
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Jiuming Wang
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
- OneAIM Ltd., Hong Kong SAR, China
| | - Qingxiong Tan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Liangzhen Zheng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Sheng Wang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Sheng Xu
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- Shanghai AI Laboratory, Shanghai, China
| | - Irwin King
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Mark Gerstein
- Department of Computer Science, Yale University, New Haven, CT, USA.
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT, USA.
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.
- Department of Statistics and Data Science, Yale University, New Haven, CT, USA.
| | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China.
- Shanghai AI Laboratory, Shanghai, China.
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA.
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- The Chinese University of Hong Kong Shenzhen Research Institute, Shenzhen, China.
| |
Collapse
|
12
|
Fazekas Z, K Menyhárd D, Perczel A. LoCoHD: a metric for comparing local environments of proteins. Nat Commun 2024; 15:4029. [PMID: 38740745 DOI: 10.1038/s41467-024-48225-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Accepted: 04/22/2024] [Indexed: 05/16/2024] Open
Abstract
Protein folds and the local environments they create can be compared using a variety of differently designed measures, such as the root mean squared deviation, the global distance test, the template modeling score or the local distance difference test. Although these measures have proven to be useful for a variety of tasks, each fails to fully incorporate the valuable chemical information inherent to atoms and residues, and considers these only partially and indirectly. Here, we develop the highly flexible local composition Hellinger distance (LoCoHD) metric, which is based on the chemical composition of local residue environments. Using LoCoHD, we analyze the chemical heterogeneity of amino acid environments and identify valines having the most conserved-, and arginines having the most variable chemical environments. We use LoCoHD to investigate structural ensembles, to evaluate critical assessment of structure prediction (CASP) competitors, to compare the results with the local distance difference test (lDDT) scoring system, and to evaluate a molecular dynamics simulation. We show that LoCoHD measurements provide unique information about protein structures that is distinct from, for example, those derived using the alignment-based RMSD metric, or the similarly distance matrix-based but alignment-free lDDT metric.
Collapse
Affiliation(s)
- Zsolt Fazekas
- Laboratory of Structural Chemistry and Biology, Institute of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary
- ELTE Hevesy György PhD School of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary
| | - Dóra K Menyhárd
- Laboratory of Structural Chemistry and Biology, Institute of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary
- HUN-REN-ELTE Protein Modeling Research Group, ELTE Eötvös Loránd University, Budapest, Hungary
| | - András Perczel
- Laboratory of Structural Chemistry and Biology, Institute of Chemistry, ELTE Eötvös Loránd University, Budapest, Hungary.
- HUN-REN-ELTE Protein Modeling Research Group, ELTE Eötvös Loránd University, Budapest, Hungary.
| |
Collapse
|
13
|
Liu W, Wang Z, You R, Xie C, Wei H, Xiong Y, Yang J, Zhu S. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nat Commun 2024; 15:2775. [PMID: 38555371 PMCID: PMC10981738 DOI: 10.1038/s41467-024-46808-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Accepted: 03/08/2024] [Indexed: 04/02/2024] Open
Abstract
Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences. Extensive experimental results show that PLMSearch can search millions of query-target protein pairs in seconds like MMseqs2 while increasing the sensitivity by more than threefold, and is comparable to state-of-the-art structure search methods. In particular, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available at https://dmiip.sjtu.edu.cn/PLMSearch .
Collapse
Affiliation(s)
- Wei Liu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Ziye Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Ronghui You
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Chenghan Xie
- School of Mathematical Sciences, Fudan University, 200433, Shanghai, China
| | - Hong Wei
- School of Mathematical Sciences, Nankai University, 300071, Tianjin, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 200240, Shanghai, China
| | - Jianyi Yang
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Science, Shandong University, 266237, Qingdao, China.
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China.
- Shanghai Qi Zhi Institute, Shanghai, China.
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China.
- Shanghai Key Lab of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai, China.
- Zhangjiang Fudan International Innovation Center, Shanghai, China.
| |
Collapse
|
14
|
Harihar B, Saravanan KM, Gromiha MM, Selvaraj S. Importance of Inter-residue Contacts for Understanding Protein Folding and Unfolding Rates, Remote Homology, and Drug Design. Mol Biotechnol 2024:10.1007/s12033-024-01119-4. [PMID: 38498284 DOI: 10.1007/s12033-024-01119-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Accepted: 02/10/2024] [Indexed: 03/20/2024]
Abstract
Inter-residue interactions in protein structures provide valuable insights into protein folding and stability. Understanding these interactions can be helpful in many crucial applications, including rational design of therapeutic small molecules and biologics, locating functional protein sites, and predicting protein-protein and protein-ligand interactions. The process of developing machine learning models incorporating inter-residue interactions has been improved recently. This review highlights the theoretical models incorporating inter-residue interactions in predicting folding and unfolding rates of proteins. Utilizing contact maps to depict inter-residue interactions aids researchers in developing computer models for detecting remote homologs and interface residues within protein-protein complexes which, in turn, enhances our knowledge of the relationship between sequence and structure of proteins. Further, the application of contact maps derived from inter-residue interactions is highlighted in the field of drug discovery. Overall, this review presents an extensive assessment of the significant models that use inter-residue interactions to investigate folding rates, unfolding rates, remote homology, and drug development, providing potential future advancements in constructing efficient computational models in structural biology.
Collapse
Affiliation(s)
- Balasubramanian Harihar
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Konda Mani Saravanan
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, 600073, India
| | - Michael M Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Samuel Selvaraj
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India.
| |
Collapse
|
15
|
Taujale R, Gravel N, Zhou Z, Yeung W, Kochut K, Kannan N. Informatic challenges and advances in illuminating the druggable proteome. Drug Discov Today 2024; 29:103894. [PMID: 38266979 DOI: 10.1016/j.drudis.2024.103894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 01/08/2024] [Accepted: 01/17/2024] [Indexed: 01/26/2024]
Abstract
The understudied members of the druggable proteomes offer promising prospects for drug discovery efforts. While large-scale initiatives have generated valuable functional information on understudied members of the druggable gene families, translating this information into actionable knowledge for drug discovery requires specialized informatics tools and resources. Here, we review the unique informatics challenges and advances in annotating understudied members of the druggable proteome. We demonstrate the application of statistical evolutionary inference tools, knowledge graph mining approaches, and protein language models in illuminating understudied protein kinases, pseudokinases, and ion channels.
Collapse
Affiliation(s)
- Rahil Taujale
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA
| | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | | | - Wayland Yeung
- Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | - Krystof Kochut
- School of Computing, University of Georgia, Athens, GA, USA
| | - Natarajan Kannan
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA; Institute of Bioinformatics, University of Georgia, Athens, GA, USA.
| |
Collapse
|
16
|
Liu GY, Yu D, Fan MM, Zhang X, Jin ZY, Tang C, Liu XF. Antimicrobial resistance crisis: could artificial intelligence be the solution? Mil Med Res 2024; 11:7. [PMID: 38254241 PMCID: PMC10804841 DOI: 10.1186/s40779-024-00510-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Accepted: 01/08/2024] [Indexed: 01/24/2024] Open
Abstract
Antimicrobial resistance is a global public health threat, and the World Health Organization (WHO) has announced a priority list of the most threatening pathogens against which novel antibiotics need to be developed. The discovery and introduction of novel antibiotics are time-consuming and expensive. According to WHO's report of antibacterial agents in clinical development, only 18 novel antibiotics have been approved since 2014. Therefore, novel antibiotics are critically needed. Artificial intelligence (AI) has been rapidly applied to drug development since its recent technical breakthrough and has dramatically improved the efficiency of the discovery of novel antibiotics. Here, we first summarized recently marketed novel antibiotics, and antibiotic candidates in clinical development. In addition, we systematically reviewed the involvement of AI in antibacterial drug development and utilization, including small molecules, antimicrobial peptides, phage therapy, essential oils, as well as resistance mechanism prediction, and antibiotic stewardship.
Collapse
Affiliation(s)
- Guang-Yu Liu
- Department of Immunology and Pathogen Biology, School of Basic Medical Sciences, Hangzhou Normal University, Key Laboratory of Aging and Cancer Biology of Zhejiang Province, Key Laboratory of Inflammation and Immunoregulation of Hangzhou, Hangzhou Normal University, Hangzhou, 311121, China
| | - Dan Yu
- National Key Discipline of Pediatrics Key Laboratory of Major Diseases in Children Ministry of Education, Laboratory of Dermatology, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing, 100045, China
| | - Mei-Mei Fan
- Department of Immunology and Pathogen Biology, School of Basic Medical Sciences, Hangzhou Normal University, Key Laboratory of Aging and Cancer Biology of Zhejiang Province, Key Laboratory of Inflammation and Immunoregulation of Hangzhou, Hangzhou Normal University, Hangzhou, 311121, China
| | - Xu Zhang
- Robert and Arlene Kogod Center on Aging, Mayo Clinic, Rochester, MN, 55905, USA
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Ze-Yu Jin
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Christoph Tang
- Sir William Dunn School of Pathology, University of Oxford, Oxford, OX1 3RE, UK.
| | - Xiao-Fen Liu
- Institute of Antibiotics, Huashan Hospital, Fudan University, Key Laboratory of Clinical Pharmacology of Antibiotics, National Health Commission of the People's Republic of China, National Clinical Research Centre for Aging and Medicine, Huashan Hospital, Fudan University, Shanghai, 200040, China.
| |
Collapse
|
17
|
Duan N, Hand E, Pheko M, Sharma S, Emiola A. Structure-guided discovery of anti-CRISPR and anti-phage defense proteins. Nat Commun 2024; 15:649. [PMID: 38245560 PMCID: PMC10799925 DOI: 10.1038/s41467-024-45068-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Accepted: 01/12/2024] [Indexed: 01/22/2024] Open
Abstract
Bacteria use a variety of defense systems to protect themselves from phage infection. In turn, phages have evolved diverse counter-defense measures to overcome host defenses. Here, we use protein structural similarity and gene co-occurrence analyses to screen >66 million viral protein sequences and >330,000 metagenome-assembled genomes for the identification of anti-phage and counter-defense systems. We predict structures for ~300,000 proteins and perform large-scale, pairwise comparison to known anti-CRISPR (Acr) and anti-phage proteins to identify structural homologs that otherwise may not be uncovered using primary sequence search. This way, we identify a Bacteroidota phage Acr protein that inhibits Cas12a, and an Akkermansia muciniphila anti-phage defense protein, termed BxaP. Gene bxaP is found in loci encoding Bacteriophage Exclusion (BREX) and restriction-modification defense systems, but confers immunity independently. Our work highlights the advantage of combining protein structural features and gene co-localization information in studying host-phage interactions.
Collapse
Affiliation(s)
- Ning Duan
- Microbial Therapeutics Unit, National Institute of Dental and Craniofacial Research, National Institutes of Health, Bethesda, MD, USA
| | - Emily Hand
- Microbial Therapeutics Unit, National Institute of Dental and Craniofacial Research, National Institutes of Health, Bethesda, MD, USA
| | - Mannuku Pheko
- Microbial Therapeutics Unit, National Institute of Dental and Craniofacial Research, National Institutes of Health, Bethesda, MD, USA
| | - Shikha Sharma
- Microbial Therapeutics Unit, National Institute of Dental and Craniofacial Research, National Institutes of Health, Bethesda, MD, USA
| | - Akintunde Emiola
- Microbial Therapeutics Unit, National Institute of Dental and Craniofacial Research, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
18
|
Guo C, Wu JY. Pathogen Discovery in the Post-COVID Era. Pathogens 2024; 13:51. [PMID: 38251358 PMCID: PMC10821006 DOI: 10.3390/pathogens13010051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 12/22/2023] [Accepted: 01/03/2024] [Indexed: 01/23/2024] Open
Abstract
Pathogen discovery plays a crucial role in the fields of infectious diseases, clinical microbiology, and public health. During the past four years, the global response to the COVID-19 pandemic highlighted the importance of early and accurate identification of novel pathogens for effective management and prevention of outbreaks. The post-COVID era has ushered in a new phase of infectious disease research, marked by accelerated advancements in pathogen discovery. This review encapsulates the recent innovations and paradigm shifts that have reshaped the landscape of pathogen discovery in response to the COVID-19 pandemic. Primarily, we summarize the latest technology innovations, applications, and causation proving strategies that enable rapid and accurate pathogen discovery for both acute and historical infections. We also explored the significance and the latest trends and approaches being employed for effective implementation of pathogen discovery from various clinical and environmental samples. Furthermore, we emphasize the collaborative nature of the pandemic response, which has led to the establishment of global networks for pathogen discovery.
Collapse
Affiliation(s)
- Cheng Guo
- Center for Infection and Immunity, Mailman School of Public Health, Columbia University, New York, NY 10032, USA
| | - Jian-Yong Wu
- School of Public Health, Xinjiang Medical University, Urumqi 830017, China
| |
Collapse
|
19
|
Pantolini L, Studer G, Pereira J, Durairaj J, Tauriello G, Schwede T. Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics 2024; 40:btad786. [PMID: 38175775 PMCID: PMC10792726 DOI: 10.1093/bioinformatics/btad786] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 10/27/2023] [Accepted: 12/29/2023] [Indexed: 01/06/2024] Open
Abstract
MOTIVATION Language models are routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful new approaches in the bioinformatics field. Protein language models (pLMs) generate high-dimensional embeddings on a per-residue level and encode a "semantic meaning" of each individual amino acid in the context of the full protein sequence. These representations have been used as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. RESULTS In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA) and show how these capture structural similarities even in the twilight zone, outperforming both classical methods as well as other approaches based on pLMs. The method shows excellent accuracy despite the absence of training and parameter optimization. We demonstrate that the combination of pLMs with alignment methods is a valuable approach for the detection of relationships between proteins in the twilight-zone. AVAILABILITY AND IMPLEMENTATION The code to run EBA and reproduce the analysis described in this article is available at: https://git.scicore.unibas.ch/schwede/EBA and https://git.scicore.unibas.ch/schwede/eba_benchmark.
Collapse
Affiliation(s)
- Lorenzo Pantolini
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Gabriel Studer
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Joana Pereira
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Janani Durairaj
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Gerardo Tauriello
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Torsten Schwede
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| |
Collapse
|
20
|
Bosch JA, Keith N, Escobedo F, Fisher WW, LaGraff JT, Rabasco J, Wan KH, Weiszmann R, Hu Y, Kondo S, Brown JB, Perrimon N, Celniker SE. Molecular and functional characterization of the Drosophila melanogaster conserved smORFome. Cell Rep 2023; 42:113311. [PMID: 37889754 PMCID: PMC10843857 DOI: 10.1016/j.celrep.2023.113311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 08/24/2023] [Accepted: 10/04/2023] [Indexed: 10/29/2023] Open
Abstract
Short polypeptides encoded by small open reading frames (smORFs) are ubiquitously found in eukaryotic genomes and are important regulators of physiology, development, and mitochondrial processes. Here, we focus on a subset of 298 smORFs that are evolutionarily conserved between Drosophila melanogaster and humans. Many of these smORFs are conserved broadly in the bilaterian lineage, and ∼182 are conserved in plants. We observe remarkably heterogeneous spatial and temporal expression patterns of smORF transcripts-indicating wide-spread tissue-specific and stage-specific mitochondrial architectures. In addition, an analysis of annotated functional domains reveals a predicted enrichment of smORF polypeptides localizing to mitochondria. We conduct an embryonic ribosome profiling experiment and find support for translation of 137 of these smORFs during embryogenesis. We further embark on functional characterization using CRISPR knockout/activation, RNAi knockdown, and cDNA overexpression, revealing diverse phenotypes. This study underscores the importance of identifying smORF function in disease and phenotypic diversity.
Collapse
Affiliation(s)
- Justin A Bosch
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Nathan Keith
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Felipe Escobedo
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - William W Fisher
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - James Thai LaGraff
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Jorden Rabasco
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Kenneth H Wan
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Richard Weiszmann
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Yanhui Hu
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Shu Kondo
- Laboratory of Invertebrate Genetics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - James B Brown
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | - Norbert Perrimon
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Harvard Medical School, Boston, MA 02115, USA.
| | - Susan E Celniker
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| |
Collapse
|