1
|
Panda P, Giri SJ, Sherman LA, Kihara D, Aryal UK. Proteomic changes orchestrate metabolic acclimation of a unicellular diazotrophic cyanobacterium during light-dark cycle and nitrogen fixation states. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.30.605809. [PMID: 39131303 PMCID: PMC11312527 DOI: 10.1101/2024.07.30.605809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
Cyanobacteria have developed an impressive array of proteins and pathways, each tailored for specific metabolic attributes, to execute photosynthesis and biological nitrogen (N2)-fixation. An understanding of these biologically incompatible processes provides important insights into how they can be optimized for renewable energy. To expand upon our current knowledge, we performed label-free quantitative proteomic analysis of the unicellular diazotrophic cyanobacterium Crocosphaera subtropica ATCC 51142 grown with and without nitrate under 12-hour light-dark cycles. Results showed significant shift in metabolic activities including photosynthesis, respiration, biological nitrogen fixation (BNF), and proteostasis to different growth conditions. We identified 14 nitrogenase enzymes which were among the most highly expressed proteins in the dark under nitrogen-fixing conditions, emphasizing their importance in BNF. Nitrogenase enzymes were not expressed under non nitrogen fixing conditions, suggesting a regulatory mechanism based on nitrogen availability. The synthesis of key respiratory enzymes and uptake hydrogenase (HupSL) synchronized with the synthesis of nitrogenase indicating a coordinated regulation of processes involved in energy production and BNF. Data suggests alternative pathways that cells utilize, such as oxidative pentose phosphate (OPP) and 2-oxoglutarate (2-OG) pathways, to produce ATP and support bioenergetic BNF. Data also indicates the important role of uptake hydrogenase for the removal of O2 to support BNF. Overall, this study expands upon our knowledge regarding molecular responses of Crocosphaera 51142 to nitrogen and light-dark phases, shedding light on potential applications and optimization for renewable energy.
Collapse
Affiliation(s)
- Punyatoya Panda
- Department of Comparative Pathobiology, Purdue University, West Lafayette, IN 47907
| | - Swagarika J Giri
- Department of Computer Science, Purdue University, West Lafayette, IN 47907
| | - Louis A Sherman
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN 47907
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907
| | - Uma K Aryal
- Department of Comparative Pathobiology, Purdue University, West Lafayette, IN 47907
- Purdue Proteomics Facility, Bindley Bioscience Center, Purdue University, West Lafayette, IN 47907
| |
Collapse
|
2
|
Panda P, Giri SJ, Sherman L, Kihara D, Aryal UK. Proteomic analysis of unicellular cyanobacterium Crocosphaera subtropica ATCC 51142 under extended light or dark growth. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.29.605499. [PMID: 39131394 PMCID: PMC11312443 DOI: 10.1101/2024.07.29.605499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
The daily light-dark cycle is a recurrent and predictable environmental phenomenon to which many organisms, including cyanobacteria, have evolved to adapt. Understanding how cyanobacteria alter their metabolic attributes in response to subjective light or dark growth may provide key features for developing strains with improved photosynthetic efficiency and applications in enhanced carbon sequestration and renewable energy. Here, we undertook a label-free proteomic approach to investigate the effect of extended light (LL) or extended dark (DD) conditions on the unicellular cyanobacterium Crocosphaera subtropica ATCC 51142. We quantified 2287 proteins, of which 603 proteins were significantly different between the two growth conditions. These proteins represent several biological processes, including photosynthetic electron transport, carbon fixation, stress responses, translation, and protein degradation. One significant observation is the regulation of over two dozen proteases, including ATP dependent Clp-proteases (endopeptidases) and metalloproteases, the majority of which were upregulated in LL compared to DD. This suggests that proteases play a crucial role in the regulation and maintenance of photosynthesis, especially the PSI and PSII components. The higher protease activity in LL indicates a need for more frequent degradation and repair of certain photosynthetic components, highlighting the dynamic nature of protein turnover and quality control mechanisms in response to prolonged light exposure. The results enhance our understanding of how Crocosphaera subtropica ATCC51142 adjusts its molecular machinery in response to extended light or dark growth conditions.
Collapse
Affiliation(s)
- Punyatoya Panda
- Department of Comparative Pathobiology, Purdue University, West Lafayette, IN 47907
| | - Swagarika J. Giri
- Department of Computer Science, Purdue University, West Lafayette, IN 47907
| | - Louis Sherman
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN 47907
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907
| | - Uma K. Aryal
- Department of Comparative Pathobiology, Purdue University, West Lafayette, IN 47907
- Purdue Proteomics Facility, Bindley Bioscience Center, Purdue University, West Lafayette, IN 47907
| |
Collapse
|
3
|
Ulusoy E, Doğan T. Mutual annotation-based prediction of protein domain functions with Domain2GO. Protein Sci 2024; 33:e4988. [PMID: 38757367 PMCID: PMC11099699 DOI: 10.1002/pro.4988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/25/2024] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.
Collapse
Affiliation(s)
- Erva Ulusoy
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| | - Tunca Doğan
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| |
Collapse
|
4
|
Giri SJ, Ibtehaz N, Kihara D. GO2Sum: generating human-readable functional summary of proteins from GO terms. NPJ Syst Biol Appl 2024; 10:29. [PMID: 38491038 PMCID: PMC10943200 DOI: 10.1038/s41540-024-00358-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 03/05/2024] [Indexed: 03/18/2024] Open
Abstract
Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.
Collapse
Affiliation(s)
| | - Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
5
|
Bukhman YV, Morin PA, Meyer S, Chu LF, Jacobsen JK, Antosiewicz-Bourget J, Mamott D, Gonzales M, Argus C, Bolin J, Berres ME, Fedrigo O, Steill J, Swanson SA, Jiang P, Rhie A, Formenti G, Phillippy AM, Harris RS, Wood JMD, Howe K, Kirilenko BM, Munegowda C, Hiller M, Jain A, Kihara D, Johnston JS, Ionkov A, Raja K, Toh H, Lang A, Wolf M, Jarvis ED, Thomson JA, Chaisson MJP, Stewart R. A High-Quality Blue Whale Genome, Segmental Duplications, and Historical Demography. Mol Biol Evol 2024; 41:msae036. [PMID: 38376487 PMCID: PMC10919930 DOI: 10.1093/molbev/msae036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 01/11/2024] [Accepted: 01/22/2024] [Indexed: 02/21/2024] Open
Abstract
The blue whale, Balaenoptera musculus, is the largest animal known to have ever existed, making it an important case study in longevity and resistance to cancer. To further this and other blue whale-related research, we report a reference-quality, long-read-based genome assembly of this fascinating species. We assembled the genome from PacBio long reads and utilized Illumina/10×, optical maps, and Hi-C data for scaffolding, polishing, and manual curation. We also provided long read RNA-seq data to facilitate the annotation of the assembly by NCBI and Ensembl. Additionally, we annotated both haplotypes using TOGA and measured the genome size by flow cytometry. We then compared the blue whale genome with other cetaceans and artiodactyls, including vaquita (Phocoena sinus), the world's smallest cetacean, to investigate blue whale's unique biological traits. We found a dramatic amplification of several genes in the blue whale genome resulting from a recent burst in segmental duplications, though the possible connection between this amplification and giant body size requires further study. We also discovered sites in the insulin-like growth factor-1 gene correlated with body size in cetaceans. Finally, using our assembly to examine the heterozygosity and historical demography of Pacific and Atlantic blue whale populations, we found that the genomes of both populations are highly heterozygous and that their genetic isolation dates to the last interglacial period. Taken together, these results indicate how a high-quality, annotated blue whale genome will serve as an important resource for biology, evolution, and conservation research.
Collapse
Affiliation(s)
- Yury V Bukhman
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Phillip A Morin
- Southwest Fisheries Science Center, National Oceanic and Atmospheric Administration (NOAA), La Jolla, CA 92037, USA
| | - Susanne Meyer
- Neuroscience Research Institute, University of California, Santa Barbara, CA, USA
| | - Li-Fang Chu
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
- Department of Comparative Biology and Experimental Medicine, University of Calgary, Calgary, Canada
| | | | | | - Daniel Mamott
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Maylie Gonzales
- Neuroscience Research Institute, University of California, Santa Barbara, CA, USA
| | - Cara Argus
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Jennifer Bolin
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Mark E Berres
- University of Wisconsin Biotechnology Center, Bioinformatics Resource Center, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Olivier Fedrigo
- Vertebrate Genome Lab, The Rockefeller University, New York, NY 10065, USA
| | - John Steill
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Scott A Swanson
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Peng Jiang
- Center for Gene Regulation in Health and Disease (GRHD), Cleveland State University, Cleveland, OH, USA
- Department of Biological, Geological and Environmental Sciences, Cleveland State University, Cleveland, OH, USA
- Center for RNA Science and Therapeutics, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Arang Rhie
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD 20892, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language, The Rockefeller University/HHMI, New York, NY 10065, USA
| | - Adam M Phillippy
- Genome Informatics Section, National Human Genome Research Institute, Bethesda, MD 20892, USA
| | - Robert S Harris
- Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
| | | | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge CB10 1SA, UK
| | - Bogdan M Kirilenko
- LOEWE Centre for Translational Biodiversity Genomics, 60325 Frankfurt, Germany
- Senckenberg Research Institute, 60325 Frankfurt, Germany
- Institute of Cell Biology and Neuroscience, Faculty of Biosciences, Goethe University Frankfurt, 60438 Frankfurt, Germany
| | - Chetan Munegowda
- LOEWE Centre for Translational Biodiversity Genomics, 60325 Frankfurt, Germany
- Senckenberg Research Institute, 60325 Frankfurt, Germany
- Institute of Cell Biology and Neuroscience, Faculty of Biosciences, Goethe University Frankfurt, 60438 Frankfurt, Germany
| | - Michael Hiller
- LOEWE Centre for Translational Biodiversity Genomics, 60325 Frankfurt, Germany
- Senckenberg Research Institute, 60325 Frankfurt, Germany
- Institute of Cell Biology and Neuroscience, Faculty of Biosciences, Goethe University Frankfurt, 60438 Frankfurt, Germany
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA
| | - J Spencer Johnston
- Department of Entomology, Texas A&M University, College Station, TX 77843, USA
| | - Alexander Ionkov
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Kalpana Raja
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| | - Huishi Toh
- Neuroscience Research Institute, University of California, Santa Barbara, CA, USA
| | - Aimee Lang
- Southwest Fisheries Science Center, National Oceanic and Atmospheric Administration (NOAA), La Jolla, CA 92037, USA
| | - Magnus Wolf
- Institute for Evolution and Biodiversity (IEB), University of Muenster, 48149, Muenster, Germany
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany
| | - Erich D Jarvis
- Vertebrate Genome Lab, The Rockefeller University, New York, NY 10065, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University/HHMI, New York, NY 10065, USA
| | - James A Thomson
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
- Department of Molecular, Cellular and Developmental Biology, University of California Santa Barbara, Santa Barbara, CA 93106, USA
- Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, WI 53726, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA
| | - Ron Stewart
- Regenerative Biology, Morgridge Institute for Research, Madison, WI 53715, USA
| |
Collapse
|
6
|
Bukhman YV, Meyer S, Chu LF, Abueg L, Antosiewicz-Bourget J, Balacco J, Brecht M, Dinatale E, Fedrigo O, Formenti G, Fungtammasan A, Giri SJ, Hiller M, Howe K, Kihara D, Mamott D, Mountcastle J, Pelan S, Rabbani K, Sims Y, Tracey A, Wood JMD, Jarvis ED, Thomson JA, Chaisson MJP, Stewart R. Chromosome level genome assembly of the Etruscan shrew Suncus etruscus. Sci Data 2024; 11:176. [PMID: 38326333 PMCID: PMC10850158 DOI: 10.1038/s41597-024-03011-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 01/26/2024] [Indexed: 02/09/2024] Open
Abstract
Suncus etruscus is one of the world's smallest mammals, with an average body mass of about 2 grams. The Etruscan shrew's small body is accompanied by a very high energy demand and numerous metabolic adaptations. Here we report a chromosome-level genome assembly using PacBio long read sequencing, 10X Genomics linked short reads, optical mapping, and Hi-C linked reads. The assembly is partially phased, with the 2.472 Gbp primary pseudohaplotype and 1.515 Gbp alternate. We manually curated the primary assembly and identified 22 chromosomes, including X and Y sex chromosomes. The NCBI genome annotation pipeline identified 39,091 genes, 19,819 of them protein-coding. We also identified segmental duplications, inferred GO term annotations, and computed orthologs of human and mouse genes. This reference-quality genome will be an important resource for research on mammalian development, metabolism, and body size control.
Collapse
Affiliation(s)
- Yury V Bukhman
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA.
| | - Susanne Meyer
- Neuroscience Research Institute, University of California - Santa Barbara, 494 UCEN Rd, Isla Vista, CA, 93117, USA
| | - Li-Fang Chu
- Department of Comparative Biology and Experimental Medicine, University of Calgary, 2500 University Drive NW, Calgary, Alberta, T2N 1N4, Canada
| | - Linelle Abueg
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | | | - Jennifer Balacco
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | - Michael Brecht
- BCCN/Humboldt University Berlin, Philippstr, 13 House 6, 10115, Berlin, Germany
| | - Erica Dinatale
- Max Planck Institute for Biology Tübingen, Max-Planck-Ring 5, 72076, Tübingen, Germany
| | - Olivier Fedrigo
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language, The Rockefeller University/HHMI, 1230 York Avenue, New York, NY, 10065, USA
| | | | - Swagarika Jaharlal Giri
- Department of Computer Science, Purdue University, 249 S. Martin Jischke Dr, West Lafayette, IN, 47907, USA
| | - Michael Hiller
- LOEWE Centre for Translational Biodiversity Genomics, Senckenberganlage 25, 60325, Frankfurt, Germany
- Senckenberg Research Institute, Senckenberganlage 25, 60325, Frankfurt, Germany
- Institute of Cell Biology and Neuroscience, Faculty of Biosciences, Goethe University Frankfurt, Max-von-Laue-Str. 9, 60438, Frankfurt, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, 249 S. Martin Jischke Dr, West Lafayette, IN, 47907, USA
- Department of Biological Sciences, Purdue University, 249 S. Martin Jischke Dr., West Lafayette, IN, 47907, USA
| | - Daniel Mamott
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA
| | - Jacquelyn Mountcastle
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | - Sarah Pelan
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Keon Rabbani
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way RRI 408, Los Angeles, CA, 90089, USA
| | - Ying Sims
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | | | - Erich D Jarvis
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University/HHMI, 1230 York Avenue, New York, NY, 10065, USA
| | - James A Thomson
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA
- Department of Molecular, Cellular and Developmental Biology, University of California Santa Barbara, Santa Barbara, CA, 93106, USA
- Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, WI, 53726, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way RRI 408, Los Angeles, CA, 90089, USA
| | - Ron Stewart
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA
| |
Collapse
|
7
|
Giri SJ, Ibtehaz N, Kihara D. GO2Sum: Generating Human Readable Functional Summary of Proteins from GO Terms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.10.566665. [PMID: 38014080 PMCID: PMC10680659 DOI: 10.1101/2023.11.10.566665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.
Collapse
Affiliation(s)
| | - Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| |
Collapse
|
8
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP allows protein function prediction using function-aware domain embedding representations. Commun Biol 2023; 6:1103. [PMID: 37907681 PMCID: PMC10618451 DOI: 10.1038/s42003-023-05476-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 10/17/2023] [Indexed: 11/02/2023] Open
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, substantially outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
9
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP: Protein Function Prediction Using Function-Aware Domain Embedding Representations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554486. [PMID: 37662252 PMCID: PMC10473699 DOI: 10.1101/2023.08.23.554486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, significantly outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| |
Collapse
|
10
|
Zheng R, Huang Z, Deng L. Large-scale predicting protein functions through heterogeneous feature fusion. Brief Bioinform 2023:bbad243. [PMID: 37401369 DOI: 10.1093/bib/bbad243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 05/18/2023] [Accepted: 06/12/2023] [Indexed: 07/05/2023] Open
Abstract
As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.
Collapse
Affiliation(s)
- Rongtao Zheng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| |
Collapse
|
11
|
Song J, Sun J, Wang Y, Ding Y, Zhang S, Ma X, Chang F, Fan B, Liu H, Bao C, Meng W. CeRNA network identified hsa-miR-17-5p, hsa-miR-106a-5p and hsa-miR-2355-5p as potential diagnostic biomarkers for tuberculosis. Medicine (Baltimore) 2023; 102:e33117. [PMID: 36930090 PMCID: PMC10019109 DOI: 10.1097/md.0000000000033117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 02/08/2023] [Indexed: 03/18/2023] Open
Abstract
This study aims to analyze the regulatory non-coding RNAs in the pathological process of tuberculosis (TB), and identify novel diagnostic biomarkers. A longitudinal study was conducted in 5 newly diagnosed pulmonary tuberculosis patients, peripheral blood samples were collected before and after anti-TB treatment for 6 months, separately. After whole transcriptome sequencing, the differentially expressed RNAs (DE RNAs) were filtrated with |log2 (fold change) | > log2(1.5) and P value < .05 as screening criteria. Then functional annotation was actualized by gene ontology enrichment analysis, and enrichment pathway analysis was conducted by Kyoto Encyclopedia of Genes and Genomes database. And finally, the competitive endogenous RNA (ceRNA) regulatory network was established according to the interaction of ceRNA pairs and miRNA-mRNA pairs. Five young women were recruited and completed this study. Based on the differential expression analysis, a total of 1469 mRNAs, 996 long non-coding RNAs, 468 circular RNAs, and 86 miRNAs were filtrated as DE RNAs. Functional annotation demonstrated that those DE-mRNAs were strongly involved in the cellular process (n = 624), metabolic process (n = 513), single-organism process (n = 505), cell (n = 651), cell part (n = 650), organelle (n = 569), and binding (n = 629). Enrichment pathway analysis revealed that the differentially expressed genes were mainly enriched in HTLV-l infection, T cell receptor signaling pathway, glycosaminoglycan biosynthesis-heparan sulfate/heparin, and Hippo signaling pathway. CeRNA networks revealed that hsa-miR-17-5p, hsa-miR-106a-5p and hsa-miR-2355-5p might be regarded as potential diagnostic biomarkers for TB. Immunomodulation-related genes are differentially expressed in TB patients, and hsa-miR-106a-5p, hsa-miR-17-5p, hsa-miR-2355-5p might serve as potential diagnostic biomarkers.
Collapse
Affiliation(s)
- Jie Song
- School of Public Health, Xinxiang Medical University, Xinxiang, China
| | - Jiaguan Sun
- School of Public Health, Xinxiang Medical University, Xinxiang, China
| | - Yuqing Wang
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Yuehe Ding
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Shengrong Zhang
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Xiuzhen Ma
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Fengxia Chang
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Bingdong Fan
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Hongjuan Liu
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Chenglan Bao
- The 4th People’s Hospital of Qinghai Province, Xining, China
| | - Weimin Meng
- The 4th People’s Hospital of Qinghai Province, Xining, China
| |
Collapse
|
12
|
Machine learning for the identification of respiratory viral attachment machinery from sequences data. PLoS One 2023; 18:e0281642. [PMID: 36862685 PMCID: PMC9980812 DOI: 10.1371/journal.pone.0281642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 01/27/2023] [Indexed: 03/03/2023] Open
Abstract
At the outset of an emergent viral respiratory pandemic, sequence data is among the first molecular information available. As viral attachment machinery is a key target for therapeutic and prophylactic interventions, rapid identification of viral "spike" proteins from sequence can significantly accelerate the development of medical countermeasures. For six families of respiratory viruses, covering the vast majority of airborne and droplet-transmitted diseases, host cell entry is mediated by the binding of viral surface glycoproteins that interact with a host cell receptor. In this report it is shown that sequence data for an unknown virus belonging to one of the six families above provides sufficient information to identify the protein(s) responsible for viral attachment. Random forest models that take as input a set of respiratory viral sequences can classify the protein as "spike" vs. non-spike based on predicted secondary structure elements alone (with 97.3% correctly classified) or in combination with N-glycosylation related features (with 97.0% correctly classified). Models were validated through 10-fold cross-validation, bootstrapping on a class-balanced set, and an out-of-sample extra-familial validation set. Surprisingly, we showed that secondary structural elements and N-glycosylation features were sufficient for model generation. The ability to rapidly identify viral attachment machinery directly from sequence data holds the potential to accelerate the design of medical countermeasures for future pandemics. Furthermore, this approach may be extendable for the identification of other potential viral targets and for viral sequence annotation in general in the future.
Collapse
|
13
|
Yan TC, Yue ZX, Xu HQ, Liu YH, Hong YF, Chen GX, Tao L, Xie T. A systematic review of state-of-the-art strategies for machine learning-based protein function prediction. Comput Biol Med 2023; 154:106446. [PMID: 36680931 DOI: 10.1016/j.compbiomed.2022.106446] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/07/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022]
Abstract
New drug discovery is inseparable from the discovery of drug targets, and the vast majority of the known targets are proteins. At the same time, proteins are essential structural and functional elements of living cells necessary for the maintenance of all forms of life. Therefore, protein functions have become the focus of many pharmacological and biological studies. Traditional experimental techniques are no longer adequate for rapidly growing annotation of protein sequences, and approaches to protein function prediction using computational methods have emerged and flourished. A significant trend has been to use machine learning to achieve this goal. In this review, approaches to protein function prediction based on the sequence, structure, protein-protein interaction (PPI) networks, and fusion of multi-information sources are discussed. The current status of research on protein function prediction using machine learning is considered, and existing challenges and prominent breakthroughs are discussed to provide ideas and methods for future studies.
Collapse
Affiliation(s)
- Tian-Ci Yan
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Zi-Xuan Yue
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
14
|
Efremenko E, Aslanli A, Lyagin I. Advanced Situation with Recombinant Toxins: Diversity, Production and Application Purposes. Int J Mol Sci 2023; 24:ijms24054630. [PMID: 36902061 PMCID: PMC10003545 DOI: 10.3390/ijms24054630] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 02/14/2023] [Accepted: 02/20/2023] [Indexed: 03/04/2023] Open
Abstract
Today, the production and use of various samples of recombinant protein/polypeptide toxins is known and is actively developing. This review presents state-of-the-art in research and development of such toxins and their mechanisms of action and useful properties that have allowed them to be implemented into practice to treat various medical conditions (including oncology and chronic inflammation applications) and diseases, as well as to identify novel compounds and to detoxify them by diverse approaches (including enzyme antidotes). Special attention is given to the problems and possibilities of the toxicity control of the obtained recombinant proteins. The recombinant prions are discussed in the frame of their possible detoxification by enzymes. The review discusses the feasibility of obtaining recombinant variants of toxins in the form of protein molecules modified with fluorescent proteins, affine sequences and genetic mutations, allowing us to investigate the mechanisms of toxins' bindings to their natural receptors.
Collapse
Affiliation(s)
- Elena Efremenko
- Correspondence: ; Tel.: +7-(495)-939-3170; Fax: +7-(495)-939-5417
| | | | | |
Collapse
|
15
|
Suleman MT, Khan YD. m1A-pred: Prediction of Modified 1-methyladenosine Sites in RNA Sequences through Artificial Intelligence. Comb Chem High Throughput Screen 2022; 25:2473-2484. [PMID: 35718969 DOI: 10.2174/1386207325666220617152743] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 04/06/2022] [Accepted: 04/11/2022] [Indexed: 01/27/2023]
Abstract
BACKGROUND The process of nucleotides modification or methyl groups addition to nucleotides is known as post-transcriptional modification (PTM). 1-methyladenosine (m1A) is a type of PTM formed by adding a methyl group to the nitrogen at the 1st position of the adenosine base. Many human disorders are associated with m1A, which is widely found in ribosomal RNA and transfer RNA. OBJECTIVE The conventional methods such as mass spectrometry and site-directed mutagenesis proved to be laborious and burdensome. Systematic identification of modified sites from RNA sequences is gaining much attention nowadays. Consequently, an extreme gradient boost predictor, m1A-Pred, is developed in this study for the prediction of modified m1A sites. METHODS The current study involves the extraction of position and composition-based properties within nucleotide sequences. The extraction of features helps in the development of the features vector. Statistical moments were endorsed for dimensionality reduction in the obtained features. RESULTS Through a series of experiments using different computational models and evaluation methods, it was revealed that the proposed predictor, m1A-pred, proved to be the most robust and accurate model for the identification of modified sites. AVAILABILITY AND IMPLEMENTATION To enhance the research on m1A sites, a friendly server was also developed, which was the final phase of this research.
Collapse
Affiliation(s)
- Muhammad Taseer Suleman
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
16
|
Toh H, Yang C, Formenti G, Raja K, Yan L, Tracey A, Chow W, Howe K, Bergeron LA, Zhang G, Haase B, Mountcastle J, Fedrigo O, Fogg J, Kirilenko B, Munegowda C, Hiller M, Jain A, Kihara D, Rhie A, Phillippy AM, Swanson SA, Jiang P, Clegg DO, Jarvis ED, Thomson JA, Stewart R, Chaisson MJP, Bukhman YV. A haplotype-resolved genome assembly of the Nile rat facilitates exploration of the genetic basis of diabetes. BMC Biol 2022; 20:245. [DOI: 10.1186/s12915-022-01427-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 09/29/2022] [Indexed: 11/09/2022] Open
Abstract
Abstract
Background
The Nile rat (Avicanthis niloticus) is an important animal model because of its robust diurnal rhythm, a cone-rich retina, and a propensity to develop diet-induced diabetes without chemical or genetic modifications. A closer similarity to humans in these aspects, compared to the widely used Mus musculus and Rattus norvegicus models, holds the promise of better translation of research findings to the clinic.
Results
We report a 2.5 Gb, chromosome-level reference genome assembly with fully resolved parental haplotypes, generated with the Vertebrate Genomes Project (VGP). The assembly is highly contiguous, with contig N50 of 11.1 Mb, scaffold N50 of 83 Mb, and 95.2% of the sequence assigned to chromosomes. We used a novel workflow to identify 3613 segmental duplications and quantify duplicated genes. Comparative analyses revealed unique genomic features of the Nile rat, including some that affect genes associated with type 2 diabetes and metabolic dysfunctions. We discuss 14 genes that are heterozygous in the Nile rat or highly diverged from the house mouse.
Conclusions
Our findings reflect the exceptional level of genomic resolution present in this assembly, which will greatly expand the potential of the Nile rat as a model organism.
Collapse
|
17
|
Suleman MT, Alkhalifah T, Alturise F, Khan YD. DHU-Pred: accurate prediction of dihydrouridine sites using position and composition variant features on diverse classifiers. PeerJ 2022; 10:e14104. [PMID: 36320563 PMCID: PMC9618264 DOI: 10.7717/peerj.14104] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 09/01/2022] [Indexed: 01/21/2023] Open
Abstract
Background Dihydrouridine (D) is a modified transfer RNA post-transcriptional modification (PTM) that occurs abundantly in bacteria, eukaryotes, and archaea. The D modification assists in the stability and conformational flexibility of tRNA. The D modification is also responsible for pulmonary carcinogenesis in humans. Objective For the detection of D sites, mass spectrometry and site-directed mutagenesis have been developed. However, both are labor-intensive and time-consuming methods. The availability of sequence data has provided the opportunity to build computational models for enhancing the identification of D sites. Based on the sequence data, the DHU-Pred model was proposed in this study to find possible D sites. Methodology The model was built by employing comprehensive machine learning and feature extraction approaches. It was then validated using in-demand evaluation metrics and rigorous experimentation and testing approaches. Results The DHU-Pred revealed an accuracy score of 96.9%, which was considerably higher compared to the existing D site predictors. Availability and Implementation A user-friendly web server for the proposed model was also developed and is freely available for the researchers.
Collapse
Affiliation(s)
- Muhammad Taseer Suleman
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| |
Collapse
|
18
|
Fenoy E, Edera AA, Stegmayer G. Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks. Brief Bioinform 2022; 23:6618242. [PMID: 35758229 DOI: 10.1093/bib/bbac232] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 05/13/2022] [Accepted: 05/18/2022] [Indexed: 11/13/2022] Open
Abstract
A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.
Collapse
Affiliation(s)
- Emilio Fenoy
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Alejando A Edera
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Georgina Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| |
Collapse
|
19
|
Kagaya Y, Flannery ST, Jain A, Kihara D. ContactPFP: Protein Function Prediction Using Predicted Contact Information. FRONTIERS IN BIOINFORMATICS 2022; 2. [PMID: 35875419 PMCID: PMC9302406 DOI: 10.3389/fbinf.2022.896295] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Computational function prediction is one of the most important problems in bioinformatics as elucidating the function of genes is a central task in molecular biology and genomics. Most of the existing function prediction methods use protein sequences as the primary source of input information because the sequence is the most available information for query proteins. There are attempts to consider other attributes of query proteins. Among these attributes, the three-dimensional (3D) structure of proteins is known to be very useful in identifying the evolutionary relationship of proteins, from which functional similarity can be inferred. Here, we report a novel protein function prediction method, ContactPFP, which uses predicted residue-residue contact maps as input structural features of query proteins. Although 3D structure information is known to be useful, it has not been routinely used in function prediction because the 3D structure is not experimentally determined for many proteins. In ContactPFP, we overcome this limitation by using residue-residue contact prediction, which has become increasingly accurate due to rapid development in the protein structure prediction field. ContactPFP takes a query protein sequence as input and uses predicted residue-residue contact as a proxy for the 3D protein structure. To characterize how predicted contacts contribute to function prediction accuracy, we compared the performance of ContactPFP with several well-established sequence-based function prediction methods. The comparative study revealed the advantages and weaknesses of ContactPFP compared to contemporary sequence-based methods. There were many cases where it showed higher prediction accuracy. We examined factors that affected the accuracy of ContactPFP using several illustrative cases that highlight the strength of our method.
Collapse
Affiliation(s)
- Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| | - Sean T. Flannery
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- *Correspondence: Daisuke Kihara,
| |
Collapse
|
20
|
Reijnders MJMF, Waterhouse RM. CrowdGO: Machine learning and semantic similarity guided consensus Gene Ontology annotation. PLoS Comput Biol 2022; 18:e1010075. [PMID: 35560159 PMCID: PMC9132264 DOI: 10.1371/journal.pcbi.1010075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 05/25/2022] [Accepted: 04/04/2022] [Indexed: 11/29/2022] Open
Abstract
Characterising gene function for the ever-increasing number and diversity of species with annotated genomes relies almost entirely on computational prediction methods. These software are also numerous and diverse, each with different strengths and weaknesses as revealed through community benchmarking efforts. Meta-predictors that assess consensus and conflict from individual algorithms should deliver enhanced functional annotations. To exploit the benefits of meta-approaches, we developed CrowdGO, an open-source consensus-based Gene Ontology (GO) term meta-predictor that employs machine learning models with GO term semantic similarities and information contents. By re-evaluating each gene-term annotation, a consensus dataset is produced with high-scoring confident annotations and low-scoring rejected annotations. Applying CrowdGO to results from a deep learning-based, a sequence similarity-based, and two protein domain-based methods, delivers consensus annotations with improved precision and recall. Furthermore, using standard evaluation measures CrowdGO performance matches that of the community’s best performing individual methods. CrowdGO therefore offers a model-informed approach to leverage strengths of individual predictors and produce comprehensive and accurate gene functional annotations. New technologies mean that we are able to read the genetic blueprints in the form of complete genome sequences from many different species. We are also able to use computational methods combined with evidence from experiments to map out the locations in the genomes of many thousands of genes and other important regions. However, discovering and characterising the biological functions of all these genes and their protein products requires considerably more experimental work. In order to gain insights into the possible functions of the many genes currently lacking functional information from experiments we must therefore rely on methods that computationally predict protein functions. Many different software tools have been developed to tackle this challenge, each with their own strengths and weaknesses as shown by several community-based competitions that assess the performance of the predictors. Taking advantage of powerful modern machine learning techniques, we developed CrowdGO, a new software that aims to combine predictions from several tools and produce comprehensive and accurate gene functional annotations. CrowdGO is able to computationally assess agreements and conflicts amongst annotations from different predictors to then re-evaluate the results and deliver enhanced predictions of protein functions.
Collapse
Affiliation(s)
- Maarten J. M. F. Reijnders
- Department of Ecology and Evolution, University of Lausanne, and Swiss Institute of Bioinformatics, Lausanne, Switzerland
- * E-mail: (MJMFR); (RMW)
| | - Robert M. Waterhouse
- Department of Ecology and Evolution, University of Lausanne, and Swiss Institute of Bioinformatics, Lausanne, Switzerland
- * E-mail: (MJMFR); (RMW)
| |
Collapse
|
21
|
|
22
|
Yousafi Q, Sarfaraz A, Saad Khan M, Saleem S, Shahzad U, Abbas Khan A, Sadiq M, Ditta Abid A, Sohail Shahzad M, ul Hassan N. In silico annotation of unreviewed acetylcholinesterase (AChE) in some lepidopteran insect pest species reveals the causes of insecticide resistance. Saudi J Biol Sci 2021; 28:2197-2209. [PMID: 33911936 PMCID: PMC8071828 DOI: 10.1016/j.sjbs.2021.01.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 12/11/2020] [Accepted: 01/06/2021] [Indexed: 02/07/2023] Open
Abstract
Lepidoptera is the second most diverse insect order outnumbered only by the Coeleptera. Acetylcholinesterase (AChE) is the major target site for insecticides. Extensive use of insecticides, to inhibit the function of this enzyme, have resulted in the development of insecticide resistance. Complete knowledge of the target proteins is very important to know the cause of resistance. Computational annotation of insect acetylcholinesterase can be helpful for the characterization of this important protein. Acetylcholinesterase of fourteen lepidopteran insect pest species was annotated by using different bioinformatics tools. AChE in all the species was hydrophilic and thermostable. All the species showed lower values for instability index except L. orbonalis, S. exigua and T. absoluta. Highest percentage of Arg, Asp, Asn, Gln and Cys were recorded in P. rapae. High percentage of Cys and Gln might be reason for insecticide resistance development in P. rapae. Phylogenetic analysis revealed the AChE in T. absoluta, L. orbonalis and S. exigua are closely related and emerged from same primary branch. Three functional motifs were predicted in eleven species while only two were found in L. orbonalis, S. exigua and T. absoluta. AChE in eleven species followed secretory pathway and have signal peptides. No signal peptides were predicted for S. exigua, L. orbonalis and T. absoluta and follow non secretory pathway. Arginine methylation and cysteine palmotylation was found in all species except S. exigua, L. orbonalis and T. absoluta. Glycosylphosphatidylinositol (GPI) anchor was predicted in only nine species.
Collapse
Affiliation(s)
- Qudsia Yousafi
- COMSATS University Islamabad, Sahiwal Campus, Sahiwal, Punjab, Pakistan
- Corresponding author.
| | - Ayesha Sarfaraz
- COMSATS University Islamabad, Sahiwal Campus, Sahiwal, Punjab, Pakistan
| | | | - Shahzad Saleem
- COMSATS University Islamabad, Sahiwal Campus, Sahiwal, Punjab, Pakistan
| | - Umbreen Shahzad
- College of Agriculture, Bahauddin Zakariya University, Bahadur Campus, Layyah, Pakistan
| | - Azhar Abbas Khan
- College of Agriculture, Bahauddin Zakariya University, Bahadur Campus, Layyah, Pakistan
| | - Mazhar Sadiq
- COMSATS University Islamabad, Sahiwal Campus, Sahiwal, Punjab, Pakistan
| | | | | | | |
Collapse
|
23
|
Makrodimitris S, van Ham RCHJ, Reinders MJT. Automatic Gene Function Prediction in the 2020's. Genes (Basel) 2020; 11:E1264. [PMID: 33120976 PMCID: PMC7692357 DOI: 10.3390/genes11111264] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 02/06/2023] Open
Abstract
The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Roeland C. H. J. van Ham
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Marcel J. T. Reinders
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands
| |
Collapse
|
24
|
de Witt RN, Kroukamp H, Van Zyl WH, Paulsen IT, Volschenk H. QTL analysis of natural Saccharomyces cerevisiae isolates reveals unique alleles involved in lignocellulosic inhibitor tolerance. FEMS Yeast Res 2020; 19:5528620. [PMID: 31276593 DOI: 10.1093/femsyr/foz047] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2019] [Accepted: 07/03/2019] [Indexed: 12/13/2022] Open
Abstract
Decoding the genetic basis of lignocellulosic inhibitor tolerance in Saccharomyces cerevisiae is crucial for rational engineering of bioethanol strains with enhanced robustness. The genetic diversity of natural strains present an invaluable resource for the exploration of complex traits of industrial importance from a pan-genomic perspective to complement the limited range of specialised, tolerant industrial strains. Natural S. cerevisiae isolates have lately garnered interest as a promising toolbox for engineering novel, genetically encoded tolerance phenotypes into commercial strains. To this end, we investigated the genetic basis for lignocellulosic inhibitor tolerance of natural S. cerevisiae isolates. A total of 12 quantitative trait loci underpinning tolerance were identified by next-generation sequencing linked bulk-segregant analysis of superior interbred pools. Our findings corroborate the current perspective of lignocellulosic inhibitor tolerance as a multigenic, complex trait. Apart from a core set of genetic variants required for inhibitor tolerance, an additional genetic background-specific response was observed. Functional analyses of the identified genetic loci revealed the uncharacterised ORF, YGL176C and the bud-site selection XRN1/BUD13 as potentially beneficial alleles contributing to tolerance to a complex lignocellulosic inhibitor mixture. We present evidence for the consideration of both regulatory and coding sequence variants for strain improvement.
Collapse
Affiliation(s)
- R N de Witt
- Department of Microbiology, Stellenbosch University, De Beer Street, Stellenbosch 7600, Western Cape, South Africa
| | - H Kroukamp
- Department of Molecular Sciences, Macquarie University, Balaclava Rd, North Ryde, NSW 2109, Australia
| | - W H Van Zyl
- Department of Microbiology, Stellenbosch University, De Beer Street, Stellenbosch 7600, Western Cape, South Africa
| | - I T Paulsen
- Department of Molecular Sciences, Macquarie University, Balaclava Rd, North Ryde, NSW 2109, Australia
| | - H Volschenk
- Department of Microbiology, Stellenbosch University, De Beer Street, Stellenbosch 7600, Western Cape, South Africa
| |
Collapse
|
25
|
You R, Yao S, Xiong Y, Huang X, Sun F, Mamitsuka H, Zhu S. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res 2020; 47:W379-W387. [PMID: 31106361 PMCID: PMC6602452 DOI: 10.1093/nar/gkz388] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 04/24/2019] [Accepted: 05/01/2019] [Indexed: 01/19/2023] Open
Abstract
Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler—a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.
Collapse
Affiliation(s)
- Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
| | - Xiaodi Huang
- School of Computing and Mathematics, Charles Sturt University, Albury, NSW 2640, Australia
| | - Fengzhu Sun
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China.,Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan.,Department of Computer Science, Aalto University, Espoo and Helsinki, Finland
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| |
Collapse
|
26
|
NNTox: Gene Ontology-Based Protein Toxicity Prediction Using Neural Network. Sci Rep 2019; 9:17923. [PMID: 31784686 PMCID: PMC6884647 DOI: 10.1038/s41598-019-54405-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 11/13/2019] [Indexed: 12/23/2022] Open
Abstract
With advancements in synthetic biology, the cost and the time needed for designing and synthesizing customized gene products have been steadily decreasing. Many research laboratories in academia as well as industry routinely create genetically engineered proteins as a part of their research activities. However, manipulation of protein sequences could result in unintentional production of toxic proteins. Therefore, being able to identify the toxicity of a protein before the synthesis would reduce the risk of potential hazards. Existing methods are too specific, which limits their application. Here, we extended general function prediction methods for predicting the toxicity of proteins. Protein function prediction methods have been actively studied in the bioinformatics community and have shown significant improvement over the last decade. We have previously developed successful function prediction methods, which were shown to be among top-performing methods in the community-wide functional annotation experiment, CAFA. Based on our function prediction method, we developed a neural network model, named NNTox, which uses predicted GO terms for a target protein to further predict the possibility of the protein being toxic. We have also developed a multi-label model, which can predict the specific toxicity type of the query sequence. Together, this work analyses the relationship between GO terms and protein toxicity and builds predictor models of protein toxicity.
Collapse
|
27
|
Hong J, Luo Y, Zhang Y, Ying J, Xue W, Xie T, Tao L, Zhu F. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform 2019; 21:1437-1447. [PMID: 31504150 PMCID: PMC7412958 DOI: 10.1093/bib/bbz081] [Citation(s) in RCA: 90] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 05/27/2019] [Accepted: 06/10/2019] [Indexed: 11/12/2022] Open
Abstract
Functional annotation of protein sequence with high accuracy has become one of the most important issues in modern biomedical studies, and computational approaches of significantly accelerated analysis process and enhanced accuracy are greatly desired. Although a variety of methods have been developed to elevate protein annotation accuracy, their ability in controlling false annotation rates remains either limited or not systematically evaluated. In this study, a protein encoding strategy, together with a deep learning algorithm, was proposed to control the false discovery rate in protein function annotation, and its performances were systematically compared with that of the traditional similarity-based and de novo approaches. Based on a comprehensive assessment from multiple perspectives, the proposed strategy and algorithm were found to perform better in both prediction stability and annotation accuracy compared with other de novo methods. Moreover, an in-depth assessment revealed that it possessed an improved capacity of controlling the false discovery rate compared with traditional methods. All in all, this study not only provided a comprehensive analysis on the performances of the newly proposed strategy but also provided a tool for the researcher in the fields of protein function annotation.
Collapse
Affiliation(s)
- Jiajun Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou, China.,College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yang Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences, Chongqing University, Chongqing, China
| | - Junbiao Ying
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences, Chongqing University, Chongqing, China
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou, China
| | - Feng Zhu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou, China.,College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| |
Collapse
|