1
|
Fukunaga T, Ogawa T, Iwasaki W, Sonoike K. Phylogenetic Profiling Analysis of the Phycobilisome Revealed a Novel State-Transition Regulator Gene in Synechocystis sp. PCC 6803. PLANT & CELL PHYSIOLOGY 2024; 65:1450-1460. [PMID: 39034452 PMCID: PMC11447641 DOI: 10.1093/pcp/pcae083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 07/05/2024] [Accepted: 07/20/2024] [Indexed: 07/23/2024]
Abstract
Phycobilisomes play a crucial role in the light-harvesting mechanisms of cyanobacteria, red algae and glaucophytes, but the molecular mechanism of their regulation is largely unknown. In the cyanobacterium, Synechocystis sp. PCC 6803, we identified slr0244 as a phycobilisome-related gene using phylogenetic profiling analysis, a method used to predict gene function based on comparative genomics. To investigate the physiological function of the slr0244 gene, we characterized slr0244 mutants spectroscopically. Disruption of the slr0244 gene impaired state transition, a process by which the distribution of light energy absorbed by the phycobilisomes between two photosystems is regulated in response to the changes in light conditions. The Slr0244 protein seems to act in the process of state transition, somewhere at or downstream of the sensing step of the redox state of the plastoquinone (PQ) pool. These findings, together with past reports describing the interaction of this gene product with thioredoxin and glutaredoxin, suggest that the slr0244 gene is a novel state-transition regulator that integrates the redox signal of PQ pools with that of the photosystem I-reducing side. The protein has two universal stress protein (USP) motifs in tandem. The second motif has two conserved cysteine residues found in USPs of other cyanobacteria and land plants. These redox-type USPs with conserved cysteines may function as redox regulators in various photosynthetic organisms. Our study also shows the efficacy of phylogenetic profiling analysis in predicting the function of cyanobacterial genes that have not been annotated so far.
Collapse
Affiliation(s)
- Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan
| | - Takako Ogawa
- Faculty of Education and Integrated Arts and Sciences, Waseda University, Tokyo 162-8480, Japan
- Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan
| | - Wataru Iwasaki
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-0882, Japan
| | - Kintake Sonoike
- Faculty of Education and Integrated Arts and Sciences, Waseda University, Tokyo 162-8480, Japan
| |
Collapse
|
2
|
Cosentino S, Sriswasdi S, Iwasaki W. SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models. Genome Biol 2024; 25:195. [PMID: 39054525 PMCID: PMC11270883 DOI: 10.1186/s13059-024-03298-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 06/04/2024] [Indexed: 07/27/2024] Open
Abstract
Accurate inference of orthologous genes constitutes a prerequisite for comparative and evolutionary genomics. SonicParanoid is one of the fastest tools for orthology inference; however, its scalability and accuracy have been hampered by time-consuming all-versus-all alignments and the existence of proteins with complex domain architectures. Here, we present a substantial update of SonicParanoid, where a gradient boosting predictor halves the execution time and a language model doubles the recall. Application to empirical large-scale and standardized benchmark datasets shows that SonicParanoid2 is much faster than comparable methods and also the most accurate. SonicParanoid2 is available at https://gitlab.com/salvo981/sonicparanoid2 and https://zenodo.org/doi/10.5281/zenodo.11371108 .
Collapse
Affiliation(s)
- Salvatore Cosentino
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan
| | - Sira Sriswasdi
- Center of Excellence in Computational Molecular Biology, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Wataru Iwasaki
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan.
- Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Bunkyo-ku, Japan.
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan.
- Atmosphere and Ocean Research Institute, the University of Tokyo, Kashiwa, Japan.
- Institute for Quantitative Biosciences, the University of Tokyo, Bunkyo-ku, Japan.
- Collaborative Research Institute for Innovative Microbiology, the University of Tokyo, Bunkyo-ku, Japan.
| |
Collapse
|
3
|
Cox RM, Papoulas O, Shril S, Lee C, Gardner T, Battenhouse AM, Lee M, Drew K, McWhite CD, Yang D, Leggere JC, Durand D, Hildebrandt F, Wallingford JB, Marcotte EM. Ancient eukaryotic protein interactions illuminate modern genetic traits and disorders. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.26.595818. [PMID: 38853926 PMCID: PMC11160598 DOI: 10.1101/2024.05.26.595818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
All eukaryotes share a common ancestor from roughly 1.5 - 1.8 billion years ago, a single-celled, swimming microbe known as LECA, the Last Eukaryotic Common Ancestor. Nearly half of the genes in modern eukaryotes were present in LECA, and many current genetic diseases and traits stem from these ancient molecular systems. To better understand these systems, we compared genes across modern organisms and identified a core set of 10,092 shared protein-coding gene families likely present in LECA, a quarter of which are uncharacterized. We then integrated >26,000 mass spectrometry proteomics analyses from 31 species to infer how these proteins interact in higher-order complexes. The resulting interactome describes the biochemical organization of LECA, revealing both known and new assemblies. We analyzed these ancient protein interactions to find new human gene-disease relationships for bone density and congenital birth defects, demonstrating the value of ancestral protein interactions for guiding functional genetics today.
Collapse
Affiliation(s)
- Rachael M Cox
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Ophelia Papoulas
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Shirlee Shril
- Division of Nephrology, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA 02215, USA
| | - Chanjae Lee
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Tynan Gardner
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Anna M Battenhouse
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Muyoung Lee
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Kevin Drew
- Department of Biological Sciences, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Claire D McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - David Yang
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Janelle C Leggere
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Dannie Durand
- Department of Biological Sciences, Carnegie Mellon University, 4400 5th Avenue Pittsburgh, PA 15213, USA
| | - Friedhelm Hildebrandt
- Division of Nephrology, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA 02215, USA
| | - John B Wallingford
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| | - Edward M Marcotte
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712, USA
| |
Collapse
|
4
|
Tian X, Teo WFA, Yang Y, Dong L, Wong A, Chen L, Ahmed H, Choo SW, Jakubovics NS, Tan GYA. Genome characterisation and comparative analysis of Schaalia dentiphila sp. nov. and its subspecies, S. dentiphila subsp. denticola subsp. nov., from the human oral cavity. BMC Microbiol 2024; 24:185. [PMID: 38802738 PMCID: PMC11131293 DOI: 10.1186/s12866-024-03346-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 05/21/2024] [Indexed: 05/29/2024] Open
Abstract
BACKGROUND Schaalia species are primarily found among the oral microbiota of humans and other animals. They have been associated with various infections through their involvement in biofilm formation, modulation of host responses, and interaction with other microorganisms. In this study, two strains previously indicated as Actinomyces spp. were found to be novel members of the genus Schaalia based on their whole genome sequences. RESULTS Whole-genome sequencing revealed both strains with a genome size of 2.3 Mbp and GC contents of 65.5%. Phylogenetics analysis for taxonomic placement revealed strains NCTC 9931 and C24 as distinct species within the genus Schaalia. Overall genome-relatedness indices including digital DNA-DNA hybridization (dDDH), and average nucleotide/amino acid identity (ANI/AAI) confirmed both strains as distinct species, with values below the species boundary thresholds (dDDH < 70%, and ANI and AAI < 95%) when compared to nearest type strain Schaalia odontolytica NCTC 9935 T. Pangenome and orthologous analyses highlighted their differences in gene properties and biological functions compared to existing type strains. Additionally, the identification of genomic islands (GIs) and virulence-associated factors indicated their genetic diversity and potential adaptive capabilities, as well as potential implications for human health. Notably, CRISPR-Cas systems in strain NCTC 9931 underscore its adaptive immune mechanisms compared to strain C24. CONCLUSIONS Based on these findings, strain NCTC 9931T (= ATCC 17982T = DSM 43331T = CIP 104728T = CCUG 18309T = NCTC 14978T = CGMCC 1.90328T) represents a novel species, for which the name Schaalia dentiphila subsp. dentiphila sp. nov. subsp. nov. is proposed, while strain C24T (= NCTC 14980T = CGMCC 1.90329T) represents a distinct novel subspecies, for which the name Schaalia dentiphila subsp. denticola. subsp. nov. is proposed. This study enriches our understanding of the genomic diversity of Schaalia species and paves the way for further investigations into their roles in oral health. SIGNIFICANCE This research reveals two Schaalia strains, NCTC 9931 T and C24T, as novel entities with distinct genomic features. Expanding the taxonomic framework of the genus Schaalia, this study offers a critical resource for probing the metabolic intricacies and resistance patterns of these bacteria. This work stands as a cornerstone for microbial taxonomy, paving the way for significant advances in clinical diagnostics.
Collapse
Affiliation(s)
- Xuechen Tian
- Institute of Biological Sciences, Faculty of Science, Universiti Malaya, Kuala Lumpur, 50603, Malaysia
- College of Science, Mathematics and Technology, Wenzhou-Kean University, 88 Daxue Road, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Wenzhou Municipal Key Laboratory for Applied Biomedical and Biopharmaceutical Informatics, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Zhejiang Bioinformatics International Science and Technology Cooperation Center, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
| | - Wee Fei Aaron Teo
- Institute of Biological Sciences, Faculty of Science, Universiti Malaya, Kuala Lumpur, 50603, Malaysia
| | - Yixin Yang
- College of Science, Mathematics and Technology, Wenzhou-Kean University, 88 Daxue Road, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Wenzhou Municipal Key Laboratory for Applied Biomedical and Biopharmaceutical Informatics, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Zhejiang Bioinformatics International Science and Technology Cooperation Center, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Dorothy and George Hennings College of Science, Mathematics and Technology, Kean University, 1000 Morris Ave, Union, NJ, 07083, USA
| | - Linyinxue Dong
- Wenzhou Municipal Key Laboratory for Applied Biomedical and Biopharmaceutical Informatics, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Zhejiang Bioinformatics International Science and Technology Cooperation Center, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
| | - Aloysius Wong
- College of Science, Mathematics and Technology, Wenzhou-Kean University, 88 Daxue Road, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Wenzhou Municipal Key Laboratory for Applied Biomedical and Biopharmaceutical Informatics, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Zhejiang Bioinformatics International Science and Technology Cooperation Center, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China
- Dorothy and George Hennings College of Science, Mathematics and Technology, Kean University, 1000 Morris Ave, Union, NJ, 07083, USA
| | - Li Chen
- Institute of Biological Sciences, Faculty of Science, Universiti Malaya, Kuala Lumpur, 50603, Malaysia
| | - Halah Ahmed
- School of Dental Sciences, Faculty of Medical Sciences, Newcastle University, Framlington Place, Newcastle Upon Tyne, NE2 4BW, UK
| | - Siew Woh Choo
- College of Science, Mathematics and Technology, Wenzhou-Kean University, 88 Daxue Road, Ouhai, Wenzhou, Zhejiang Province, 325060, China.
- Wenzhou Municipal Key Laboratory for Applied Biomedical and Biopharmaceutical Informatics, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China.
- Zhejiang Bioinformatics International Science and Technology Cooperation Center, Wenzhou-Kean University, Ouhai, Wenzhou, Zhejiang Province, 325060, China.
- Dorothy and George Hennings College of Science, Mathematics and Technology, Kean University, 1000 Morris Ave, Union, NJ, 07083, USA.
| | - Nicholas S Jakubovics
- School of Dental Sciences, Faculty of Medical Sciences, Newcastle University, Framlington Place, Newcastle Upon Tyne, NE2 4BW, UK.
| | - Geok Yuan Annie Tan
- Institute of Biological Sciences, Faculty of Science, Universiti Malaya, Kuala Lumpur, 50603, Malaysia.
| |
Collapse
|
5
|
Sternberg PW, Van Auken K, Wang Q, Wright A, Yook K, Zarowiecki M, Arnaboldi V, Becerra A, Brown S, Cain S, Chan J, Chen WJ, Cho J, Davis P, Diamantakis S, Dyer S, Grigoriadis D, Grove CA, Harris T, Howe K, Kishore R, Lee R, Longden I, Luypaert M, Müller HM, Nuin P, Quinton-Tulloch M, Raciti D, Schedl T, Schindelman G, Stein L. WormBase 2024: status and transitioning to Alliance infrastructure. Genetics 2024; 227:iyae050. [PMID: 38573366 PMCID: PMC11075546 DOI: 10.1093/genetics/iyae050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 03/19/2024] [Accepted: 03/20/2024] [Indexed: 04/05/2024] Open
Abstract
WormBase has been the major repository and knowledgebase of information about the genome and genetics of Caenorhabditis elegans and other nematodes of experimental interest for over 2 decades. We have 3 goals: to keep current with the fast-paced C. elegans research, to provide better integration with other resources, and to be sustainable. Here, we discuss the current state of WormBase as well as progress and plans for moving core WormBase infrastructure to the Alliance of Genome Resources (the Alliance). As an Alliance member, WormBase will continue to interact with the C. elegans community, develop new features as needed, and curate key information from the literature and large-scale projects.
Collapse
Affiliation(s)
- Paul W Sternberg
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Qinghua Wang
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Adam Wright
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Karen Yook
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Magdalena Zarowiecki
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Valerio Arnaboldi
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Andrés Becerra
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Stephanie Brown
- School of Infection and Immunity, University of Glasgow, Glasgow G12 8TA, UK
| | - Scott Cain
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Juancarlos Chan
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Wen J Chen
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Jaehyoung Cho
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Paul Davis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Stavros Diamantakis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Sarah Dyer
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | | | - Christian A Grove
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Todd Harris
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Kevin Howe
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Ranjana Kishore
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Raymond Lee
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Ian Longden
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Manuel Luypaert
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Hans-Michael Müller
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Paulo Nuin
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Mark Quinton-Tulloch
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Daniela Raciti
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Tim Schedl
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - Gary Schindelman
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Lincoln Stein
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| |
Collapse
|
6
|
Aleksander SA, Anagnostopoulos AV, Antonazzo G, Arnaboldi V, Attrill H, Becerra A, Bello SM, Blodgett O, Bradford YM, Bult CJ, Cain S, Calvi BR, Carbon S, Chan J, Chen WJ, Cherry JM, Cho J, Crosby MA, De Pons JL, D’Eustachio P, Diamantakis S, Dolan ME, dos Santos G, Dyer S, Ebert D, Engel SR, Fashena D, Fisher M, Foley S, Gibson AC, Gollapally VR, Gramates LS, Grove CA, Hale P, Harris T, Hayman GT, Hu Y, James-Zorn C, Karimi K, Karra K, Kishore R, Kwitek AE, Laulederkind SJF, Lee R, Longden I, Luypaert M, Markarian N, Marygold SJ, Matthews B, McAndrews MS, Millburn G, Miyasato S, Motenko H, Moxon S, Muller HM, Mungall CJ, Muruganujan A, Mushayahama T, Nash RS, Nuin P, Paddock H, Pells T, Perrimon N, Pich C, Quinton-Tulloch M, Raciti D, Ramachandran S, Richardson JE, Gelbart SR, Ruzicka L, Schindelman G, Shaw DR, Sherlock G, Shrivatsav A, Singer A, Smith CM, Smith CL, Smith JR, Stein L, Sternberg PW, Tabone CJ, Thomas PD, Thorat K, Thota J, Tomczuk M, Trovisco V, Tutaj MA, Urbano JM, Van Auken K, Van Slyke CE, Vize PD, Wang Q, Weng S, Westerfield M, Wilming LG, Wong ED, Wright A, Yook K, Zhou P, Zorn A, Zytkovicz M. Updates to the Alliance of Genome Resources central infrastructure. Genetics 2024; 227:iyae049. [PMID: 38552170 PMCID: PMC11075569 DOI: 10.1093/genetics/iyae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 02/28/2024] [Accepted: 02/29/2024] [Indexed: 04/09/2024] Open
Abstract
The Alliance of Genome Resources (Alliance) is an extensible coalition of knowledgebases focused on the genetics and genomics of intensively studied model organisms. The Alliance is organized as individual knowledge centers with strong connections to their research communities and a centralized software infrastructure, discussed here. Model organisms currently represented in the Alliance are budding yeast, Caenorhabditis elegans, Drosophila, zebrafish, frog, laboratory mouse, laboratory rat, and the Gene Ontology Consortium. The project is in a rapid development phase to harmonize knowledge, store it, analyze it, and present it to the community through a web portal, direct downloads, and application programming interfaces (APIs). Here, we focus on developments over the last 2 years. Specifically, we added and enhanced tools for browsing the genome (JBrowse), downloading sequences, mining complex data (AllianceMine), visualizing pathways, full-text searching of the literature (Textpresso), and sequence similarity searching (SequenceServer). We enhanced existing interactive data tables and added an interactive table of paralogs to complement our representation of orthology. To support individual model organism communities, we implemented species-specific "landing pages" and will add disease-specific portals soon; in addition, we support a common community forum implemented in Discourse software. We describe our progress toward a central persistent database to support curation, the data modeling that underpins harmonization, and progress toward a state-of-the-art literature curation system with integrated artificial intelligence and machine learning (AI/ML).
Collapse
Affiliation(s)
| | | | | | - Giulia Antonazzo
- Department of Physiology, Development and Neuroscience , University of Cambridge, Downing Street, Cambridge CB2 3DY , UK
| | - Valerio Arnaboldi
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Helen Attrill
- Department of Physiology, Development and Neuroscience , University of Cambridge, Downing Street, Cambridge CB2 3DY , UK
| | - Andrés Becerra
- European Molecular Biology Laboratory, European Bioinformatics Institute , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , UK
| | - Susan M Bello
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Olin Blodgett
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | | | - Carol J Bult
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Scott Cain
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research , Toronto, ON M5G0A3 , Canada
| | - Brian R Calvi
- Department of Biology, Indiana University , Bloomington, IN 47408 , USA
| | - Seth Carbon
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , Berkeley, CA
| | - Juancarlos Chan
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Wen J Chen
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - J Michael Cherry
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - Jaehyoung Cho
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Madeline A Crosby
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Jeffrey L De Pons
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | | | - Stavros Diamantakis
- European Molecular Biology Laboratory, European Bioinformatics Institute , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , UK
| | - Mary E Dolan
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Gilberto dos Santos
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Sarah Dyer
- European Molecular Biology Laboratory, European Bioinformatics Institute , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , UK
| | - Dustin Ebert
- Department of Population and Public Health Sciences, University of Southern California , Los Angeles, CA 90033 , USA
| | - Stacia R Engel
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - David Fashena
- Institute of Neuroscience, University of Oregon , Eugene, OR 97403
| | - Malcolm Fisher
- Division of Developmental Biology, Cincinnati Children's Hospital Medical Center , 3333 Burnet Ave, Cincinnati, OH 45229 , USA
| | - Saoirse Foley
- Department of Biological Sciences, Carnegie Mellon University , 5000 Forbes Ave, Pittsburgh, PA 15203
| | - Adam C Gibson
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Varun R Gollapally
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - L Sian Gramates
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Christian A Grove
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Paul Hale
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Todd Harris
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research , Toronto, ON M5G0A3 , Canada
| | - G Thomas Hayman
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Yanhui Hu
- Department of Genetics, Howard Hughes Medical Institute , Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115 , USA
| | - Christina James-Zorn
- Division of Developmental Biology, Cincinnati Children's Hospital Medical Center , 3333 Burnet Ave, Cincinnati, OH 45229 , USA
| | - Kamran Karimi
- Department of Biological Sciences, University of Calgary , 507 Campus Dr NW, Calgary, AB T2N 4V8 , Canada
| | - Kalpana Karra
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - Ranjana Kishore
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Anne E Kwitek
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Stanley J F Laulederkind
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Raymond Lee
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Ian Longden
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Manuel Luypaert
- European Molecular Biology Laboratory, European Bioinformatics Institute , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , UK
| | - Nicholas Markarian
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Steven J Marygold
- Department of Physiology, Development and Neuroscience , University of Cambridge, Downing Street, Cambridge CB2 3DY , UK
| | - Beverley Matthews
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Monica S McAndrews
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Gillian Millburn
- Department of Physiology, Development and Neuroscience , University of Cambridge, Downing Street, Cambridge CB2 3DY , UK
| | - Stuart Miyasato
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - Howie Motenko
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Sierra Moxon
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , Berkeley, CA
| | - Hans-Michael Muller
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , Berkeley, CA
| | - Anushya Muruganujan
- Department of Population and Public Health Sciences, University of Southern California , Los Angeles, CA 90033 , USA
| | - Tremayne Mushayahama
- Department of Population and Public Health Sciences, University of Southern California , Los Angeles, CA 90033 , USA
| | - Robert S Nash
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - Paulo Nuin
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research , Toronto, ON M5G0A3 , Canada
| | - Holly Paddock
- Institute of Neuroscience, University of Oregon , Eugene, OR 97403
| | - Troy Pells
- Department of Biological Sciences, University of Calgary , 507 Campus Dr NW, Calgary, AB T2N 4V8 , Canada
| | - Norbert Perrimon
- Department of Genetics, Howard Hughes Medical Institute , Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115 , USA
| | - Christian Pich
- Institute of Neuroscience, University of Oregon , Eugene, OR 97403
| | - Mark Quinton-Tulloch
- European Molecular Biology Laboratory, European Bioinformatics Institute , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , UK
| | - Daniela Raciti
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | | | | | - Susan Russo Gelbart
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Leyla Ruzicka
- Institute of Neuroscience, University of Oregon , Eugene, OR 97403
| | - Gary Schindelman
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - David R Shaw
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Gavin Sherlock
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - Ajay Shrivatsav
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - Amy Singer
- Institute of Neuroscience, University of Oregon , Eugene, OR 97403
| | - Constance M Smith
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Cynthia L Smith
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Jennifer R Smith
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Lincoln Stein
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research , Toronto, ON M5G0A3 , Canada
| | - Paul W Sternberg
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Christopher J Tabone
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California , Los Angeles, CA 90033 , USA
| | - Ketaki Thorat
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Jyothi Thota
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Monika Tomczuk
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Vitor Trovisco
- Department of Physiology, Development and Neuroscience , University of Cambridge, Downing Street, Cambridge CB2 3DY , UK
| | - Marek A Tutaj
- Medical College of Wisconsin—Rat Genome Database, Departments of Physiology and Biomedical Engineering , Medical College of Wisconsin, Milwaukee, WI 53226 , USA
| | - Jose-Maria Urbano
- Department of Physiology, Development and Neuroscience , University of Cambridge, Downing Street, Cambridge CB2 3DY , UK
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Ceri E Van Slyke
- Institute of Neuroscience, University of Oregon , Eugene, OR 97403
| | - Peter D Vize
- Department of Biological Sciences, University of Calgary , 507 Campus Dr NW, Calgary, AB T2N 4V8 , Canada
| | - Qinghua Wang
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Shuai Weng
- Department of Genetics, Stanford University , Stanford, CA 94305
| | | | - Laurens G Wilming
- The Jackson Laboratory for Mammalian Genomics, Bar Harbor , ME 04609 , USA
| | - Edith D Wong
- Department of Genetics, Stanford University , Stanford, CA 94305
| | - Adam Wright
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research , Toronto, ON M5G0A3 , Canada
| | - Karen Yook
- Division of Biology and Biological Engineering 140-18, California Institute of Technology , Pasadena, CA 91125 , USA
| | - Pinglei Zhou
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| | - Aaron Zorn
- Division of Developmental Biology, Cincinnati Children's Hospital Medical Center , 3333 Burnet Ave, Cincinnati, OH 45229 , USA
| | - Mark Zytkovicz
- The Biological Laboratories, Harvard University , 16 Divinity Avenue, Cambridge, MA 02138 , USA
| |
Collapse
|
7
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
8
|
Ludwig J, Mrázek J. OrthoRefine: automated enhancement of prior ortholog identification via synteny. BMC Bioinformatics 2024; 25:163. [PMID: 38664637 PMCID: PMC11044567 DOI: 10.1186/s12859-024-05786-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 04/15/2024] [Indexed: 04/29/2024] Open
Abstract
BACKGROUND Identifying orthologs continues to be an early and imperative step in genome analysis but remains a challenging problem. While synteny (conservation of gene order) has previously been used independently and in combination with other methods to identify orthologs, applying synteny in ortholog identification has yet to be automated in a user-friendly manner. This desire for automation and ease-of-use led us to develop OrthoRefine, a standalone program that uses synteny to refine ortholog identification. RESULTS We developed OrthoRefine to improve the detection of orthologous genes by implementing a look-around window approach to detect synteny. We tested OrthoRefine in tandem with OrthoFinder, one of the most used software for identification of orthologs in recent years. We evaluated improvements provided by OrthoRefine in several bacterial and a eukaryotic dataset. OrthoRefine efficiently eliminates paralogs from orthologous groups detected by OrthoFinder. Using synteny increased specificity and functional ortholog identification; additionally, analysis of BLAST e-value, phylogenetics, and operon occurrence further supported using synteny for ortholog identification. A comparison of several window sizes suggested that smaller window sizes (eight genes) were generally the most suitable for identifying orthologs via synteny. However, larger windows (30 genes) performed better in datasets containing less closely related genomes. A typical run of OrthoRefine with ~ 10 bacterial genomes can be completed in a few minutes on a regular desktop PC. CONCLUSION OrthoRefine is a simple-to-use, standalone tool that automates the application of synteny to improve ortholog detection. OrthoRefine is particularly efficient in eliminating paralogs from orthologous groups delineated by standard methods.
Collapse
Affiliation(s)
- J Ludwig
- Institute of Bioinformatics, The University of Georgia, Athens, GA, 30602, USA.
| | - J Mrázek
- Department of Microbiology and Institute of Bioinformatics, The University of Georgia, Athens, GA, 30602, USA
| |
Collapse
|
9
|
Roder T, Pimentel G, Fuchsmann P, Stern MT, von Ah U, Vergères G, Peischl S, Brynildsrud O, Bruggmann R, Bär C. Scoary2: rapid association of phenotypic multi-omics data with microbial pan-genomes. Genome Biol 2024; 25:93. [PMID: 38605417 PMCID: PMC11007987 DOI: 10.1186/s13059-024-03233-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 03/29/2024] [Indexed: 04/13/2024] Open
Abstract
Unraveling bacterial gene function drives progress in various areas, such as food production, pharmacology, and ecology. While omics technologies capture high-dimensional phenotypic data, linking them to genomic data is challenging, leaving 40-60% of bacterial genes undescribed. To address this bottleneck, we introduce Scoary2, an ultra-fast microbial genome-wide association studies (mGWAS) software. With its data exploration app and improved performance, Scoary2 is the first tool to enable the study of large phenotypic datasets using mGWAS. As proof of concept, we explore the metabolome of yogurts, each produced with a different Propionibacterium reichii strain and discover two genes affecting carnitine metabolism.
Collapse
Affiliation(s)
- Thomas Roder
- Interfaculty Bioinformatics Unit and Swiss Institute of Bioinformatics, University of Bern, Bern, CH-3012, Switzerland
- Graduate School for Cellular and Biomedical Sciences, University of Bern, CH-3012, Bern, Switzerland
| | - Grégory Pimentel
- Methods development and analytics, Agroscope, Schwarzenburgstrasse 161, Bern, CH-3003, Switzerland
| | - Pascal Fuchsmann
- Food microbial systems, Agroscope, Schwarzenburgstrasse 161, Bern, CH-3003, Switzerland
| | - Mireille Tena Stern
- Food microbial systems, Agroscope, Schwarzenburgstrasse 161, Bern, CH-3003, Switzerland
| | - Ueli von Ah
- Food microbial systems, Agroscope, Schwarzenburgstrasse 161, Bern, CH-3003, Switzerland
| | - Guy Vergères
- Food microbial systems, Agroscope, Schwarzenburgstrasse 161, Bern, CH-3003, Switzerland
| | - Stephan Peischl
- Interfaculty Bioinformatics Unit and Swiss Institute of Bioinformatics, University of Bern, Bern, CH-3012, Switzerland
| | - Ola Brynildsrud
- Norwegian Institute of Public Health, Oslo and Norwegian University of Life Science, Ås, Norway
| | - Rémy Bruggmann
- Interfaculty Bioinformatics Unit and Swiss Institute of Bioinformatics, University of Bern, Bern, CH-3012, Switzerland.
| | - Cornelia Bär
- Methods development and analytics, Agroscope, Schwarzenburgstrasse 161, Bern, CH-3003, Switzerland
| |
Collapse
|
10
|
Thiébaut A, Altenhoff AM, Campli G, Glover N, Dessimoz C, Waterhouse RM. DrosOMA: the Drosophila Orthologous Matrix browser. F1000Res 2024; 12:936. [PMID: 38434623 PMCID: PMC10905159 DOI: 10.12688/f1000research.135250.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/12/2024] [Indexed: 03/05/2024] Open
Abstract
Background Comparative genomic analyses to delineate gene evolutionary histories inform the understanding of organismal biology by characterising gene and gene family origins, trajectories, and dynamics, as well as enabling the tracing of speciation, duplication, and loss events, and facilitating the transfer of gene functional information across species. Genomic data are available for an increasing number of species from the genus Drosophila, however, a dedicated resource exploiting these data to provide the research community with browsable results from genus-wide orthology delineation has been lacking. Methods Using the OMA Orthologous Matrix orthology inference approach and browser deployment framework, we catalogued orthologues across a selected set of Drosophila species with high-quality annotated genomes. We developed and deployed a dedicated instance of the OMA browser to facilitate intuitive exploration, visualisation, and downloading of the genus-wide orthology delineation results. Results DrosOMA - the Drosophila Orthologous Matrix browser, accessible from https://drosoma.dcsr.unil.ch/ - presents the results of orthology delineation for 36 drosophilids from across the genus and four outgroup dipterans. It enables querying and browsing of the orthology data through a feature-rich web interface, with gene-view, orthologous group-view, and genome-view pages, including comprehensive gene name and identifier cross-references together with available functional annotations and protein domain architectures, as well as tools to visualise local and global synteny conservation. Conclusions The DrosOMA browser demonstrates the deployability of the OMA browser framework for building user-friendly orthology databases with dense sampling of a selected taxonomic group. It provides the Drosophila research community with a tailored resource of browsable results from genus-wide orthology delineation.
Collapse
Affiliation(s)
- Antonin Thiébaut
- Department of Ecology and Evolution, SIB Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Adrian M. Altenhoff
- Department of Computer Science, SIB Swiss Institute of Bioinformatics, ETH Zurich, Zurich, Switzerland
| | - Giulia Campli
- Department of Ecology and Evolution, SIB Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Natasha Glover
- Department of Computational Biology, SIB Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, SIB Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Robert M. Waterhouse
- Department of Ecology and Evolution, SIB Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| |
Collapse
|
11
|
Carhuaricra-Huaman D, Setubal JC. Protein-Coding Gene Families in Prokaryote Genome Comparisons. Methods Mol Biol 2024; 2802:33-55. [PMID: 38819555 DOI: 10.1007/978-1-0716-3838-5_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
The identification of orthologous genes is relevant for comparative genomics, phylogenetic analysis, and functional annotation. There are many computational tools for the prediction of orthologous groups as well as web-based resources that offer orthology datasets for download and online analysis. This chapter presents a simple and practical guide to the process of orthologous group prediction, using a dataset of 10 prokaryotic proteomes as example. The orthology methods covered are OrthoMCL, COGtriangles, OrthoFinder2, and OMA. The authors compare the number of orthologous groups predicted by these various methods, and present a brief workflow for the functional annotation and reconstruction of phylogenies from inferred single-copy orthologous genes. The chapter also demonstrates how to explore two orthology databases: eggNOG6 and OrthoDB.
Collapse
Affiliation(s)
- Dennis Carhuaricra-Huaman
- Programa de Pós-Graduação Interunidades em Bioinformática, Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, SP, Brazil
- Research Group in Biotechnology Applied to Animal Health, Production and Conservation (SANIGEN), Laboratory of Biology and Molecular Genetics, Faculty of Veterinary Medicine, Universidad Nacional Mayor de San Marcos, Lima, Peru
| | - João Carlos Setubal
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil.
| |
Collapse
|
12
|
Singleton M, Eisen M. Leveraging genomic redundancy to improve inference and alignment of orthologous proteins. G3 (BETHESDA, MD.) 2023; 13:jkad222. [PMID: 37770067 PMCID: PMC10700111 DOI: 10.1093/g3journal/jkad222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 09/11/2023] [Accepted: 09/19/2023] [Indexed: 10/03/2023]
Abstract
Identifying protein sequences with common ancestry is a core task in bioinformatics and evolutionary biology. However, methods for inferring and aligning such sequences in annotated genomes have not kept pace with the increasing scale and complexity of the available data. Thus, in this work, we implemented several improvements to the traditional methodology that more fully leverage the redundancy of closely related genomes and the organization of their annotations. Two highlights include the application of the more flexible k-clique percolation algorithm for identifying clusters of orthologous proteins and the development of a novel technique for removing poorly supported regions of alignments with a phylogenetic hidden Markov model (phylo-HMM). In making the latter, we wrote a fully documented Python package Homomorph that implements standard HMM algorithms and created a set of tutorials to promote its use by a wide audience. We applied the resulting pipeline to a set of 33 annotated Drosophila genomes, generating 22,813 orthologous groups and 8,566 high-quality alignments.
Collapse
Affiliation(s)
- Marc Singleton
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA 94720, USA
| | - Michael Eisen
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA 94720, USA
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA 94720, USA
| |
Collapse
|
13
|
Nestor BJ, Bayer PE, Fernandez CGT, Edwards D, Finnegan PM. Approaches to increase the validity of gene family identification using manual homology search tools. Genetica 2023; 151:325-338. [PMID: 37817002 PMCID: PMC10692271 DOI: 10.1007/s10709-023-00196-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/01/2023] [Indexed: 10/12/2023]
Abstract
Identifying homologs is an important process in the analysis of genetic patterns underlying traits and evolutionary relationships among species. Analysis of gene families is often used to form and support hypotheses on genetic patterns such as gene presence, absence, or functional divergence which underlie traits examined in functional studies. These analyses often require precise identification of all members in a targeted gene family. Manual pipelines where homology search and orthology assignment tools are used separately are the most common approach for identifying small gene families where accurate identification of all members is important. The ability to curate sequences between steps in manual pipelines allows for simple and precise identification of all possible gene family members. However, the validity of such manual pipeline analyses is often decreased by inappropriate approaches to homology searches including too relaxed or stringent statistical thresholds, inappropriate query sequences, homology classification based on sequence similarity alone, and low-quality proteome or genome sequences. In this article, we propose several approaches to mitigate these issues and allow for precise identification of gene family members and support for hypotheses linking genetic patterns to functional traits.
Collapse
Affiliation(s)
- Benjamin J Nestor
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia.
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia.
| | - Philipp E Bayer
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - Cassandria G Tay Fernandez
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - David Edwards
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| | - Patrick M Finnegan
- School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia
| |
Collapse
|
14
|
Jin Z, Sato Y, Kawashima M, Kanehisa M. KEGG tools for classification and analysis of viral proteins. Protein Sci 2023; 32:e4820. [PMID: 37881892 PMCID: PMC10661063 DOI: 10.1002/pro.4820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 10/19/2023] [Accepted: 10/21/2023] [Indexed: 10/27/2023]
Abstract
The KEGG database and analysis tools (https://www.kegg.jp) have been developed mostly for understanding genes and genomes of cellular organisms. The KO (KEGG Orthology) dataset, which is a collection of functional orthologs, plays the role of linking genes in the genome to pathways and other molecular networks, enabling KEGG mapping to uncover hidden features in the genome. Although viruses were part of KEGG for some time, they were not fully integrated in the KEGG analysis tools, because the KO assignment rate is very low for virus genes. To supplement KOs a new dataset named virus ortholog clusters (VOCs) is computationally generated, covering 90% of viral proteins in KEGG. VOCs can be used, in place of KOs, for taxonomy mapping to uncover relationships of sequence similarity groups and taxonomic groups and for identifying conserved gene orders in virus genomes. Furthermore, selected VOCs are used to define tentative KOs for characterizing protein functions. Here an overview of KEGG tools is presented focusing on these extensions for viral protein analysis.
Collapse
Affiliation(s)
- Zhao Jin
- Institute for Chemical Research, Kyoto UniversityUjiKyotoJapan
- Pathway Solutions Inc.TokyoJapan
| | | | | | - Minoru Kanehisa
- Institute for Chemical Research, Kyoto UniversityUjiKyotoJapan
| |
Collapse
|
15
|
Bult CJ, Sternberg PW. The alliance of genome resources: transforming comparative genomics. Mamm Genome 2023; 34:531-544. [PMID: 37666946 PMCID: PMC10628019 DOI: 10.1007/s00335-023-10015-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 08/11/2023] [Indexed: 09/06/2023]
Abstract
Comparing genomic and biological characteristics across multiple species is essential to using model systems to investigate the molecular and cellular mechanisms underlying human biology and disease and to translate mechanistic insights from studies in model organisms for clinical applications. Building a scalable knowledge commons platform that supports cross-species comparison of rich, expertly curated knowledge regarding gene function, phenotype, and disease associations available for model organisms and humans is the primary mission of the Alliance of Genome Resources (the Alliance). The Alliance is a consortium of seven model organism knowledgebases (mouse, rat, yeast, nematode, zebrafish, frog, fruit fly) and the Gene Ontology resource. The Alliance uses a common set of gene ortholog assertions as the basis for comparing biological annotations across the organisms represented in the Alliance. The major types of knowledge associated with genes that are represented in the Alliance database currently include gene function, phenotypic alleles and variants, human disease associations, pathways, gene expression, and both protein-protein and genetic interactions. The Alliance has enhanced the ability of researchers to easily compare biological annotations for common data types across model organisms and human through the implementation of shared programmatic access mechanisms, data-specific web pages with a unified "look and feel", and interactive user interfaces specifically designed to support comparative biology. The modular infrastructure developed by the Alliance allows the resource to serve as an extensible "knowledge commons" capable of expanding to accommodate additional model organisms.
Collapse
|
16
|
Aleksander SA, Anagnostopoulos AV, Antonazzo G, Arnaboldi V, Attrill H, Becerra A, Bello SM, Blodgett O, Bradford YM, Bult CJ, Cain S, Calvi BR, Carbon S, Chan J, Chen WJ, Michael Cherry J, Cho J, Crosby MA, De Pons JL, D’Eustachio P, Diamantakis S, Dolan ME, Santos GD, Dyer S, Ebert D, Engel SR, Fashena D, Fisher M, Foley S, Gibson AC, Gollapally VR, Sian Gramates L, Grove CA, Hale P, Harris T, Thomas Hayman G, Hu Y, James-Zorn C, Karimi K, Karra K, Kishore R, Kwitek AE, Laulederkind SJF, Lee R, Longden I, Luypaert M, Markarian N, Marygold SJ, Matthews B, McAndrews MS, Millburn G, Miyasato S, Motenko H, Moxon S, Muller HM, Mungall CJ, Muruganujan A, Mushayahama T, Nash RS, Nuin P, Paddock H, Pells T, Perrimon N, Pich C, Quinton-Tulloch M, Raciti D, Ramachandran S, Richardson JE, Gelbart SR, Ruzicka L, Schindelman G, Shaw DR, Sherlock G, Shrivatsav A, Singer A, Smith CM, Smith CL, Smith JR, Stein L, Sternberg PW, Tabone CJ, Thomas PD, Thorat K, Thota J, Tomczuk M, Trovisco V, Tutaj MA, Urbano JM, Auken KV, Van Slyke CE, Vize PD, Wang Q, Weng S, Westerfield M, Wilming LG, Wong ED, Wright A, Yook K, Zhou P, Zorn A, Zytkovicz M. Updates to the Alliance of Genome Resources Central Infrastructure Alliance of Genome Resources Consortium. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.20.567935. [PMID: 38045425 PMCID: PMC10690154 DOI: 10.1101/2023.11.20.567935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The Alliance of Genome Resources (Alliance) is an extensible coalition of knowledgebases focused on the genetics and genomics of intensively-studied model organisms. The Alliance is organized as individual knowledge centers with strong connections to their research communities and a centralized software infrastructure, discussed here. Model organisms currently represented in the Alliance are budding yeast, C. elegans, Drosophila, zebrafish, frog, laboratory mouse, laboratory rat, and the Gene Ontology Consortium. The project is in a rapid development phase to harmonize knowledge, store it, analyze it, and present it to the community through a web portal, direct downloads, and APIs. Here we focus on developments over the last two years. Specifically, we added and enhanced tools for browsing the genome (JBrowse), downloading sequences, mining complex data (AllianceMine), visualizing pathways, full-text searching of the literature (Textpresso), and sequence similarity searching (SequenceServer). We enhanced existing interactive data tables and added an interactive table of paralogs to complement our representation of orthology. To support individual model organism communities, we implemented species-specific "landing pages" and will add disease-specific portals soon; in addition, we support a common community forum implemented in Discourse. We describe our progress towards a central persistent database to support curation, the data modeling that underpins harmonization, and progress towards a state-of-the art literature curation system with integrated Artificial Intelligence and Machine Learning (AI/ML).
Collapse
|
17
|
Contreras-Moreira B, Saraf S, Naamati G, Casas AM, Amberkar SS, Flicek P, Jones AR, Dyer S. GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation. Genome Biol 2023; 24:223. [PMID: 37798615 PMCID: PMC10552430 DOI: 10.1186/s13059-023-03071-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 09/21/2023] [Indexed: 10/07/2023] Open
Abstract
Crop pangenomes made from individual cultivar assemblies promise easy access to conserved genes, but genome content variability and inconsistent identifiers hamper their exploration. To address this, we define pangenes, which summarize a species coding potential and link back to original annotations. The protocol get_pangenes performs whole genome alignments (WGA) to call syntenic gene models based on coordinate overlaps. A benchmark with small and large plant genomes shows that pangenes recapitulate phylogeny-based orthologies and produce complete soft-core gene sets. Moreover, WGAs support lift-over and help confirm gene presence-absence variation. Source code and documentation: https://github.com/Ensembl/plant-scripts .
Collapse
Affiliation(s)
- Bruno Contreras-Moreira
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
- Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain.
| | - Shradha Saraf
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Guy Naamati
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Ana M Casas
- Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain
| | - Sandeep S Amberkar
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Andrew R Jones
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK
| | - Sarah Dyer
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
| |
Collapse
|
18
|
Pajkos M, Erdős G, Dosztányi Z. The Origin of Discrepancies between Predictions and Annotations in Intrinsically Disordered Proteins. Biomolecules 2023; 13:1442. [PMID: 37892124 PMCID: PMC10604070 DOI: 10.3390/biom13101442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/05/2023] [Accepted: 09/20/2023] [Indexed: 10/29/2023] Open
Abstract
Disorder prediction methods that can discriminate between ordered and disordered regions have contributed fundamentally to our understanding of the properties and prevalence of intrinsically disordered proteins (IDPs) in proteomes as well as their functional roles. However, a recent large-scale assessment of the performance of these methods indicated that there is still room for further improvements, necessitating novel approaches to understand the strengths and weaknesses of individual methods. In this study, we compared two methods, IUPred and disorder prediction, based on the pLDDT scores derived from AlphaFold2 (AF2) models. We evaluated these methods using a dataset from the DisProt database, consisting of experimentally characterized disordered regions and subsets associated with diverse experimental methods and functions. IUPred and AF2 provided consistent predictions in 79% of cases for long disordered regions; however, for 15% of these cases, they both suggested order in disagreement with annotations. These discrepancies arose primarily due to weak experimental support, the presence of intermediate states, or context-dependent behavior, such as binding-induced transitions. Furthermore, AF2 tended to predict helical regions with high pLDDT scores within disordered segments, while IUPred had limitations in identifying linker regions. These results provide valuable insights into the inherent limitations and potential biases of disorder prediction methods.
Collapse
Affiliation(s)
| | | | - Zsuzsanna Dosztányi
- Department of Biochemistry, ELTE Eötvös Loránd University, Pázmány Péter Stny 1/c, H-1117 Budapest, Hungary; (M.P.); (G.E.)
| |
Collapse
|
19
|
Chodkowski M, Zielezinski A, Anbalagan S. A ligand-receptor interactome atlas of the zebrafish. iScience 2023; 26:107309. [PMID: 37539027 PMCID: PMC10393773 DOI: 10.1016/j.isci.2023.107309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 05/25/2023] [Accepted: 07/04/2023] [Indexed: 08/05/2023] Open
Abstract
Studies in zebrafish can unravel the functions of cellular communication and thus identify novel bench-to-bedside drugs targeting cellular communication signaling molecules. Due to the incomplete annotation of zebrafish proteome, the knowledge of zebrafish receptors, ligands, and tools to explore their interactome is limited. To address this gap, we de novo predicted the cellular localization of zebrafish reference proteome using deep learning algorithm. We combined the predicted and existing annotations on cellular localization of zebrafish proteins and created repositories of zebrafish ligands, membrane receptome, and interactome as well as associated diseases and targeting drugs. Unlike other tools, our interactome atlas is based on both the physical interaction data of zebrafish proteome and existing human ligand-receptor pair databases. The resources are available as R and Python scripts. DanioTalk provides a novel resource for researchers interested in targeting cellular communication in zebrafish, as we demonstrate in applications studying synapse and axo-glial interactome. DanioTalk methodology can be applied to build and explore the ligand-receptor atlas of other non-mammalian model organisms.
Collapse
Affiliation(s)
- Milosz Chodkowski
- Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University in Poznań, Poznań, Poland
| | - Andrzej Zielezinski
- Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University in Poznań, Poznań, Poland
| | - Savani Anbalagan
- Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University in Poznań, Poznań, Poland
| |
Collapse
|
20
|
Lyubetsky VA, Rubanov LI, Tereshina MB, Ivanova AS, Araslanova KR, Uroshlev LA, Goremykina GI, Yang JR, Kanovei VG, Zverkov OA, Shitikov AD, Korotkova DD, Zaraisky AG. Wide-scale identification of novel/eliminated genes responsible for evolutionary transformations. Biol Direct 2023; 18:45. [PMID: 37568147 PMCID: PMC10416458 DOI: 10.1186/s13062-023-00405-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 08/07/2023] [Indexed: 08/13/2023] Open
Abstract
BACKGROUND It is generally accepted that most evolutionary transformations at the phenotype level are associated either with rearrangements of genomic regulatory elements, which control the activity of gene networks, or with changes in the amino acid contents of proteins. Recently, evidence has accumulated that significant evolutionary transformations could also be associated with the loss/emergence of whole genes. The targeted identification of such genes is a challenging problem for both bioinformatics and evo-devo research. RESULTS To solve this problem we propose the WINEGRET method, named after the first letters of the title. Its main idea is to search for genes that satisfy two requirements: first, the desired genes were lost/emerged at the same evolutionary stage at which the phenotypic trait of interest was lost/emerged, and second, the expression of these genes changes significantly during the development of the trait of interest in the model organism. To verify the first requirement, we do not use existing databases of orthologs, but rely purely on gene homology and local synteny by using some novel quickly computable conditions. Genes satisfying the second requirement are found by deep RNA sequencing. As a proof of principle, we used our method to find genes absent in extant amniotes (reptiles, birds, mammals) but present in anamniotes (fish and amphibians), in which these genes are involved in the regeneration of large body appendages. As a result, 57 genes were identified. For three of them, c-c motif chemokine 4, eotaxin-like, and a previously unknown gene called here sod4, essential roles for tail regeneration were demonstrated. Noteworthy, we established that the latter gene belongs to a novel family of Cu/Zn-superoxide dismutases lost by amniotes, SOD4. CONCLUSIONS We present a method for targeted identification of genes whose loss/emergence in evolution could be associated with the loss/emergence of a phenotypic trait of interest. In a proof-of-principle study, we identified genes absent in amniotes that participate in body appendage regeneration in anamniotes. Our method provides a wide range of opportunities for studying the relationship between the loss/emergence of phenotypic traits and the loss/emergence of specific genes in evolution.
Collapse
Affiliation(s)
- Vassily A Lyubetsky
- Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), 19 Build. 1, Bolshoy Karetny per., Moscow, Russia, 127051
- Department of Mechanics and Mathematics, Lomonosov Moscow State University, Kolmogorova Str., 1, Moscow, Russia, 119234
| | - Lev I Rubanov
- Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), 19 Build. 1, Bolshoy Karetny per., Moscow, Russia, 127051
| | - Maria B Tereshina
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 16/10, Miklukho-Maklaya Str., Moscow, Russia, 117997
- Pirogov Russian National Research Medical University, Moscow, Russia
| | - Anastasiya S Ivanova
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 16/10, Miklukho-Maklaya Str., Moscow, Russia, 117997
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, USA
| | - Karina R Araslanova
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 16/10, Miklukho-Maklaya Str., Moscow, Russia, 117997
| | - Leonid A Uroshlev
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 32, Vavilova Str., Moscow, Russia, 119991
| | - Galina I Goremykina
- Plekhanov Russian University of Economics, Stremyanny Lane 36, Moscow, Russia
| | - Jian-Rong Yang
- Advanced Medical Technology Center, The First Affiliated Hospital, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China
- Department of Genetics and Biomedical Informatics, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China
| | - Vladimir G Kanovei
- Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), 19 Build. 1, Bolshoy Karetny per., Moscow, Russia, 127051
| | - Oleg A Zverkov
- Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), 19 Build. 1, Bolshoy Karetny per., Moscow, Russia, 127051
| | - Alexander D Shitikov
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 16/10, Miklukho-Maklaya Str., Moscow, Russia, 117997
| | - Daria D Korotkova
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 16/10, Miklukho-Maklaya Str., Moscow, Russia, 117997
- Global Health Institute, School of Life Sciences, EPFL, Lausanne, Switzerland
| | - Andrey G Zaraisky
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 16/10, Miklukho-Maklaya Str., Moscow, Russia, 117997.
- Pirogov Russian National Research Medical University, Moscow, Russia.
| |
Collapse
|
21
|
Langschied F, Leisegang MS, Brandes RP, Ebersberger I. ncOrtho: efficient and reliable identification of miRNA orthologs. Nucleic Acids Res 2023; 51:e71. [PMID: 37260093 PMCID: PMC10359484 DOI: 10.1093/nar/gkad467] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Revised: 05/04/2023] [Accepted: 05/30/2023] [Indexed: 06/02/2023] Open
Abstract
MicroRNAs (miRNAs) are post-transcriptional regulators that finetune gene expression via translational repression or degradation of their target mRNAs. Despite their functional relevance, frameworks for the scalable and accurate detection of miRNA orthologs are missing. Consequently, there is still no comprehensive picture of how miRNAs and their associated regulatory networks have evolved. Here we present ncOrtho, a synteny informed pipeline for the targeted search of miRNA orthologs in unannotated genome sequences. ncOrtho matches miRNA annotations from multi-tissue transcriptomes in precision, while scaling to the analysis of hundreds of custom-selected species. The presence-absence pattern of orthologs to 266 human miRNA families across 402 vertebrate species reveals four bursts of miRNA acquisition, of which the most recent event occurred in the last common ancestor of higher primates. miRNA families are rarely modified or lost, but notable exceptions for both events exist. miRNA co-ortholog numbers faithfully indicate lineage-specific whole genome duplications, and miRNAs are powerful markers for phylogenomic analyses. Their exceptionally low genetic diversity makes them suitable to resolve clades where the phylogenetic signal is blurred by incomplete lineage sorting of ancestral alleles. In summary, ncOrtho allows to routinely consider miRNAs in evolutionary analyses that were thus far reserved to protein-coding genes.
Collapse
Affiliation(s)
- Felix Langschied
- Applied Bioinformatics Group, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany
| | - Matthias S Leisegang
- Institute for Cardiovascular Physiology, Goethe University, Frankfurt, Germany
- German Center of Cardiovascular Research (DZHK), Partner site RheinMain, Frankfurt, Germany
| | - Ralf P Brandes
- Institute for Cardiovascular Physiology, Goethe University, Frankfurt, Germany
- German Center of Cardiovascular Research (DZHK), Partner site RheinMain, Frankfurt, Germany
| | - Ingo Ebersberger
- Applied Bioinformatics Group, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany
- Senckenberg Biodiversity and Climate Research Centre (S-BIK-F), Frankfurt am Main, Germany
- LOEWE Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany
| |
Collapse
|
22
|
Moi D, Dessimoz C. Phylogenetic profiling in eukaryotes comes of age. Proc Natl Acad Sci U S A 2023; 120:e2305013120. [PMID: 37126713 PMCID: PMC10175774 DOI: 10.1073/pnas.2305013120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023] Open
Affiliation(s)
- David Moi
- Department of Computational Biology, University of Lausanne, 1015Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, 1015Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015Lausanne, Switzerland
| |
Collapse
|
23
|
Sun J, Lu F, Luo Y, Bie L, Xu L, Wang Y. OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes. Nucleic Acids Res 2023:7146343. [PMID: 37114999 DOI: 10.1093/nar/gkad313] [Citation(s) in RCA: 81] [Impact Index Per Article: 81.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 04/07/2023] [Accepted: 04/13/2023] [Indexed: 04/29/2023] Open
Abstract
Advancements in comparative genomics research have led to a growing interest in studying species evolution and genetic diversity. To facilitate this research, OrthoVenn3 has been developed as a powerful, web-based tool that enables users to efficiently identify and annotate orthologous clusters and infer phylogenetic relationships across a range of species. The latest upgrade of OrthoVenn includes several important new features, including enhanced orthologous cluster identification accuracy, improved visualization capabilities for numerous sets of data, and wrapped phylogenetic analysis. Furthermore, OrthoVenn3 now provides gene family contraction and expansion analysis to support researchers better understanding the evolutionary history of gene families, as well as collinearity analysis to detect conserved and variable genomic structures. With its intuitive user interface and robust functionality, OrthoVenn3 is a valuable resource for comparative genomics research. The tool is freely accessible at https://orthovenn3.bioinfotoolkits.net.
Collapse
Affiliation(s)
- Jiahe Sun
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Fang Lu
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Yongjiang Luo
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Lingzi Bie
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Ling Xu
- State Key Laboratory of Plant Environmental Resilience, College of Biological Sciences, China Agricultural University, Beijing, China
| | - Yi Wang
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| |
Collapse
|
24
|
Persson E, Sonnhammer ELL. InParanoiDB 9: Ortholog Groups for Protein Domains and Full-Length Proteins. J Mol Biol 2023:168001. [PMID: 36764355 DOI: 10.1016/j.jmb.2023.168001] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 01/20/2023] [Accepted: 02/01/2023] [Indexed: 02/11/2023]
Abstract
Prediction of orthologs is an important bioinformatics pursuit that is frequently used for inferring protein function and evolutionary analyses. The InParanoid database is a well known resource of ortholog predictions between a wide variety of organisms. Although orthologs have historically been inferred at the level of full-length protein sequences, many proteins consist of several independent protein domains that may be orthologous to domains in other proteins in a way that differs from the full-length protein case. To be able to capture all types of orthologous relations, conventional full-length protein orthologs can be complemented with orthologs inferred at the domain level. We here present InParanoiDB 9, covering 640 species and providing orthologs for both protein domains and full-length proteins. InParanoiDB 9 was built using the faster InParanoid-DIAMOND algorithm for orthology analysis, as well as Domainoid and Pfam to infer orthologous domains. InParanoiDB 9 is based on proteomes from 447 eukaryotes, 158 bacteria and 35 archaea, and includes over one billion predicted ortholog groups. A new website has been built for the database, providing multiple search options as well as visualization of groups of orthologs and orthologous domains. This release constitutes a major upgrade of the InParanoid database in terms of the number of species as well as the new capability to operate on the domain level. InParanoiDB 9 is available at https://inparanoidb.sbc.su.se/.
Collapse
Affiliation(s)
- Emma Persson
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden. https://twitter.com/eriksonnhammer
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden.
| |
Collapse
|
25
|
Kaur H, Lynn AM. Mapping the FtsQBL divisome components in bacterial NTD pathogens as potential drug targets. Front Genet 2023; 13:1010870. [PMID: 36685953 PMCID: PMC9846249 DOI: 10.3389/fgene.2022.1010870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 12/05/2022] [Indexed: 01/05/2023] Open
Abstract
Cytokinesis is an essential process in bacterial cell division, and it involves more than 25 essential/non-essential cell division proteins that form a protein complex known as a divisome. Central to the divisome are the proteins FtsB and FtsL binding to FtsQ to form a complex FtsQBL, which helps link the early proteins with late proteins. The FtsQBL complex is highly conserved as a component across bacteria. Pathogens like Vibrio cholerae, Mycobacterium ulcerans, Mycobacterium leprae, and Chlamydia trachomatis are the causative agents of the bacterial Neglected Tropical Diseases Cholera, Buruli ulcer, Leprosy, and Trachoma, respectively, some of which seemingly lack known homologs for some of the FtsQBL complex proteins. In the absence of experimental characterization, either due to insufficient resources or the massive increase in novel sequences generated from genomics, functional annotation is traditionally inferred by sequence similarity to a known homolog. With the advent of accurate protein structure prediction methods, features both at the fold level and at the protein interaction level can be used to identify orthologs that cannot be unambiguously identified using sequence similarity methods. Using the FtsQBL complex proteins as a case study, we report potential remote homologs using Profile Hidden Markov models and structures predicted using AlphaFold. Predicted ortholog structures show conformational similarity with corresponding E. coli proteins irrespective of their level of sequence similarity. Alphafold multimer was used to characterize remote homologs as FtsB or FtsL, when they were not sufficiently distinguishable at both the sequence or structure level, as their interactions with FtsQ and FtsW play a crucial role in their function. The structures were then analyzed to identify functionally critical regions of the proteins consistent with their homologs and delineate regions potentially useful for inhibitor discovery.
Collapse
|
26
|
Liu X, Shen Q, Zhang S. Cross-species cell-type assignment from single-cell RNA-seq data by a heterogeneous graph neural network. Genome Res 2023; 33:96-111. [PMID: 36526433 PMCID: PMC9977153 DOI: 10.1101/gr.276868.122] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 12/09/2022] [Indexed: 12/23/2022]
Abstract
Cross-species comparative analyses of single-cell RNA sequencing (scRNA-seq) data allow us to explore, at single-cell resolution, the origins of the cellular diversity and evolutionary mechanisms that shape cellular form and function. Cell-type assignment is a crucial step to achieve that. However, the poorly annotated genome and limited known biomarkers hinder us from assigning cell identities for nonmodel species. Here, we design a heterogeneous graph neural network model, CAME, to learn aligned and interpretable cell and gene embeddings for cross-species cell-type assignment and gene module extraction from scRNA-seq data. CAME achieves significant improvements in cell-type characterization across distant species owing to the utilization of non-one-to-one homologous gene mapping ignored by early methods. Our large-scale benchmarking study shows that CAME significantly outperforms five classical methods in terms of cell-type assignment and model robustness to insufficiency and inconsistency of sequencing depths. CAME can transfer the major cell types and interneuron subtypes of human brains to mouse and discover shared cell-type-specific functions in homologous gene modules. CAME can align the trajectories of human and macaque spermatogenesis and reveal their conservative expression dynamics. In short, CAME can make accurate cross-species cell-type assignments even for nonmodel species and uncover shared and divergent characteristics between two species from scRNA-seq data.
Collapse
Affiliation(s)
- Xingyan Liu
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China;,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Qunlun Shen
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China;,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China;,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China;,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China;,Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
| |
Collapse
|
27
|
Kress A, Poch O, Lecompte O, Thompson JD. Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events. FRONTIERS IN BIOINFORMATICS 2023; 3:1178926. [PMID: 37151482 PMCID: PMC10158824 DOI: 10.3389/fbinf.2023.1178926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 04/05/2023] [Indexed: 05/09/2023] Open
Abstract
Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.
Collapse
|
28
|
Duan G, Wu G, Chen X, Tian D, Li Z, Sun Y, Du Z, Hao L, Song S, Gao Y, Xiao J, Zhang Z, Bao Y, Tang B, Zhao W. HGD: an integrated homologous gene database across multiple species. Nucleic Acids Res 2022; 51:D994-D1002. [PMID: 36318261 PMCID: PMC9825607 DOI: 10.1093/nar/gkac970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 09/28/2022] [Accepted: 10/17/2022] [Indexed: 11/06/2022] Open
Abstract
Homology is fundamental to infer genes' evolutionary processes and relationships with shared ancestry. Existing homolog gene resources vary in terms of inferring methods, homologous relationship and identifiers, posing inevitable difficulties for choosing and mapping homology results from one to another. Here, we present HGD (Homologous Gene Database, https://ngdc.cncb.ac.cn/hgd), a comprehensive homologs resource integrating multi-species, multi-resources and multi-omics, as a complement to existing resources providing public and one-stop data service. Currently, HGD houses a total of 112 383 644 homologous pairs for 37 species, including 19 animals, 16 plants and 2 microorganisms. Meanwhile, HGD integrates various annotations from public resources, including 16 909 homologs with traits, 276 670 homologs with variants, 398 573 homologs with expression and 536 852 homologs with gene ontology (GO) annotations. HGD provides a wide range of omics gene function annotations to help users gain a deeper understanding of gene function.
Collapse
Affiliation(s)
| | | | - Xiaoning Chen
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Dongmei Tian
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Zhaohua Li
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yanling Sun
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Zhenglin Du
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Lili Hao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Shuhui Song
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yuan Gao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jingfa Xiao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhang Zhang
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yiming Bao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bixia Tang
- Correspondence may also be addressed to Bixia Tang.
| | - Wenming Zhao
- To whom correspondence should be addressed. Tel: +86 1084097636; Fax: +86 1084097720;
| |
Collapse
|