1
|
Chen J, Wu H, Wang N. KEGG orthology prediction of bacterial proteins using natural language processing. BMC Bioinformatics 2024; 25:146. [PMID: 38600441 PMCID: PMC11007918 DOI: 10.1186/s12859-024-05766-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 04/03/2024] [Indexed: 04/12/2024] Open
Abstract
BACKGROUND The advent of high-throughput technologies has led to an exponential increase in uncharacterized bacterial protein sequences, surpassing the capacity of manual curation. A large number of bacterial protein sequences remain unannotated by Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology, making it necessary to use auto annotation tools. These tools are now indispensable in the biological research landscape, bridging the gap between the vastness of unannotated sequences and meaningful biological insights. RESULTS In this work, we propose a novel pipeline for KEGG orthology annotation of bacterial protein sequences that uses natural language processing and deep learning. To assess the effectiveness of our pipeline, we conducted evaluations using the genomes of two randomly selected species from the KEGG database. In our evaluation, we obtain competitive results on precision, recall, and F1 score, with values of 0.948, 0.947, and 0.947, respectively. CONCLUSIONS Our experimental results suggest that our pipeline demonstrates performance comparable to traditional methods and excels in identifying distant relatives with low sequence identity. This demonstrates the potential of our pipeline to significantly improve the accuracy and comprehensiveness of KEGG orthology annotation, thereby advancing our understanding of functional relationships within biological systems.
Collapse
Affiliation(s)
- Jing Chen
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
- Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computing Intelligence, Jiangnan University, Wuxi, China
| | - Haoyu Wu
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Ning Wang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China.
| |
Collapse
|
2
|
de Jong TV, Pan Y, Rastas P, Munro D, Tutaj M, Akil H, Benner C, Chen D, Chitre AS, Chow W, Colonna V, Dalgard CL, Demos WM, Doris PA, Garrison E, Geurts AM, Gunturkun HM, Guryev V, Hourlier T, Howe K, Huang J, Kalbfleisch T, Kim P, Li L, Mahaffey S, Martin FJ, Mohammadi P, Ozel AB, Polesskaya O, Pravenec M, Prins P, Sebat J, Smith JR, Solberg Woods LC, Tabakoff B, Tracey A, Uliano-Silva M, Villani F, Wang H, Sharp BM, Telese F, Jiang Z, Saba L, Wang X, Murphy TD, Palmer AA, Kwitek AE, Dwinell MR, Williams RW, Li JZ, Chen H. A revamped rat reference genome improves the discovery of genetic diversity in laboratory rats. Cell Genom 2024; 4:100527. [PMID: 38537634 PMCID: PMC11019364 DOI: 10.1016/j.xgen.2024.100527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 12/26/2023] [Accepted: 02/29/2024] [Indexed: 04/09/2024]
Abstract
The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared with its predecessor. Gene annotations are now more complete, improving the mapping precision of genomic, transcriptomic, and proteomics datasets. We jointly analyzed 163 short-read whole-genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ∼20.0 million sequence variations, of which 18,700 are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.
Collapse
Affiliation(s)
- Tristan V de Jong
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Yanchao Pan
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Pasi Rastas
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Daniel Munro
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA; Department of Integrative Structural and Computational Biology, Scripps Research, San Diego, CA, USA
| | - Monika Tutaj
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Huda Akil
- Michigan Neuroscience Institute, University of Michigan, Ann Arbor, MI, USA
| | - Chris Benner
- Department of Medicine, University of California San Diego, San Diego, CA, USA
| | - Denghui Chen
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Apurva S Chitre
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - William Chow
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy; Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Clifton L Dalgard
- Department of Anatomy, Physiology & Genetics, The American Genome Center, Uniformed Services University of the Health Sciences, Bethesda, MD, USA
| | - Wendy M Demos
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Peter A Doris
- The Brown Foundation Institute of Molecular Medicine, Center for Human Genetics, University of Texas Health Science Center, Houston, TX, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Aron M Geurts
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Hakan M Gunturkun
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Victor Guryev
- Genome Structure and Ageing, University of Groningen, UMC, Groningen, the Netherlands
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Jun Huang
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ted Kalbfleisch
- Gluck Equine Research Center, Department of Veterinary Science, University of Kentucky, Louisville, KY, USA
| | - Panjun Kim
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ling Li
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA; Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Spencer Mahaffey
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Pejman Mohammadi
- Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA; Department of Pediatrics, University of Washington School of Medicine, Seattle, WA, USA
| | - Ayse Bilge Ozel
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Oksana Polesskaya
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Michal Pravenec
- Institute of Physiology, Czech Academy of Sciences, Prague, Czechia
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonathan Sebat
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Jennifer R Smith
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Leah C Solberg Woods
- Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - Boris Tabakoff
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | | | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hongyang Wang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Burt M Sharp
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Francesca Telese
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Zhihua Jiang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Laura Saba
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Xusheng Wang
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA; Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Abraham A Palmer
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA; Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Anne E Kwitek
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Melinda R Dwinell
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Robert W Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jun Z Li
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA.
| |
Collapse
|
3
|
Zou M, Lin A, Wang Y, Yang D, Liu X. The chromosome-level genome assembly of the giant dobsonfly Acanthacorydalis orientalis (McLachlan, 1899). Sci Data 2024; 11:351. [PMID: 38589366 PMCID: PMC11001986 DOI: 10.1038/s41597-024-03194-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 03/28/2024] [Indexed: 04/10/2024] Open
Abstract
Acanthacorydalis orientalis (McLachlan, 1899) (Megaloptera: Corydalidae) is an important freshwater-benthic invertebrate species that serves as an indicator for water-quality biomonitoring and is valuable for conservation from East Asia. Here, a high-quality reference genome for A. orientalis was constructed using Oxford Nanopore sequencing and High throughput Chromosome Conformation Capture (Hi-C) technology. The final genome size is 547.98 Mb, with the N50 values of contig and scaffold being 7.77 Mb and 50.53 Mb, respectively. The longest contig and scaffold are 20.57 Mb and 62.26 Mb in length, respectively. There are 99.75% contigs anchored onto 13 pseudo-chromosomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the completeness of the genome assembly is 99.01%. There are 10,977 protein-coding genes identified, of which 84.00% are functionally annotated. The genome contains 44.86% repeat sequences. This high-quality genome provides substantial data for future studies on population genetics, aquatic adaptation, and evolution of Megaloptera and other related insect groups.
Collapse
Affiliation(s)
- Mingming Zou
- Department of Entomology, China Agricultural University, Beijing, 100193, China
| | - Aili Lin
- Department of Entomology, China Agricultural University, Beijing, 100193, China
| | - Yuyu Wang
- College of Plant Protection, Hebei Agricultural University, Baoding, 071001, China.
| | - Ding Yang
- Department of Entomology, China Agricultural University, Beijing, 100193, China
| | - Xingyue Liu
- Department of Entomology, China Agricultural University, Beijing, 100193, China.
| |
Collapse
|
4
|
Chen L, Yu XY, Zhang F, Zhang HM, Guo LX, Ren L, Hong XY, Sun JT. A chromosome-level genome assembly of the spider mite Tetranychus piercei McGregor. Sci Data 2024; 11:340. [PMID: 38580722 PMCID: PMC10997676 DOI: 10.1038/s41597-024-03189-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 03/25/2024] [Indexed: 04/07/2024] Open
Abstract
Despite the rapid advances in sequencing technology, limited genomic resources are currently available for phytophagous spider mites, which include many important agricultural pests. One of these pests is Tetranychus piercei (McGregor), a serious banana pest in East Asia exhibiting remarkable tolerance to high temperature. In this study, we assembled a high-quality genome of T. piercei using a combination of PacBio long reads and Illumina short reads sequencing. With the assistance of chromatin conformation capture technology, 99.9% of the contigs were anchored into three pseudochromosomes with a total size of 86.02 Mb. Repetitive elements, accounting for 14.16% of this genome (12.20 Mb), are predominantly composed of long-terminal repeats (30.7%). By combining evidence of ab initio prediction, transcripts, and homologous proteins, we annotated 11,881 protein-coding genes. Both the genome and proteins have high BUSCO completeness scores (>94%). This high-quality genome, along with reliable annotation, provides a valuable resource for investigating the high-temperature tolerance of this species and exploring the genomic basis that underlies the host range evolution of spider mites.
Collapse
Affiliation(s)
- Lei Chen
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China
| | - Xin-Yue Yu
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China
| | - Feng Zhang
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China
| | - Hua-Meng Zhang
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China
| | - Li-Xue Guo
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China
| | - Lu Ren
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China
| | - Xiao-Yue Hong
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China
| | - Jing-Tao Sun
- Department of Entomology, Nanjing Agricultural University, Nanjing, Jiangsu, 210095, China.
| |
Collapse
|
5
|
Jiang H, Chai ZX, Chen XY, Zhang CF, Zhu Y, Ji QM, Xin JW. Yak genome database: a multi-omics analysis platform. BMC Genomics 2024; 25:346. [PMID: 38580907 PMCID: PMC10998334 DOI: 10.1186/s12864-024-10274-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/31/2024] [Indexed: 04/07/2024] Open
Abstract
BACKGROUND The yak (Bos grunniens) is a large ruminant species that lives in high-altitude regions and exhibits excellent adaptation to the plateau environments. To further understand the genetic characteristics and adaptive mechanisms of yak, we have developed a multi-omics database of yak including genome, transcriptome, proteome, and DNA methylation data. DESCRIPTION The Yak Genome Database ( http://yakgenomics.com/ ) integrates the research results of genome, transcriptome, proteome, and DNA methylation, and provides an integrated platform for researchers to share and exchange omics data. The database contains 26,518 genes, 62 transcriptomes, 144,309 proteome spectra, and 22,478 methylation sites of yak. The genome module provides access to yak genome sequences, gene annotations and variant information. The transcriptome module offers transcriptome data from various tissues of yak and cattle strains at different developmental stages. The proteome module presents protein profiles from diverse yak organs. Additionally, the DNA methylation module shows the DNA methylation information at each base of the whole genome. Functions of data downloading and browsing, functional gene exploration, and experimental practice were available for the database. CONCLUSION This comprehensive database provides a valuable resource for further investigations on development, molecular mechanisms underlying high-altitude adaptation, and molecular breeding of yak.
Collapse
Affiliation(s)
- Hui Jiang
- State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China
- Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China
| | - Zhi-Xin Chai
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization, Sichuan Province and Ministry of Education, Southwest Minzu University, 610041, Chengdu, Sichuan, China
| | - Xiao-Ying Chen
- State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China
- Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China
| | - Cheng-Fu Zhang
- State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China
- Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China
| | - Yong Zhu
- State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China
- Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China
| | - Qiu-Mei Ji
- State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China.
- Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China.
| | - Jin-Wei Xin
- State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China.
- Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China.
| |
Collapse
|
6
|
Bhattarai UR, Poulin R, Gemmell NJ, Dowle E. Genome assembly and annotation of the mermithid nematode Mermis nigrescens. G3 (Bethesda) 2024; 14:jkae023. [PMID: 38301266 PMCID: PMC10989877 DOI: 10.1093/g3journal/jkae023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 01/21/2024] [Accepted: 01/22/2024] [Indexed: 02/03/2024]
Abstract
Genetic studies of nematodes have been dominated by Caenorhabditis elegans as a model species. A lack of genomic resources has limited the expansion of genetic research to other groups of nematodes. Here, we report a draft genome assembly of a mermithid nematode, Mermis nigrescens. Mermithidae are insect parasitic nematodes with hosts including a wide range of terrestrial arthropods. We sequenced, assembled, and annotated the whole genome of M. nigrescens using nanopore long reads and 10X Chromium link reads. The assembly is 524 Mb in size consisting of 867 scaffolds. The N50 value is 2.42 Mb, and half of the assembly is in the 30 longest scaffolds. The assembly BUSCO score from the eukaryotic database (eukaryota_odb10) indicates that the genome is 86.7% complete and 5.1% partial. The genome has a high level of heterozygosity (6.6%) with a repeat content of 83.98%. mRNA-seq reads from different sized nematodes (≤2 cm, 3.5-7 cm, and >7 cm body length) representing different developmental stages were also generated and used for the genome annotation. Using ab initio and evidence-based gene model predictions, 12,313 protein-coding genes and 24,186 mRNAs were annotated. These genomic resources will help researchers investigate the various aspects of the biology and host-parasite interactions of mermithid nematodes.
Collapse
Affiliation(s)
- Upendra R Bhattarai
- Department of Anatomy, University of Otago, Dunedin 9016, New Zealand
- Department of Organismic & Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Robert Poulin
- Department of Zoology, University of Otago, Dunedin 9016, New Zealand
| | - Neil J Gemmell
- Department of Anatomy, University of Otago, Dunedin 9016, New Zealand
| | - Eddy Dowle
- Department of Anatomy, University of Otago, Dunedin 9016, New Zealand
| |
Collapse
|
7
|
Wang M, Li X, Liu X, Hou X, He Y, Yu JH, Hu S, Yin H, Xie BB. Annotation of 2,507 Saccharomyces cerevisiae genomes. Microbiol Spectr 2024; 12:e0358223. [PMID: 38488392 DOI: 10.1128/spectrum.03582-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 02/25/2024] [Indexed: 04/06/2024] Open
Abstract
Saccharomyces cerevisiae (baker's yeast, budding yeast) is one of the most important model organisms for biological research and is a crucial microorganism in industry. Currently, a huge number of Saccharomyces cerevisiae genome sequences are available at the public domain. However, these genomes are distributed at different websites and a large number of them are released without annotation information. To provide one complete annotated genome data resource, we collected 2,507 Saccharomyces cerevisiae genome assemblies and re-annotated 2,506 assemblies using a custom annotation pipeline, producing a total of 15,407,164 protein-coding gene models. With a custom pipeline, all these gene sequences were clustered into families. A total of 1,506 single-copy genes were selected as marker genes, which were then used to evaluate the genome completeness and base qualities of all assemblies. Pangenomic analyses were performed based on a selected subset of 847 medium-high-quality genomes. Statistical comparisons revealed a number of gene families showing copy number variations among different organism sources. To the authors' knowledge, this study represents the largest genome annotation project of S. cerevisiae so far, providing rich genomic resources for the future studies of the model organism S. cerevisiae and its relatives.IMPORTANCESaccharomyces cerevisiae (baker's yeast, budding yeast) is one of the most important model organisms for biological research and is a crucial microorganism in industry. Though a huge number of Saccharomyces cerevisiae genome sequences are available at the public domain, these genomes are distributed at different websites and most are released without annotation, hindering the efficient reuse of these genome resources. Here, we collected 2,507 genomes for Saccharomyces cerevisiae, performed genome annotation, and evaluated the genome qualities. All the obtained data have been deposited at public repositories and are freely accessible to the community. This study represents the largest genome annotation project of S. cerevisiae so far, providing one complete annotated genome data set for S. cerevisiae, an important workhorse for fundamental biology, biotechnology, and industry.
Collapse
Affiliation(s)
- Meng Wang
- Microbial Technology Institute and State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China
| | - Xuan Li
- Microbial Technology Institute and State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China
| | - Xian Liu
- Microbial Technology Institute and State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China
| | - Xiaoping Hou
- State Key Laboratory of Biological Fermentation Engineering of Beer, Tsingtao Brewery Co., Ltd, Qingdao, China
| | - Yang He
- State Key Laboratory of Biological Fermentation Engineering of Beer, Tsingtao Brewery Co., Ltd, Qingdao, China
| | - Jun-Hong Yu
- State Key Laboratory of Biological Fermentation Engineering of Beer, Tsingtao Brewery Co., Ltd, Qingdao, China
| | - Shumin Hu
- State Key Laboratory of Biological Fermentation Engineering of Beer, Tsingtao Brewery Co., Ltd, Qingdao, China
| | - Hua Yin
- State Key Laboratory of Biological Fermentation Engineering of Beer, Tsingtao Brewery Co., Ltd, Qingdao, China
| | - Bin-Bin Xie
- Microbial Technology Institute and State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China
| |
Collapse
|
8
|
Halstead-Nussloch G, Signorini SG, Giulio M, Crocetta F, Munari M, Della Torre C, Weber AAT. The genome of the rayed Mediterranean limpet Patella caerulea (Linnaeus, 1758). Genome Biol Evol 2024; 16:evae070. [PMID: 38546725 PMCID: PMC11003540 DOI: 10.1093/gbe/evae070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/23/2024] [Indexed: 04/11/2024] Open
Abstract
Patella caerulea (Linnaeus, 1758) is a mollusc limpet species of the class Gastropoda. Endemic to the Mediterranean Sea, it is considered a keystone species due to its primary role in structuring and regulating the ecological balance of tidal and subtidal habitats. It is currently being used as a bioindicator to assess the environmental quality of coastal marine waters and as a model species to understand adaptation to ocean acidification. Here, we provide a high-quality reference genome assembly and annotation for P. caerulea. We generated ∼30 Gb of Pacific Biosciences high-fidelity data from a single individual and provide a final 749.8 Mb assembly containing 62 contigs, including the mitochondrial genome (14,938 bp). With an N50 of 48.8 Mb and 98% of the assembly contained in the 18 largest contigs, this assembly is near chromosome-scale. Benchmarking Universal Single-Copy Orthologs scores were high (Mollusca, 87.8% complete; Metazoa, 97.2% complete) and similar to metrics observed for other chromosome-level Patella genomes, highlighting a possible bias in the Mollusca database for Patellids. We generated transcriptomic Illumina data from a second individual collected at the same locality and used it together with protein evidence to annotate the genome. A total of 23,938 protein-coding gene models were found. By comparing this annotation with other published Patella annotations, we found that the distribution and median values of exon and gene lengths was comparable with other Patella species despite different annotation approaches. The present high-quality P. caerulea reference genome, available on GenBank (BioProject: PRJNA1045377; assembly: GCA_036850965.1), is an important resource for future ecological and evolutionary studies.
Collapse
Affiliation(s)
| | - Silvia Giorgia Signorini
- Department of Aquatic Ecology, Swiss Federal Institute of Aquatic Science and Technology (Eawag), Dübendorf, Switzerland
- Department of Biosciences, University of Milan, Milan, Italy
- Department of Integrative Marine Ecology, Stazione Zoologica Anton Dohrn, Naples, Italy
| | - Marco Giulio
- Department of Aquatic Ecology, Swiss Federal Institute of Aquatic Science and Technology (Eawag), Dübendorf, Switzerland
| | - Fabio Crocetta
- Department of Integrative Marine Ecology, Stazione Zoologica Anton Dohrn, Naples, Italy
- National Biodiversity Future Center (NBFC), Palermo, Italy
| | - Marco Munari
- Department of Integrative Marine Ecology, Stazione Zoologica Anton Dohrn, Naples, Italy
- Department of Biology, Stazione Idrobiologica ‘Umberto d’Ancona’, University of Padova, Chioggia, Italy
| | - Camilla Della Torre
- Department of Biosciences, University of Milan, Milan, Italy
- Department of Integrative Marine Ecology, Stazione Zoologica Anton Dohrn, Naples, Italy
| | - Alexandra Anh-Thu Weber
- Department of Aquatic Ecology, Swiss Federal Institute of Aquatic Science and Technology (Eawag), Dübendorf, Switzerland
| |
Collapse
|
9
|
Baril T, Galbraith J, Hayward A. Earl Grey: A Fully Automated User-Friendly Transposable Element Annotation and Analysis Pipeline. Mol Biol Evol 2024; 41:msae068. [PMID: 38577785 PMCID: PMC11003543 DOI: 10.1093/molbev/msae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 02/20/2024] [Accepted: 03/22/2024] [Indexed: 04/06/2024] Open
Abstract
Transposable elements (TEs) are major components of eukaryotic genomes and are implicated in a range of evolutionary processes. Yet, TE annotation and characterization remain challenging, particularly for nonspecialists, since existing pipelines are typically complicated to install, run, and extract data from. Current methods of automated TE annotation are also subject to issues that reduce overall quality, particularly (i) fragmented and overlapping TE annotations, leading to erroneous estimates of TE count and coverage, and (ii) repeat models represented by short sections of total TE length, with poor capture of 5' and 3' ends. To address these issues, we present Earl Grey, a fully automated TE annotation pipeline designed for user-friendly curation and annotation of TEs in eukaryotic genome assemblies. Using nine simulated genomes and an annotation of Drosophila melanogaster, we show that Earl Grey outperforms current widely used TE annotation methodologies in ameliorating the issues mentioned above while scoring highly in benchmarking for TE annotation and classification and being robust across genomic contexts. Earl Grey provides a comprehensive and fully automated TE annotation toolkit that provides researchers with paper-ready summary figures and outputs in standard formats compatible with other bioinformatics tools. Earl Grey has a modular format, with great scope for the inclusion of additional modules focused on further quality control and tailored analyses in future releases.
Collapse
Affiliation(s)
- Tobias Baril
- Centre for Ecology and Conservation, University of Exeter, Penryn Campus, Cornwall TR10 9FE, UK
- Laboratory of Evolutionary Genetics, Institute of Biology, University of Neuchâtel, 2000 Neuchâtel, Switzerland
| | - James Galbraith
- Centre for Ecology and Conservation, University of Exeter, Penryn Campus, Cornwall TR10 9FE, UK
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Alex Hayward
- Centre for Ecology and Conservation, University of Exeter, Penryn Campus, Cornwall TR10 9FE, UK
| |
Collapse
|
10
|
Zhao Y, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Predicting Protein Functions Based on Heterogeneous Graph Attention Technique. IEEE J Biomed Health Inform 2024; 28:2408-2415. [PMID: 38319781 DOI: 10.1109/jbhi.2024.3357834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2024]
Abstract
In bioinformatics, protein function prediction stands as a fundamental area of research and plays a crucial role in addressing various biological challenges, such as the identification of potential targets for drug discovery and the elucidation of disease mechanisms. However, known functional annotation databases usually provide positive experimental annotations that proteins carry out a given function, and rarely record negative experimental annotations that proteins do not carry out a given function. Therefore, existing computational methods based on deep learning models focus on these positive annotations for prediction and ignore these scarce but informative negative annotations, leading to an underestimation of precision. To address this issue, we introduce a deep learning method that utilizes a heterogeneous graph attention technique. The method first constructs a heterogeneous graph that covers the protein-protein interaction network, ontology structure, and positive and negative annotation information. Then, it learns embedding representations of proteins and ontology terms by using the heterogeneous graph attention technique. Finally, it leverages these learned representations to reconstruct the positive protein-term associations and score unobserved functional annotations. It can enhance the predictive performance by incorporating these known limited negative annotations into the constructed heterogeneous graph. Experimental results on three species (i.e., Human, Mouse, and Arabidopsis) demonstrate that our method can achieve better performance in predicting new protein annotations than state-of-the-art methods.
Collapse
|
11
|
Gomes-Dos-Santos A, Domingues M, Ruivo R, Fonseca E, Froufe E, Deyanova D, Franco JN, C Castro LF. An historical "wreck": A transcriptome assembly of the naval shipworm, Teredo navalis Linnaeus, 1978. Mar Genomics 2024; 74:101097. [PMID: 38485291 DOI: 10.1016/j.margen.2024.101097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 12/27/2023] [Accepted: 02/20/2024] [Indexed: 03/19/2024]
Abstract
Historically famous for their negative impact on human-built marine wood structures, mollusc shipworms play a central ecological role in marine ecosystems. Their association with bacterial symbionts, providing cellulolytic and nitrogen-fixing activities, underscores their exceptional wood-eating and wood-boring behaviours, improving energy transfer and the recycling of essential nutrients locked in the wood cellulose. Importantly, from a molecular standpoint, a minute of omic resources are available from this lineage of Bivalvia. Here, we produced and assembled a transcriptome from the globally distributed naval shipworm, Teredo navalis (family Teredinidae). The transcriptome was obtained by sequencing the total RNA from five equidistant segments of the whole body of a T. navalis specimen. The quality of the produced assembly was accessed with several statistics, revealing a highly contiguous (1194 N50) and complete (over 90% BUSCO scores for Eukaryote and Metazoan databases) transcriptome, with nearly 38,000 predicted ORF, more than half being functionally annotated. Our findings pave the way to investigate the unique evolutionary biology of these highly modified bivalves and lay the foundation for an adequate gene annotation of a full genome sequence of the species.
Collapse
Affiliation(s)
- André Gomes-Dos-Santos
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Avenida General Norton de Matos, S/N, 4450-208 Matosinhos, Portugal.
| | - Marcos Domingues
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Avenida General Norton de Matos, S/N, 4450-208 Matosinhos, Portugal
| | - Raquel Ruivo
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Avenida General Norton de Matos, S/N, 4450-208 Matosinhos, Portugal
| | - Elza Fonseca
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Avenida General Norton de Matos, S/N, 4450-208 Matosinhos, Portugal
| | - Elsa Froufe
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Avenida General Norton de Matos, S/N, 4450-208 Matosinhos, Portugal
| | - Diana Deyanova
- Department of Biological and Environmental Sciences, University of Gothenburg, Kristineberg, Fiskebäckskil, Sweden
| | - João N Franco
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Avenida General Norton de Matos, S/N, 4450-208 Matosinhos, Portugal; MARE - Marine and Environmental Sciences Centre & ARNET - Aquatic Research Network, ESTM, Polytechnic of Leiria, 2520-641 Peniche, Portugal
| | - L Filipe C Castro
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Avenida General Norton de Matos, S/N, 4450-208 Matosinhos, Portugal; Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre 1021/1055, 4169-007 Porto, Portugal.
| |
Collapse
|
12
|
Liu W, Wang Z, You R, Xie C, Wei H, Xiong Y, Yang J, Zhu S. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nat Commun 2024; 15:2775. [PMID: 38555371 PMCID: PMC10981738 DOI: 10.1038/s41467-024-46808-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Accepted: 03/08/2024] [Indexed: 04/02/2024] Open
Abstract
Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences. Extensive experimental results show that PLMSearch can search millions of query-target protein pairs in seconds like MMseqs2 while increasing the sensitivity by more than threefold, and is comparable to state-of-the-art structure search methods. In particular, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available at https://dmiip.sjtu.edu.cn/PLMSearch .
Collapse
Affiliation(s)
- Wei Liu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Ziye Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Ronghui You
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China
| | - Chenghan Xie
- School of Mathematical Sciences, Fudan University, 200433, Shanghai, China
| | - Hong Wei
- School of Mathematical Sciences, Nankai University, 300071, Tianjin, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 200240, Shanghai, China
| | - Jianyi Yang
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Science, Shandong University, 266237, Qingdao, China.
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, 200433, Shanghai, China.
- Shanghai Qi Zhi Institute, Shanghai, China.
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China.
- Shanghai Key Lab of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai, China.
- Zhangjiang Fudan International Innovation Center, Shanghai, China.
| |
Collapse
|
13
|
Dong Z, Wang J, Chen G, Guo Y, Zhao N, Wang Z, Zhang B. A high-quality chromosome-level genome assembly of the Chinese medaka Oryzias sinensis. Sci Data 2024; 11:322. [PMID: 38548787 PMCID: PMC10978949 DOI: 10.1038/s41597-024-03173-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Accepted: 03/21/2024] [Indexed: 04/01/2024] Open
Abstract
Oryzias sinensis, also known as Chinese medaka or Chinese ricefish, is a commonly used animal model for aquatic environmental assessment in the wild as well as gene function validation or toxicology research in the lab. Here, a high-quality chromosome-level genome assembly of O. sinensis was generated using single-tube long fragment read (stLFR) reads, Nanopore long-reads, and Hi-C sequencing data. The genome is 796.58 Mb, and a total of 712.17 Mb of the assembled sequences were anchored to 23 pseudo-chromosomes. A final set of 22,461 genes were annotated, with 98.67% being functionally annotated. The Benchmarking Universal Single-Copy Orthologs (BUSCO) benchmark of genome assembly and gene annotation reached 95.1% (93.3% single-copy) and 94.6% (91.7% single-copy), respectively. Furthermore, we also use ATAC-seq to uncover chromosome transposase-accessibility as well as related genome area function enrichment for Oryzias sinensis. This study offers a new improved foundation for future genomics research in Chinese medaka.
Collapse
Affiliation(s)
- Zhongdian Dong
- Key Laboratory of Aquaculture in the South China Sea for Aquatic Economic Animals of Guangdong Higher Education Institutes, College of Fishery, Guangdong Ocean University, Zhanjiang, 524088, China
- Guangdong Provincial Key Laboratory of Aquatic Animal Disease Control and Healthy Culture, College of Fishery, Guangdong Ocean University, Zhanjiang, 524088, China
| | - Jiangman Wang
- Qingdao Marine Management Support Center, Qingdao, Shandong, China
| | - Guozhu Chen
- National Plateau Wetland Research Center, College of Wetlands, Southwest Forestry University, Kunming, 650224, China
| | - Yusong Guo
- Key Laboratory of Aquaculture in the South China Sea for Aquatic Economic Animals of Guangdong Higher Education Institutes, College of Fishery, Guangdong Ocean University, Zhanjiang, 524088, China
| | - Na Zhao
- Key Laboratory of Aquaculture in the South China Sea for Aquatic Economic Animals of Guangdong Higher Education Institutes, College of Fishery, Guangdong Ocean University, Zhanjiang, 524088, China
- Southern Marine Science and Engineering Guangdong Laboratory-Zhanjiang, Zhanjiang, 524000, China
| | - Zhongduo Wang
- Key Laboratory of Aquaculture in the South China Sea for Aquatic Economic Animals of Guangdong Higher Education Institutes, College of Fishery, Guangdong Ocean University, Zhanjiang, 524088, China.
- Guangdong Provincial Key Laboratory of Aquatic Animal Disease Control and Healthy Culture, College of Fishery, Guangdong Ocean University, Zhanjiang, 524088, China.
| | - Bo Zhang
- Key Laboratory of Aquaculture in the South China Sea for Aquatic Economic Animals of Guangdong Higher Education Institutes, College of Fishery, Guangdong Ocean University, Zhanjiang, 524088, China.
- Southern Marine Science and Engineering Guangdong Laboratory-Zhanjiang, Zhanjiang, 524000, China.
| |
Collapse
|
14
|
Chen Z, Ain NU, Zhao Q, Zhang X. From tradition to innovation: conventional and deep learning frameworks in genome annotation. Brief Bioinform 2024; 25:bbae138. [PMID: 38581418 PMCID: PMC10998533 DOI: 10.1093/bib/bbae138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 03/08/2024] [Accepted: 03/10/2024] [Indexed: 04/08/2024] Open
Abstract
Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.
Collapse
Affiliation(s)
- Zhaojia Chen
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
- College of Biomedical Engineering, Taiyuan University of Technology, Jinzhong 030600, China
| | - Noor ul Ain
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
| | - Qian Zhao
- State Key Laboratory for Ecological Pest Control of Fujian/Taiwan Crops and College of Life Science, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Xingtan Zhang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
| |
Collapse
|
15
|
Li Y, Liu Y, Zheng J, Wu B, Cui X, Xu W, Zhu C, Qiu Q, Wang K. A chromosome-level genome assembly of the pig-nosed turtle (Carettochelys insculpta). Sci Data 2024; 11:311. [PMID: 38521795 PMCID: PMC10960847 DOI: 10.1038/s41597-024-03157-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 03/15/2024] [Indexed: 03/25/2024] Open
Abstract
The pig-nosed turtle (Carettochelys insculpta) represents the only extant species within the Carettochelyidae family, is a unique Trionychia member fully adapted to aquatic life and currently facing endangerment. To enhance our understanding of this species and contribute to its conservation efforts, we employed high-fidelity (HiFi) and Hi-C sequencing technology to generate its genome assembly at the chromosome level. The assembly result spans 2.18 Gb, with a contig N50 of 126 Mb, encompassing 34 chromosomes that account for 99.6% of the genome. The assembly has a BUSCO score above 95% with different databases and strong collinearity with Yangtze giant softshell turtles (Rafetus swinhoei), indicating its completeness and continuity. A total of 19,175 genes and 46.86% repetitive sequences were annotated. The availability of this chromosome-scale genome represents a valuable resource for the pig-nosed turtle, providing insights into its aquatic adaptation and serving as a foundation for future turtle research.
Collapse
Affiliation(s)
- Ye Li
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Yuxuan Liu
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Jiangmin Zheng
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Baosheng Wu
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Institute of Zoology, Guangdong Academy of Sciences, Guangzhou, 510260, China
| | - Xinxin Cui
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Wenjie Xu
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Chenglong Zhu
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Qiang Qiu
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Kun Wang
- Shaanxi Key Laboratory of Qinling Ecological Intelligent Monitoring and Protection, School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China.
| |
Collapse
|
16
|
Wu E, Mallawaarachchi V, Zhao J, Yang Y, Liu H, Wang X, Shen C, Lin Y, Qiao L. Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics. Microbiome 2024; 12:58. [PMID: 38504332 PMCID: PMC10949615 DOI: 10.1186/s40168-024-01775-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024]
Abstract
BACKGROUND Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis. RESULTS Herein, we proposed an accurate taxonomic annotation pipeline for genes from metagenomic data, namely contigs directed gene annotation (ConDiGA), and used the method to build a protein sequence database for metaproteomic analysis. We compared our pipeline (ConDiGA or MD3) with two other popular annotation pipelines (MD1 and MD2). In MD1, genes were directly annotated against the whole bacterial genome database; in MD2, contigs were annotated against the whole bacterial genome database and the taxonomic information of contigs was assigned to the genes; in MD3, the most confident species from the contigs annotation results were taken as reference to annotate genes. Annotation tools, including BLAST, Kaiju, and Kraken2, were compared. Based on a synthetic microbial community of 12 species, it was found that Kaiju with the MD3 pipeline outperformed the others in the construction of protein sequence database from metagenomic data. Similar performance was also observed with a fecal sample, as well as in silico mixed datasets of the simulated microbial community and the fecal sample. CONCLUSIONS Overall, we developed an optimized pipeline for gene taxonomic annotation to construct protein sequence databases. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database and can promote the in-depth metaproteomic analysis of microbiome. The unique metagenomic and metaproteomic datasets of the 12 bacterial species are publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of microbiota samples. Video Abstract.
Collapse
Affiliation(s)
- Enhui Wu
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Vijini Mallawaarachchi
- School of Computing, College of Engineering, Computing and Cybernetics, The Australian National University, Canberra, ACT, 2600, Australia
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, SA, 5042, Australia
| | - Jinzhi Zhao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Yi Yang
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Hebin Liu
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Xiaoqing Wang
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Chengpin Shen
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Yu Lin
- School of Computing, College of Engineering, Computing and Cybernetics, The Australian National University, Canberra, ACT, 2600, Australia
| | - Liang Qiao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China.
| |
Collapse
|
17
|
Degalez F, Charles M, Foissac S, Zhou H, Guan D, Fang L, Klopp C, Allain C, Lagoutte L, Lecerf F, Acloque H, Giuffra E, Pitel F, Lagarrigue S. Enriched atlas of lncRNA and protein-coding genes for the GRCg7b chicken assembly and its functional annotation across 47 tissues. Sci Rep 2024; 14:6588. [PMID: 38504112 PMCID: PMC10951430 DOI: 10.1038/s41598-024-56705-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 03/09/2024] [Indexed: 03/21/2024] Open
Abstract
Gene atlases for livestock are steadily improving thanks to new genome assemblies and new expression data improving the gene annotation. However, gene content varies across databases due to differences in RNA sequencing data and bioinformatics pipelines, especially for long non-coding RNAs (lncRNAs) which have higher tissue and developmental specificity and are harder to consistently identify compared to protein coding genes (PCGs). As done previously in 2020 for chicken assemblies galgal5 and GRCg6a, we provide a new gene atlas, lncRNA-enriched, for the latest GRCg7b chicken assembly, integrating "NCBI RefSeq", "EMBL-EBI Ensembl/GENCODE" reference annotations and other resources such as FAANG and NONCODE. As a result, the number of PCGs increases from 18,022 (RefSeq) and 17,007 (Ensembl) to 24,102, and that of lncRNAs from 5789 (RefSeq) and 11,944 (Ensembl) to 44,428. Using 1400 public RNA-seq transcriptome representing 47 tissues, we provided expression evidence for 35,257 (79%) lncRNAs and 22,468 (93%) PCGs, supporting the relevance of this atlas. Further characterization including tissue-specificity, sex-differential expression and gene configurations are provided. We also identified conserved miRNA-hosting genes with human counterparts, suggesting common function. The annotated atlas is available at gega.sigenae.org.
Collapse
Affiliation(s)
- Fabien Degalez
- PEGASE, INRAE, Institut Agro, 35590, Saint Gilles, France
| | - Mathieu Charles
- INRAE, BioinfOmics, GenoToul Bioinformatics facility, Sigenae, Université Fédérale de Toulouse, 31326, Castanet-Tolosan, France
- INRAE, AgroParisTech, GABI, Paris-Saclay University, 78350, Jouy-en-Josas, France
| | - Sylvain Foissac
- GenPhySE, Université de Toulouse, INRAE, ENVT, 31326, Castanet-Tolosan, France
| | | | - Dailu Guan
- University of California Davis, Davis, USA
| | | | - Christophe Klopp
- INRAE, BioinfOmics, GenoToul Bioinformatics facility, Sigenae, Université Fédérale de Toulouse, 31326, Castanet-Tolosan, France
| | - Coralie Allain
- PEGASE, INRAE, Institut Agro, 35590, Saint Gilles, France
| | | | | | - Hervé Acloque
- INRAE, AgroParisTech, GABI, Paris-Saclay University, 78350, Jouy-en-Josas, France
| | - Elisabetta Giuffra
- INRAE, AgroParisTech, GABI, Paris-Saclay University, 78350, Jouy-en-Josas, France
| | - Frédérique Pitel
- GenPhySE, Université de Toulouse, INRAE, ENVT, 31326, Castanet-Tolosan, France
| | | |
Collapse
|
18
|
Pan A, Shentu J, Zeng Y, Guo R, Yu Y. Identification of homologous protein models via 3D comparisons using predicted structures. STAR Protoc 2024; 5:102814. [PMID: 38183654 PMCID: PMC10789644 DOI: 10.1016/j.xpro.2023.102814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 11/14/2023] [Accepted: 12/18/2023] [Indexed: 01/08/2024] Open
Abstract
Recent advances in protein structure prediction enable 3D homology alignment and domain annotation using tertiary structures. Here, we present a protocol to identify homologous structures and annotate protein domains through in silico comparisons using the AlphaFold database. We describe steps for downloading and installing PyMOL software, preparing the query structure, and conducting a 3D homology search. The example provided highlights the application of this protocol in reevaluating an mpox viral protein annotation. For complete details on the use and execution of this protocol, please refer to Pan et al. (2023).1.
Collapse
Affiliation(s)
- Anyu Pan
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Molecular Biology and Biochemistry, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China
| | - Jieyi Shentu
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Molecular Biology and Biochemistry, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China
| | - Yangfan Zeng
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Molecular Biology and Biochemistry, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China
| | - Rong Guo
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Molecular Biology and Biochemistry, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China
| | - Yang Yu
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Molecular Biology and Biochemistry, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China.
| |
Collapse
|
19
|
Sharma B, Sharma S, Medicherla KM, Reddy SM. Genome Sequence Analysis of Calcifying Bacteria Bacillus paranthracis CT5 and Its Biomineralization Efficacy to Improve the Strength and Durability Properties of Civil Structures. Curr Microbiol 2024; 81:109. [PMID: 38466427 DOI: 10.1007/s00284-024-03625-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Accepted: 01/24/2024] [Indexed: 03/13/2024]
Abstract
Bacteria producing urea amidohydrolases (UA) and carbonic anhydrases (CA) are of great importance in civil engineering as these enzymes are responsible for microbially induced calcium carbonate precipitation (MICCP). In this investigation, genomic insights of Bacillus paranthracis CT5 and the expression of genes underlying in MICCP were studied. B. paranthracis produced a maximum level of UA (669.3 U/ml) and CA (125 U/ml) on 5th day of incubation and precipitated 197 mg/100 ml CaCO3 after 7 days of incubation. After 28 days of curing, compressive strength of bacterial admixed and bacterial cured (B-B) specimens was 13.7% higher compared to water-mixed and water-cured (W-W) specimens. A significant decrease in water absorption was observed in bacterial-cured specimens compared to water-cured specimens after 28 days of curing. For genome analysis, reads were assembled de novo producing 5,402,771 bp assembly with N50 of 273,050 bp. RAST annotation detected six amidohydrolase and three carbonic anhydrase genes. Among 5700 coding sequences found in genome, COG gene annotation grouped 4360 genes into COG categories with highest number of genes to transcription (435 genes), amino acid transport and metabolism (362 genes) along with cell wall/membrane/envelope biogenesis and ion transport and metabolism. KEGG functional classification predicted 223 pathways consisting of 1,960 genes and the highest number of genes belongs to two-component system (101 genes) and ABC transporter pathways (98 genes) enabling bacteria to sense and respond to environmental signals and actively transport various minerals and organic molecules, which facilitate the active transport of molecules required for MICCP.
Collapse
Affiliation(s)
- Bhavdeep Sharma
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, 147004, India
| | - Shruti Sharma
- Department of Civil Engineering, Thapar Institute of Engineering & Technology, Patiala, Punjab, 147004, India
| | | | - Sudhakara M Reddy
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, 147004, India.
| |
Collapse
|
20
|
Fuchs LIR, Knobloch J, Wiesenthal AA, Fuss J, Franzenburg S, Torres Oliva M, Müller C, Wheat CW, Hildebrandt JP. A draft genome of the neritid snail Theodoxus fluviatilis. G3 (Bethesda) 2024; 14:jkad282. [PMID: 38069680 PMCID: PMC10917513 DOI: 10.1093/g3journal/jkad282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 12/01/2023] [Indexed: 03/08/2024]
Abstract
The neritid snail Theodoxus fluviatilis is found across habitats differing in salinity, from shallow waters along the coast of the Baltic Sea to lakes throughout Europe. Living close to the water surface makes this species vulnerable to changes in salinity in their natural habitat, and the lack of a free-swimming larval stage limits this species' dispersal. Together, these factors have resulted in a patchy distribution of quite isolated populations differing in their salinity tolerances. In preparation for investigating the mechanisms underlying the physiological differences in osmoregulation between populations that cannot be explained solely by phenotypic plasticity, we present here an annotated draft genome assembly for T. fluviatilis, generated using PacBio long reads, Illumina short reads, and transcriptomic data. While the total assembly size (1045 kb) is similar to those of related species, it remains highly fragmented (N scaffolds = 35,695; N50 = 74 kb) though moderately high in complete gene content (BUSCO single copy complete: 74.3%, duplicate: 2.6%, fragmented: 10.6%, missing: 12.5% using metazoa n = 954). Nevertheless, we were able to generate gene annotations of 21,220 protein-coding genes (BUSCO single copy complete: 65.1%, duplicate: 16.7%, fragmented: 9.1%, missing: 9.1% using metazoa n = 954). Not only will this genome facilitate comparative evolutionary studies across Gastropoda, as this is the first genome assembly for the basal snail family Neritidae, it will also greatly facilitate the study of salinity tolerance in this species. Additionally, we discuss the challenges of working with a species where high molecular weight DNA isolation is very difficult.
Collapse
Affiliation(s)
- Laura Iris Regina Fuchs
- Animal Physiology and Biochemistry, Zoological Institute and Museum, University of Greifswald, Felix Hausdorff-Strasse 1, D - 17489 Greifswald, Germany
| | - Jan Knobloch
- Animal Physiology and Biochemistry, Zoological Institute and Museum, University of Greifswald, Felix Hausdorff-Strasse 1, D - 17489 Greifswald, Germany
| | - Amanda Alice Wiesenthal
- Animal Physiology and Biochemistry, Zoological Institute and Museum, University of Greifswald, Felix Hausdorff-Strasse 1, D - 17489 Greifswald, Germany
- Marine Biology, University of Rostock, Albert-Einstein-Straße 3, D - 18059 Rostock, Germany
| | - Janina Fuss
- Institute of Clinical Molecular Biology, Kiel University (CAU), University Hospital Schleswig Holstein, Rosalind-Franklin-Strasse 12, D - 24105 Kiel, Germany
| | - Soeren Franzenburg
- Institute of Clinical Molecular Biology, Kiel University (CAU), University Hospital Schleswig Holstein, Rosalind-Franklin-Strasse 12, D - 24105 Kiel, Germany
| | - Montserrat Torres Oliva
- Institute of Clinical Molecular Biology, Kiel University (CAU), University Hospital Schleswig Holstein, Rosalind-Franklin-Strasse 12, D - 24105 Kiel, Germany
| | - Christian Müller
- Animal Physiology and Biochemistry, Zoological Institute and Museum, University of Greifswald, Felix Hausdorff-Strasse 1, D - 17489 Greifswald, Germany
| | - Christopher W Wheat
- Department of Zoology, Stockholm University, Svante Arrheniusväg 18 B, S-10691 Stockholm, Sweden
| | - Jan-Peter Hildebrandt
- Animal Physiology and Biochemistry, Zoological Institute and Museum, University of Greifswald, Felix Hausdorff-Strasse 1, D - 17489 Greifswald, Germany
| |
Collapse
|
21
|
Ostevik KL, Alabady M, Zhang M, Rausher MD. Whole-genome sequence and annotation of Penstemon davidsonii. G3 (Bethesda) 2024; 14:jkad296. [PMID: 38155402 PMCID: PMC10917496 DOI: 10.1093/g3journal/jkad296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 11/30/2023] [Accepted: 12/01/2023] [Indexed: 12/30/2023]
Abstract
Penstemon is the most speciose flowering plant genus endemic to North America. Penstemon species' diverse morphology and adaptation to various environments have made them a valuable model system for studying evolution. Here, we report the first full reference genome assembly and annotation for Penstemon davidsonii. Using PacBio long-read sequencing and Hi-C scaffolding technology, we constructed a de novo reference genome of 437,568,744 bases, with a contig N50 of 40 Mb and L50 of 5. The annotation includes 18,199 gene models, and both the genome and transcriptome assembly contain over 95% complete eudicot BUSCOs. This genome assembly will serve as a valuable reference for studying the evolutionary history and genetic diversity of the Penstemon genus.
Collapse
Affiliation(s)
- Kate L Ostevik
- Department of Evolution, Ecology, and Organismal Biology, University of California Riverside, Riverside, CA 92521, USA
- Department of Biology, Duke University, Durham, NC 27708, USA
| | - Magdy Alabady
- Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
| | - Mengrui Zhang
- Department of Statistics, University of Georgia, Athens, GA 30602, USA
| | - Mark D Rausher
- Department of Biology, Duke University, Durham, NC 27708, USA
| |
Collapse
|
22
|
Wang J, Zhang Q, Tung J, Zhang X, Liu D, Deng Y, Tian Z, Chen H, Wang T, Yin W, Li B, Lai Z, Dinesh-Kumar SP, Baker B, Li F. High-quality assembled and annotated genomes of Nicotiana tabacum and Nicotiana benthamiana reveal chromosome evolution and changes in defense arsenals. Mol Plant 2024; 17:423-437. [PMID: 38273657 DOI: 10.1016/j.molp.2024.01.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 01/08/2024] [Accepted: 01/21/2024] [Indexed: 01/27/2024]
Abstract
Nicotiana tabacum and Nicotiana benthamiana are widely used models in plant biology research. However, genomic studies of these species have lagged. Here we report the chromosome-level reference genome assemblies for N. benthamiana and N. tabacum with an estimated 99.5% and 99.8% completeness, respectively. Sensitive transcription start and termination site sequencing methods were developed and used for accurate gene annotation in N. tabacum. Comparative analyses revealed evidence for the parental origins and chromosome structural changes, leading to hybrid genome formation of each species. Interestingly, the antiviral silencing genes RDR1, RDR6, DCL2, DCL3, and AGO2 were lost from one or both subgenomes in N. benthamiana, while both homeologs were kept in N. tabacum. Furthermore, the N. benthamiana genome encodes fewer immune receptors and signaling components than that of N. tabacum. These findings uncover possible reasons underlying the hypersusceptible nature of N. benthamiana. We developed the user-friendly Nicomics (http://lifenglab.hzau.edu.cn/Nicomics/) web server to facilitate better use of Nicotiana genomic resources as well as gene structure and expression analyses.
Collapse
Affiliation(s)
- Jubin Wang
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China; The Key Laboratory of Horticultural Plant Genetic and Improvement of Jiangxi Province, Institute of Biological Resources, Jiangxi Academy of Sciences, Nanchang 330299, China
| | - Qingling Zhang
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China; Institute of Vegetables and Flowers, Jiangxi Academy of Agricultural Sciences, Nanchang 330200, China
| | - Jeffrey Tung
- Plant Gene Expression Center, Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA 94706, USA
| | - Xi Zhang
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China
| | - Dan Liu
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China
| | - Yingtian Deng
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China
| | - Zhendong Tian
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China; Hubei Hongshan Laboratory, Wuhan, Hubei 430070, China
| | - Huilan Chen
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China
| | - Taotao Wang
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China
| | - Weixiao Yin
- College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, Hubei 430070, China
| | - Bo Li
- College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, Hubei 430070, China; Hubei Hongshan Laboratory, Wuhan, Hubei 430070, China
| | - Zhibing Lai
- College of Life Science and Technology, Huazhong Agricultural University, Wuhan, Hubei 430070, China; Hubei Hongshan Laboratory, Wuhan, Hubei 430070, China
| | - Savithramma P Dinesh-Kumar
- Department of Plant Biology and The Genome Center, College of Biological Sciences, University of California, Davis, Davis, CA 95616, USA
| | - Barbara Baker
- Plant Gene Expression Center, Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA 94706, USA.
| | - Feng Li
- National Key Laboratory for Germplasm Innovation and Utilization for Fruit and Vegetable Horticultural Crops, College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan, Hubei 430070, China; Hubei Hongshan Laboratory, Wuhan, Hubei 430070, China.
| |
Collapse
|
23
|
Legeai F, Romain S, Capblancq T, Doniol-Valcroze P, Joron M, Lemaitre C, Després L. Chromosome-Level Assembly and Annotation of the Pearly Heath Coenonympha arcania Butterfly Genome. Genome Biol Evol 2024; 16:evae055. [PMID: 38491969 PMCID: PMC10980516 DOI: 10.1093/gbe/evae055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 03/07/2024] [Accepted: 03/13/2024] [Indexed: 03/18/2024] Open
Abstract
We present the first chromosome-level genome assembly and annotation of the pearly heath Coenonympha arcania, generated with a PacBio HiFi sequencing approach and complemented with Hi-C data. We additionally compare synteny, gene, and repeat content between C. arcania and other Lepidopteran genomes. This reference genome will enable future population genomics studies with Coenonympha butterflies, a species-rich genus that encompasses some of the most highly endangered butterfly taxa in Europe.
Collapse
Affiliation(s)
- Fabrice Legeai
- Inria, CNRS, IRISA, University of Rennes, 35000 Rennes, France
- IGEPP, INRAE, Institut Agro, University of Rennes, 35653 Le Rheu, France
| | - Sandra Romain
- Inria, CNRS, IRISA, University of Rennes, 35000 Rennes, France
| | - Thibaut Capblancq
- LECA, CNRS, Université Grenoble-Alpes, Université Savoie Mont Blanc, Grenoble, France
| | | | - Mathieu Joron
- CEFE, CNRS, EPHE, IRD, Université de Montpellier, Montpellier, France
| | - Claire Lemaitre
- Inria, CNRS, IRISA, University of Rennes, 35000 Rennes, France
| | - Laurence Després
- LECA, CNRS, Université Grenoble-Alpes, Université Savoie Mont Blanc, Grenoble, France
| |
Collapse
|
24
|
Murugesan SN, Tian S, Monteiro A. Genome Assembly and Annotation of the Dark-Branded Bushbrown Butterfly Mycalesis mineus (Nymphalidae: Satyrinae). Genome Biol Evol 2024; 16:evae051. [PMID: 38505885 PMCID: PMC10972688 DOI: 10.1093/gbe/evae051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 02/28/2024] [Accepted: 03/13/2024] [Indexed: 03/21/2024] Open
Abstract
We report a high-quality genome draft assembly of the dark-branded bushbrown, Mycalesis mineus, a member of the Satyrinae subfamily of nymphalid butterflies. This species is emerging as a promising model organism for investigating the evolution and development of phenotypic plasticity. Using 45.99 Gb of long-read data (N50 = 11.11 kb), we assembled a genome size of 497.4 Mb for M. mineus. The assembly is highly contiguous and nearly complete (96.8% of Benchmarking Universal Single-Copy Orthologs lepidopteran genes were complete and single copy). The genome comprises 38.71% of repetitive elements and includes 20,967 predicted protein-coding genes. The assembled genome was super-scaffolded into 28 pseudo-chromosomes using a closely related species, Bicyclus anynana, with a chromosomal-level genome as a template. This valuable genomic tool will advance both ongoing and future research focused on this model organism.
Collapse
Affiliation(s)
| | - Shen Tian
- Department of Biological Sciences, National University of Singapore, Singapore 117558, Singapore
| | - Antónia Monteiro
- Department of Biological Sciences, National University of Singapore, Singapore 117558, Singapore
| |
Collapse
|
25
|
Wu S, Wang K, Dou T, Yuan S, Wu DD, Ge C, Jia J, Su Z. High-quality genome assembly of a C. crossoptilon and related functional and genetics data resources. Sci Data 2024; 11:247. [PMID: 38413610 PMCID: PMC10899641 DOI: 10.1038/s41597-024-03087-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 02/21/2024] [Indexed: 02/29/2024] Open
Abstract
There are four species in the Crossoptilon genus inhibiting at from very low to very high altitudes across China, and they are in varying levels of danger of extinction. To better understand the genetic basis of adaptation to high altitudes and genetic changes due to bottleneck, we assembled the genome (~1.02 Gb) of a white eared pheasant (WT) (Crossoptilon crossoptilon) inhibiting at high altitudes (3,000~7,000 m) in northwest of Yunnan province, China, using a combination of Illumina short reads, PacBio long reads and Hi-C reads, with a contig N50 of 19.63 Mb and only six gaps. To further provide resources for gene annotation as well as functional and population genetics analyses, we sequenced transcriptomes of 20 major tissues of the WT individual and re-sequenced another 10 WT individuals and a blue eared pheasant (Crossoptilon auritum) individual inhabiting at intermediate altitudes (1,500~3,000 m). Our assembled WT genome, transcriptome data, and DNA sequencing data can be valuable resources for studying the biology, evolution and developing conservation strategies of these endangered species.
Collapse
Affiliation(s)
- Siwen Wu
- Department of Bioinformatics and Genomics, College of Computing and Informatics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Kun Wang
- Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China
| | - Tengfei Dou
- Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China
| | - Sisi Yuan
- Department of Bioinformatics and Genomics, College of Computing and Informatics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Dong-Dong Wu
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China
| | - Changrong Ge
- Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China.
| | - Junjing Jia
- Faculty of Animal Science and Technology, Yunnan Agricultural University, Kunming, 650201, Yunnan, China.
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
26
|
Wei Z, Zhang L, Gao L, Chen J, Peng L, Yang L. Chromosome-level genome assembly and annotation of the Yunling cattle with PacBio and Hi-C sequencing data. Sci Data 2024; 11:233. [PMID: 38395911 PMCID: PMC10891105 DOI: 10.1038/s41597-024-03066-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 02/13/2024] [Indexed: 02/25/2024] Open
Abstract
Yunling cattle is a new breed of beef cattle bred in Yunnan Province, China. It is bred by crossing the Brahman, the Murray Grey and the Yunnan Yellow cattle. Yunling cattle can adapt to the tropical and subtropical climate environment, and has good reproductive ability and growth speed under high temperature and high humidity conditions, it also has strong resistance to internal and external parasites and with good beef performance. In this study, we generated a high-quality chromosome-level genome assembly of a male Yunling cattle using a combination of short reads sequencing, PacBio HiFi sequencing and Hi-C scaffolding technologies. The genome assembly(3.09 Gb) is anchored to 31 chromosomes(29 autosomes plus one X and Y), with a contig N50 of 35.97 Mb and a scaffold N50 of 112.01 Mb. It contains 1.62 Gb of repetitive sequences and 20,660 protein-coding genes. This first construction of the Yunling cattle genome provides a valuable genetic resource that will facilitate further study of the genetic diversity of bovine species and accelerate Yunling cattle breeding efforts.
Collapse
Affiliation(s)
- Zaichao Wei
- College of Food Science and Technology, Yunnan Agricultural University, Kunming, China
- College of Big Data, Baoshan University, Baoshan, China
| | - Lilian Zhang
- College of Big Data, Yunnan Agricultural University, Kunming, China
- Yunnan Engineering Technology Research Center of Agricultural Big Data, Kunming, China
- Yunnan Engineering Research Center for Big Data Intelligent Information Processing of Green Agricultural Products, Kunming, China
| | - Lutao Gao
- College of Big Data, Yunnan Agricultural University, Kunming, China
- Yunnan Engineering Technology Research Center of Agricultural Big Data, Kunming, China
- Yunnan Engineering Research Center for Big Data Intelligent Information Processing of Green Agricultural Products, Kunming, China
| | - Jian Chen
- College of Big Data, Yunnan Agricultural University, Kunming, China
- Yunnan Engineering Technology Research Center of Agricultural Big Data, Kunming, China
- Yunnan Engineering Research Center for Big Data Intelligent Information Processing of Green Agricultural Products, Kunming, China
| | - Lin Peng
- College of Big Data, Yunnan Agricultural University, Kunming, China
- Yunnan Engineering Technology Research Center of Agricultural Big Data, Kunming, China
- Yunnan Engineering Research Center for Big Data Intelligent Information Processing of Green Agricultural Products, Kunming, China
| | - Linnan Yang
- College of Big Data, Yunnan Agricultural University, Kunming, China.
- Yunnan Engineering Technology Research Center of Agricultural Big Data, Kunming, China.
- Yunnan Engineering Research Center for Big Data Intelligent Information Processing of Green Agricultural Products, Kunming, China.
| |
Collapse
|
27
|
Liang Y, Xian L, Pan J, Zhu K, Guo H, Liu B, Zhang N, Ou-Yang Y, Zhang Q, Zhang D. De Novo Genome Assembly of the Whitespot Parrotfish ( Scarus forsteni): A Valuable Scaridae Genomic Resource. Genes (Basel) 2024; 15:249. [PMID: 38397238 PMCID: PMC10888354 DOI: 10.3390/genes15020249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 02/01/2024] [Accepted: 02/09/2024] [Indexed: 02/25/2024] Open
Abstract
Scarus forsteni, a whitespot parrotfish from the Scaridae family, is a herbivorous fish inhabiting coral reef ecosystems. The deterioration of coral reefs has highly affected the habitats of the parrotfish. The decline in genetic diversity of parrotfish emphasizes the critical importance of conserving their genetic variability to ensure the resilience and sustainability of marine ecosystems for future generations. In this study, a genome of S. forsteni was assembled de novo through using Illumina and Nanopore sequencing. The 1.71-Gb genome of S. forsteni, was assembled into 544 contigs (assembly level: contig). It exhibited an N50 length of 17.97 Mb and a GC content percentage of 39.32%. Our BUSCO analysis revealed that the complete protein of the S. forsteni genome had 98.10% integrity. Combined with structure annotation data, 34,140 (74.81%) genes were functionally annotated out of 45,638 predicted protein-coding genes. Upon comparing the genome size and TE content of teleost fishes, a roughly linear relationship was observed between these two parameters. However, TE content is not a decisive factor in determining the genome size of S. forsteni. Population history analysis results indicate that S. forsteni experienced two major population expansions, both of which occurred before the last interglacial period. In addition, through a comparative genomic analysis of the evolutionary relationship of other species, it was found that S. forsteni had the closest relationship with Cheilinus undulatus, another member of the Labridae family. Our expansion and contraction analysis of the gene family showed that the expansion genes were mainly associated with immune diseases, organismal systems, and cellular processes. At the same time, cell transcription and translation, sex hormone regulation, and other related pathways were also more prominent in the positive selection genes. The genomic sequence of S. forsteni offers valuable resources for future investigations on the conservation, evolution, and behavior of fish species.
Collapse
Affiliation(s)
- Yu Liang
- Guangxi Marine Microbial Resources Industrialization Engineering Technology Research Center, Guangxi Key Laboratory for Polysaccharide Materials and Modifications, School of Marine Sciences and Biotechnology, Guangxi Minzu University, 158 University Road, Nanning 530008, China
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
| | - Lin Xian
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
- Sanya Tropical Fisheries Research Institute, Sanya 572018, China
- Guangdong Provincial Engineer Technology Research Center of Marine Biological Seed Industry, Guangzhou 510300, China
| | - Jinmin Pan
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
| | - Kecheng Zhu
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
- Sanya Tropical Fisheries Research Institute, Sanya 572018, China
- Guangdong Provincial Engineer Technology Research Center of Marine Biological Seed Industry, Guangzhou 510300, China
| | - Huayang Guo
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
- Sanya Tropical Fisheries Research Institute, Sanya 572018, China
- Guangdong Provincial Engineer Technology Research Center of Marine Biological Seed Industry, Guangzhou 510300, China
| | - Baosuo Liu
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
- Sanya Tropical Fisheries Research Institute, Sanya 572018, China
- Guangdong Provincial Engineer Technology Research Center of Marine Biological Seed Industry, Guangzhou 510300, China
| | - Nan Zhang
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
- Sanya Tropical Fisheries Research Institute, Sanya 572018, China
- Guangdong Provincial Engineer Technology Research Center of Marine Biological Seed Industry, Guangzhou 510300, China
| | - Yan Ou-Yang
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
| | - Qin Zhang
- Guangxi Marine Microbial Resources Industrialization Engineering Technology Research Center, Guangxi Key Laboratory for Polysaccharide Materials and Modifications, School of Marine Sciences and Biotechnology, Guangxi Minzu University, 158 University Road, Nanning 530008, China
| | - Dianchang Zhang
- Chinese Academy of Fishery Sciences, Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou 510300, China
- Sanya Tropical Fisheries Research Institute, Sanya 572018, China
- Guangdong Provincial Engineer Technology Research Center of Marine Biological Seed Industry, Guangzhou 510300, China
| |
Collapse
|
28
|
Furumizu C, Tanizawa Y, Nakamura Y. Letter to the Editor: Genome Annotation Matters-From Genes to Phylogenetic Inferences. Plant Cell Physiol 2024; 65:181-184. [PMID: 38035794 DOI: 10.1093/pcp/pcad151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 11/20/2023] [Accepted: 11/28/2023] [Indexed: 12/02/2023]
Affiliation(s)
- Chihiro Furumizu
- Natural Science Center for Basic Research and Development, Hiroshima University, 1-4-2 Kagamiyama, Higashi-Hiroshima, Hiroshima, 739-8527 Japan
- Graduate School of Integrated Sciences for Life, Hiroshima University, 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima, 739-8530 Japan
| | - Yasuhiro Tanizawa
- National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka, 411-8540 Japan
| | - Yasukazu Nakamura
- National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka, 411-8540 Japan
| |
Collapse
|
29
|
Li J, Ma H, Qin Y, Zhao Z, Niu Y, Lian J, Li J, Noor Z, Guo S, Yu Z, Zhang Y. Chromosome-level genome assembly and annotation of rare and endangered tropical bivalve, Tridacna crocea. Sci Data 2024; 11:186. [PMID: 38341475 PMCID: PMC10858879 DOI: 10.1038/s41597-024-03014-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 01/24/2024] [Indexed: 02/12/2024] Open
Abstract
Tridacna crocea is an ecologically important marine bivalve inhabiting tropical coral reef waters. High quality and available genomic resources will help us understand the population structure and genetic diversity of giant clams. This study reports a high-quality chromosome-scale T. crocea genome sequence of 1.30 Gb, with a scaffold N50 and contig N50 of 56.38 Mb and 1.29 Mb, respectively, which was assembled by combining PacBio long reads and Hi-C sequencing data. Repetitive sequences cover 71.60% of the total length, and a total of 25,440 protein-coding genes were annotated. A total of 1,963 non-coding RNA (ncRNA) were determined in the T. crocea genome, including 62 micro RNA (miRNA), 58 small nuclear RNA (snRNA), 83 ribosomal RNA (rRNA), and 1,760 transfer RNA (tRNA). Phylogenetic analysis revealed that giant clams diverged from oyster about 505.7 Mya during the evolution of bivalves. The genome assembly presented here provides valuable genomic resources to enhance our understanding of the genetic diversity and population structure of giant clams.
Collapse
Affiliation(s)
- Jun Li
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China
- Hainan Key Laboratory of Tropical Marine Biotechnology, Hainan Sanya Marine Ecosystem National Observation and Research Station, Sanya, 572024, China
- Daya Bay Marine Biology Research Station, Chinese Academy of Sciences, Shenzhen, 518124, China
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519015, China
| | - Haitao Ma
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China
- Hainan Key Laboratory of Tropical Marine Biotechnology, Hainan Sanya Marine Ecosystem National Observation and Research Station, Sanya, 572024, China
- Daya Bay Marine Biology Research Station, Chinese Academy of Sciences, Shenzhen, 518124, China
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519015, China
| | - Yanpin Qin
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China
- Hainan Key Laboratory of Tropical Marine Biotechnology, Hainan Sanya Marine Ecosystem National Observation and Research Station, Sanya, 572024, China
- Daya Bay Marine Biology Research Station, Chinese Academy of Sciences, Shenzhen, 518124, China
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519015, China
| | - Zhen Zhao
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China
- Hainan Key Laboratory of Tropical Marine Biotechnology, Hainan Sanya Marine Ecosystem National Observation and Research Station, Sanya, 572024, China
| | | | | | - Jiang Li
- Biozeron Shenzhen, Inc, Shenzhen, 518000, China
| | - Zohaib Noor
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China
- Hainan Key Laboratory of Tropical Marine Biotechnology, Hainan Sanya Marine Ecosystem National Observation and Research Station, Sanya, 572024, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Shuming Guo
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Ziniu Yu
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China.
- Hainan Key Laboratory of Tropical Marine Biotechnology, Hainan Sanya Marine Ecosystem National Observation and Research Station, Sanya, 572024, China.
- Daya Bay Marine Biology Research Station, Chinese Academy of Sciences, Shenzhen, 518124, China.
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519015, China.
| | - Yuehuan Zhang
- Key Laboratory of Tropical Marine Bio-resources and Ecology, Guangdong Provincial Key Laboratory of Applied Marine Biology, Innovation Academy of South China Sea Ecology and Environmental Engineering, South China Sea Institute of Oceanology, Chinese Academy of Science, Guangzhou, 510301, China.
- Hainan Key Laboratory of Tropical Marine Biotechnology, Hainan Sanya Marine Ecosystem National Observation and Research Station, Sanya, 572024, China.
- Daya Bay Marine Biology Research Station, Chinese Academy of Sciences, Shenzhen, 518124, China.
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519015, China.
| |
Collapse
|
30
|
Feng S, Zhang Y, He Z, Xi E, Ru D, Liang J, Yang Y. Chromosome-scale genome assembly of Lepus oiostolus (Lepus, Leporidae). Sci Data 2024; 11:183. [PMID: 38341484 PMCID: PMC10858874 DOI: 10.1038/s41597-024-03024-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/30/2024] [Indexed: 02/12/2024] Open
Abstract
Lepus oiostolus (L. oiostolus) is a species endemic to the Qinghai-Tibet Plateau. However, the absence of a reference genome limits genetic studies. Here, we reported a high-quality L. oiostolus genome assembly, with scaffolds anchored to 24 chromosomes and a total assembled length of 2.80 Gb (contig N50 = 64.25 Mb). Genomic annotation uncovered 22,295 protein-coding genes and identified 49.84% of the sequences as transposable elements. Long interspersed nuclear elements (LINEs) constitute a high proportion of the genome. Our study is at the first time to report the chromosome-scale genome for the species of the L. oiostolus. It provides a valuable genomic resource for future research on the evolution of the Leporidae.
Collapse
Affiliation(s)
- Shuo Feng
- State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University, Xining, 810016, China.
| | - Yaying Zhang
- State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University, Xining, 810016, China
| | - Zhaotong He
- State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University, Xining, 810016, China
| | - Erning Xi
- State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University, Xining, 810016, China
| | - Dafu Ru
- State Key Laboratory of Grassland Agro-Ecosystems, and College of Ecology, Lanzhou University, Lanzhou, 730000, China
| | - Jian Liang
- State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University, Xining, 810016, China
| | - Yongzhi Yang
- State Key Laboratory of Grassland Agro-Ecosystems, and College of Ecology, Lanzhou University, Lanzhou, 730000, China
| |
Collapse
|
31
|
Pathan N, Deng WQ, Di Scipio M, Khan M, Mao S, Morton RW, Lali R, Pigeyre M, Chong MR, Paré G. A method to estimate the contribution of rare coding variants to complex trait heritability. Nat Commun 2024; 15:1245. [PMID: 38336875 PMCID: PMC10858280 DOI: 10.1038/s41467-024-45407-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 01/22/2024] [Indexed: 02/12/2024] Open
Abstract
It has been postulated that rare coding variants (RVs; MAF < 0.01) contribute to the "missing" heritability of complex traits. We developed a framework, the Rare variant heritability (RARity) estimator, to assess RV heritability (h2RV) without assuming a particular genetic architecture. We applied RARity to 31 complex traits in the UK Biobank (n = 167,348) and showed that gene-level RV aggregation suffers from 79% (95% CI: 68-93%) loss of h2RV. Using unaggregated variants, 27 traits had h2RV > 5%, with height having the highest h2RV at 21.9% (95% CI: 19.0-24.8%). The total heritability, including common and rare variants, recovered pedigree-based estimates for 11 traits. RARity can estimate gene-level h2RV, enabling the assessment of gene-level characteristics and revealing 11, previously unreported, gene-phenotype relationships. Finally, we demonstrated that in silico pathogenicity prediction (variant-level) and gene-level annotations do not generally enrich for RVs that over-contribute to complex trait variance, and thus, innovative methods are needed to predict RV functionality.
Collapse
Affiliation(s)
- Nazia Pathan
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
- Department of Pathology and Molecular Medicine, McMaster University, Michael G. DeGroote School of Medicine, Hamilton, Canada
| | - Wei Q Deng
- Peter Boris Centre for Addictions Research, St. Joseph's Healthcare Hamilton, Hamilton, Canada
- Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, Canada
| | - Matteo Di Scipio
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
- Department of Medicine, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - Mohammad Khan
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
- Department of Medicine, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - Shihong Mao
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
| | - Robert W Morton
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
- Department of Pathology and Molecular Medicine, McMaster University, Michael G. DeGroote School of Medicine, Hamilton, Canada
| | - Ricky Lali
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Canada
| | - Marie Pigeyre
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
- Department of Medicine, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - Michael R Chong
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada
- Department of Pathology and Molecular Medicine, McMaster University, Michael G. DeGroote School of Medicine, Hamilton, Canada
- Thrombosis and Atherosclerosis Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton, Canada
| | - Guillaume Paré
- Population Health Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton Health Sciences and McMaster University, Hamilton, Canada.
- Department of Pathology and Molecular Medicine, McMaster University, Michael G. DeGroote School of Medicine, Hamilton, Canada.
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Canada.
- Thrombosis and Atherosclerosis Research Institute, David Braley Cardiac, Vascular and Stroke Research Institute, Hamilton, Canada.
| |
Collapse
|
32
|
Bukhman YV, Meyer S, Chu LF, Abueg L, Antosiewicz-Bourget J, Balacco J, Brecht M, Dinatale E, Fedrigo O, Formenti G, Fungtammasan A, Giri SJ, Hiller M, Howe K, Kihara D, Mamott D, Mountcastle J, Pelan S, Rabbani K, Sims Y, Tracey A, Wood JMD, Jarvis ED, Thomson JA, Chaisson MJP, Stewart R. Chromosome level genome assembly of the Etruscan shrew Suncus etruscus. Sci Data 2024; 11:176. [PMID: 38326333 PMCID: PMC10850158 DOI: 10.1038/s41597-024-03011-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 01/26/2024] [Indexed: 02/09/2024] Open
Abstract
Suncus etruscus is one of the world's smallest mammals, with an average body mass of about 2 grams. The Etruscan shrew's small body is accompanied by a very high energy demand and numerous metabolic adaptations. Here we report a chromosome-level genome assembly using PacBio long read sequencing, 10X Genomics linked short reads, optical mapping, and Hi-C linked reads. The assembly is partially phased, with the 2.472 Gbp primary pseudohaplotype and 1.515 Gbp alternate. We manually curated the primary assembly and identified 22 chromosomes, including X and Y sex chromosomes. The NCBI genome annotation pipeline identified 39,091 genes, 19,819 of them protein-coding. We also identified segmental duplications, inferred GO term annotations, and computed orthologs of human and mouse genes. This reference-quality genome will be an important resource for research on mammalian development, metabolism, and body size control.
Collapse
Affiliation(s)
- Yury V Bukhman
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA.
| | - Susanne Meyer
- Neuroscience Research Institute, University of California - Santa Barbara, 494 UCEN Rd, Isla Vista, CA, 93117, USA
| | - Li-Fang Chu
- Department of Comparative Biology and Experimental Medicine, University of Calgary, 2500 University Drive NW, Calgary, Alberta, T2N 1N4, Canada
| | - Linelle Abueg
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | | | - Jennifer Balacco
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | - Michael Brecht
- BCCN/Humboldt University Berlin, Philippstr, 13 House 6, 10115, Berlin, Germany
| | - Erica Dinatale
- Max Planck Institute for Biology Tübingen, Max-Planck-Ring 5, 72076, Tübingen, Germany
| | - Olivier Fedrigo
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language, The Rockefeller University/HHMI, 1230 York Avenue, New York, NY, 10065, USA
| | | | - Swagarika Jaharlal Giri
- Department of Computer Science, Purdue University, 249 S. Martin Jischke Dr, West Lafayette, IN, 47907, USA
| | - Michael Hiller
- LOEWE Centre for Translational Biodiversity Genomics, Senckenberganlage 25, 60325, Frankfurt, Germany
- Senckenberg Research Institute, Senckenberganlage 25, 60325, Frankfurt, Germany
- Institute of Cell Biology and Neuroscience, Faculty of Biosciences, Goethe University Frankfurt, Max-von-Laue-Str. 9, 60438, Frankfurt, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, 249 S. Martin Jischke Dr, West Lafayette, IN, 47907, USA
- Department of Biological Sciences, Purdue University, 249 S. Martin Jischke Dr., West Lafayette, IN, 47907, USA
| | - Daniel Mamott
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA
| | - Jacquelyn Mountcastle
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
| | - Sarah Pelan
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Keon Rabbani
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way RRI 408, Los Angeles, CA, 90089, USA
| | - Ying Sims
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | | | - Erich D Jarvis
- Vertebrate Genome Lab, The Rockefeller University, 1230 York Avenue, New York, NY, 10065, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University/HHMI, 1230 York Avenue, New York, NY, 10065, USA
| | - James A Thomson
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA
- Department of Molecular, Cellular and Developmental Biology, University of California Santa Barbara, Santa Barbara, CA, 93106, USA
- Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, WI, 53726, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way RRI 408, Los Angeles, CA, 90089, USA
| | - Ron Stewart
- Regenerative Biology, Morgridge Institute for Research, 330 N. Orchard St., Madison, WI, 53715, USA
| |
Collapse
|
33
|
Fu Y, Fang X, Xiao Y, Mao B, Xu Z, Shen M, Wang X. Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae). Sci Data 2024; 11:165. [PMID: 38310146 PMCID: PMC10838273 DOI: 10.1038/s41597-024-03010-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 01/26/2024] [Indexed: 02/05/2024] Open
Abstract
Chironomids are one of the most abundant aquatic insects and are widely distributed in various biological communities. However, the lack of high-quality genomes has hindered our ability to study the evolution and ecology of this group. Here, we used Nanopore long reads and Hi-C data to produce two chromosome-level genomes from mixed genomic data. The genomes of Smittia aterrima (SateA) and Smittia pratorum (SateB) were assembled into three chromosomes, with sizes of 78.45 Mb and 71.56 Mb, scaffold N50 lengths of 25.73 and 23.53 Mb, and BUSCO completeness of 98.5% and 97.8% (n = 1,367), 5.68 Mb (7.24%) and 1.94 Mb (2.72%) of repetitive elements, and predicted 12,330 (97.70% BUSCO completeness) and 11,250 (97.40%) protein-coding genes, respectively. These high-quality genomes will serve as valuable resources for comprehending the evolution and environmental adaptation of chironomids.
Collapse
Affiliation(s)
- Yue Fu
- Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Hubei Zhongke Research Institute of Industrial Technology, College of Biology and Agricultural Resources, Huanggang Normal University, Huanggang City, Hubei, 438000, China.
| | - Xiangliang Fang
- Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Hubei Zhongke Research Institute of Industrial Technology, College of Biology and Agricultural Resources, Huanggang Normal University, Huanggang City, Hubei, 438000, China
| | - Yunli Xiao
- Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Hubei Zhongke Research Institute of Industrial Technology, College of Biology and Agricultural Resources, Huanggang Normal University, Huanggang City, Hubei, 438000, China
| | - Bin Mao
- Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Hubei Zhongke Research Institute of Industrial Technology, College of Biology and Agricultural Resources, Huanggang Normal University, Huanggang City, Hubei, 438000, China
| | - Zigang Xu
- Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Hubei Zhongke Research Institute of Industrial Technology, College of Biology and Agricultural Resources, Huanggang Normal University, Huanggang City, Hubei, 438000, China
| | - Mi Shen
- Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Hubei Zhongke Research Institute of Industrial Technology, College of Biology and Agricultural Resources, Huanggang Normal University, Huanggang City, Hubei, 438000, China
| | - Xinhua Wang
- College of Life Sciences, Nankai University, Tianjin, 300071, China
| |
Collapse
|
34
|
Zheng J, Jiang J, Rui Q, Li F, Liu S, Cheng S, Chi M, Jiang W. Chromosome-level genome assembly of Acrossocheilus fasciatus using PacBio sequencing and Hi-C technology. Sci Data 2024; 11:166. [PMID: 38310107 PMCID: PMC10838343 DOI: 10.1038/s41597-024-02999-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 01/25/2024] [Indexed: 02/05/2024] Open
Abstract
Acrossocheilus fasciatus (Cypriniformes, Cyprinidae) is emerged as a newly commercial stream fish in the south of China with high economic and ornamental value. In this study, a chromosome-level reference genome of A. fasciatus was assembled using PacBio, Illumina and Hi-C sequencing technologies. As a result, a high-quality genome was generated with a size of 879.52 Mb (accession number: JAVLVS000000000), scaffold N50 of 32.7 Mb, and contig N50 of 32.7 Mb. The largest and smallest scafford was 60.57 Mb and 16 kb, respectively. BUSCO analysis showed a completeness score of 98.3%. Meanwhile, the assembled sequences were anchored to 25 pseudo-chromosomes with an integration efficiency of 96.95%. Additionally, we found approximately 390.91 Mb of repetitive sequences that accounting for 44.45% of the assembled genome, and predicted 24,900 protein-coding genes. The available genome reported in the present study provided a crucial resource to further investigate the regulation mechanism of genetic diversity, sexual dimorphism and evolutionary histories.
Collapse
Affiliation(s)
- Jianbo Zheng
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China
| | - Jianhu Jiang
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China
| | - Qianlong Rui
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China
- College of Biological and Environmental Sciences, Zhejiang Wanli University, Ningbo, China
| | - Fei Li
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China.
| | - Shili Liu
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China
| | - Shun Cheng
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China
| | - Meili Chi
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China
| | - Wenping Jiang
- Key Laboratory of Genetics and Breeding, Zhejiang Institute of Freshwater Fisheries, Huzhou, China
| |
Collapse
|
35
|
Lü Z, Yu Z, Luo W, Liu T, Wang Y, Liu Y, Liu J, Liu B, Gong L, Liu L, Li Y. Chromosome-level genome assembly and annotation of eel goby (Odontamblyopus rebecca). Sci Data 2024; 11:160. [PMID: 38307872 PMCID: PMC10837429 DOI: 10.1038/s41597-024-02997-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 01/25/2024] [Indexed: 02/04/2024] Open
Abstract
The eel gobies fascinate researchers with many important features, including its unique body structure, benthic lifestyle, and degenerated eyes. However, genome assembly and exploration of the unique genomic composition of the eel gobies are still in their infancy. This has severely limited research progress on gobies. In this study, multi-platform sequencing data were generated and used to assemble and annotate the genome of O. rebecca at the chromosome-level. The assembled genome size of O. rebecca is 918.57 Mbp, which is similar to the estimated genome size (903.03 Mbp) using 17-mer. The scaffold N50 is 41.67 Mbp, and 23 chromosomes were assembled using Hi-C technology with a mounting rate of 99.96%. Genome annotation indicates that 53.29% of the genome is repetitive sequences, and 22,999 protein-coding genes are predicted, of which 21,855 have functional annotations. The chromosome-level genome of O. rebecca will not only provide important genomic resources for comparative genomic studies of gobies, but also expand our knowledge of the genetic origin of their unique features fascinating researchers for decades.
Collapse
Affiliation(s)
- Zhenming Lü
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Ziwei Yu
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Wenkai Luo
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Tianwei Liu
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Yuzheng Wang
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Yantang Liu
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Jing Liu
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Bingjian Liu
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Li Gong
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Liqin Liu
- National Engineering Laboratory of Marine Germplasm Resources Exploration and Utilization, College of Marine Sciences and Technology, Zhejiang Ocean University, Zhoushan, 316022, China
| | - Yongxin Li
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China.
| |
Collapse
|
36
|
Song Y, Huang JP, Wang YJ, Huang SX. Chromosome level genome assembly of endangered medicinal plant Anisodus tanguticus. Sci Data 2024; 11:161. [PMID: 38307894 PMCID: PMC10837431 DOI: 10.1038/s41597-024-03007-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 01/26/2024] [Indexed: 02/04/2024] Open
Abstract
Anisodus tanguticus is a medicinal herb that belongs to the Anisodus genus of the Solanaceae family. This endangered herb is mainly distributed in Qinghai-Tibet Plateau. In this study, we combined the Illumina short-read, Nanopore long-read and high-throughput chromosome conformation capture (Hi-C) sequencing technologies to de novo assemble the A. tanguticus genome. A high-quality chromosomal-level genome assembly was obtained with a genome size of 1.26 Gb and a contig N50 of 25.07 Mb. Of the draft genome sequences, 97.47% were anchored to 24 pseudochromosomes with a scaffold N50 of 51.28 Mb. In addition, 842.14 Mb of transposable elements occupying 66.70% of the genome assembly were identified and 44,252 protein-coding genes were predicted. The genome assembly of A. tanguticus will provide genetic repertoire to understand the adaptation strategy of Anisodus species in the plateau, which will further promote the conservation of endangered A. tanguticus resources.
Collapse
Affiliation(s)
- Yongli Song
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
| | - Jian-Ping Huang
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
| | - Yong-Jiang Wang
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China.
| | - Sheng-Xiong Huang
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China.
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China.
| |
Collapse
|
37
|
Yao A, Kohtsuka H, Miura T. Reference transcriptome assembly of a protogynous sex change fish, harlequin sandsmelt (Parapercis pulchella). Mar Genomics 2024; 73:101086. [PMID: 38365348 DOI: 10.1016/j.margen.2024.101086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/22/2024] [Accepted: 01/24/2024] [Indexed: 02/18/2024]
Abstract
The harlequin sandsmelt (Parapercis pulchella) is a female-to-male sex change fish in which functional females possess ovotestes that consist of both ovarian and testicular tissues. These features indicate that this species could be an excellent model for studying the flexibility of sex differentiation in vertebrates. However, genetic resources in this species have so far been limited. Therefore, in this study, the reference transcriptome of this fish was constructed through RNA-sequencing, de novo transcriptome assembly, superTranscripts construction, and functional annotations. To obtain as many genes as possible, RNA was extracted from various tissues (brains, gills, hearts, livers, guts, and gonads) and various sexual stages (females, individuals during sex change, and males) and then subjected to sequencing and downstream analyses. As a result, 91,884 representative transcripts with 32,627 protein-coding sequences were generated. 72.2% of protein-coding sequences (23,566 sequences) were functionally annotated. Also, our analysis shows that the superTranscripts method effectively removes redundant sequences from raw-assembled data compared with other strategies. The resultant dataset is a valuable resource for future molecular developmental studies on sex change in P. pulchella.
Collapse
Affiliation(s)
- Akifumi Yao
- Misaki Marine Biological Station, School of Science, The University of Tokyo, Misaki, Miura, Kanagawa 238-0225, Japan
| | - Hisanori Kohtsuka
- Misaki Marine Biological Station, School of Science, The University of Tokyo, Misaki, Miura, Kanagawa 238-0225, Japan
| | - Toru Miura
- Misaki Marine Biological Station, School of Science, The University of Tokyo, Misaki, Miura, Kanagawa 238-0225, Japan.
| |
Collapse
|
38
|
Kim KR, Park SY, Kim H, Kim J, Hong JM, Kim SY, Yu JN. Genome assembly and microsatellite marker development using Illumina and PacBio sequencing in Persicaria maackiana (Polygonaceae) from Korea. Genes Genomics 2024; 46:187-202. [PMID: 38240922 DOI: 10.1007/s13258-023-01479-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 11/23/2023] [Indexed: 01/30/2024]
Abstract
BACKGROUND Persicaria maackiana (Regel) is a potential medicinal plant that exerts anti-diabetic effects. However, the lack of genomic information on P. maackiana hinders research at the molecular level. OBJECTIVE Herein, we aimed to construct a draft genome assembly and obtain comprehensive genomic information on P. maackiana using high-throughput sequencing tools PacBio Sequel II and Illumina. METHODS Persicaria maackiana samples from three natural populations in Gaecheon, Gichi, and Uiryeong reservoirs in South Korea were used to generate genomic DNA libraries, perform genome de novo assembly, gene ontology analysis, phylogenetic tree analysis, genotyping, and identify microsatellite markers. RESULTS The assembled P. maackiana genome yielded 32,179 contigs. Assessment of assembly integrity revealed 1503 (93.12%) complete Benchmarking Universal Single-Copy Orthologs. A total of 64,712 protein-coding genes were predicted and annotated successfully in the protein database. In the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthologs, 13,778 genes were annotated into 18 categories. Genes that activated AMPK were identified in the KEGG pathway. A total of 316,992 microsatellite loci were identified, and primers targeting the flanking regions were developed for 292,059 microsatellite loci. Of these, 150 primer sets were randomly selected for amplification, and 30 of these primer sets were identified as polymorphic. These primers amplified 3-9 alleles. The mean observed and expected heterozygosity were 0.189 and 0.593, respectively. Polymorphism information content values of the markers were 0.361-0.754. CONCLUSION Collectively, our study provides a valuable resource for future comparative genomics, phylogeny, and population studies of P. maackiana.
Collapse
Affiliation(s)
- Kang-Rae Kim
- Animal and Plant Research Department, Nakdonggang National Institute of Biological Resources, Sangju, Republic of Korea
| | - So Young Park
- Animal and Plant Research Department, Nakdonggang National Institute of Biological Resources, Sangju, Republic of Korea
| | - Heesoo Kim
- Animal and Plant Research Department, Nakdonggang National Institute of Biological Resources, Sangju, Republic of Korea
| | - Jiyeon Kim
- Animal and Plant Research Department, Nakdonggang National Institute of Biological Resources, Sangju, Republic of Korea
| | - Jeong Min Hong
- Animal and Plant Research Department, Nakdonggang National Institute of Biological Resources, Sangju, Republic of Korea
| | - Sun-Yu Kim
- Animal and Plant Research Department, Nakdonggang National Institute of Biological Resources, Sangju, Republic of Korea
| | - Jeong-Nam Yu
- Animal and Plant Research Department, Nakdonggang National Institute of Biological Resources, Sangju, Republic of Korea.
| |
Collapse
|
39
|
Feng X, Liu S, Li K, Bu F, Yuan H. NCAD v1.0: a database for non-coding variant annotation and interpretation. J Genet Genomics 2024; 51:230-242. [PMID: 38142743 DOI: 10.1016/j.jgg.2023.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 12/15/2023] [Accepted: 12/18/2023] [Indexed: 12/26/2023]
Abstract
The application of whole genome sequencing is expanding in clinical diagnostics across various genetic disorders, and the significance of non-coding variants in penetrant diseases is increasingly being demonstrated. Therefore, it is urgent to improve the diagnostic yield by exploring the pathogenic mechanisms of variants in non-coding regions. However, the interpretation of non-coding variants remains a significant challenge, due to the complex functional regulatory mechanisms of non-coding regions and the current limitations of available databases and tools. Hence, we develop the non-coding variant annotation database (NCAD, http://www.ncawdb.net/), encompassing comprehensive insights into 665,679,194 variants, regulatory elements, and element interaction details. Integrating data from 96 sources, spanning both GRCh37 and GRCh38 versions, NCAD v1.0 provides vital information to support the genetic diagnosis of non-coding variants, including allele frequencies of 12 diverse populations, with a particular focus on the population frequency information for 230,235,698 variants in 20,964 Chinese individuals. Moreover, it offers prediction scores for variant functionality, five categories of regulatory elements, and four types of non-coding RNAs. With its rich data and comprehensive coverage, NCAD serves as a valuable platform, empowering researchers and clinicians with profound insights into non-coding regulatory mechanisms while facilitating the interpretation of non-coding variants.
Collapse
Affiliation(s)
- Xiaoshu Feng
- Institute of Rare Diseases, West China Hospital, Sichuan University, Chengdu, Sichuan 610044, China
| | - Sihan Liu
- Institute of Rare Diseases, West China Hospital, Sichuan University, Chengdu, Sichuan 610044, China
| | - Ke Li
- Institute of Rare Diseases, West China Hospital, Sichuan University, Chengdu, Sichuan 610044, China
| | - Fengxiao Bu
- Institute of Rare Diseases, West China Hospital, Sichuan University, Chengdu, Sichuan 610044, China.
| | - Huijun Yuan
- Institute of Rare Diseases, West China Hospital, Sichuan University, Chengdu, Sichuan 610044, China.
| |
Collapse
|
40
|
Pan C, Yin J, Ma B, Wen J, Luo P. Whole-genome sequence and characterization of a marine red yeast, Rhodosporidium sphaerocarpum GDMCC 60679, featuring the assimilation of ammonia nitrogen. J Biosci Bioeng 2024; 137:85-93. [PMID: 38155026 DOI: 10.1016/j.jbiosc.2023.12.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Revised: 12/07/2023] [Accepted: 12/11/2023] [Indexed: 12/30/2023]
Abstract
A marine red yeast, Rhodosporidium sphaerocarpum, is generally used for the production of lipids and carotenoids. In a previous study, we demonstrated that a marine-derived R. sphaerocarpum GDMCC 60679 can efficiently remove ammonia nitrogen and exhibit multiple probiotic functions for shrimp, Litopenaeus vannamei. Here, we performed a genome assembly of the strain GDMCC 60679 using a combination of the data from Illumina PE and PacBio CLR reads. The genome has a size of 18.03 Mb and consists of 32 contigs with an N50 length of 1,074,774 bp and GC content of 63 %. The genome was predicted to contain 6092 protein-coding genes, 5962 of which were functionally annotated. Metabolic pathways responsible for the ammonia assimilation and the synthesis of lipids and carotenoids were particularly examined to explore and characterize genes contributing to these functions. Whole-genome sequence and annotation of the strain lays a foundation to reveal the molecular mechanism of its prominent biological functions and will facilitate us to further expand new applications of yeasts in Rhodosporidium.
Collapse
Affiliation(s)
- Chuanhao Pan
- Fisheries College, Guangdong Ocean University, Zhanjiang 524088, China
| | - Jiayue Yin
- CAS Key Laboratory of Tropical Marine Bio-resources and Ecology (LMB), Guangdong Provincial Key Laboratory of Applied Marine Biology (LAMB), South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou 510301, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bo Ma
- CAS Key Laboratory of Tropical Marine Bio-resources and Ecology (LMB), Guangdong Provincial Key Laboratory of Applied Marine Biology (LAMB), South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou 510301, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jing Wen
- Department of Biology, Lingnan Normal University, Zhanjiang 524048, China
| | - Peng Luo
- Fisheries College, Guangdong Ocean University, Zhanjiang 524088, China; CAS Key Laboratory of Tropical Marine Bio-resources and Ecology (LMB), Guangdong Provincial Key Laboratory of Applied Marine Biology (LAMB), South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou 510301, China.
| |
Collapse
|
41
|
Höglund J, Dias G, Olsen RA, Soares A, Bunikis I, Talla V, Backström N. A Chromosome-Level Genome Assembly and Annotation for the Clouded Apollo Butterfly (Parnassius mnemosyne): A Species of Global Conservation Concern. Genome Biol Evol 2024; 16:evae031. [PMID: 38368625 PMCID: PMC10901555 DOI: 10.1093/gbe/evae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/06/2024] [Accepted: 02/10/2024] [Indexed: 02/20/2024] Open
Abstract
The clouded apollo (Parnassius mnemosyne) is a palearctic butterfly distributed over a large part of western Eurasia, but population declines and fragmentation have been observed in many parts of the range. The development of genomic tools can help to shed light on the genetic consequences of the decline and to make informed decisions about direct conservation actions. Here, we present a high-contiguity, chromosome-level genome assembly of a female clouded apollo butterfly and provide detailed annotations of genes and transposable elements. We find that the large genome (1.5 Gb) of the clouded apollo is extraordinarily repeat rich (73%). Despite that, the combination of sequencing techniques allowed us to assemble all chromosomes (nc = 29) to a high degree of completeness. The annotation resulted in a relatively high number of protein-coding genes (22,854) compared with other Lepidoptera, of which a large proportion (21,635) could be assigned functions based on homology with other species. A comparative analysis indicates that overall genome structure has been largely conserved, both within the genus and compared with the ancestral lepidopteran karyotype. The high-quality genome assembly and detailed annotation presented here will constitute an important tool for forthcoming efforts aimed at understanding the genetic consequences of fragmentation and decline, as well as for assessments of genetic diversity, population structure, inbreeding, and genetic load in the clouded apollo butterfly.
Collapse
Affiliation(s)
- Jacob Höglund
- Animal Ecology Program, Department of Ecology and Genetics (IEG), Uppsala University, Uppsala SE-752 36, Sweden
| | - Guilherme Dias
- National Bioinformatics Infrastructure Sweden (NBIS), Science for Life Laboratory, Uppsala 752 37, Sweden
| | - Remi-André Olsen
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna 17165, Sweden
| | - André Soares
- National Bioinformatics Infrastructure Sweden (NBIS), Science for Life Laboratory, Uppsala 752 37, Sweden
| | - Ignas Bunikis
- Uppsala Genome Center, Department of Immunology, Genetics and Pathology, Uppsala University, National Genomics Infrastructure hosted by SciLifeLab, Uppsala, Sweden
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala 752 37, Sweden
| | - Venkat Talla
- Evolutionary Biology Program, Department of Ecology and Genetics (IEG), Uppsala University, Uppsala SE-752 36, Sweden
| | - Niclas Backström
- Evolutionary Biology Program, Department of Ecology and Genetics (IEG), Uppsala University, Uppsala SE-752 36, Sweden
| |
Collapse
|
42
|
Zheng L, Shi S, Lu M, Fang P, Pan Z, Zhang H, Zhou Z, Zhang H, Mou M, Huang S, Tao L, Xia W, Li H, Zeng Z, Zhang S, Chen Y, Li Z, Zhu F. AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding. Genome Biol 2024; 25:41. [PMID: 38303023 PMCID: PMC10832132 DOI: 10.1186/s13059-024-03166-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 01/05/2024] [Indexed: 02/03/2024] Open
Abstract
Protein function annotation has been one of the longstanding issues in biological sciences, and various computational methods have been developed. However, the existing methods suffer from a serious long-tail problem, with a large number of GO families containing few annotated proteins. Herein, an innovative strategy named AnnoPRO was therefore constructed by enabling sequence-based multi-scale protein representation, dual-path protein encoding using pre-training, and function annotation by long short-term memory-based decoding. A variety of case studies based on different benchmarks were conducted, which confirmed the superior performance of AnnoPRO among available methods. Source code and models have been made freely available at: https://github.com/idrblab/AnnoPRO and https://zenodo.org/records/10012272.
Collapse
Affiliation(s)
- Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
| | - Shuiyang Shi
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Mingkun Lu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Pan Fang
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Hongning Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Zhimeng Zhou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Hanyu Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Shijie Huang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Weiqi Xia
- Pharmaceutical Department, Zhejiang Provincial People's Hospital, Hangzhou, 310014, China
| | - Honglin Li
- School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhenyu Zeng
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Shun Zhang
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Yuzong Chen
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, China
| | - Zhaorong Li
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China.
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China.
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China.
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| |
Collapse
|
43
|
McCartney N, Kondakath G, Tai A, Trimmer BA. Functional annotation of insecta transcriptomes: A cautionary tale from Lepidoptera. Insect Biochem Mol Biol 2024; 165:104038. [PMID: 37952902 DOI: 10.1016/j.ibmb.2023.104038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 10/30/2023] [Accepted: 11/07/2023] [Indexed: 11/14/2023]
Abstract
Functional annotation is a critical step in the analysis of genomic data, as it provides insight into the function of individual genes and the pathways in which they participate. Currently, there is no consensus on the best computational approach for assigning functional annotation. This study compares three functional annotation methods (BLAST, eggNOG-Mapper, and InterProScan) in their ability to assign Gene Ontology terms in two species of Insecta with differing levels of annotation, Bombyx mori and Manduca sexta. The methods were compared for their annotation coverage, number of term assignments, term agreement and non-overlapping terms. Here we show that there are large discrepancies in gene ontology term assignment among the three computational methods, which could lead to confounding interpretations of data and non-comparable results. This study provide insight into the strengths and weaknesses of each computational method and highlight the need for more standardized methods of functional annotation.
Collapse
Affiliation(s)
- Naya McCartney
- Department of Biology, Tufts University, 200 Boston Ave, Medford, MA, 02155, USA
| | - Gayathri Kondakath
- Department of Biology, Tufts University, 200 Boston Ave, Medford, MA, 02155, USA
| | - Albert Tai
- School of Medicine, Tufts University, 136 Harrison Ave, Boston, MA, 02111, USA
| | - Barry A Trimmer
- Department of Biology, Tufts University, 200 Boston Ave, Medford, MA, 02155, USA.
| |
Collapse
|
44
|
Bonello J, Orengo C. FunPredCATH: An ensemble method for predicting protein function using CATH. Biochim Biophys Acta Proteins Proteom 2024; 1872:140985. [PMID: 38122964 DOI: 10.1016/j.bbapap.2023.140985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 12/05/2023] [Accepted: 12/06/2023] [Indexed: 12/23/2023]
Abstract
MOTIVATION The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms. METHODOLOGY We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein. METHODS In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge. EVALUATION We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and Fmax metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks. CONTRIBUTIONS FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher Fmax scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower Fmax score.
Collapse
Affiliation(s)
- Joseph Bonello
- Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom; Department of Computer Information Systems, University of Malta, Faculty of ICT, Msida, MSD 2080, Malta.
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom
| |
Collapse
|
45
|
Zhang W, Yang Y, Hua S, Ruan Q, Li D, Wang L, Wang X, Wen X, Liu X, Meng Z. Chromosome-level genome assembly and annotation of the yellow grouper, Epinephelus awoara. Sci Data 2024; 11:151. [PMID: 38296995 PMCID: PMC10830450 DOI: 10.1038/s41597-024-02989-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 01/18/2024] [Indexed: 02/02/2024] Open
Abstract
Epinephelus awoara, as known as yellow grouper, is a significant economic marine fish that has been bred artificially in China. However, the genetic structure and evolutionary history of yellow grouper remains largely unknown. Here, this work presents the high-quality chromosome-level genome assembly of yellow grouper using PacBio single molecule sequencing technique (SMRT) and High-through chromosome conformation capture (Hi-C) technologies. The 984.48 Mb chromosome-level genome of yellow grouper was assembled, with a contig N50 length of 39.77 Mb and scaffold N50 length of 41.39 Mb. Approximately 99.76% of assembled sequences were anchored into 24 pseudo-chromosomes with the assistance of Hi-C reads. Furthermore, approximately 41.17% of the genome was composed of repetitive elements. In total, 24,541 protein-coding genes were predicted, of which 22,509 (91.72%) genes were functionally annotated. The highly accurate, chromosome-level reference genome assembly and annotation are crucial to the understanding of population genetic structure, adaptive evolution and speciation of the yellow grouper.
Collapse
Affiliation(s)
- Weiwei Zhang
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China
| | - Yang Yang
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China
- Key Laboratory of Tropical Marine Fish Germplasm Innovation and Utilization, Ministry of Agriculture and Rural Affairs, Sanya, 570000, China
- Hainan Engineering Research Center for Germplasm Innovation and Utilization, Sanya, 570000, China
| | - Sijie Hua
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China
| | - Qingxin Ruan
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China
| | - Duo Li
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China
| | - Le Wang
- Molecular Population Genetics Group, Temasek Life Sciences Laboratory, National University of Singapore, Singapore City, 119077, Singapore
| | - Xi Wang
- Area of Ecology and Biodiversity, School of Biological Sciences, University of Hong Kong, Hong Kong SAR, 999077, China
| | - Xin Wen
- School of Marine Biology and Fisheries, Hainan Aquaculture Breeding Engineering Research Center, Hainan Academician Team Innovation Center, Hainan University, Haikou, 570228, China
| | - Xiaochun Liu
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China
- Southern Laboratory of Ocean Science and Engineering (Zhuhai), Zhuhai, 519000, China
| | - Zining Meng
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Province Key Laboratory of Aquatic Economic Animals, School of Life Sciences, Sun Yat-sen University, Guangzhou, 510275, China.
- Southern Laboratory of Ocean Science and Engineering (Zhuhai), Zhuhai, 519000, China.
| |
Collapse
|
46
|
Luo H, Lin Q, Fang W, Chen X, Zhou X. Genomic insights into the endangered white-eared night heron (Gorsachius magnificus). BMC Genom Data 2024; 25:11. [PMID: 38291423 PMCID: PMC10826008 DOI: 10.1186/s12863-024-01194-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 01/18/2024] [Indexed: 02/01/2024] Open
Abstract
OBJECTIVES A genome sequence of a threatened species can provide valuable genetic information that is important for improving the conservation strategies. The white-eared night heron (Gorsachius magnificus) is an endangered and poorly known ardeid bird. In order to support future studies on conservation genetics and evolutionary adaptation of this species, we have reported a de novo assembled and annotated whole-genome sequence of the G. magnificus. DATA DESCRIPTION The final draft genome assembly of the G. magnificus was 1.19 Gb in size, with a contig N50 of 187.69 kb and a scaffold N50 of 7,338.28 kb. According to BUSCO analysis, the genome assembly contained 97.49% of the 8,338 genes in the Aves (odb10) dataset. Approximately 10.52% of the genome assembly was composed of repetitive sequences. A total of 14,613 protein-coding genes were predicted in the genome assembly, with functional annotations available for 14,611 genes. The genome assembly exhibited a heterozygosity rate of 0.49 heterozygosity per kilobase pair. This draft genome of G. magnificus provides valuable genomic resources for future studies on conservation and evolution.
Collapse
Affiliation(s)
- Haoran Luo
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, 361102, Xiamen, China
| | - Qingxian Lin
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, 361102, Xiamen, China.
| | - Wenzhen Fang
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, 361102, Xiamen, China
| | - Xiaolin Chen
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, 361102, Xiamen, China
| | - Xiaoping Zhou
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, 361102, Xiamen, China.
| |
Collapse
|
47
|
Chen M, Yang D, Yang S, Yang X, Chen Z, Yang T, Yang Y, Yang Y. Chromosome-level genome assembly of Hippophae gyantsensis. Sci Data 2024; 11:126. [PMID: 38272931 PMCID: PMC10810969 DOI: 10.1038/s41597-024-02909-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 12/29/2023] [Indexed: 01/27/2024] Open
Abstract
Hippophae gyantsensis, which is a native tree species in China, is ideal for windbreak and sand-fixing forests. It is an economically and ecologically valuable tree species distributed exclusively in the Qinghai-Tibet Plateau in China. In our study, we assembled a chromosome-level genome of H. gyantsensis using Illumina sequencing, Nanopore sequencing and chromosome structure capture technique. The genome was 716.32 Mb in size with scaffold N50 length of 64.84 Mb. A total of 716.25 Mb genome data was anchored and orientated onto 12 chromosomes with a mounting rate of up to 99.99%. Additionally, the genome was found to comprise approximately 56.84% repeat sequences, of which long terminal repeats(LTRs) that accounted for 33.19% of the entire genome. Meanwhile, a total of 32,316 protein-coding genes were predicted, and 91.07% of these genes were functionally annotated. We also completed a series of comparative genomic analyses to provide researchers with useful reference material for future studies on seabuckthorn.
Collapse
Affiliation(s)
- Mingyue Chen
- School of Ecology and Environmental Science, Yunnan University, Kunming, China
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- Institute of Tibetan Plateau Research at Kunming, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
| | - Danni Yang
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- Institute of Tibetan Plateau Research at Kunming, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
| | - Shihai Yang
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- Tibet Yunwang Industrial Corporation, Ltd., Shigatse, China
| | - Xingyu Yang
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- Institute of Tibetan Plateau Research at Kunming, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zhiyu Chen
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- Institute of Tibetan Plateau Research at Kunming, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Tianyu Yang
- School of Ecology and Environmental Science, Yunnan University, Kunming, China
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yunqiang Yang
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China.
- Institute of Tibetan Plateau Research at Kunming, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Yongping Yang
- Plant Germplasm and Genomics Center, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China.
- Institute of Tibetan Plateau Research at Kunming, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
48
|
Wang Y, Wang M, Li J, Zhang J, Zhang L. A chromosome-level genome assembly of a deep-sea symbiotic Aplacophora mollusc Chaetoderma sp. Sci Data 2024; 11:133. [PMID: 38272948 PMCID: PMC10810820 DOI: 10.1038/s41597-024-02940-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 01/10/2024] [Indexed: 01/27/2024] Open
Abstract
The worm-shaped, shell-less Caudofoveata is one of the least known groups of molluscs. As early-branching molluscs, the lack of high-quality genomes hinders our understanding of their evolution and ecology. Here, we report a high-quality chromosome-scale genome of Chaetoderma sp. combining PacBio, Illumina, and high-resolution chromosome conformation capture sequencing. The final assembly has a size of 2.45 Gb, with a scaffold N50 length of 141.46 Mb, and is anchored to 17 chromosomes. Gene annotations showed a high level of accuracy and completeness, with 23,675 predicted protein-coding genes and 94.44% of the metazoan conserved genes by BUSCO assessment. We further present 16S rRNA gene amplicon sequencing of the gut microbiota in Chaetoderma sp., which was dominated by the chemoautotrophic bacteria (phylum Gammaproteobacteria). This chromosome-level genome assembly presents the first genome for the Caudofoveata, which constitutes an important resource for studies ranging from molluscan evolution, symposium, to deep-sea adaptation.
Collapse
Affiliation(s)
- Yue Wang
- CAS and Shandong Province Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Center of Deep-Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Key Laboratory of Breeding Biotechnology and Sustainable Aquaculture, Chinese Academy of Sciences, Wuhan, 430072, China
- College of Marine Science, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Minxiao Wang
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Center of Deep-Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- College of Marine Science, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Jie Li
- CAS and Shandong Province Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Center of Deep-Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
- Key Laboratory of Breeding Biotechnology and Sustainable Aquaculture, Chinese Academy of Sciences, Wuhan, 430072, China
- College of Marine Science, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Junlong Zhang
- College of Marine Science, University of Chinese Academy of Sciences, Beijing, 100049, China
- Department of Marine Organism Taxonomy & Phylogeny, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China
| | - Linlin Zhang
- CAS and Shandong Province Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China.
- Center of Deep-Sea Research, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.
- Key Laboratory of Breeding Biotechnology and Sustainable Aquaculture, Chinese Academy of Sciences, Wuhan, 430072, China.
- College of Marine Science, University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
49
|
Zhou S, Li Y, Wu W, Li L. scMMT: a multi-use deep learning approach for cell annotation, protein prediction and embedding in single-cell RNA-seq data. Brief Bioinform 2024; 25:bbad523. [PMID: 38300515 PMCID: PMC10833085 DOI: 10.1093/bib/bbad523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 11/27/2023] [Accepted: 12/19/2023] [Indexed: 02/02/2024] Open
Abstract
Accurate cell type annotation in single-cell RNA-sequencing data is essential for advancing biological and medical research, particularly in understanding disease progression and tumor microenvironments. However, existing methods are constrained by single feature extraction approaches, lack of adaptability to immune cell types with similar molecular profiles but distinct functions and a failure to account for the impact of cell label noise on model accuracy, all of which compromise the precision of annotation. To address these challenges, we developed a supervised approach called scMMT. We proposed a novel feature extraction technique to uncover more valuable information. Additionally, we constructed a multi-task learning framework based on the GradNorm method to enhance the recognition of challenging immune cells and reduce the impact of label noise by facilitating mutual reinforcement between cell type annotation and protein prediction tasks. Furthermore, we introduced logarithmic weighting and label smoothing mechanisms to enhance the recognition ability of rare cell types and prevent model overconfidence. Through comprehensive evaluations on multiple public datasets, scMMT has demonstrated state-of-the-art performance in various aspects including cell type annotation, rare cell identification, dropout and label noise resistance, protein expression prediction and low-dimensional embedding representation.
Collapse
Affiliation(s)
- Songqi Zhou
- Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China
- Chongqing School, University of Chinese Academy of Sciences, Chongqing, China
| | - Yang Li
- Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China
- Chongqing School, University of Chinese Academy of Sciences, Chongqing, China
- Chongqing Research Institute of Big Data, Peking University, Chongqing, China
| | - Wenyuan Wu
- Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China
- Chongqing School, University of Chinese Academy of Sciences, Chongqing, China
| | - Li Li
- Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China
- Chongqing School, University of Chinese Academy of Sciences, Chongqing, China
| |
Collapse
|
50
|
Li W, Wang B, Dai J, Kou Y, Chen X, Pan Y, Hu S, Xu ZZ. Partial order relation-based gene ontology embedding improves protein function prediction. Brief Bioinform 2024; 25:bbae077. [PMID: 38446740 PMCID: PMC10917077 DOI: 10.1093/bib/bbae077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 01/22/2024] [Indexed: 03/08/2024] Open
Abstract
Protein annotation has long been a challenging task in computational biology. Gene Ontology (GO) has become one of the most popular frameworks to describe protein functions and their relationships. Prediction of a protein annotation with proper GO terms demands high-quality GO term representation learning, which aims to learn a low-dimensional dense vector representation with accompanying semantic meaning for each functional label, also known as embedding. However, existing GO term embedding methods, which mainly take into account ancestral co-occurrence information, have yet to capture the full topological information in the GO-directed acyclic graph (DAG). In this study, we propose a novel GO term representation learning method, PO2Vec, to utilize the partial order relationships to improve the GO term representations. Extensive evaluations show that PO2Vec achieves better outcomes than existing embedding methods in a variety of downstream biological tasks. Based on PO2Vec, we further developed a new protein function prediction method PO2GO, which demonstrates superior performance measured in multiple metrics and annotation specificity as well as few-shot prediction capability in the benchmarks. These results suggest that the high-quality representation of GO structure is critical for diverse biological tasks including computational protein annotation.
Collapse
Affiliation(s)
- Wenjing Li
- College of Computer Science and Software, Shenzhen University, Shenzhen, China
| | - Bin Wang
- School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
| | - Jin Dai
- Center for Quantum Technology Research and School of Physics, Beijing Institute of Technology, Beijing, China
| | - Yan Kou
- Xbiome, Scientific Research Building, Tsinghua High-Tech Park, Shenzhen, China
| | - Xiaojun Chen
- College of Computer Science and Software, Shenzhen University, Shenzhen, China
| | - Yi Pan
- Faculty of Computer Science and Control Engineering Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, China
| | - Shuangwei Hu
- Xbiome, Scientific Research Building, Tsinghua High-Tech Park, Shenzhen, China
| | - Zhenjiang Zech Xu
- School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- State Key Laboratory of Food Science and Technology, Nanchang University, Nanchang, China
| |
Collapse
|