1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Tang T, Liu Y, Zheng B, Li R, Zhang X, Liu Y. Integration of hybrid and self-correction method improves the quality of long-read sequencing data. Brief Funct Genomics 2024; 23:249-255. [PMID: 37340778 DOI: 10.1093/bfgp/elad026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 06/04/2023] [Accepted: 06/05/2023] [Indexed: 06/22/2023] Open
Abstract
Third-generation sequencing (TGS) technologies have revolutionized genome science in the past decade. However, the long-read data produced by TGS platforms suffer from a much higher error rate than that of the previous technologies, thus complicating the downstream analysis. Several error correction tools for long-read data have been developed; these tools can be categorized into hybrid and self-correction tools. So far, these two types of tools are separately investigated, and their interplay remains understudied. Here, we integrate hybrid and self-correction methods for high-quality error correction. Our procedure leverages the inter-similarity between long-read data and high-accuracy information from short reads. We compare the performance of our method and state-of-the-art error correction tools on Escherichia coli and Arabidopsis thaliana datasets. The result shows that the integration approach outperformed the existing error correction methods and holds promise for improving the quality of downstream analyses in genomic research.
Collapse
Affiliation(s)
- Tao Tang
- School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210023, Jiangsu, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Binshuang Zheng
- School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210023, Jiangsu, China
| | - Rong Li
- School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210023, Jiangsu, China
| | - Xiaocai Zhang
- Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), 138632, Singapore, Singapore
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
3
|
Grybchuk D, Galan A, Klocek D, Macedo DH, Wolf YI, Votýpka J, Butenko A, Lukeš J, Neri U, Záhonová K, Kostygov AY, Koonin EV, Yurchenko V. Identification of diverse RNA viruses in Obscuromonas flagellates (Euglenozoa: Trypanosomatidae: Blastocrithidiinae). Virus Evol 2024; 10:veae037. [PMID: 38774311 PMCID: PMC11108086 DOI: 10.1093/ve/veae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 04/03/2024] [Accepted: 04/29/2024] [Indexed: 05/24/2024] Open
Abstract
Trypanosomatids (Euglenozoa) are a diverse group of unicellular flagellates predominately infecting insects (monoxenous species) or circulating between insects and vertebrates or plants (dixenous species). Monoxenous trypanosomatids harbor a wide range of RNA viruses belonging to the families Narnaviridae, Totiviridae, Qinviridae, Leishbuviridae, and a putative group of tombus-like viruses. Here, we focus on the subfamily Blastocrithidiinae, a previously unexplored divergent group of monoxenous trypanosomatids comprising two related genera: Obscuromonas and Blastocrithidia. Members of the genus Blastocrithidia employ a unique genetic code, in which all three stop codons are repurposed to encode amino acids, with TAA also used to terminate translation. Obscuromonas isolates studied here bear viruses of three families: Narnaviridae, Qinviridae, and Mitoviridae. The latter viral group is documented in trypanosomatid flagellates for the first time. While other known mitoviruses replicate in the mitochondria, those of trypanosomatids appear to reside in the cytoplasm. Although no RNA viruses were detected in Blastocrithidia spp., we identified an endogenous viral element in the genome of B. triatomae indicating its past encounter(s) with tombus-like viruses.
Collapse
Affiliation(s)
- Danyil Grybchuk
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
- Central European Institute of Technology, Masaryk University, Brno 625 00, Czechia
| | - Arnau Galan
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
| | - Donnamae Klocek
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
| | - Diego H Macedo
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
| | - Yuri I Wolf
- National Center for Biotechnology Information, NLM, National Institutes of Health, Bethesda 20894, USA
| | - Jan Votýpka
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice 370 05, Czechia
- Department of Parasitology, Faculty of Science, Charles University, Prague 128 00, Czechia
| | - Anzhelika Butenko
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice 370 05, Czechia
- Faculty of Science, University of South Bohemia, České Budějovice 370 05, Czechia
| | - Julius Lukeš
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice 370 05, Czechia
- Faculty of Science, University of South Bohemia, České Budějovice 370 05, Czechia
| | - Uri Neri
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv 39040, Israel
| | - Kristína Záhonová
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice 370 05, Czechia
- Department of Parasitology, Faculty of Science, Charles University, BIOCEV, Vestec 252 50, Czechia
- Division of Infectious Diseases, Department of Medicine, University of Alberta, Edmonton, Alberta T6G 2G3, Canada
| | - Alexei Yu Kostygov
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
- Zoological Institute of the Ruian Academy of Sciences, St. Petersburg 199034, Russia
| | - Eugene V Koonin
- National Center for Biotechnology Information, NLM, National Institutes of Health, Bethesda 20894, USA
| | - Vyacheslav Yurchenko
- Life Science Research Centre, Faculty of Science, University of Ostrava, Ostrava 710 00, Czechia
| |
Collapse
|
4
|
Eitel M, Osigus H, Brenzinger B, Wörheide G. Beauty in the beast - Placozoan biodiversity explored through molluscan predator genomics. Ecol Evol 2024; 14:e11220. [PMID: 38606341 PMCID: PMC11007570 DOI: 10.1002/ece3.11220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 03/17/2024] [Accepted: 03/20/2024] [Indexed: 04/13/2024] Open
Abstract
The marine animal phylum Placozoa is characterized by a poorly explored cryptic biodiversity combined with very limited knowledge of their ecology. While placozoans are typically found as part of the epibenthos of coastal waters, known placozoan predators, namely small, shell-less sea slugs belonging to the family Rhodopidae (Mollusca: Gastropoda: Heterobranchia), inhabit the interstitium of seafloor sediment. In order to gain further insights into this predator-prey relationship and to expand our understanding of placozoan ecological niches, we screened publicly available whole-body metagenomic data from two rhodopid specimens collected from coastal sediments. Our analysis not only revealed the signatures of three previously unknown placozoan lineages in these sea slug samples but also enabled the assembly of three complete and two partial mitochondrial chromosomes belonging to four previously described placozoan genera, substantially extending the picture of placozoan biodiversity. Our findings further refine the molecular phylogeny of the Placozoa, corroborate the recently established taxonomic ranks in this phylum, and provide molecular support that known placozoan clades should be referred to as genera. We finally discuss the main finding of our study - the presence of placozoans in the sea floor sediment interstitium - in the context of their ecological, biological, and natural history implications.
Collapse
Affiliation(s)
- Michael Eitel
- GeoBio‐CenterLudwig‐Maximilians‐Universität MünchenMünchenDeutschland
- Department of Earth and Environmental Sciences, Paleontology and GeobiologyLudwig‐Maximilians‐Universität‐MünchenMünchenDeutschland
| | - Hans‐Jürgen Osigus
- Institut für TierökologieStiftung Tierärztliche Hochschule HannoverHannoverDeutschland
- Present address:
Hochschulbibliothek, Stiftung Tierärztliche Hochschule HannoverHannoverDeutschland
| | - Bastian Brenzinger
- Staatliche Naturwissenschaftliche Sammlungen Bayerns (SNSB) – Zoologische StaatssammlungMünchenDeutschland
| | - Gert Wörheide
- GeoBio‐CenterLudwig‐Maximilians‐Universität MünchenMünchenDeutschland
- Department of Earth and Environmental Sciences, Paleontology and GeobiologyLudwig‐Maximilians‐Universität‐MünchenMünchenDeutschland
- Staatliche Naturwissenschaftliche Sammlungen Bayerns (SNSB) – Bayerische Staatssammlung für Paläontologie und GeologieMünchenDeutschland
| |
Collapse
|
5
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
6
|
Lee WK, Chan BKK, Kim JY, Ju SJ, Kim SJ. Comparative genomics reveals the dynamic evolutionary history of cement protein genes of barnacles from intertidal to deep-sea hydrothermal vents. Mol Ecol Resour 2024; 24:e13895. [PMID: 37955198 DOI: 10.1111/1755-0998.13895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2022] [Revised: 10/16/2023] [Accepted: 10/30/2023] [Indexed: 11/14/2023]
Abstract
Thoracican barnacles are a diverse group of marine organisms for which the availability of genome assemblies is currently limited. In this study, we sequenced the genomes of two neolepadoid species (Ashinkailepas kermadecensis, Imbricaverruca yamaguchii) from hydrothermal vents, in addition to two intertidal species. Genome sizes ranged from 481 to 1054 Mb, with repetitive sequence contents of 21.2% to 50.7%. Concordance rates of orthologs and heterozygosity rates were between 82.4% and 91.7% and between 1.0% and 2.1%, respectively, indicating high genetic diversity and heterozygosity. Based on phylogenomic analyses, we revised the nomenclature of cement genes encoding cement proteins that are not homologous to any known proteins. The major cement gene, CP100A, was found in all thoracican species, including vent-associated neolepadoids, and was hypothesised to be essential for thoracican settlement. Duplicated genes, CP100B and CP100C, were found only in balanids, suggesting potential functional redundancy or acquisition of new functions associated with the calcareous base. An ancestor of CP52 genes was duplicated dynamically among lepadids, pollicipedids with multiple copies on a single scaffold, and balanids with multiple sequential repeats of the conserved regions, but no CP52 genes were found in neolepadoids, providing insights into cement gene evolution among thoracican lineages. This study enhances our understanding of the adhesion mechanisms of thoracicans in underwater environments. The newly sequenced genomes provide opportunities for studying their evolution and ecology, shedding light on their adaptation to diverse marine environments, and contributing to our knowledge of barnacle biology with valuable genomic resources for further studies in this field.
Collapse
Affiliation(s)
- Won-Kyung Lee
- Division of Biomedical Research, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
- Division of EcoScience, Ewha Womans University, Seoul, Korea
| | - Benny K K Chan
- Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
| | - Jae-Yoon Kim
- Division of Biomedical Research, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| | - Se-Jong Ju
- Marine Resources & Environment Research Division, Korea Institute of Ocean Science and Technology, Busan, Korea
| | - Se-Joo Kim
- Division of Biomedical Research, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| |
Collapse
|
7
|
Dimens PV, Jones KL, Margulies D, Scholey V, Cusatti S, McPeak B, Hildahl TE, Saillant EAE. Genomic resources for the Yellowfin tuna Thunnus albacares. Mol Biol Rep 2024; 51:232. [PMID: 38281308 DOI: 10.1007/s11033-023-09117-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Accepted: 12/06/2023] [Indexed: 01/30/2024]
Abstract
BACKGROUND The Yellowfin tuna (Thunnus albacares) is a large tuna exploited by major fisheries in tropical and subtropical waters of all oceans except the Mediterranean Sea. Genomic studies of population structure, adaptive variation or of the genetic basis of phenotypic traits are needed to inform fisheries management but are currently limited by the lack of a reference genome for this species. Here we report a draft genome assembly and a linkage map for use in genomic studies of T. albacares. METHODS AND RESULTS Illumina and PacBio SMRT sequencing were used in combination to generate a hybrid assembly that comprises 743,073,847 base pairs contained in 2,661 scaffolds. The assembly has a N50 of 351,587 and complete and partial BUSCO scores of 86.47% and 3.63%, respectively. Double-digest restriction associated DNA (ddRAD) was used to genotype the 2 parents and 164 of their F1 offspring resulting from a controlled breeding cross, retaining 19,469 biallelic single nucleotide polymorphism (SNP) loci. The SNP loci were used to construct a linkage map that features 24 linkage groups that represent the 24 chromosomes of yellowfin tuna. The male and female maps span 1,243.8 cM and 1,222.9 cM, respectively. The map was used to anchor the assembly in 24 super-scaffolds that contain 79% of the yellowfin tuna genome. Gene prediction identified 46,992 putative genes 20,203 of which could be annotated via gene ontology. CONCLUSIONS The draft reference will be valuable to interpret studies of genome wide variation in T. albacares and other Scombroid species.
Collapse
Affiliation(s)
- Pavel V Dimens
- School of Ocean Science and Engineering, The University of Southern Mississippi, Ocean Springs, MS, 39564, USA
| | | | - Daniel Margulies
- Inter-American Tropical Tuna Commission, 8901 La Jolla Shores Drive, La Jolla, CA, 92037, USA
| | - Vernon Scholey
- Inter-American Tropical Tuna Commission, 8901 La Jolla Shores Drive, La Jolla, CA, 92037, USA
| | - Susana Cusatti
- Inter-American Tropical Tuna Commission, 8901 La Jolla Shores Drive, La Jolla, CA, 92037, USA
| | - Brooke McPeak
- School of Ocean Science and Engineering, The University of Southern Mississippi, Ocean Springs, MS, 39564, USA
| | - Tami E Hildahl
- School of Ocean Science and Engineering, The University of Southern Mississippi, Ocean Springs, MS, 39564, USA
| | - Eric A E Saillant
- School of Ocean Science and Engineering, The University of Southern Mississippi, Ocean Springs, MS, 39564, USA.
| |
Collapse
|
8
|
Długosz M, Deorowicz S. Illumina reads correction: evaluation and improvements. Sci Rep 2024; 14:2232. [PMID: 38278837 DOI: 10.1038/s41598-024-52386-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 01/18/2024] [Indexed: 01/28/2024] Open
Abstract
The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.
Collapse
Affiliation(s)
- Maciej Długosz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland.
| |
Collapse
|
9
|
Albanaz ATS, Carrington M, Frolov AO, Ganyukova AI, Gerasimov ES, Kostygov AY, Lukeš J, Malysheva MN, Votýpka J, Zakharova A, Záhonová K, Zimmer SL, Yurchenko V, Butenko A. Shining the spotlight on the neglected: new high-quality genome assemblies as a gateway to understanding the evolution of Trypanosomatidae. BMC Genomics 2023; 24:471. [PMID: 37605127 PMCID: PMC10441713 DOI: 10.1186/s12864-023-09591-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 08/15/2023] [Indexed: 08/23/2023] Open
Abstract
BACKGROUND Protists of the family Trypanosomatidae (phylum Euglenozoa) have gained notoriety as parasites affecting humans, domestic animals, and agricultural plants. However, the true extent of the group's diversity spreads far beyond the medically and veterinary relevant species. We address several knowledge gaps in trypanosomatid research by undertaking sequencing, assembly, and analysis of genomes from previously overlooked representatives of this protistan group. RESULTS We assembled genomes for twenty-one trypanosomatid species, with a primary focus on insect parasites and Trypanosoma spp. parasitizing non-human hosts. The assemblies exhibit sizes consistent with previously sequenced trypanosomatid genomes, ranging from approximately 18 Mb for Obscuromonas modryi to 35 Mb for Crithidia brevicula and Zelonia costaricensis. Despite being the smallest, the genome of O. modryi has the highest content of repetitive elements, contributing nearly half of its total size. Conversely, the highest proportion of unique DNA is found in the genomes of Wallacemonas spp., with repeats accounting for less than 8% of the assembly length. The majority of examined species exhibit varying degrees of aneuploidy, with trisomy being the most frequently observed condition after disomy. CONCLUSIONS The genome of Obscuromonas modryi represents a very unusual, if not unique, example of evolution driven by two antidromous forces: i) increasing dependence on the host leading to genomic shrinkage and ii) expansion of repeats causing genome enlargement. The observed variation in somy within and between trypanosomatid genera suggests that these flagellates are largely predisposed to aneuploidy and, apparently, exploit it to gain a fitness advantage. High heterogeneity in the genome size, repeat content, and variation in chromosome copy numbers in the newly-sequenced species highlight the remarkable genome plasticity exhibited by trypanosomatid flagellates. These new genome assemblies are a robust foundation for future research on the genetic basis of life cycle changes and adaptation to different hosts in the family Trypanosomatidae.
Collapse
Affiliation(s)
- Amanda T S Albanaz
- Life Science Research Centre, Faculty of Science, University of Ostrava, 710 00, Ostrava, Czech Republic
| | - Mark Carrington
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QW, UK
| | - Alexander O Frolov
- Zoological Institute of the Russian Academy of Sciences, 199034, St. Petersburg, Russia
| | - Anna I Ganyukova
- Zoological Institute of the Russian Academy of Sciences, 199034, St. Petersburg, Russia
| | - Evgeny S Gerasimov
- Faculty of Biology, M. V. Lomonosov Moscow State University, 119991, Moscow, Russia
- Martsinovsky Institute of Medical Parasitology, Sechenov University, 119435, Moscow, Russia
| | - Alexei Y Kostygov
- Life Science Research Centre, Faculty of Science, University of Ostrava, 710 00, Ostrava, Czech Republic
| | - Julius Lukeš
- Institute of Parasitology, Czech Academy of Sciences, 370 05, České Budějovice, Czech Republic
- Faculty of Sciences, University of South Bohemia, 370 05, České Budějovice, Czech Republic
| | - Marina N Malysheva
- Zoological Institute of the Russian Academy of Sciences, 199034, St. Petersburg, Russia
| | - Jan Votýpka
- Institute of Parasitology, Czech Academy of Sciences, 370 05, České Budějovice, Czech Republic
- Department of Parasitology, Faculty of Science, Charles University, 128 44, Prague, Czech Republic
| | - Alexandra Zakharova
- Life Science Research Centre, Faculty of Science, University of Ostrava, 710 00, Ostrava, Czech Republic
| | - Kristína Záhonová
- Life Science Research Centre, Faculty of Science, University of Ostrava, 710 00, Ostrava, Czech Republic
- Institute of Parasitology, Czech Academy of Sciences, 370 05, České Budějovice, Czech Republic
- Department of Parasitology, Faculty of Science, Charles University, BIOCEV, 252 50, Vestec, Czech Republic
- Division of Infectious Diseases, Department of Medicine, University of Alberta, Edmonton, T6G 2G3, Canada
| | - Sara L Zimmer
- Duluth Campus, University of Minnesota Medical School, Duluth, MN, 55812, USA
| | - Vyacheslav Yurchenko
- Life Science Research Centre, Faculty of Science, University of Ostrava, 710 00, Ostrava, Czech Republic.
| | - Anzhelika Butenko
- Life Science Research Centre, Faculty of Science, University of Ostrava, 710 00, Ostrava, Czech Republic.
- Institute of Parasitology, Czech Academy of Sciences, 370 05, České Budějovice, Czech Republic.
- Faculty of Sciences, University of South Bohemia, 370 05, České Budějovice, Czech Republic.
| |
Collapse
|
10
|
Francis WR, Eitel M, Vargas S, Garcia-Escudero CA, Conci N, Deister F, Mah JL, Guiglielmoni N, Krebs S, Blum H, Leys SP, Wörheide G. The genome of the reef-building glass sponge Aphrocallistes vastus provides insights into silica biomineralization. ROYAL SOCIETY OPEN SCIENCE 2023; 10:230423. [PMID: 37351491 PMCID: PMC10282587 DOI: 10.1098/rsos.230423] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 05/26/2023] [Indexed: 06/24/2023]
Abstract
Well-annotated and contiguous genomes are an indispensable resource for understanding the evolution, development, and metabolic capacities of organisms. Sponges, an ecologically important non-bilaterian group of primarily filter-feeding sessile aquatic organisms, are underrepresented with respect to available genomic resources. Here we provide a high-quality and well-annotated genome of Aphrocallistes vastus, a glass sponge (Porifera: Hexactinellida) that forms large reef structures off the coast of British Columbia (Canada). We show that its genome is approximately 80 Mb, small compared to most other metazoans, and contains nearly 2500 nested genes, more than other genomes. Hexactinellida is characterized by a unique skeletal architecture made of amorphous silicon dioxide (SiO2), and we identified 419 differentially expressed genes between the osculum, i.e. the vertical growth zone of the sponge, and the main body. Among the upregulated ones, mineralization-related genes such as glassin, as well as collagens and actins, dominate the expression profile during growth. Silicateins, suggested being involved in silica mineralization, especially in demosponges, were not found at all in the A. vastus genome and suggests that the underlying mechanisms of SiO2 deposition in the Silicea sensu stricto (Hexactinellida + Demospongiae) may not be homologous.
Collapse
Affiliation(s)
- Warren R. Francis
- Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Michael Eitel
- Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Sergio Vargas
- Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Catalina A. Garcia-Escudero
- Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Nicola Conci
- Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Fabian Deister
- Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Jasmine L. Mah
- Department of Biological Sciences, University of Alberta, Edmonton, Canada T6G 2E9
| | - Nadège Guiglielmoni
- Service Evolution Biologique et Ecologie, Université libre de Bruxelles (ULB), 1050 Brussels, Belgium
| | - Stefan Krebs
- Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Helmut Blum
- Laboratory for Functional Genome Analysis (LAFUGA), Gene Center, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Sally P. Leys
- Department of Biological Sciences, University of Alberta, Edmonton, Canada T6G 2E9
| | - Gert Wörheide
- Department of Earth and Environmental Sciences, Paleontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
- GeoBio-Center, Ludwig-Maximilians-Universität München, Munich, Germany
- Staatliche Naturwissenschaftliche Sammlungen Bayerns (SNSB)–Bayerische Staatssammlung für Paläontologie und Geologie, Munich, Germany
| |
Collapse
|
11
|
Cai X, Lan T, Ping P, Oliver B, Li J. Intra-Host Co-Existing Strains of SARS-CoV-2 Reference Genome Uncovered by Exhaustive Computational Search. Viruses 2023; 15:v15051065. [PMID: 37243151 DOI: 10.3390/v15051065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 04/24/2023] [Accepted: 04/24/2023] [Indexed: 05/28/2023] Open
Abstract
The COVID-19 pandemic caused by SARS-CoV-2 has had a severe impact on people worldwide. The reference genome of the virus has been widely used as a template for designing mRNA vaccines to combat the disease. In this study, we present a computational method aimed at identifying co-existing intra-host strains of the virus from RNA-sequencing data of short reads that were used to assemble the original reference genome. Our method consisted of five key steps: extraction of relevant reads, error correction for the reads, identification of within-host diversity, phylogenetic study, and protein binding affinity analysis. Our study revealed that multiple strains of SARS-CoV-2 can coexist in both the viral sample used to produce the reference sequence and a wastewater sample from California. Additionally, our workflow demonstrated its capability to identify within-host diversity in foot-and-mouth disease virus (FMDV). Through our research, we were able to shed light on the binding affinity and phylogenetic relationships of these strains with the published SARS-CoV-2 reference genome, SARS-CoV, variants of concern (VOC) of SARS-CoV-2, and some closely related coronaviruses. These insights have important implications for future research efforts aimed at identifying within-host diversity, understanding the evolution and spread of these viruses, as well as the development of effective treatments and vaccines against them.
Collapse
Affiliation(s)
- Xinhui Cai
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Tian Lan
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Pengyao Ping
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Brian Oliver
- School of Life Sciences, Faculty of Science, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Jinyan Li
- Data Science Institute and School of Computer Science, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen 518055, China
| |
Collapse
|
12
|
Beres SB, Olsen RJ, Long SW, Eraso JM, Boukthir S, Faili A, Kayal S, Musser JM. Analysis of the Genomics and Mouse Virulence of an Emergent Clone of Streptococcus dysgalactiae Subspecies equisimilis. Microbiol Spectr 2023; 11:e0455022. [PMID: 36971562 PMCID: PMC10100674 DOI: 10.1128/spectrum.04550-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 03/04/2023] [Indexed: 03/29/2023] Open
Abstract
Streptococcus dysgalactiae subsp. equisimilis is a bacterial pathogen that is increasingly recognized as a cause of severe human infections. Much less is known about the genomics and infection pathogenesis of S. dysgalactiae subsp. equisimilis strains compared to the closely related bacterium Streptococcus pyogenes. To address these knowledge deficits, we sequenced to closure the genomes of seven S. dysgalactiae subsp. equisimilis human isolates, including six that were emm type stG62647. Recently, for unknown reasons, strains of this emm type have emerged and caused an increasing number of severe human infections in several countries. The genomes of these seven strains vary between 2.15 and 2.21 Mbp. The core chromosomes of these six S. dysgalactiae subsp. equisimilis stG62647 strains are closely related, differing on average by only 495 single-nucleotide polymorphisms, consistent with a recent descent from a common progenitor. The largest source of genetic diversity among these seven isolates is differences in putative mobile genetic elements, both chromosomal and extrachromosomal. Consistent with the epidemiological observations of increased frequency and severity of infections, both stG62647 strains studied were significantly more virulent than a strain of emm type stC74a in a mouse model of necrotizing myositis, as assessed by bacterial CFU burden, lesion size, and survival curves. Taken together, our genomic and pathogenesis data show the strains of emm type stG62647 we studied are closely genetically related and have enhanced virulence in a mouse model of severe invasive disease. Our findings underscore the need for expanded study of the genomics and molecular pathogenesis of S. dysgalactiae subsp. equisimilis strains causing human infections. IMPORTANCE Our studies addressed a critical knowledge gap in understanding the genomics and virulence of the bacterial pathogen Streptococcus dysgalactiae subsp. equisimilis. S. dysgalactiae subsp. equisimilis strains are responsible for a recent increase in severe human infections in some countries. We determined that certain S. dysgalactiae subsp. equisimilis strains are genetically descended from a common ancestor and that these strains can cause severe infections in a mouse model of necrotizing myositis. Our findings highlight the need for expanded studies on the genomics and pathogenic mechanisms of this understudied subspecies of the Streptococcus family.
Collapse
Affiliation(s)
- Stephen B. Beres
- Laboratory of Molecular and Translational Human Infectious Disease Research, Center for Infectious Diseases, Department of Pathology and Genomic Medicine, Houston Methodist Research Institute and Houston Methodist Hospital, Houston, Texas, USA
| | - Randall J. Olsen
- Laboratory of Molecular and Translational Human Infectious Disease Research, Center for Infectious Diseases, Department of Pathology and Genomic Medicine, Houston Methodist Research Institute and Houston Methodist Hospital, Houston, Texas, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York, USA
- Department of Microbiology and Immunology, Weill Cornell Medical College, New York, New York, USA
| | - S. Wesley Long
- Laboratory of Molecular and Translational Human Infectious Disease Research, Center for Infectious Diseases, Department of Pathology and Genomic Medicine, Houston Methodist Research Institute and Houston Methodist Hospital, Houston, Texas, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York, USA
- Department of Microbiology and Immunology, Weill Cornell Medical College, New York, New York, USA
| | - Jesus M. Eraso
- Laboratory of Molecular and Translational Human Infectious Disease Research, Center for Infectious Diseases, Department of Pathology and Genomic Medicine, Houston Methodist Research Institute and Houston Methodist Hospital, Houston, Texas, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York, USA
| | - Sarrah Boukthir
- CHU de Rennes, Service de Bacteriologie-Hygiène Hospitalière, Rennes, France
- INSERM, CIC 1414, Rennes, France
- Université Rennes 1, Faculté de Médecine, Rennes, France
| | - Ahmad Faili
- INSERM, CIC 1414, Rennes, France
- Université Rennes 1, Faculté de Pharmacie, Rennes, France
- Chemistry, Oncogenesis, Stress, and Signaling, INSERM 1242, Rennes, France
| | - Samer Kayal
- CHU de Rennes, Service de Bacteriologie-Hygiène Hospitalière, Rennes, France
- INSERM, CIC 1414, Rennes, France
- Université Rennes 1, Faculté de Médecine, Rennes, France
- Chemistry, Oncogenesis, Stress, and Signaling, INSERM 1242, Rennes, France
| | - James M. Musser
- Laboratory of Molecular and Translational Human Infectious Disease Research, Center for Infectious Diseases, Department of Pathology and Genomic Medicine, Houston Methodist Research Institute and Houston Methodist Hospital, Houston, Texas, USA
- Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York, USA
- Department of Microbiology and Immunology, Weill Cornell Medical College, New York, New York, USA
| |
Collapse
|
13
|
Nesterenko M, Miroliubov A. From head to rootlet: comparative transcriptomic analysis of a rhizocephalan barnacle Peltogaster reticulata (Crustacea: Rhizocephala). F1000Res 2023; 11:583. [PMID: 36447930 PMCID: PMC9664023 DOI: 10.12688/f1000research.110492.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/04/2023] [Indexed: 01/11/2023] Open
Abstract
Background: Rhizocephalan barnacles stand out in the diverse world of metazoan parasites. The body of a rhizocephalan female is modified beyond revealing any recognizable morphological features, consisting of the interna, a system of rootlets, and the externa, a sac-like reproductive body. Moreover, rhizocephalans have an outstanding ability to control their hosts, literally turning them into "zombies". Despite all these amazing traits, there are no genomic or transcriptomic data about any Rhizocephala. Methods: We collected transcriptomes from four body parts of an adult female rhizocephalan Peltogaster reticulata: the externa, and the main, growing, and thoracic parts of the interna. We used all prepared data for the de novo assembly of the reference transcriptome. Next, a set of encoded proteins was determined, the expression levels of protein-coding genes in different parts of the parasite's body were calculated and lists of enriched bioprocesses were identified. We also in silico identified and analyzed sets of potential excretory / secretory proteins. Finally, we applied phylostratigraphy and evolutionary transcriptomics approaches to our data. Results: The assembled reference transcriptome included transcripts of 12,620 protein-coding genes and was the first for any rhizocephalan. Based on the results obtained, the spatial heterogeneity of protein-coding gene expression in different regions of the adult female body of P. reticulata was established. The results of both transcriptomic analysis and histological studies indicated the presence of germ-like cells in the lumen of the interna. The potential molecular basis of the interaction between the nervous system of the host and the parasite's interna was also determined. Given the prolonged expression of development-associated genes, we suggest that rhizocephalans "got stuck in their metamorphosis", even at the reproductive stage. Conclusions: The results of the first comparative transcriptomic analysis for Rhizocephala not only clarified but also expanded the existing ideas about the biology of these extraordinary parasites.
Collapse
Affiliation(s)
- Maksim Nesterenko
- Department of Invertebrate Zoology, St Petersburg State University, St Petersburg, 199034, Russian Federation,Laboratory of parasitic worms and protists, Zoological Institute of Russian Academy of Sciences, St Petersburg, 199034, Russian Federation,
| | - Aleksei Miroliubov
- Laboratory of parasitic worms and protists, Zoological Institute of Russian Academy of Sciences, St Petersburg, 199034, Russian Federation
| |
Collapse
|
14
|
Platova S, Poliushkevich L, Kulakova M, Nesterenko M, Starunov V, Novikova E. Gotta Go Slow: Two Evolutionarily Distinct Annelids Retain a Common Hedgehog Pathway Composition, Outlining Its Pan-Bilaterian Core. Int J Mol Sci 2022; 23:ijms232214312. [PMID: 36430788 PMCID: PMC9695228 DOI: 10.3390/ijms232214312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 11/11/2022] [Accepted: 11/13/2022] [Indexed: 11/19/2022] Open
Abstract
Hedgehog signaling is one of the key regulators of morphogenesis, cell differentiation, and regeneration. While the Hh pathway is present in all bilaterians, it has mainly been studied in model animals such as Drosophila and vertebrates. Despite the conservatism of its core components, mechanisms of signal transduction and additional components vary in Ecdysozoa and Deuterostomia. Vertebrates have multiple copies of the pathway members, which complicates signaling implementation, whereas model ecdysozoans appear to have lost some components due to fast evolution rates. To shed light on the ancestral state of Hh signaling, models from the third clade, Spiralia, are needed. In our research, we analyzed the transcriptomes of two spiralian animals, errantial annelid Platynereis dumerilii (Nereididae) and sedentarian annelid Pygospio elegans (Spionidae). We found that both annelids express almost all Hh pathway components present in Drosophila and mouse. We performed a phylogenetic analysis of the core pathway components and built multiple sequence alignments of the additional key members. Our results imply that the Hh pathway compositions of both annelids share more similarities with vertebrates than with the fruit fly. Possessing an almost complete set of single-copy Hh pathway members, lophotrochozoan signaling composition may reflect the ancestral features of all three bilaterian branches.
Collapse
Affiliation(s)
- Sofia Platova
- Faculty of Biology, St. Petersburg State University, Saint Petersburg 199034, Russia
- Zoological Institute RAS, Saint Petersburg 199034, Russia
| | | | - Milana Kulakova
- Faculty of Biology, St. Petersburg State University, Saint Petersburg 199034, Russia
- Zoological Institute RAS, Saint Petersburg 199034, Russia
- Correspondence: (M.K.); (E.N.)
| | | | - Viktor Starunov
- Faculty of Biology, St. Petersburg State University, Saint Petersburg 199034, Russia
- Zoological Institute RAS, Saint Petersburg 199034, Russia
| | - Elena Novikova
- Faculty of Biology, St. Petersburg State University, Saint Petersburg 199034, Russia
- Zoological Institute RAS, Saint Petersburg 199034, Russia
- Correspondence: (M.K.); (E.N.)
| |
Collapse
|
15
|
Kaya Y, Aydın ZU, Cai X, Wang X, Dönmez AA. Genome-wide characterization of two Aubrieta taxa: Aubrieta canescens subsp. canescens and Au. macrostyla (Brassicaceae). AOB PLANTS 2022; 14:plac035. [PMID: 36196394 PMCID: PMC9521481 DOI: 10.1093/aobpla/plac035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Accepted: 09/09/2022] [Indexed: 06/16/2023]
Abstract
Aubrieta canescens complex is divided into two subspecies, Au. canescens subsp. canescens, Au. canescens subsp. cilicica and a distinct species, Au. macrostyla, based on molecular phylogeny. We generated a draft assembly of Au. canescens subsp. canescens and Au. macrostyla using paired-end shotgun sequencing. This is the first attempt at genome characterization for the genus. In the presented study, ~165 and ~157 Mbp of the genomes of Au. canescens subsp. canescens and Au. macrostyla were assembled, respectively, and a total of 32 425 and 31 372 gene models were predicted in the genomes of the target taxa, respectively. We corroborated the phylogenomic affinity of taxa with some core Brassicaceae species (Clades A and B) including Arabis alpina. The orthology-based tree suggested that Aubrieta species differentiated from A. alpina 1.3-2.0 mya (million years ago). The genome-wide syntenic comparison of two Aubrieta taxa revealed that Au. canescens subsp. canescens (46 %) and Au. macrostyla (45 %) have an almost identical syntenic gene pair ratio. These novel genome assemblies are the first steps towards the chromosome-level assembly of Au. canescens and understanding the genome diversity within the genus.
Collapse
Affiliation(s)
| | - Zübeyde Uğurlu Aydın
- Molecular Plant Systematic Laboratory (MOBIS), Department of Biology, Faculty of Science, Hacettepe University, Ankara 06800, Turkey
| | - Xu Cai
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Xiaowu Wang
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Ali A Dönmez
- Molecular Plant Systematic Laboratory (MOBIS), Department of Biology, Faculty of Science, Hacettepe University, Ankara 06800, Turkey
| |
Collapse
|
16
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
17
|
K-Mer Spectrum-Based Error Correction Algorithm for Next-Generation Sequencing Data. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8077664. [PMID: 35875730 PMCID: PMC9303089 DOI: 10.1155/2022/8077664] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 06/13/2022] [Indexed: 11/26/2022]
Abstract
In the mid-1970s, the first-generation sequencing technique (Sanger) was created. It used Advanced BioSystems sequencing devices and Beckman's GeXP genetic testing technology. The second-generation sequencing (2GS) technique arrived just several years after the first human genome was published in 2003. 2GS devices are very quicker than Sanger sequencing equipment, with considerably cheaper manufacturing costs and far higher throughput in the form of short reads. The third-generation sequencing (3GS) method, initially introduced in 2005, offers further reduced manufacturing costs and higher throughput. Even though sequencing technique has result generations, it is error-prone due to a large number of reads. The study of this massive amount of data will aid in the decoding of life secrets, the detection of infections, the development of improved crops, and the improvement of life quality, among other things. This is a challenging task, which is complicated not just by a large number of reads and by the occurrence of sequencing mistakes. As a result, error correction is a crucial duty in data processing; it entails identifying and correcting read errors. Various k-spectrum-based error correction algorithms' performance can be influenced by a variety of characteristics like coverage depth, read length, and genome size, as demonstrated in this work. As a result, time and effort must be put into selecting acceptable approaches for error correction of certain NGS data.
Collapse
|
18
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
19
|
Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022; 23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Julian Cascitti
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
20
|
Nesterenko M, Miroliubov A. From head to rootlet: comparative transcriptomic analysis of a rhizocephalan barnacle Peltogaster reticulata (Crustacea: Rhizocephala). F1000Res 2022; 11:583. [PMID: 36447930 PMCID: PMC9664023 DOI: 10.12688/f1000research.110492.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/04/2023] [Indexed: 09/16/2023] Open
Abstract
Background: Rhizocephalan barnacles stand out in the diverse world of metazoan parasites. The body of a rhizocephalan female is modified beyond revealing any recognizable morphological features, consisting of the interna, a system of rootlets, and the externa, a sac-like reproductive body. Moreover, rhizocephalans have an outstanding ability to control their hosts, literally turning them into "zombies". Despite all these amazing traits, there are no genomic or transcriptomic data about any Rhizocephala. Methods: We collected transcriptomes from four body parts of an adult female rhizocephalan Peltogaster reticulata: the externa, and the main, growing, and thoracic parts of the interna. We used all prepared data for the de novo assembly of the reference transcriptome. Next, a set of encoded proteins was determined, the expression levels of protein-coding genes in different parts of the parasite's body were calculated and lists of enriched bioprocesses were identified. We also in silico identified and analyzed sets of potential excretory / secretory proteins. Finally, we applied phylostratigraphy and evolutionary transcriptomics approaches to our data. Results: The assembled reference transcriptome included transcripts of 12,620 protein-coding genes and was the first for any rhizocephalan. Based on the results obtained, the spatial heterogeneity of protein-coding gene expression in different regions of the adult female body of P. reticulata was established. The results of both transcriptomic analysis and histological studies indicated the presence of germ-like cells in the lumen of the interna. The potential molecular basis of the interaction between the nervous system of the host and the parasite's interna was also determined. Given the prolonged expression of development-associated genes, we suggest that rhizocephalans "got stuck in their metamorphosis", even at the reproductive stage. Conclusions: The results of the first comparative transcriptomic analysis for Rhizocephala not only clarified but also expanded the existing ideas about the biology of these extraordinary parasites.
Collapse
Affiliation(s)
- Maksim Nesterenko
- Department of Invertebrate Zoology, St Petersburg State University, St Petersburg, 199034, Russian Federation
- Laboratory of parasitic worms and protists, Zoological Institute of Russian Academy of Sciences, St Petersburg, 199034, Russian Federation
| | - Aleksei Miroliubov
- Laboratory of parasitic worms and protists, Zoological Institute of Russian Academy of Sciences, St Petersburg, 199034, Russian Federation
| |
Collapse
|
21
|
Kim MJ, Park JS, Kim H, Kim SR, Kim SW, Kim KY, Kwak W, Kim I. Phylogeographic Relationships among Bombyx mandarina (Lepidoptera: Bombycidae) Populations and Their Relationships to B. mori Inferred from Mitochondrial Genomes. BIOLOGY 2022; 11:68. [PMID: 35053066 PMCID: PMC8773246 DOI: 10.3390/biology11010068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Revised: 12/19/2021] [Accepted: 12/30/2021] [Indexed: 02/01/2023]
Abstract
We report 37 mitochondrial genome (mitogenome) sequences of Bombyx mori strains (Lepidoptera: Bombycidae) and four of B. mandarina individuals, each preserved and collected, respectively, in South Korea. These mitogenome sequences combined with 45 public data showed a substantial genetic reduction in B. mori strains compared to the presumed ancestor B. mandarina, with the highest diversity detected in the Chinese origin B. mori. Chinese B. mandarina were divided into northern and southern groups, concordant to the Qinling-Huaihe line, and the northern group was placed as an immediate progenitor of monophyletic B. mori strains in phylogenetic analyses, as has previously been detected. However, one individual that was in close proximity to the south Qinling-Huaihe line was exceptional, belonging to the northern group. The enigmatic South Korean population of B. mandarina, which has often been regarded as a closer genetic group to Japan, was most similar to the northern Chinese group, evidencing substantial gene flow between the two regions. Although a substantial genetic divergence is present between B. mandarina in southern China and Japan, a highly supported sister relationship between the two regional populations may suggest the potential origin of Japanese B. mandarina from southern China instead of the Korean peninsula.
Collapse
Affiliation(s)
- Min-Jee Kim
- Experiment and Analysis Division, Honam Regional Office, Animal and Plant Quarantine Agency, Gunsan 54096, Korea;
- Department of Applied Biology, College of Agriculture & Life Sciences, Chonnam National University, Gwangju 61186, Korea; (J.-S.P.); (H.K.)
| | - Jeong-Sun Park
- Department of Applied Biology, College of Agriculture & Life Sciences, Chonnam National University, Gwangju 61186, Korea; (J.-S.P.); (H.K.)
| | - Hyeongmin Kim
- Department of Applied Biology, College of Agriculture & Life Sciences, Chonnam National University, Gwangju 61186, Korea; (J.-S.P.); (H.K.)
| | - Seong-Ryul Kim
- Department of Agricultural Biology, National Academy of Agricultural Science, Rural Development Administration, Wanju Gun 55365, Korea; (S.-R.K.); (S.-W.K.); (K.-Y.K.)
| | - Seong-Wan Kim
- Department of Agricultural Biology, National Academy of Agricultural Science, Rural Development Administration, Wanju Gun 55365, Korea; (S.-R.K.); (S.-W.K.); (K.-Y.K.)
| | - Kee-Young Kim
- Department of Agricultural Biology, National Academy of Agricultural Science, Rural Development Administration, Wanju Gun 55365, Korea; (S.-R.K.); (S.-W.K.); (K.-Y.K.)
| | | | - Iksoo Kim
- Department of Applied Biology, College of Agriculture & Life Sciences, Chonnam National University, Gwangju 61186, Korea; (J.-S.P.); (H.K.)
| |
Collapse
|
22
|
Complete Mitochondrial Genomes of Metcalfa pruinosa and Salurnis marginella (Hemiptera: Flatidae): Genomic Comparison and Phylogenetic Inference in Fulgoroidea. Curr Issues Mol Biol 2021; 43:1391-1418. [PMID: 34698117 PMCID: PMC8929015 DOI: 10.3390/cimb43030099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Revised: 09/25/2021] [Accepted: 09/27/2021] [Indexed: 12/30/2022] Open
Abstract
The complete mitochondrial genomes (mitogenomes) of two DNA barcode-defined haplotypes of Metcalfa pruinosa and one of Salurnis marginella (Hemiptera: Flatidae) were sequenced and compared to those of other Fulgoroidea species. Furthermore, the mitogenome sequences were used to reconstruct phylogenetic relationships among fulgoroid families. The three mitogenomes, including that of the available species of Flatidae, commonly possessed distinctive structures in the 1702-1836 bp A+T-rich region, such as two repeat regions at each end and a large centered nonrepeat region. All members of the superfamily Fulgoroidea, including the Flatidae, consistently possessed a motiflike sequence (TAGTA) at the ND1 and trnS2 junction. The phylogenetic analyses consistently recovered the familial relationships of (((((Ricaniidae + Issidae) + Flatidae) + Fulgoridae) + Achilidae) + Derbidae) in the amino acid-based analysis, with the placement of Cixiidae and Delphacidae as the earliest-derived lineages of fulgoroid families, whereas the monophyly of Delphacidae was not congruent between tree-constructing algorithms.
Collapse
|
23
|
Wu J, Zhang S, Zhang T, Liu Y. HD-Code: End-to-End High Density Code for DNA Storage. IEEE Trans Nanobioscience 2021; 20:455-463. [PMID: 34343096 DOI: 10.1109/tnb.2021.3102122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the rapid development of digital information techniques, the use of DNA media for information storage is considered as the future direction of data storage. Existing DNA storage schemes simply map compressed binary multimedia data into DNA base data, which has the disadvantages of data loss, low logical storage density and high cost of synthesis. This paper presents an end-to-end high density DNA encoding algorithm(referred to as HD-code, where HD stands for high density). The novelty and contributions of this work contain three parts. First, by taking full advantage of the statistical characteristics of the original multimedia data and considering the biological constraints on the DNA bases, the proposed scheme achieves higher logical storage density and improves the flexibility and consistency in data storage. Second, by performing data conversion, the proposed scheme can effectively encode extreme images with large proportion of single color. Third, the proposed method can reconstruct high quality images and reduce synthesis costs by yielding better rate-PSNR(Peak Signal to Noise Ratio).
Collapse
|
24
|
Charlesworth D, Graham C, Trivedi U, Gardner J, Bergero R. PromethION sequencing and assembly of the genome of Micropoecilia picta, a fish with a highly Degenerated Y chromosome. Genome Biol Evol 2021; 13:6326803. [PMID: 34297069 PMCID: PMC8449826 DOI: 10.1093/gbe/evab171] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/19/2021] [Indexed: 11/13/2022] Open
Abstract
We here describe sequencing and assembly of both the autosomes and the sex chromosome in M. picta, the closest related species to the guppy, Poecilia reticulata. Poecilia ()Micropoecilia) picta is a close outgroup for studying the guppy, an important organism for studies in evolutionary ecology and in sex chromosome evolution. The guppy XY pair (LG12) has long been studied as a test case for the importance of sexually antagonistic variants in selection for suppressed recombination between Y and X chromosomes. The guppy Y chromosome is not degenerated, but appears to carry functional copies of all genes that are present on its X counterpart. The X chromosomes of M. picta (and its relative M. parae) are homologous to the guppy XY pair, but their Y chromosomes are highly degenerated, and no genes can be identified in the fully Y-linked region. A complete genome sequence of a M. picta male may therefore contribute to understanding how the guppy Y evolved. These fish species' genomes are estimated to be about 750 Mb, with high densities of repetitive sequences, suggesting that long-read sequencing is needed. We evaluated several assembly approaches, and used our results to investigate the extent of Y chromosome degeneration in this species.
Collapse
Affiliation(s)
- Deborah Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Charlotte Auerbach Road, EH9 3LF, UK
| | - Chay Graham
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Charlotte Auerbach Road, EH9 3LF, UK.,University of Cambridge, Department of Biochemistry, Sanger Building, 80 Tennis Ct Rd, Cambridge, CB2 1GA, UK
| | - Urmi Trivedi
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Charlotte Auerbach Road, EH9 3LF, UK
| | - Jim Gardner
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Charlotte Auerbach Road, EH9 3LF, UK
| | - Roberta Bergero
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Charlotte Auerbach Road, EH9 3LF, UK
| |
Collapse
|
25
|
Zhang X, Ping P, Hutvagner G, Blumenstein M, Li J. Aberration-corrected ultrafine analysis of miRNA reads at single-base resolution: a k-mer lattice approach. Nucleic Acids Res 2021; 49:e106. [PMID: 34291293 PMCID: PMC8631080 DOI: 10.1093/nar/gkab610] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 07/01/2021] [Accepted: 07/06/2021] [Indexed: 12/21/2022] Open
Abstract
Raw sequencing reads of miRNAs contain machine-made substitution errors, or even insertions and deletions (indels). Although the error rate can be low at 0.1%, precise rectification of these errors is critically important because isoform variation analysis at single-base resolution such as novel isomiR discovery, editing events understanding, differential expression analysis, or tissue-specific isoform identification is very sensitive to base positions and copy counts of the reads. Existing error correction methods do not work for miRNA sequencing data attributed to miRNAs’ length and per-read-coverage properties distinct from DNA or mRNA sequencing reads. We present a novel lattice structure combining kmers, (k – 1)mers and (k + 1)mers to address this problem. The method is particularly effective for the correction of indel errors. Extensive tests on datasets having known ground truth of errors demonstrate that the method is able to remove almost all of the errors, without introducing any new error, to improve the data quality from every-50-reads containing one error to every-1300-reads containing one error. Studies on experimental miRNA sequencing datasets show that the errors are often rectified at the 5′ ends and the seed regions of the reads, and that there are remarkable changes after the correction in miRNA isoform abundance, volume of singleton reads, overall entropy, isomiR families, tissue-specific miRNAs, and rare-miRNA quantities.
Collapse
Affiliation(s)
- Xuan Zhang
- Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Pengyao Ping
- Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Michael Blumenstein
- Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Jinyan Li
- To whom correspondence should be addressed. Tel: +61 295149264; Fax: +61 295149264;
| |
Collapse
|
26
|
Muthukumarasamy U, Preusse M, Kordes A, Koska M, Schniederjans M, Khaledi A, Häussler S. Single-Nucleotide Polymorphism-Based Genetic Diversity Analysis of Clinical Pseudomonas aeruginosa Isolates. Genome Biol Evol 2021; 12:396-406. [PMID: 32196089 PMCID: PMC7197496 DOI: 10.1093/gbe/evaa059] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/19/2020] [Indexed: 01/26/2023] Open
Abstract
Extensive use of next-generation sequencing has the potential to transform our knowledge on how genomic variation within bacterial species impacts phenotypic versatility. Because different environments have unique selection pressures, they drive divergent evolution. However, there is also parallel or convergent evolution of traits in independent bacterial isolates inhabiting similar environments. The application of tools to describe population-wide genomic diversity provides an opportunity to measure the predictability of genetic changes underlying adaptation. Here, we describe patterns of sequence variations in the core genome among 99 individual Pseudomonas aeruginosa clinical isolates and identified single-nucleotide polymorphisms that are the basis for branching of the phylogenetic tree. We also identified single-nucleotide polymorphisms that were acquired independently, in separate lineages, and not through inheritance from a common ancestor. Although our results demonstrate that the Pseudomonas aeruginosa core genome is highly conserved and in general, not subject to adaptive evolution, instances of parallel evolution will provide an opportunity to uncover genetic changes that underlie phenotypic diversity.
Collapse
Affiliation(s)
- Uthayakumar Muthukumarasamy
- Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Institute of Molecular Bacteriology, TWINCORE GmbH, Center for Clinical and Experimental Infection Research, Hannover, Germany
| | - Matthias Preusse
- Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Institute of Molecular Bacteriology, TWINCORE GmbH, Center for Clinical and Experimental Infection Research, Hannover, Germany
| | - Adrian Kordes
- Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Institute of Molecular Bacteriology, TWINCORE GmbH, Center for Clinical and Experimental Infection Research, Hannover, Germany
| | - Michal Koska
- Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Institute of Molecular Bacteriology, TWINCORE GmbH, Center for Clinical and Experimental Infection Research, Hannover, Germany
| | - Monika Schniederjans
- Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Institute of Molecular Bacteriology, TWINCORE GmbH, Center for Clinical and Experimental Infection Research, Hannover, Germany
| | - Ariane Khaledi
- Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Institute of Molecular Bacteriology, TWINCORE GmbH, Center for Clinical and Experimental Infection Research, Hannover, Germany
| | - Susanne Häussler
- Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Institute of Molecular Bacteriology, TWINCORE GmbH, Center for Clinical and Experimental Infection Research, Hannover, Germany
| |
Collapse
|
27
|
Kallenborn F, Hildebrandt A, Schmidt B. CARE: context-aware sequencing read error correction. Bioinformatics 2021; 37:889-895. [PMID: 32818262 DOI: 10.1093/bioinformatics/btaa738] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Revised: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany
| |
Collapse
|
28
|
Zhang X, Liu Y, Yu Z, Blumenstein M, Hutvagner G, Li J. Instance-based error correction for short reads of disease-associated genes. BMC Bioinformatics 2021; 22:142. [PMID: 34078284 PMCID: PMC8170817 DOI: 10.1186/s12859-021-04058-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 03/02/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. RESULTS To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). CONCLUSION Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.
Collapse
Affiliation(s)
- Xuan Zhang
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, China
| | - Michael Blumenstein
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Gyorgy Hutvagner
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia.
| |
Collapse
|
29
|
He K, Eastman TG, Czolacz H, Li S, Shinohara A, Kawada SI, Springer MS, Berenbrink M, Campbell KL. Myoglobin primary structure reveals multiple convergent transitions to semi-aquatic life in the world's smallest mammalian divers. eLife 2021; 10:e66797. [PMID: 33949308 PMCID: PMC8205494 DOI: 10.7554/elife.66797] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 05/04/2021] [Indexed: 01/01/2023] Open
Abstract
The speciose mammalian order Eulipotyphla (moles, shrews, hedgehogs, solenodons) combines an unusual diversity of semi-aquatic, semi-fossorial, and fossorial forms that arose from terrestrial forbearers. However, our understanding of the ecomorphological pathways leading to these lifestyles has been confounded by a fragmentary fossil record, unresolved phylogenetic relationships, and potential morphological convergence, calling for novel approaches. The net surface charge of the oxygen-storing muscle protein myoglobin (ZMb), which can be readily determined from its primary structure, provides an objective target to address this question due to mechanistic linkages with myoglobin concentration. Here, we generate a comprehensive 71 species molecular phylogeny that resolves previously intractable intra-family relationships and then ancestrally reconstruct ZMb evolution to identify ancient lifestyle transitions based on protein sequence alone. Our phylogenetically informed analyses confidently resolve fossorial habits having evolved twice in talpid moles and reveal five independent secondary aquatic transitions in the order housing the world's smallest endothermic divers.
Collapse
Affiliation(s)
- Kai He
- Department of Biological Sciences, University of ManitobaWinnipegCanada
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Southern Medical UniversityGuangzhouChina
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of SciencesKunmingChina
- Guangdong Provincial Key Laboratory of Single Cell Technology and Application, Southern Medical UniversityGuangzhouChina
| | - Triston G Eastman
- Department of Biological Sciences, University of ManitobaWinnipegCanada
| | - Hannah Czolacz
- Department of Evolution, Ecology and Behaviour, University of LiverpoolLiverpoolUnited Kingdom
| | - Shuhao Li
- Department of Biological Sciences, University of ManitobaWinnipegCanada
| | - Akio Shinohara
- Department of Bio-resources, Division of Biotechnology, Frontier Science Research Center, University of MiyazakiMiyazakiJapan
| | - Shin-ichiro Kawada
- Department of Zoology, Division of Vertebrates, National Museum of Nature and ScienceTokyoJapan
| | - Mark S Springer
- Department of Evolution, Ecology and Organismal Biology, University of California, RiversideRiversideUnited States
| | - Michael Berenbrink
- Department of Evolution, Ecology and Behaviour, University of LiverpoolLiverpoolUnited Kingdom
| | - Kevin L Campbell
- Department of Biological Sciences, University of ManitobaWinnipegCanada
| |
Collapse
|
30
|
Hosseini ZZ, Rahimi SK, Forouzan E, Baraani A. RMI-DBG algorithm: A more agile iterative de Bruijn graph algorithm in short read genome assembly. J Bioinform Comput Biol 2021; 19:2150005. [PMID: 33866959 DOI: 10.1142/s0219720021500050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The de Bruijn Graph algorithm (DBG) as one of the cornerstones algorithms in short read assembly has extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low-cost production of millions of high-quality short reads. Erroneous reads, non-uniform coverage, and genomic repeats are three major problems that influence the performance of short read assemblers. To encounter these problems, the iterative DBG algorithm applies multiple [Formula: see text]-mers instead of a single [Formula: see text]-mer, by iterating the DBG graph over a range of [Formula: see text]-mer sizes from the minimum to the maximum. However, the iteration paradigm of iterative DBG deals with complex graphs from the beginning of the algorithm and therefore, causes more potential errors and computational time for resolving various unreal branches. In this research, we propose the Reverse Modified Iterative DBG graph (named RMI-DBG) for short read assembly. RMI-DBG utilizes the DBG algorithm and String graph to achieve the advantages of both algorithms. We present that RMI-DBG performs faster with comparable results in comparison to iterative DBG. Additionally, the quality of the proposed algorithm in terms of continuity and accuracy is evaluated with some commonly-used assemblers via several real datasets of the GAGE-B benchmark.
Collapse
Affiliation(s)
| | | | - Esmaeil Forouzan
- National Institute for Genetic, Engineering & Biotechnology, (NIGEB), Tehran, Iran.,GeneMan Genomics Ltd, (www.ggenomics.ir), Shiraz, Iran
| | - Ahmad Baraani
- Department of Software Engineering, University of Isfahan, Iran
| |
Collapse
|
31
|
Oh HK, Hwang YJ, Hong HW, Myung H. Comparison of Enterococcus faecalis Biofilm Removal Efficiency among Bacteriophage PBEF129, Its Endolysin, and Cefotaxime. Viruses 2021; 13:v13030426. [PMID: 33800040 PMCID: PMC7999683 DOI: 10.3390/v13030426] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 03/01/2021] [Accepted: 03/03/2021] [Indexed: 02/07/2023] Open
Abstract
Enterococcus faecalis is a Gram-positive pathogen which colonizes human intestinal surfaces, forming biofilms, and demonstrates a high resistance to many antibiotics. Especially, antibiotics are less effective for eradicating biofilms and better alternatives are needed. In this study, we have isolated and characterized a bacteriophage, PBEF129, infecting E. faecalis. PBEF129 infected a variety of strains of E. faecalis, including those exhibiting antibiotic resistance. Its genome is a linear double-stranded DNA, 144,230 base pairs in length. Its GC content is 35.9%. The closest genomic DNA sequence was found in Enterococcus phage vB_EfaM_Ef2.3, with a sequence identity of 99.06% over 95% query coverage. Furthermore, 75 open reading frames (ORFs) were functionally annotated and five tRNA-encoding genes were found. ORF 6 was annotated as a phage endolysin having an L-acetylmuramoyl-l-alanine amidase activity. We purified the enzyme as a recombinant protein and confirmed its enzymatic activity. The endolysin’s host range was observed to be wider than its parent phage PBEF129. When applied to bacterial biofilm on the surface of in vitro cultured human intestinal cells, it demonstrated a removal efficacy of the same degree as cefotaxime, but much lower than its parent bacteriophage.
Collapse
Affiliation(s)
- Hyun Keun Oh
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Gyung-Gi Do 17035, Korea; (H.K.O.); (Y.J.H.)
| | - Yoon Jung Hwang
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Gyung-Gi Do 17035, Korea; (H.K.O.); (Y.J.H.)
| | | | - Heejoon Myung
- Department of Bioscience and Biotechnology, Hankuk University of Foreign Studies, Gyung-Gi Do 17035, Korea; (H.K.O.); (Y.J.H.)
- LyseNTech Co. Ltd., Gyung-Gi Do 17035, Korea;
- Bacteriophage Bank of Korea, Yong-In, Mo-Hyun, Gyung-Gi Do 17035, Korea
- Correspondence:
| |
Collapse
|
32
|
Álvarez-Pérez S, Dhami MK, Pozo MI, Crauwels S, Verstrepen KJ, Herrera CM, Lievens B, Jacquemyn H. Genetic admixture increases phenotypic diversity in the nectar yeast Metschnikowia reukaufii. FUNGAL ECOL 2021. [DOI: 10.1016/j.funeco.2020.101016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
33
|
Fehrer J, Slavíková R, Paštová L, Josefiová J, Mráz P, Chrtek J, Bertrand YJK. Molecular Evolution and Organization of Ribosomal DNA in the Hawkweed Tribe Hieraciinae (Cichorieae, Asteraceae). FRONTIERS IN PLANT SCIENCE 2021; 12:647375. [PMID: 33777082 PMCID: PMC7994888 DOI: 10.3389/fpls.2021.647375] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 02/19/2021] [Indexed: 05/14/2023]
Abstract
Molecular evolution of ribosomal DNA can be highly dynamic. Hundreds to thousands of copies in the genome are subject to concerted evolution, which homogenizes sequence variants to different degrees. If well homogenized, sequences are suitable for phylogeny reconstruction; if not, sequence polymorphism has to be handled appropriately. Here we investigate non-coding rDNA sequences (ITS/ETS, 5S-NTS) along with the chromosomal organization of their respective loci (45S and 5S rDNA) in diploids of the Hieraciinae. The subtribe consists of genera Hieracium, Pilosella, Andryala, and Hispidella and has a complex evolutionary history characterized by ancient intergeneric hybridization, allele sharing among species, and incomplete lineage sorting. Direct or cloned Sanger sequences and phased alleles derived from Illumina genome sequencing were subjected to phylogenetic analyses. Patterns of homogenization and tree topologies based on the three regions were compared. In contrast to most other plant groups, 5S-NTS sequences were generally better homogenized than ITS and ETS sequences. A novel case of ancient intergeneric hybridization between Hispidella and Hieracium was inferred, and some further incongruences between the trees were found, suggesting independent evolution of these regions. In some species, homogenization of ITS/ETS and 5S-NTS sequences proceeded in different directions although the 5S rDNA locus always occurred on the same chromosome with one 45S rDNA locus. The ancestral rDNA organization in the Hieraciinae comprised 4 loci of 45S rDNA in terminal positions and 2 loci of 5S rDNA in interstitial positions per diploid genome. In Hieracium, some deviations from this general pattern were found (3, 6, or 7 loci of 45S rDNA; three loci of 5S rDNA). Some of these deviations concerned intraspecific variation, and most of them occurred at the tips of the tree or independently in different lineages. This indicates that the organization of rDNA loci is more dynamic than the evolution of sequences contained in them and that locus number is therefore largely unsuitable to inform about species relationships in Hieracium. No consistent differences in the degree of sequence homogenization and the number of 45S rDNA loci were found, suggesting interlocus concerted evolution.
Collapse
Affiliation(s)
- Judith Fehrer
- Institute of Botany, Czech Academy of Sciences, Průhonice, Czechia
- *Correspondence: Judith Fehrer,
| | - Renáta Slavíková
- Institute of Botany, Czech Academy of Sciences, Průhonice, Czechia
| | | | - Jiřina Josefiová
- Institute of Botany, Czech Academy of Sciences, Průhonice, Czechia
| | - Patrik Mráz
- Department of Botany, Charles University, Prague, Czechia
| | - Jindřich Chrtek
- Institute of Botany, Czech Academy of Sciences, Průhonice, Czechia
- Department of Botany, Charles University, Prague, Czechia
| | | |
Collapse
|
34
|
Mycoviral diversity and characteristics of a negative-stranded RNA virus LeNSRV1 in the edible mushroom Lentinula edodes. Virology 2020; 555:89-101. [PMID: 33308828 DOI: 10.1016/j.virol.2020.11.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 11/04/2020] [Accepted: 11/12/2020] [Indexed: 11/23/2022]
Abstract
Bioinformatics and RT-PCR analysis of RNA from four Lentinula edodes samples identified 22 different virus-like contigs comprising 15 novel and 3 previously reported viruses. We further investigated the Lentinula edodes negative-stranded RNA virus 1 (LeNSRV1) isolated from a symptomatic sample, whose virion is a filamentous particle with a diameter of ~15 nm and a length of ~1200 nm. RT-PCR analysis detected LeNSRV1 in 10 of the 56 Chinese L. edodes core collection strains and 6 of the 22 monokaryotic strains from the L. edodes strain HNZMD. Genetic variation analysis showed that the sequences encoding the nucleocapsid protein (ORF2) from all the aforementioned LeNSRV1 positive strains are very conservative. The results presented here may enrich our understanding of L. edodes virus diversity and the characteristics of LeNSRV1, and will promote further research on virus-host interaction in L. edodes.
Collapse
|
35
|
Bennett EP, Petersen BL, Johansen IE, Niu Y, Yang Z, Chamberlain CA, Met Ö, Wandall HH, Frödin M. INDEL detection, the 'Achilles heel' of precise genome editing: a survey of methods for accurate profiling of gene editing induced indels. Nucleic Acids Res 2020; 48:11958-11981. [PMID: 33170255 PMCID: PMC7708060 DOI: 10.1093/nar/gkaa975] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 10/05/2020] [Accepted: 10/15/2020] [Indexed: 12/11/2022] Open
Abstract
Advances in genome editing technologies have enabled manipulation of genomes at the single base level. These technologies are based on programmable nucleases (PNs) that include meganucleases, zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs) and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/CRISPR-associated 9 (Cas9) nucleases and have given researchers the ability to delete, insert or replace genomic DNA in cells, tissues and whole organisms. The great flexibility in re-designing the genomic target specificity of PNs has vastly expanded the scope of gene editing applications in life science, and shows great promise for development of the next generation gene therapies. PN technologies share the principle of inducing a DNA double-strand break (DSB) at a user-specified site in the genome, followed by cellular repair of the induced DSB. PN-elicited DSBs are mainly repaired by the non-homologous end joining (NHEJ) and the microhomology-mediated end joining (MMEJ) pathways, which can elicit a variety of small insertion or deletion (indel) mutations. If indels are elicited in a protein coding sequence and shift the reading frame, targeted gene knock out (KO) can readily be achieved using either of the available PNs. Despite the ease by which gene inactivation in principle can be achieved, in practice, successful KO is not only determined by the efficiency of NHEJ and MMEJ repair; it also depends on the design and properties of the PN utilized, delivery format chosen, the preferred indel repair outcomes at the targeted site, the chromatin state of the target site and the relative activities of the repair pathways in the edited cells. These variables preclude accurate prediction of the nature and frequency of PN induced indels. A key step of any gene KO experiment therefore becomes the detection, characterization and quantification of the indel(s) induced at the targeted genomic site in cells, tissues or whole organisms. In this survey, we briefly review naturally occurring indels and their detection. Next, we review the methods that have been developed for detection of PN-induced indels. We briefly outline the experimental steps and describe the pros and cons of the various methods to help users decide a suitable method for their editing application. We highlight recent advances that enable accurate and sensitive quantification of indel events in cells regardless of their genome complexity, turning a complex pool of different indel events into informative indel profiles. Finally, we review what has been learned about PN-elicited indel formation through the use of the new methods and how this insight is helping to further advance the genome editing field.
Collapse
Affiliation(s)
- Eric Paul Bennett
- Copenhagen Center for Glycomics, Department of Odontology and Molecular and Cellular Medicine, Faculty of Health Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark
| | - Bent Larsen Petersen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871 Frederiksberg C, Denmark
| | - Ida Elisabeth Johansen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871 Frederiksberg C, Denmark
| | - Yiyuan Niu
- Biotech Research and Innovation Centre (BRIC), Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
- College of Animal Science and Technology, Northwest A&F University, Yangling Shaanxi, China
| | - Zhang Yang
- Copenhagen Center for Glycomics, Department of Odontology and Molecular and Cellular Medicine, Faculty of Health Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark
| | | | - Özcan Met
- Center for Cancer Immune Therapy, Department of Oncology, Copenhagen University Hospital, Herlev, Denmark
- Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Hans H Wandall
- Copenhagen Center for Glycomics, Department of Odontology and Molecular and Cellular Medicine, Faculty of Health Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark
| | - Morten Frödin
- Biotech Research and Innovation Centre (BRIC), Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
36
|
Nesterenko MA, Starunov VV, Shchenkov SV, Maslova AR, Denisova SA, Granovich AI, Dobrovolskij AA, Khalturin KV. Molecular signatures of the rediae, cercariae and adult stages in the complex life cycles of parasitic flatworms (Digenea: Psilostomatidae). Parasit Vectors 2020; 13:559. [PMID: 33168070 PMCID: PMC7653818 DOI: 10.1186/s13071-020-04424-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 10/24/2020] [Indexed: 11/10/2022] Open
Abstract
Background Parasitic flatworms (Trematoda: Digenea) represent one of the most remarkable examples of drastic morphological diversity among the stages within a life cycle. Which genes are responsible for extreme differences in anatomy, physiology, behavior, and ecology among the stages? Here we report a comparative transcriptomic analysis of parthenogenetic and amphimictic generations in two evolutionary informative species of Digenea belonging to the family Psilostomatidae. Methods In this study the transcriptomes of rediae, cercariae and adult worm stages of Psilotrema simillimum and Sphaeridiotrema pseudoglobulus, were sequenced and analyzed. High-quality transcriptomes were generated, and the reference sets of protein-coding genes were used for differential expression analysis in order to identify stage-specific genes. Comparative analysis of gene sets, their expression dynamics and Gene Ontology enrichment analysis were performed for three life stages within each species and between the two species. Results Reference transcriptomes for P. simillimum and S. pseudoglobulus include 21,433 and 46,424 sequences, respectively. Among 14,051 orthologous groups (OGs), 1354 are common and specific for two analyzed psilostomatid species, whereas 13 and 43 OGs were unique for P. simillimum and S. pseudoglobulus, respectively. In contrast to P. simillimum, where more than 60% of analyzed genes were active in the redia, cercaria and adult worm stages, in S. pseudoglobulus less than 40% of genes had such a ubiquitous expression pattern. In general, 7805 (36.41%) and 30,622 (65.96%) of genes were preferentially expressed in one of the analyzed stages of P. simillimum and S. pseudoglobulus, respectively. In both species 12 clusters of co-expressed genes were identified, and more than a half of the genes belonging to the reference sets were included into these clusters. Functional specialization of the life cycle stages was clearly supported by Gene Ontology enrichment analysis. Conclusions During the life cycles of the two species studied, most of the genes change their expression levels considerably, consequently the molecular signature of a stage is not only a unique set of expressed genes, but also the specific levels of their expression. Our results indicate unexpectedly high level of plasticity in gene regulation between closely related species. Transcriptomes of P. simillimum and S. pseudoglobulus provide high quality reference resource for future evolutionary studies and comparative analyses.![]()
Collapse
Affiliation(s)
- Maksim A Nesterenko
- Department of Invertebrate Zoology, St-Petersburg State University, Saint Petersburg, 199034, Russia.
| | - Viktor V Starunov
- Department of Invertebrate Zoology, St-Petersburg State University, Saint Petersburg, 199034, Russia.,Zoological Institute, Russian Academy of Sciences, Saint Petersburg, 199034, Russia
| | - Sergei V Shchenkov
- Department of Invertebrate Zoology, St-Petersburg State University, Saint Petersburg, 199034, Russia
| | - Anna R Maslova
- Department of Invertebrate Zoology, St-Petersburg State University, Saint Petersburg, 199034, Russia
| | - Sofia A Denisova
- Department of Invertebrate Zoology, St-Petersburg State University, Saint Petersburg, 199034, Russia
| | - Andrey I Granovich
- Department of Invertebrate Zoology, St-Petersburg State University, Saint Petersburg, 199034, Russia
| | - Andrey A Dobrovolskij
- Department of Invertebrate Zoology, St-Petersburg State University, Saint Petersburg, 199034, Russia
| | - Konstantin V Khalturin
- Marine Genomics Unit, OIST, 1919-1 Tancha, Onna-son, Kunigami-gun, Okinawa, 904-0495, Japan
| |
Collapse
|
37
|
Metatranscriptomics by
In Situ
RNA Stabilization Directly and Comprehensively Revealed Episymbiotic Microbial Communities of Deep-Sea Squat Lobsters. mSystems 2020; 5:5/5/e00551-20. [PMID: 33024051 PMCID: PMC8534475 DOI: 10.1128/msystems.00551-20] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Shinkaia crosnieri is an invertebrate that inhabits an area around deep-sea hydrothermal vents in the Okinawa Trough in Japan by harboring episymbiotic microbes as the primary nutrition. To reveal physiology and phylogenetic composition of the active episymbiotic populations, metatranscriptomics is expected to be a powerful approach. However, this has been hindered by substantial perturbation (e.g., RNA degradation) during time-consuming retrieval from the deep sea. Here, we conducted direct metatranscriptomic analysis of S. crosnieri episymbionts by applying in situ RNA stabilization equipment. As expected, we obtained RNA expression profiles that were substantially different from those obtained by conventional metatranscriptomics (i.e., stabilization after retrieval). The episymbiotic community members were dominated by three orders, namely, Thiotrichales, Methylococcales, and Campylobacterales, and the Campylobacterales members were mostly dominated by the Sulfurovum genus. At a finer phylogenetic scale, the episymbiotic communities on different host individuals shared many species, indicating that the episymbionts on each host individual are not descendants of a few founder cells but are horizontally exchanged. Furthermore, our analysis revealed the key metabolisms of the community: two carbon fixation pathways, a formaldehyde assimilation pathway, and utilization of five electron donors (sulfide, thiosulfate, sulfur, methane, and ammonia) and two electron accepters (oxygen and nitrate/nitrite). Importantly, it was suggested that Thiotrichales episymbionts can utilize intercellular sulfur globules even when sulfur compounds are not usable, possibly also in a detached and free-living state. IMPORTANCE Deep-sea hydrothermal vent ecosystems remain mysterious. To depict in detail the enigmatic life of chemosynthetic microbes, which are key primary producers in these ecosystems, metatranscriptomic analysis is expected to be a promising approach. However, this has been hindered by substantial perturbation (e.g., RNA degradation) during time-consuming retrieval from the deep sea. In this study, we conducted direct metatranscriptome analysis of microbial episymbionts of deep-sea squat lobsters (Shinkaia crosnieri) by applying in situ RNA stabilization equipment. Compared to conventional metatranscriptomics (i.e., RNA stabilization after retrieval), our method provided substantially different RNA expression profiles. Moreover, we discovered that S. crosnieri and its episymbiotic microbes constitute complex and resilient ecosystems, where closely related but various episymbionts are stably maintained by horizontal exchange and partly by their sulfur storage ability for survival even when sulfur compounds are not usable, likely also in a detached and free-living state.
Collapse
|
38
|
Abstract
The Arctic is warming at an accelerating pace, and the rise in temperature has increasing impacts on the Arctic biome. Lakes are integrators of their surroundings and thus excellent sentinels of environmental change. Despite their importance in the regulation of key microbial processes, viruses remain largely uncharacterized in Arctic lacustrine environments. We sampled a highly stratified meromictic lake near the northern limit of the Canadian High Arctic, a region in rapid transition due to climate change. We found that the different layers of the lake harbored viral communities that were strikingly dissimilar and highly divergent from known viruses. Viruses were more abundant in the deepest part of the lake containing ancient Arctic Ocean seawater that was trapped during glacial retreat and were genomically unlike any viruses previously described. This research demonstrates the complexity and novelty of viral communities in an environment that is vulnerable to ongoing perturbation. High-latitude, perennially stratified (meromictic) lakes are likely to be especially vulnerable to climate warming because of the importance of ice in maintaining their water column structure and associated distribution of microbial communities. This study aimed to characterize viral abundance, diversity, and distribution in a meromictic lake of marine origin on the far northern coast of Ellesmere Island, in the Canadian High Arctic. We collected triplicate samples for double-stranded DNA (dsDNA) viromics from five depths that encompassed the major features of the lake, as determined by limnological profiling of the water column. Viral abundance and virus-to-prokaryote ratios were highest at greater depths, while bacterial and cyanobacterial counts were greatest in the surface waters. The viral communities from each zone of the lake defined by salinity, temperature, and dissolved oxygen concentrations were markedly distinct, suggesting that there was little exchange of viral types among lake strata. Ten viral assembled genomes were obtained from our libraries, and these also segregated with depth. This well-defined structure of viral communities was consistent with that of potential hosts. Viruses from the monimolimnion, a deep layer of ancient Arctic Ocean seawater, were more diverse and relatively abundant, with few similarities to available viral sequences. The Lake A viral communities also differed from published records from the Arctic Ocean and meromictic Ace Lake in Antarctica. This first characterization of viral diversity from this sentinel environment underscores the microbial richness and complexity of an ecosystem type that is increasingly exposed to major perturbations in the fast-changing Arctic. IMPORTANCE The Arctic is warming at an accelerating pace, and the rise in temperature has increasing impacts on the Arctic biome. Lakes are integrators of their surroundings and thus excellent sentinels of environmental change. Despite their importance in the regulation of key microbial processes, viruses remain largely uncharacterized in Arctic lacustrine environments. We sampled a highly stratified meromictic lake near the northern limit of the Canadian High Arctic, a region in rapid transition due to climate change. We found that the different layers of the lake harbored viral communities that were strikingly dissimilar and highly divergent from known viruses. Viruses were more abundant in the deepest part of the lake containing ancient Arctic Ocean seawater that was trapped during glacial retreat and were genomically unlike any viruses previously described. This research demonstrates the complexity and novelty of viral communities in an environment that is vulnerable to ongoing perturbation.
Collapse
|
39
|
Liao X, Li M, Luo J, Zou Y, Wu FX, Pan Y, Luo F, Wang J. Improving de novo Assembly Based on Read Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:177-188. [PMID: 30059317 DOI: 10.1109/tcbb.2018.2861380] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Due to sequencing bias, sequencing error, and repeat problems, the genome assemblies usually contain misarrangements and gaps. When tackling these problems, current assemblers commonly consider the read libraries as a whole and adopt the same strategy to deal with them. However, if we can divide reads into different categories and take different assembly strategies for different read categories, we expect to reduce the mutual effects on problems in genome assembly and facilitate to produce satisfactory assemblies. In this paper, we present a new pipeline for genome assembly based on read classification (ARC). ARC classifies reads into three categories according to the frequencies of k-mers they contain. The three categories refer to (1) low depth reads, which contain a certain low frequency k-mers and are often caused by sequencing errors or bias; (2) high depth reads, which contain a certain high frequency k-mers and usually come from repetitive regions; and (3) normal depth reads, which are the rest of reads. After read classification, an existing assembler is used to assemble different read categories separately, which is beneficial to resolve problems in the genome assembly. ARC adopts loose assembly parameters for low depth reads, and strict assembly parameters for normal depth and high depth reads. We test ARC using five datasets. The experimental results show that, assemblers combining with ARC can generate better assemblies in terms of NA50, NGA50, and genome fraction.
Collapse
|
40
|
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 2019; 20:278. [PMID: 31842956 PMCID: PMC6912988 DOI: 10.1186/s13059-019-1910-1] [Citation(s) in RCA: 716] [Impact Index Per Article: 143.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Accepted: 12/02/2019] [Indexed: 11/13/2022] Open
Abstract
RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.
Collapse
Affiliation(s)
- Sam Kovaka
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
| | - Aleksey V. Zimin
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Geo M. Pertea
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Roham Razaghi
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Steven L. Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205 USA
| | - Mihaela Pertea
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21205 USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| |
Collapse
|
41
|
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models. Sci Rep 2019; 9:16157. [PMID: 31695060 PMCID: PMC6834855 DOI: 10.1038/s41598-019-52196-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 10/07/2019] [Indexed: 01/30/2023] Open
Abstract
The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.
Collapse
|
42
|
Chen J, Shang J, Wang J, Sun Y. A binning tool to reconstruct viral haplotypes from assembled contigs. BMC Bioinformatics 2019; 20:544. [PMID: 31684876 PMCID: PMC6829986 DOI: 10.1186/s12859-019-3138-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 10/09/2019] [Indexed: 11/21/2022] Open
Abstract
Background Infections by RNA viruses such as Influenza, HIV still pose a serious threat to human health despite extensive research on viral diseases. One challenge for producing effective prevention and treatment strategies is high intra-species genetic diversity. As different strains may have different biological properties, characterizing the genetic diversity is thus important to vaccine and drug design. Next-generation sequencing technology enables comprehensive characterization of both known and novel strains and has been widely adopted for sequencing viral populations. However, genome-scale reconstruction of haplotypes is still a challenging problem. In particular, haplotype assembly programs often produce contigs rather than full genomes. As a mutation in one gene can mask the phenotypic effects of a mutation at another locus, clustering these contigs into genome-scale haplotypes is still needed. Results We developed a contig binning tool, VirBin, which clusters contigs into different groups so that each group represents a haplotype. Commonly used features based on sequence composition and contig coverage cannot effectively distinguish viral haplotypes because of their high sequence similarity and heterogeneous sequencing coverage for RNA viruses. VirBin applied prototype-based clustering to cluster regions that are more likely to contain mutations specific to a haplotype. The tool was tested on multiple simulated sequencing data with different haplotype abundance distributions and contig sizes, and also on mock quasispecies sequencing data. The benchmark results with other contig binning tools demonstrated the superior sensitivity and precision of VirBin in contig binning for viral haplotype reconstruction. Conclusions In this work, we presented VirBin, a new contig binning tool for distinguishing contigs from different viral haplotypes with high sequence similarity. It competes favorably with other tools on viral contig binning. The source codes are available at: https://github.com/chjiao/VirBin.
Collapse
Affiliation(s)
- Jiao Chen
- Computer Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Jiayu Shang
- Electrical Engineering, City University of Hong Kong, Hong Kong, China
| | - Jianrong Wang
- Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|
43
|
Morisse P, Lecroq T, Lefebvre A. Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics 2019; 34:4213-4222. [PMID: 29955770 DOI: 10.1093/bioinformatics/bty521] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Accepted: 06/27/2018] [Indexed: 12/31/2022] Open
Abstract
Motivation The recent rise of long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore allows to solve assembly problems for larger and more complex genomes than what allowed short reads technologies. However, these long reads are very noisy, reaching an error rate of around 10-15% for Pacific Biosciences, and up to 30% for Oxford Nanopore. The error correction problem has been tackled by either self-correcting the long reads, or using complementary short reads in a hybrid approach. However, even though sequencing technologies promise to lower the error rate of the long reads below 10%, it is still higher in practice, and correcting such noisy long reads remains an issue. Results We present HG-CoLoR, a hybrid error correction method that focuses on a seed-and-extend approach based on the alignment of the short reads to the long reads, followed by the traversal of a variable-order de Bruijn graph, built from the short reads. Our experiments show that HG-CoLoR manages to efficiently correct highly noisy long reads that display an error rate as high as 44%. When compared to other state-of-the-art long read error correction methods, our experiments also show that HG-CoLoR provides the best trade-off between runtime and quality of the results, and is the only method able to efficiently scale to eukaryotic genomes. Availability and implementation HG-CoLoR is implemented is C++, supported on Linux platforms and freely available at https://github.com/morispi/HG-CoLoR. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
44
|
Jin AH, Muttenthaler M, Dutertre S, Himaya SWA, Kaas Q, Craik DJ, Lewis RJ, Alewood PF. Conotoxins: Chemistry and Biology. Chem Rev 2019; 119:11510-11549. [PMID: 31633928 DOI: 10.1021/acs.chemrev.9b00207] [Citation(s) in RCA: 161] [Impact Index Per Article: 32.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The venom of the marine predatory cone snails (genus Conus) has evolved for prey capture and defense, providing the basis for survival and rapid diversification of the now estimated 750+ species. A typical Conus venom contains hundreds to thousands of bioactive peptides known as conotoxins. These mostly disulfide-rich and well-structured peptides act on a wide range of targets such as ion channels, G protein-coupled receptors, transporters, and enzymes. Conotoxins are of interest to neuroscientists as well as drug developers due to their exquisite potency and selectivity, not just against prey but also mammalian targets, thereby providing a rich source of molecular probes and therapeutic leads. The rise of integrated venomics has accelerated conotoxin discovery with now well over 10,000 conotoxin sequences published. However, their structural and pharmacological characterization lags considerably behind. In this review, we highlight the diversity of new conotoxins uncovered since 2014, their three-dimensional structures and folds, novel chemical approaches to their syntheses, and their value as pharmacological tools to unravel complex biology. Additionally, we discuss challenges and future directions for the field.
Collapse
Affiliation(s)
- Ai-Hua Jin
- Institute for Molecular Bioscience , The University of Queensland , Brisbane Queensland 4072 , Australia
| | - Markus Muttenthaler
- Institute for Molecular Bioscience , The University of Queensland , Brisbane Queensland 4072 , Australia.,Institute of Biological Chemistry, Faculty of Chemistry , University of Vienna , 1090 Vienna , Austria
| | - Sebastien Dutertre
- Département des Acides Amines, Peptides et Protéines, Unité Mixte de Recherche 5247, Université Montpellier 2-Centre Nationale de la Recherche Scientifique , Institut des Biomolécules Max Mousseron , Place Eugène Bataillon , 34095 Montpellier Cedex 5 , France
| | - S W A Himaya
- Institute for Molecular Bioscience , The University of Queensland , Brisbane Queensland 4072 , Australia
| | - Quentin Kaas
- Institute for Molecular Bioscience , The University of Queensland , Brisbane Queensland 4072 , Australia
| | - David J Craik
- Institute for Molecular Bioscience , The University of Queensland , Brisbane Queensland 4072 , Australia
| | - Richard J Lewis
- Institute for Molecular Bioscience , The University of Queensland , Brisbane Queensland 4072 , Australia
| | - Paul F Alewood
- Institute for Molecular Bioscience , The University of Queensland , Brisbane Queensland 4072 , Australia
| |
Collapse
|
45
|
Chen J, Zhao Y, Sun Y. De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics 2019; 34:2927-2935. [PMID: 29617936 DOI: 10.1093/bioinformatics/bty202] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2017] [Accepted: 04/02/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation RNA virus populations contain different but genetically related strains, all infecting an individual host. Reconstruction of the viral haplotypes is a fundamental step to characterize the virus population, predict their viral phenotypes and finally provide important information for clinical treatment and prevention. Advances of the next-generation sequencing technologies open up new opportunities to assemble full-length haplotypes. However, error-prone short reads, high similarities between related strains, an unknown number of haplotypes pose computational challenges for reference-free haplotype reconstruction. There is still much room to improve the performance of existing haplotype assembly tools. Results In this work, we developed a de novo haplotype reconstruction tool named PEHaplo, which employs paired-end reads to distinguish highly similar strains for viral quasispecies data. It was applied on both simulated and real quasispecies data, and the results were benchmarked against several recently published de novo haplotype reconstruction tools. The comparison shows that PEHaplo outperforms the benchmarked tools in a comprehensive set of metrics. Availability and implementation The source code and the documentation of PEHaplo are available at https://github.com/chjiao/PEHaplo. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiao Chen
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yingchao Zhao
- School of Computing and Information Sciences, Caritas Institute of Higher Education, Hong Kong, China
| | - Yanni Sun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
46
|
Chen J, Huang J, Sun Y. TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from metagenomic data. BMC Bioinformatics 2019; 20:305. [PMID: 31164077 PMCID: PMC6549370 DOI: 10.1186/s12859-019-2878-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2018] [Accepted: 05/07/2019] [Indexed: 12/15/2022] Open
Abstract
Background Strain-level RNA virus characterization is essential for developing prevention and treatment strategies. Viral metagenomic data, which can contain sequences of both known and novel viruses, provide new opportunities for characterizing RNA viruses. Although there are a number of pipelines for analyzing viruses in metagenomic data, they have different limitations. First, viruses that lack closely related reference genomes cannot be detected with high sensitivity. Second, strain-level analysis is usually missing. Results In this study, we developed a hybrid pipeline named TAR-VIR that reconstructs viral strains without relying on complete or high-quality reference genomes. It is optimized for identifying RNA viruses from metagenomic data by combining an effective read classification method and our in-house strain-level de novo assembly tool. TAR-VIR was tested on both simulated and real viral metagenomic data sets. The results demonstrated that TAR-VIR competes favorably with other tested tools. Conclusion TAR-VIR can be used standalone for viral strain reconstruction from metagenomic data. Or, its read recruiting stage can be used with other de novo assembly tools for superior viral functional and taxonomic analyses. The source code and the documentation of TAR-VIR are available at https://github.com/chjiao/TAR-VIR. Electronic supplementary material The online version of this article (10.1186/s12859-019-2878-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jiao Chen
- Computer Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Jiating Huang
- Institute of Clinical Pharmacology, Guangzhou University of Chinese Medicine, Guangzhou, 510006, China
| | - Yanni Sun
- Electronic Engineering, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|
47
|
|
48
|
Heydari M, Miclotte G, Van de Peer Y, Fostier J. Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 2019; 20:298. [PMID: 31159722 PMCID: PMC6545690 DOI: 10.1186/s12859-019-2906-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 05/17/2019] [Indexed: 11/10/2022] Open
Abstract
Background Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. Results We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. Conclusions BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. Electronic supplementary material The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Giles Miclotte
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Yves Van de Peer
- Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.,Center for Plant Systems Biology, VIB, Ghent, B-9052, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, B-9052, Belgium.,Department of Genetics, Genome Research Institute, University of Pretoria, Pretoria, South Africa
| | - Jan Fostier
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. .,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
| |
Collapse
|
49
|
Transcriptomic-Proteomic Correlation in the Predation-Evoked Venom of the Cone Snail, Conus imperialis. Mar Drugs 2019; 17:md17030177. [PMID: 30893765 PMCID: PMC6471084 DOI: 10.3390/md17030177] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 03/12/2019] [Accepted: 03/14/2019] [Indexed: 12/23/2022] Open
Abstract
Individual variation in animal venom has been linked to geographical location, feeding habit, season, size, and gender. Uniquely, cone snails possess the remarkable ability to change venom composition in response to predatory or defensive stimuli. To date, correlations between the venom gland transcriptome and proteome within and between individual cone snails have not been reported. In this study, we use 454 pyrosequencing and mass spectrometry to decipher the transcriptomes and proteomes of the venom gland and corresponding predation-evoked venom of two specimens of Conus imperialis. Transcriptomic analyses revealed 17 conotoxin gene superfamilies common to both animals, including 5 novel superfamilies and two novel cysteine frameworks. While highly expressed transcripts were common to both specimens, variation of moderately and weakly expressed precursor sequences was surprisingly diverse, with one specimen expressing two unique gene superfamilies and consistently producing more paralogs within each conotoxin gene superfamily. Using a quantitative labelling method, conotoxin variability was compared quantitatively, with highly expressed peptides showing a strong correlation between transcription and translation, whereas peptides expressed at lower levels showed a poor correlation. These results suggest that major transcripts are subject to stabilizing selection, while minor transcripts are subject to diversifying selection.
Collapse
|
50
|
Wang W, Schalamun M, Morales-Suarez A, Kainer D, Schwessinger B, Lanfear R. Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case. BMC Genomics 2018; 19:977. [PMID: 30594129 PMCID: PMC6311037 DOI: 10.1186/s12864-018-5348-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2018] [Accepted: 12/03/2018] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10-30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. RESULTS Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function. CONCLUSIONS Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.
Collapse
Affiliation(s)
- Weiwen Wang
- Research School of Biology, Australian National University, Canberra, Australia.
| | - Miriam Schalamun
- Research School of Biology, Australian National University, Canberra, Australia.,Institute of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences, Vienna, Austria
| | | | - David Kainer
- Research School of Biology, Australian National University, Canberra, Australia
| | | | - Robert Lanfear
- Research School of Biology, Australian National University, Canberra, Australia
| |
Collapse
|