1
|
Chen Y, Huang JH, Sun Y, Zhang Y, Li Y, Xu X. Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing. CELL REPORTS METHODS 2024; 4:100754. [PMID: 38614089 PMCID: PMC11133727 DOI: 10.1016/j.crmeth.2024.100754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Revised: 01/03/2024] [Accepted: 03/20/2024] [Indexed: 04/15/2024]
Abstract
Precision medicine's emphasis on individual genetic variants highlights the importance of haplotype-resolved assembly, a computational challenge in bioinformatics given its combinatorial nature. While classical algorithms have made strides in addressing this issue, the potential of quantum computing remains largely untapped. Here, we present the vehicle routing problem (VRP) assembler: an approach that transforms this task into a vehicle routing problem, an optimization formulation solvable on a quantum computer. We demonstrate its potential and feasibility through a proof of concept on short synthetic diploid and triploid genomes using a D-Wave quantum annealer. To tackle larger-scale assembly problems, we integrate the VRP assembler with Google's OR-Tools, achieving a haplotype-resolved local assembly across the human major histocompatibility complex (MHC) region. Our results show encouraging performance compared to Hifiasm with phasing accuracy approaching the theoretical limit, underscoring the promising future of quantum computing in bioinformatics.
Collapse
Affiliation(s)
- Yibo Chen
- BGI Research, Shenzhen 518083, China
| | | | - Yuhui Sun
- BGI Research, Shenzhen 518083, China
| | - Yong Zhang
- BGI Research, Wuhan 430047, China; Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China.
| | - Yuxiang Li
- BGI Research, Wuhan 430047, China; Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China.
| | - Xun Xu
- BGI Research, Shenzhen 518083, China; BGI Research, Wuhan 430047, China.
| |
Collapse
|
2
|
Yu W, Luo H, Yang J, Zhang S, Jiang H, Zhao X, Hui X, Sun D, Li L, Wei XQ, Lonardi S, Pan W. Comprehensive assessment of 11 de novo HiFi assemblers on complex eukaryotic genomes and metagenomes. Genome Res 2024; 34:326-340. [PMID: 38428994 PMCID: PMC10984382 DOI: 10.1101/gr.278232.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Accepted: 01/23/2024] [Indexed: 03/03/2024]
Abstract
Pacific Biosciences (PacBio) HiFi sequencing technology generates long reads (>10 kbp) with very high accuracy (<0.01% sequencing error). Although several de novo assembly tools are available for HiFi reads, there are no comprehensive studies on the evaluation of these assemblers. We evaluated the performance of 11 de novo HiFi assemblers on (1) real data for three eukaryotic genomes; (2) 34 synthetic data sets with different ploidy, sequencing coverage levels, heterozygosity rates, and sequencing error rates; (3) one real metagenomic data set; and (4) five synthetic metagenomic data sets with different composition abundance and heterozygosity rates. The 11 assemblers were evaluated using quality assessment tool (QUAST) and benchmarking universal single-copy ortholog (BUSCO). We also used several additional criteria, namely, completion rate, single-copy completion rate, duplicated completion rate, average proportion of largest category, average distance difference, quality value, run-time, and memory utilization. Results show that hifiasm and hifiasm-meta should be the first choice for assembling eukaryotic genomes and metagenomes with HiFi data. We performed a comprehensive benchmarking study of commonly used assemblers on complex eukaryotic genomes and metagenomes. Our study will help the research community to choose the most appropriate assembler for their data and identify possible improvements in assembly algorithms.
Collapse
Affiliation(s)
- Wenjuan Yu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Haohui Luo
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Jinbao Yang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Shengchen Zhang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Heling Jiang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Xianjia Zhao
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
- School of Agricultural Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Xingqi Hui
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
- School of Agricultural Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Da Sun
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Liang Li
- Fruit Research Institute, Fujian Academy of Agricultural Sciences, Fuzhou, Fujian 350002, China
| | - Xiu-Qing Wei
- Fruit Research Institute, Fujian Academy of Agricultural Sciences, Fuzhou, Fujian 350002, China;
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, California 92521, USA;
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China;
| |
Collapse
|
3
|
Qiu Z, Yuan L, Lian CA, Lin B, Chen J, Mu R, Qiao X, Zhang L, Xu Z, Fan L, Zhang Y, Wang S, Li J, Cao H, Li B, Chen B, Song C, Liu Y, Shi L, Tian Y, Ni J, Zhang T, Zhou J, Zhuang WQ, Yu K. BASALT refines binning from metagenomic data and increases resolution of genome-resolved metagenomic analysis. Nat Commun 2024; 15:2179. [PMID: 38467684 PMCID: PMC10928208 DOI: 10.1038/s41467-024-46539-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 03/01/2024] [Indexed: 03/13/2024] Open
Abstract
Metagenomic binning is an essential technique for genome-resolved characterization of uncultured microorganisms in various ecosystems but hampered by the low efficiency of binning tools in adequately recovering metagenome-assembled genomes (MAGs). Here, we introduce BASALT (Binning Across a Series of Assemblies Toolkit) for binning and refinement of short- and long-read sequencing data. BASALT employs multiple binners with multiple thresholds to produce initial bins, then utilizes neural networks to identify core sequences to remove redundant bins and refine non-redundant bins. Using the same assemblies generated from Critical Assessment of Metagenome Interpretation (CAMI) datasets, BASALT produces up to twice as many MAGs as VAMB, DASTool, or metaWRAP. Processing assemblies from a lake sediment dataset, BASALT produces ~30% more MAGs than metaWRAP, including 21 unique class-level prokaryotic lineages. Functional annotations reveal that BASALT can retrieve 47.6% more non-redundant opening-reading frames than metaWRAP. These results highlight the robust handling of metagenomic sequencing data of BASALT.
Collapse
Affiliation(s)
- Zhiguang Qiu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
| | - Li Yuan
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Chun-Ang Lian
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
| | - Bin Lin
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
| | - Jie Chen
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Rong Mu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Xuejiao Qiao
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Liyu Zhang
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Zheng Xu
- Southern University of Sciences and Technology Yantian Hospital, Shenzhen, China
- Institute of Biomedicine and Biotechnology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Lu Fan
- Department of Ocean Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China
| | - Yunzeng Zhang
- Joint International Research Laboratory of Agriculture and Agri-Product Safety, the Ministry of Education of China, Yangzhou University, Yangzhou, China
| | - Shanquan Wang
- Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Sun Yat-Sen University, Guangzhou, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China
| | - Huiluo Cao
- Department of Microbiology, University of Hong Kong, Hong Kong, China
| | - Bing Li
- Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
| | - Baowei Chen
- Guangdong Provincial Key Laboratory of Marine Resources and Coastal Engineering, School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Chi Song
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, China
- Wuhan Benagen Technology Co., Ltd, Wuhan, China
| | - Yongxin Liu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Lili Shi
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- State Key Laboratory of Chemical Oncogenomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Yonghong Tian
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Jinren Ni
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- College of Environmental Sciences and Engineering, Key Laboratory of Water and Sediment Sciences, Ministry of Education, Peking University, Beijing, China
| | - Tong Zhang
- Department of Civil Engineering, University of Hong Kong, Hong Kong, China
| | - Jizhong Zhou
- Institute for Environmental Genomics, University of Oklahoma, Norman, OK, USA
| | - Wei-Qin Zhuang
- Department of Civil and Environmental Engineering, Faculty of Engineering, University of Auckland, Auckland, New Zealand
| | - Ke Yu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China.
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China.
| |
Collapse
|
4
|
Pozo G, Albuja-Quintana M, Larreátegui L, Gutiérrez B, Fuentes N, Alfonso-Cortés F, Torres MDL. First whole-genome sequence and assembly of the Ecuadorian brown-headed spider monkey (Ateles fusciceps fusciceps), a critically endangered species, using Oxford Nanopore Technologies. G3 (BETHESDA, MD.) 2024; 14:jkae014. [PMID: 38244218 PMCID: PMC10917520 DOI: 10.1093/g3journal/jkae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 12/11/2023] [Accepted: 01/05/2024] [Indexed: 01/22/2024]
Abstract
The Ecuadorian brown-headed spider monkey (Ateles fusciceps fusciceps) is currently considered one of the most endangered primates in the world and is classified as critically endangered [International union for conservation of nature (IUCN)]. It faces multiple threats, the most significant one being habitat loss due to deforestation in western Ecuador. Genomic tools are keys for the management of endangered species, but this requires a reference genome, which until now was unavailable for A. f. fusciceps. The present study reports the first whole-genome sequence and assembly of A. f. fusciceps generated using Oxford Nanopore long reads. DNA was extracted from a subadult male, and libraries were prepared for sequencing following the Ligation Sequencing Kit SQK-LSK112 workflow. Sequencing was performed using a MinION Mk1C sequencer. The sequencing reads were processed to generate a genome assembly. Two different assemblers were used to obtain draft genomes using raw reads, of which the Flye assembly was found to be superior. The final assembly has a total length of 2.63 Gb and contains 3,861 contigs, with an N50 of 7,560,531 bp. The assembly was analyzed for annotation completeness based on primate ortholog prediction using a high-resolution database, and was found to be 84.3% complete, with a low number of duplicated genes indicating a precise assembly. The annotation of the assembly predicted 31,417 protein-coding genes, comparable with other mammal assemblies. A reference genome for this critically endangered species will allow researchers to gain insight into the genetics of its populations and thus aid conservation and management efforts of this vulnerable species.
Collapse
Affiliation(s)
- Gabriela Pozo
- Laboratorio de Biotecnología Vegetal, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito (USFQ), Quito 170901, Ecuador
- Instituto Nacional de Biodiversidad (INABIO), Quito 170135, Ecuador
| | - Martina Albuja-Quintana
- Laboratorio de Biotecnología Vegetal, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito (USFQ), Quito 170901, Ecuador
| | - Lizbeth Larreátegui
- Laboratorio de Biotecnología Vegetal, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito (USFQ), Quito 170901, Ecuador
| | - Bernardo Gutiérrez
- Laboratorio de Biotecnología Vegetal, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito (USFQ), Quito 170901, Ecuador
- Department of Biology, University of Oxford, Oxford OX1 3SZ, UK
| | - Nathalia Fuentes
- Proyecto Washu/Fundación Naturaleza y Arte, Quito 170521, Ecuador
| | | | - Maria de Lourdes Torres
- Laboratorio de Biotecnología Vegetal, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito (USFQ), Quito 170901, Ecuador
- Instituto Nacional de Biodiversidad (INABIO), Quito 170135, Ecuador
| |
Collapse
|
5
|
Goldberg JK, Allan CW, Copetti D, Matzkin LM, Bronstein J. A pooled-sample draft genome assembly provides insights into host plant-specific transcriptional responses of a Solanaceae-specializing pest, Tupiocoris notatus (Hemiptera: Miridae). Ecol Evol 2024; 14:e10979. [PMID: 38476697 PMCID: PMC10928254 DOI: 10.1002/ece3.10979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 12/12/2023] [Accepted: 12/18/2023] [Indexed: 03/14/2024] Open
Abstract
The assembly of genomes from pooled samples of genetically heterogenous samples of conspecifics remains challenging. In this study, we show that high-quality genome assemblies can be produced from samples of multiple wild-caught individuals. We sequenced DNA extracted from a pooled sample of conspecific herbivorous insects (Hemiptera: Miridae: Tupiocoris notatus) acquired from a greenhouse infestation in Tucson, Arizona (in the range of 30-100 individuals; 0.5 mL tissue by volume) using PacBio highly accurate long reads (HiFi). The initial assembly contained multiple haplotigs (>85% BUSCOs duplicated), but duplicate contigs could be easily purged to reveal a highly complete assembly (95.6% BUSCO, 4.4% duplicated) that is highly contiguous by short-read assembly standards (N 50 = 675 kb; Largest contig = 4.3 Mb). We then used our assembly as the basis for a genome-guided differential expression study of host plant-specific transcriptional responses. We found thousands of genes (N = 4982) to be differentially expressed between our new data from individuals feeding on Datura wrightii (Solanaceae) and existing RNA-seq data from Nicotiana attenuata (Solanaceae)-fed individuals. We identified many of these genes as previously documented detoxification genes such as glutathione-S-transferases, cytochrome P450s, and UDP-glucosyltransferases. Together our results show that long-read sequencing of pooled samples can provide a cost-effective genome assembly option for small insects and can provide insights into the genetic mechanisms underlying interactions between plants and herbivorous pests.
Collapse
Affiliation(s)
- Jay K. Goldberg
- Department of Ecology and Evolutionary BiologyUniversity of ArizonaTucsonArizonaUSA
- Department of Cellular and Developmental BiologyJohn Innes CentreNorwichNorfolkUK
| | - Carson W. Allan
- Department of EntomologyUniversity of ArizonaTucsonArizonaUSA
| | - Dario Copetti
- Arizona Genomics InstituteUniversity of ArizonaTucsonArizonaUSA
- BIO5 InstituteUniversity of ArizonaTucsonArizonaUSA
| | - Luciano M. Matzkin
- Department of Ecology and Evolutionary BiologyUniversity of ArizonaTucsonArizonaUSA
- Department of EntomologyUniversity of ArizonaTucsonArizonaUSA
- BIO5 InstituteUniversity of ArizonaTucsonArizonaUSA
| | - Judith Bronstein
- Department of Ecology and Evolutionary BiologyUniversity of ArizonaTucsonArizonaUSA
- Department of EntomologyUniversity of ArizonaTucsonArizonaUSA
- BIO5 InstituteUniversity of ArizonaTucsonArizonaUSA
| |
Collapse
|
6
|
Zhou Y, Wang Y, Prangishvili D, Krupovic M. Exploring the Archaeal Virosphere by Metagenomics. Methods Mol Biol 2024; 2732:1-22. [PMID: 38060114 DOI: 10.1007/978-1-0716-3515-5_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
During the past decade, environmental research has demonstrated that archaea are abundant and widespread in nature and play important ecological roles at a global scale. Currently, however, the majority of archaeal lineages cannot be cultivated under laboratory conditions and are known exclusively or nearly exclusively through metagenomics. A similar trend extends to the archaeal virosphere, where isolated representatives are available for a handful of model archaeal virus-host systems. Viral metagenomics provides an alternative way to circumvent the limitations of culture-based virus discovery and offers insight into the diversity, distribution, and environmental impact of uncultured archaeal viruses. Presently, metagenomics approaches have been successfully applied to explore the viromes associated with various lineages of extremophilic and mesophilic archaea, including Asgard archaea (Asgardarchaeota), ANME-1 archaea (Methanophagales), thaumarchaea (Nitrososphaeria), altiarchaea (Altiarchaeota), and marine group II archaea (Poseidoniales). Here, we provide an overview of methods widely used in archaeal virus metagenomics, covering metavirome preparation, genome annotation, phylogenetic and phylogenomic analyses, and archaeal host assignment. We hope that this summary will contribute to further exploration and characterization of the enigmatic archaeal virome lurking in diverse environments.
Collapse
Affiliation(s)
- Yifan Zhou
- Institut Pasteur, Université Paris Cité, Archaeal Virology Unit, Paris, France
- Sorbonne Université, Collège Doctoral, Paris, France
| | - Yongjie Wang
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture, Shanghai, China
| | - David Prangishvili
- Institut Pasteur, Université Paris Cité, Archaeal Virology Unit, Paris, France
- Ivane Javakhishvili Tbilisi State University, Tbilisi, Georgia
| | - Mart Krupovic
- Institut Pasteur, Université Paris Cité, Archaeal Virology Unit, Paris, France.
| |
Collapse
|
7
|
Kang X, Xu J, Luo X, Schönhuth A. Hybrid-hybrid correction of errors in long reads with HERO. Genome Biol 2023; 24:275. [PMID: 38041098 PMCID: PMC10690975 DOI: 10.1186/s13059-023-03112-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 11/16/2023] [Indexed: 12/03/2023] Open
Abstract
Although generally superior, hybrid approaches for correcting errors in third-generation sequencing (TGS) reads, using next-generation sequencing (NGS) reads, mistake haplotype-specific variants for errors in polyploid and mixed samples. We suggest HERO, as the first "hybrid-hybrid" approach, to make use of both de Bruijn graphs and overlap graphs for optimal catering to the particular strengths of NGS and TGS reads. Extensive benchmarking experiments demonstrate that HERO improves indel and mismatch error rates by on average 65% (27[Formula: see text]95%) and 20% (4[Formula: see text]61%). Using HERO prior to genome assembly significantly improves the assemblies in the majority of the relevant categories.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Jialu Xu
- College of Biology, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
8
|
Zhao J, Chen H, Li G, Jumaturti MA, Yao X, Hu Y. Phylogenetics Study to Compare Chloroplast Genomes in Four Magnoliaceae Species. Curr Issues Mol Biol 2023; 45:9234-9251. [PMID: 37998755 PMCID: PMC10670740 DOI: 10.3390/cimb45110578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/07/2023] [Accepted: 11/12/2023] [Indexed: 11/25/2023] Open
Abstract
Magnoliaceae, a family of perennial woody plants, contains several endangered species whose taxonomic status remains ambiguous. The study of chloroplast genome information can help in the protection of Magnoliaceae plants and confirmation of their phylogenetic relationships. In this study, the chloroplast genomes were sequenced, assembled, and annotated in Woonyoungia septentrionalis and three Michelia species (Michelia champaca, Michelia figo, and Michelia macclurei). Comparative analyses of genomic characteristics, repetitive sequences, and sequence differences were performed among the four Magnoliaceae plants, and phylogenetic relationships were constructed with twenty different magnolia species. The length of the chloroplast genomes varied among the four studied species ranging from 159,838 bp (Woonyoungia septentrionalis) to 160,127 bp (Michelia macclurei). Four distinct hotspot regions were identified based on nucleotide polymorphism analysis. They were petA-psbJ, psbJ-psbE, ndhD-ndhE, and rps15-ycf1. These gene fragments may be developed and utilized as new molecular marker primers. By using Liriodendron tulipifera and Liriodendron chinense as outgroups reference, a phylogenetic tree of the four Magnoliaceae species and eighteen other Magnoliaceae species was constructed with the method of Shared Coding Sequences (CDS). Results showed that the endangered species, W. septentrionalis, is relatively genetically distinct from the other three species, indicating the different phylogenetic processes among Magnoliaceae plants. Therefore, further genetic information is required to determine the relationships within Magnoliaceae. Overall, complete chloroplast genome sequences for four Magnoliaceae species reported in this paper have shed more light on phylogenetic relationships within the botanical group.
Collapse
Affiliation(s)
- Jianyun Zhao
- Key Laboratory of National Forestry and Grassland Administration on Cultivation of Fast-Growing Timber in Central South China, College of Forestry, Guangxi University, Nanning 530004, China; (J.Z.); (G.L.); (M.A.J.); (X.Y.)
- Guangxi Key Laboratory of Forest Ecology and Conservation, College of Forestry, Guangxi University, Nanning 530004, China
| | - Hu Chen
- Guangxi Key Laboratory of Superior Timber Trees Resource Cultivation, Guangxi Forestry Research Institute, Nanning 530002, China;
| | - Gaiping Li
- Key Laboratory of National Forestry and Grassland Administration on Cultivation of Fast-Growing Timber in Central South China, College of Forestry, Guangxi University, Nanning 530004, China; (J.Z.); (G.L.); (M.A.J.); (X.Y.)
- Guangxi Key Laboratory of Forest Ecology and Conservation, College of Forestry, Guangxi University, Nanning 530004, China
| | - Maimaiti Aisha Jumaturti
- Key Laboratory of National Forestry and Grassland Administration on Cultivation of Fast-Growing Timber in Central South China, College of Forestry, Guangxi University, Nanning 530004, China; (J.Z.); (G.L.); (M.A.J.); (X.Y.)
- Guangxi Key Laboratory of Forest Ecology and Conservation, College of Forestry, Guangxi University, Nanning 530004, China
| | - Xiaomin Yao
- Key Laboratory of National Forestry and Grassland Administration on Cultivation of Fast-Growing Timber in Central South China, College of Forestry, Guangxi University, Nanning 530004, China; (J.Z.); (G.L.); (M.A.J.); (X.Y.)
- Guangxi Key Laboratory of Forest Ecology and Conservation, College of Forestry, Guangxi University, Nanning 530004, China
| | - Ying Hu
- Key Laboratory of National Forestry and Grassland Administration on Cultivation of Fast-Growing Timber in Central South China, College of Forestry, Guangxi University, Nanning 530004, China; (J.Z.); (G.L.); (M.A.J.); (X.Y.)
- Guangxi Key Laboratory of Forest Ecology and Conservation, College of Forestry, Guangxi University, Nanning 530004, China
| |
Collapse
|
9
|
Vigil K, Aw TG. Comparison of de novo assembly using long-read shotgun metagenomic sequencing of viruses in fecal and serum samples from marine mammals. Front Microbiol 2023; 14:1248323. [PMID: 37808316 PMCID: PMC10556685 DOI: 10.3389/fmicb.2023.1248323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 09/04/2023] [Indexed: 10/10/2023] Open
Abstract
Introduction Viral diseases of marine mammals are difficult to study, and this has led to a limited knowledge on emerging known and unknown viruses which are ongoing threats to animal health. Viruses are the leading cause of infectious disease-induced mass mortality events among marine mammals. Methods In this study, we performed viral metagenomics in stool and serum samples from California sea lions (Zalophus californianus) and bottlenose dolphins (Tursiops truncates) using long-read nanopore sequencing. Two widely used long-read de novo assemblers, Canu and Metaflye, were evaluated to assemble viral metagenomic sequencing reads from marine mammals. Results Both Metaflye and Canu assembled similar viral contigs of vertebrates, such as Parvoviridae, and Poxviridae. Metaflye assembled viral contigs that aligned with one viral family that was not reproduced by Canu, while Canu assembled viral contigs that aligned with seven viral families that was not reproduced by Metaflye. Only Canu assembled viral contigs from dolphin and sea lion fecal samples that matched both protein and nucleotide RefSeq viral databases using BLASTx and BLASTn for Anelloviridae, Parvoviridae and Circoviridae families. Viral contigs assembled with Canu aligned with torque teno viruses and anelloviruses from vertebrate hosts. Viruses associated with invertebrate hosts including densoviruses, Ambidensovirus, and various Circoviridae isolates were also aligned. Some of the invertebrate and vertebrate viruses reported here are known to potentially cause mortality events and/or disease in different seals, sea stars, fish, and bivalve species. Discussion Canu performed better by producing the most viral contigs as compared to Metaflye with assemblies aligning to both protein and nucleotide databases. This study suggests that marine mammals can be used as important sentinels to surveil marine viruses that can potentially cause diseases in vertebrate and invertebrate hosts.
Collapse
Affiliation(s)
| | - Tiong Gim Aw
- Department of Environmental Health Sciences, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, United States
| |
Collapse
|
10
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
11
|
Pan R, Hu H, Xiao Y, Xu L, Xu Y, Ouyang K, Li C, He T, Zhang W. High-quality wild barley genome assemblies and annotation with Nanopore long reads and Hi-C sequencing data. Sci Data 2023; 10:535. [PMID: 37563167 PMCID: PMC10415357 DOI: 10.1038/s41597-023-02434-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 07/31/2023] [Indexed: 08/12/2023] Open
Abstract
Wild barley, from "Evolution Canyon (EC)" in Mount Carmel, Israel, are ideal models for cereal chromosome evolution studies. Here, the wild barley EC_S1 is from the south slope with higher daily temperatures and drought, while EC_N1 is from the north slope with a cooler climate and higher relative humidity, which results in a differentiated selection due to contrasting environments. We assembled a 5.03 Gb genome with contig N50 of 3.53 Mb for wild barley EC_S1 and a 5.05 Gb genome with contig N50 of 3.45 Mb for EC_N1 using 145 Gb and 160.0 Gb Illumina sequencing data, 295.6 Gb and 285.35 Gb Nanopore sequencing data and 555.1 Gb and 514.5 Gb Hi-C sequencing data, respectively. BUSCOs and CEGMA evaluation suggested highly complete assemblies. Using full-length transcriptome data, we predicted 39,179 and 38,373 high-confidence genes in EC_S1 and EC_N1, in which 93.6% and 95.2% were functionally annotated, respectively. We annotated repetitive elements and non-coding RNAs. These two wild barley genome assemblies will provide a rich gene pool for domesticated barley.
Collapse
Affiliation(s)
- Rui Pan
- Research Center of Crop Stresses Resistance Technologies, Yangtze University, Jingzhou, 434025, China
| | - Haifei Hu
- Western Crop Genetics Alliance, Western Australian State Agricultural Biotechnology Centre, College of Science, Health, Engineering and Education, Murdoch University, Murdoch, WA, 6155, Australia
- Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High-Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou, 510640, China
| | - Yuhui Xiao
- Grandomics Biotechnology Co., Ltd, Wuhan, 430076, China
| | - Le Xu
- Research Center of Crop Stresses Resistance Technologies, Yangtze University, Jingzhou, 434025, China
- Hubei Collaborative Innovation Centre for Grain Industry, Yangtze University, Jingzhou, 434025, China
| | - Yanhao Xu
- Research Center of Crop Stresses Resistance Technologies, Yangtze University, Jingzhou, 434025, China
- Hubei Collaborative Innovation Centre for Grain Industry, Yangtze University, Jingzhou, 434025, China
| | - Kai Ouyang
- Grandomics Biotechnology Co., Ltd, Wuhan, 430076, China
| | - Chengdao Li
- Western Crop Genetics Alliance, Western Australian State Agricultural Biotechnology Centre, College of Science, Health, Engineering and Education, Murdoch University, Murdoch, WA, 6155, Australia
- Department of Primary Industries and Regional Development, South Perth, WA, 6155, Australia
| | - Tianhua He
- Western Crop Genetics Alliance, Western Australian State Agricultural Biotechnology Centre, College of Science, Health, Engineering and Education, Murdoch University, Murdoch, WA, 6155, Australia.
| | - Wenying Zhang
- Research Center of Crop Stresses Resistance Technologies, Yangtze University, Jingzhou, 434025, China.
- MARA Key Laboratory of Sustainable Crop Production in the Middle Reaches of the Yangtze River (Co-construction by Ministry and Province), Yangtze University, Jingzhou, 434025, China.
| |
Collapse
|
12
|
Wong J, Coombe L, Nikolić V, Zhang E, Nip KM, Sidhu P, Warren RL, Birol I. Linear time complexity de novo long read genome assembly with GoldRush. Nat Commun 2023; 14:2906. [PMID: 37217507 DOI: 10.1038/s41467-023-38716-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 05/11/2023] [Indexed: 05/24/2023] Open
Abstract
Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap - its most costly step - was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Emily Zhang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Puneet Sidhu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| |
Collapse
|
13
|
Qu M, Zhang Y, Gao Z, Zhang Z, Liu Y, Wan S, Wang X, Yu H, Zhang H, Liu Y, Schneider R, Meyer A, Lin Q. The genetic basis of the leafy seadragon's unique camouflage morphology and avenues for its efficient conservation derived from habitat modeling. SCIENCE CHINA. LIFE SCIENCES 2023:10.1007/s11427-022-2317-6. [PMID: 37204606 DOI: 10.1007/s11427-022-2317-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 03/03/2023] [Indexed: 05/20/2023]
Abstract
The leafy seadragon certainly is among evolution's most "beautiful and wonderful" species aptly named for its extraordinary camouflage mimicking its coastal seaweed habitat. However, limited information is known about the genetic basis of its phenotypes and conspicuous camouflage. Here, we revealed genomic signatures of rapid evolution and positive selection in core genes related to its camouflage, which allowed us to predict population dynamics for this species. Comparative genomic analysis revealed that seadragons have the smallest olfactory repertoires among all ray-finned fishes, suggesting adaptations to the highly specialized habitat. Other positively selected and rapidly evolving genes that serve in bone development and coloration are highly expressed in the leaf-like appendages, supporting a recent adaptive shift in camouflage appendage formation. Knock-out of bmp6 results in dysplastic intermuscular bones with a significantly reduced number in zebrafish, implying its important function in bone formation. Global climate change-induced loss of seagrass beds now severely threatens the continued existence of this enigmatic species. The leafy seadragon has a historically small population size likely due to its specific habitat requirements that further exacerbate its vulnerability to climate change. Therefore, taking climate change-induced range shifts into account while developing future protection strategies.
Collapse
Affiliation(s)
- Meng Qu
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yingyi Zhang
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zexia Gao
- College of Fisheries, Key Lab of Freshwater Animal Breeding, Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Zhixin Zhang
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
- Global Ocean and Climate Research Center, South China Sea Institute of Oceanology, Guangzhou, 510301, China
| | - Yali Liu
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Shiming Wan
- College of Fisheries, Key Lab of Freshwater Animal Breeding, Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xin Wang
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
| | - Haiyan Yu
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
| | - Huixian Zhang
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
| | - Yuhong Liu
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China
| | - Ralf Schneider
- Marine Evolutionary Ecology, Zoological Institute, Kiel University, 24118, Kiel, Germany
| | - Axel Meyer
- Department of Biology, University of Konstanz, 78464, Konstanz, Germany.
| | - Qiang Lin
- CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Southern Marine Science and Engineering Guangdong Laboratory (GML, Guangzhou), Guangzhou, 511458, China.
- Sanya Institute of Oceanology, SCSIO, Sanya, 572000, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
14
|
Kariuki EG, Kibet C, Paredes JC, Mboowa G, Mwaura O, Njogu J, Masiga D, Bugg TDH, Tanga CM. Metatranscriptomic analysis of the gut microbiome of black soldier fly larvae reared on lignocellulose-rich fiber diets unveils key lignocellulolytic enzymes. Front Microbiol 2023; 14:1120224. [PMID: 37180276 PMCID: PMC10171111 DOI: 10.3389/fmicb.2023.1120224] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 04/03/2023] [Indexed: 05/16/2023] Open
Abstract
Recently, interest in the black soldier fly larvae (BSFL) gut microbiome has received increased attention primarily due to their role in waste bioconversion. However, there is a lack of information on the positive effect on the activities of the gut microbiomes and enzymes (CAZyme families) acting on lignocellulose. In this study, BSFL were subjected to lignocellulose-rich diets: chicken feed (CF), chicken manure (CM), brewers' spent grain (BSG), and water hyacinth (WH). The mRNA libraries were prepared, and RNA-Sequencing was conducted using the PCR-cDNA approach through the MinION sequencing platform. Our results demonstrated that BSFL reared on BSG and WH had the highest abundance of Bacteroides and Dysgonomonas. The presence of GH51 and GH43_16 enzyme families in the gut of BSFL with both α-L-arabinofuranosidases and exo-alpha-L-arabinofuranosidase 2 were common in the BSFL reared on the highly lignocellulosic WH and BSG diets. Gene clusters that encode hemicellulolytic arabinofuranosidases in the CAZy family GH51 were also identified. These findings provide novel insight into the shift of gut microbiomes and the potential role of BSFL in the bioconversion of various highly lignocellulosic diets to fermentable sugars for subsequent value-added products (bioethanol). Further research on the role of these enzymes to improve existing technologies and their biotechnological applications is crucial.
Collapse
Affiliation(s)
- Eric G. Kariuki
- International Centre of Insect Physiology and Ecology (icipe), Nairobi, Kenya
- Department of Immunology and Molecular Biology, Makerere University, Kampala, Uganda
| | - Caleb Kibet
- International Centre of Insect Physiology and Ecology (icipe), Nairobi, Kenya
| | - Juan C. Paredes
- Department of Immunology and Molecular Biology, Makerere University, Kampala, Uganda
| | - Gerald Mboowa
- Department of Immunology and Molecular Biology, Makerere University, Kampala, Uganda
| | - Oscar Mwaura
- International Centre of Insect Physiology and Ecology (icipe), Nairobi, Kenya
| | - John Njogu
- International Centre of Insect Physiology and Ecology (icipe), Nairobi, Kenya
| | - Daniel Masiga
- International Centre of Insect Physiology and Ecology (icipe), Nairobi, Kenya
| | - Timothy D. H. Bugg
- Department of Chemistry, School of Life Sciences, University of Warwick, Coventry, United Kingdom
| | - Chrysantus M. Tanga
- International Centre of Insect Physiology and Ecology (icipe), Nairobi, Kenya
| |
Collapse
|
15
|
Duitama J. Phased Genome Assemblies. Methods Mol Biol 2023; 2590:273-286. [PMID: 36335504 DOI: 10.1007/978-1-0716-2819-5_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The ultimate goal of de novo assembly of reads sequenced from a diploid individual is the separate reconstruction of the sequences corresponding to the two copies of each chromosome. Unfortunately, the allele linkage information needed to perform phased genome assemblies has been difficult to generate. Hence, most current genome assemblies are a haploid mixture of the two underlying chromosome copies present in the sequenced individual. Sequencing technologies providing long (20 kb) and accurate reads are the basis to generate phased genome assemblies. This chapter provides a brief overview of the main milestones in traditional genome assembly, focusing on the bioinformatic techniques developed to generate haplotype information from different specialized protocols. Using these techniques as a knowledge background, the chapter reviews the current algorithms to generate phased assemblies from long reads with low error rates. Current techniques perform haplotype-aware error correction steps to increase the quality of the raw reads. In addition, variations on the traditional overlap-layout-consensus (OLC) graph have been developed in an effort to eliminate edges between reads sequenced from different chromosome copies. This allows for large presence-absence variants between the chromosome copies to be taken into account. The development of these algorithms, along with the improved sequencing technologies has been crucial to finish chromosome-level assemblies of complex genomes.
Collapse
Affiliation(s)
- Jorge Duitama
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia.
| |
Collapse
|
16
|
Zuckerman NS, Shulman LM. Next-Generation Sequencing in the Study of Infectious Diseases. Infect Dis (Lond) 2023. [DOI: 10.1007/978-1-0716-2463-0_1090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/10/2023] Open
|
17
|
Muñoz-Barrera A, Rubio-Rodríguez LA, Díaz-de Usera A, Jáspez D, Lorenzo-Salazar JM, González-Montelongo R, García-Olivares V, Flores C. From Samples to Germline and Somatic Sequence Variation: A Focus on Next-Generation Sequencing in Melanoma Research. Life (Basel) 2022; 12:1939. [PMID: 36431075 PMCID: PMC9695713 DOI: 10.3390/life12111939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 11/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Next-generation sequencing (NGS) applications have flourished in the last decade, permitting the identification of cancer driver genes and profoundly expanding the possibilities of genomic studies of cancer, including melanoma. Here we aimed to present a technical review across many of the methodological approaches brought by the use of NGS applications with a focus on assessing germline and somatic sequence variation. We provide cautionary notes and discuss key technical details involved in library preparation, the most common problems with the samples, and guidance to circumvent them. We also provide an overview of the sequence-based methods for cancer genomics, exposing the pros and cons of targeted sequencing vs. exome or whole-genome sequencing (WGS), the fundamentals of the most common commercial platforms, and a comparison of throughputs and key applications. Details of the steps and the main software involved in the bioinformatics processing of the sequencing results, from preprocessing to variant prioritization and filtering, are also provided in the context of the full spectrum of genetic variation (SNVs, indels, CNVs, structural variation, and gene fusions). Finally, we put the emphasis on selected bioinformatic pipelines behind (a) short-read WGS identification of small germline and somatic variants, (b) detection of gene fusions from transcriptomes, and (c) de novo assembly of genomes from long-read WGS data. Overall, we provide comprehensive guidance across the main methodological procedures involved in obtaining sequencing results for the most common short- and long-read NGS platforms, highlighting key applications in melanoma research.
Collapse
Affiliation(s)
- Adrián Muñoz-Barrera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Luis A. Rubio-Rodríguez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Ana Díaz-de Usera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - José M. Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Rafaela González-Montelongo
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Víctor García-Olivares
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, 28029 Madrid, Spain
- Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, 35450 Las Palmas de Gran Canaria, Spain
| |
Collapse
|
18
|
Cordier BA, Sawaya NPD, Guerreschi GG, McWeeney SK. Biology and medicine in the landscape of quantum advantages. J R Soc Interface 2022; 19:20220541. [PMID: 36448288 PMCID: PMC9709576 DOI: 10.1098/rsif.2022.0541] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Quantum computing holds substantial potential for applications in biology and medicine, spanning from the simulation of biomolecules to machine learning methods for subtyping cancers on the basis of clinical features. This potential is encapsulated by the concept of a quantum advantage, which is contingent on a reduction in the consumption of a computational resource, such as time, space or data. Here, we distill the concept of a quantum advantage into a simple framework to aid researchers in biology and medicine pursuing the development of quantum applications. We then apply this framework to a wide variety of computational problems relevant to these domains in an effort to (i) assess the potential of practical advantages in specific application areas and (ii) identify gaps that may be addressed with novel quantum approaches. In doing so, we provide an extensive survey of the intersection of biology and medicine with the current landscape of quantum algorithms and their potential advantages. While we endeavour to identify specific computational problems that may admit practical advantages throughout this work, the rapid pace of change in the fields of quantum computing, classical algorithms and biological research implies that this intersection will remain highly dynamic for the foreseeable future.
Collapse
Affiliation(s)
- Benjamin A. Cordier
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR 97202, USA
| | | | | | - Shannon K. McWeeney
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR 97202, USA,Knight Cancer Institute, Oregon Health and Science University, Portland, OR 97202, USA,Oregon Clinical and Translational Research Institute, Oregon Health and Science University, Portland, OR 97202, USA
| |
Collapse
|
19
|
Freire B, Ladra S, Parama JR. Memory-Efficient Assembly Using Flye. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3564-3577. [PMID: 34469305 DOI: 10.1109/tcbb.2021.3108843] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In the past decade, next-generation sequencing (NGS) enabled the generation of genomic data in a cost-effective, high-throughput manner. The most recent third-generation sequencing technologies produce longer reads; however, their error rates are much higher, which complicates the assembly process. This generates time- and space- demanding long-read assemblers. Moreover, the advances in these technologies have allowed portable and real-time DNA sequencing, enabling in-field analysis. In these scenarios, it becomes crucial to have more efficient solutions that can be executed in computers or mobile devices with minimum hardware requirements. We re-implemented an existing assembler devoted for long reads, more concretely Flye, using compressed data structures. We then compare our version with the original software using real datasets, and evaluate their performance in terms of memory requirements, execution speed, and energy consumption. The assembly results are not affected, as the core of the algorithm is maintained, but the usage of advanced compact data structures leads to improvements in memory consumption that range from 22% to 47% less space, and in the processing time, which range from being on a par up to decreases of 25%. These improvements also cause reductions in energy consumption of around 3-8%, with some datasets obtaining decreases up to 26%.
Collapse
|
20
|
An X, Ghosh P, Keppler P, Kurt SE, Krishnamoorthy S, Sadayappan P, Rajam AS, Çatalyürek ÜV, Kalyanaraman A. BOA: A partitioned view of genome assembly. iScience 2022; 25:105273. [PMID: 36304115 PMCID: PMC9593263 DOI: 10.1016/j.isci.2022.105273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Revised: 09/27/2022] [Accepted: 09/30/2022] [Indexed: 11/16/2022] Open
Abstract
De novo genome assembly is a fundamental problem in computational molecular biology that aims to reconstruct an unknown genome sequence from a set of short DNA sequences (or reads) obtained from the genome. The relative ordering of the reads along the target genome is not known a priori, which is one of the main contributors to the increased complexity of the assembly process. In this article, with the dual objective of improving assembly quality and exposing a high degree of parallelism, we present a partitioning-based approach. Our framework, BOA (bucket-order-assemble), uses a bucketing alongside graph- and hypergraph-based partitioning techniques to produce a partial ordering of the reads. This partial ordering enables us to divide the read set into disjoint blocks that can be independently assembled in parallel using any state-of-the-art serial assembler of choice. Experimental results show that BOA improves both the overall assembly quality and performance.
Collapse
Affiliation(s)
- Xiaojing An
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Priyanka Ghosh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Patrick Keppler
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA
| | - Sureyya Emre Kurt
- School of Computing, University of Utah, Salt Lake City, UT 84112, USA
| | | | | | - Aravind Sukumaran Rajam
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA
| | - Ümit V. Çatalyürek
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA,Amazon Web Services, Seattle, WA 98109, USA
| | - Ananth Kalyanaraman
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA,Corresponding author
| |
Collapse
|
21
|
Faure R, Lavenier D. QuickDeconvolution: fast and scalable deconvolution of linked-read sequencing data. BIOINFORMATICS ADVANCES 2022; 2:vbac068. [PMID: 36699389 PMCID: PMC9710601 DOI: 10.1093/bioadv/vbac068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 08/22/2022] [Accepted: 09/21/2022] [Indexed: 01/28/2023]
Abstract
Motivation Recently introduced, linked-read technologies, such as the 10× chromium system, use microfluidics to tag multiple short reads from the same long fragment (50-200 kb) with a small sequence, called a barcode. They are inexpensive and easy to prepare, combining the accuracy of short-read sequencing with the long-range information of barcodes. The same barcode can be used for several different fragments, which complicates the analyses. Results We present QuickDeconvolution (QD), a new software for deconvolving a set of reads sharing a barcode, i.e. separating the reads from the different fragments. QD only takes sequencing data as input, without the need for a reference genome. We show that QD outperforms existing software in terms of accuracy, speed and scalability, making it capable of deconvolving previously inaccessible data sets. In particular, we demonstrate here the first example in the literature of a successfully deconvoluted animal sequencing dataset, a 33-Gb Drosophila melanogaster dataset. We show that the taxonomic assignment of linked reads can be improved by deconvoluting reads with QD before taxonomic classification. Availability and implementation Code and instructions are available on https://github.com/RolandFaure/QuickDeconvolution. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
22
|
Yang S, Lan T, Zhang Y, Wang Q, Li H, Dussex N, Sahu SK, Shi M, Hu M, Zhu Y, Cao J, Liu L, Lin J, Wan QH, Liu H, Fang SG. Genomic investigation of the Chinese alligator reveals wild-extinct genetic diversity and genomic consequences of their continuous decline. Mol Ecol Resour 2022; 23:294-311. [PMID: 35980602 PMCID: PMC10087395 DOI: 10.1111/1755-0998.13702] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 07/29/2022] [Accepted: 08/15/2022] [Indexed: 11/26/2022]
Abstract
Critically endangered species are usually restricted to small and isolated populations. High inbreeding without gene flow among populations further aggravates their threatened condition and reduces the likelihood of their long-term survival. Chinese alligator (Alligator sinensis) is one of the most endangered crocodiles in the world and has experienced a continuous decline over the past ca. 1 million years. In order to identify the genetic status of the remaining populations and aid conservation efforts, we assembled the first high-quality chromosome-level genome of Chinese alligator and explored the genomic characteristics of three extant breeding populations. Our analyses revealed the existence of at least three genetically distinct populations, comprising two breeding populations in China (Changxing and Xuancheng) and one breeding population in an American wildlife refuge. The American population does not belong to the last two populations of its native range (Xuancheng and Changxing), thus representing genetic diversity extinct in the wild and provides future opportunities for genetic rescue. Moreover, the effective population size of these three populations has been continuously declining over the past 20 ka. Consistent with this decline, the species shows extremely low genetic diversity, a large proportion of long runs of homozygous fragments, and mutational load across the genome. Finally, to provide genomic insights for future breeding management and conservation, we assessed the feasibility of mixing extant populations based on the likelihood of introducing new deleterious alleles and signatures of local adaptation. Overall, this study provides a valuable genomic resource and important genomic insights into the ecology, evolution, and conservation of critically endangered alligators.
Collapse
Affiliation(s)
- Shangchen Yang
- MOE Key Laboratory of Biosystems Homeostasis & Protection, State Conservation Centre for Gene Resources of Endangered Wildlife, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Tianming Lan
- State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, China.,BGI Life Science Joint Research Center, Northeast Forestry University, China
| | - Yi Zhang
- MOE Key Laboratory of Biosystems Homeostasis & Protection, State Conservation Centre for Gene Resources of Endangered Wildlife, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Qing Wang
- State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Haimeng Li
- State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Nicolas Dussex
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691, Stockholm, Sweden.,Department of Zoology, Stockholm University, Stockholm, Sweden.,Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden
| | - Sunil Kumar Sahu
- State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, China
| | - Minhui Shi
- State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Mengyuan Hu
- MOE Key Laboratory of Biosystems Homeostasis & Protection, State Conservation Centre for Gene Resources of Endangered Wildlife, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Yixin Zhu
- State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Jun Cao
- China National GeneBank, BGI-Shenzhen, Shenzhen, China.,Guangdong Provincial Key Laboratory of Genome Read and Write, BGI-Shenzhen, Shenzhen, China
| | - Lirong Liu
- China National GeneBank, BGI-Shenzhen, Shenzhen, China.,Guangdong Provincial Key Laboratory of Genome Read and Write, BGI-Shenzhen, Shenzhen, China
| | - Jianqing Lin
- MOE Key Laboratory of Biosystems Homeostasis & Protection, State Conservation Centre for Gene Resources of Endangered Wildlife, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Qiu-Hong Wan
- MOE Key Laboratory of Biosystems Homeostasis & Protection, State Conservation Centre for Gene Resources of Endangered Wildlife, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Huan Liu
- State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen, China.,BGI Life Science Joint Research Center, Northeast Forestry University, China
| | - Sheng-Guo Fang
- MOE Key Laboratory of Biosystems Homeostasis & Protection, State Conservation Centre for Gene Resources of Endangered Wildlife, College of Life Sciences, Zhejiang University, Hangzhou, China
| |
Collapse
|
23
|
Kang X, Luo X, Schönhuth A. StrainXpress: strain aware metagenome assembly from short reads. Nucleic Acids Res 2022; 50:e101. [PMID: 35776122 PMCID: PMC9508831 DOI: 10.1093/nar/gkac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 05/27/2022] [Accepted: 06/30/2022] [Indexed: 12/05/2022] Open
Abstract
Next-generation sequencing–based metagenomics has enabled to identify microorganisms in characteristic habitats without the need for lengthy cultivation. Importantly, clinically relevant phenomena such as resistance to medication, virulence or interactions with the environment can vary already within species. Therefore, a major current challenge is to reconstruct individual genomes from the sequencing reads at the level of strains, and not just the level of species. However, strains of one species can differ only by minor amounts of variants, which makes it difficult to distinguish them. Despite considerable recent progress, related approaches have remained fragmentary so far. Here, we present StrainXpress, as a comprehensive solution to the problem of strain aware metagenome assembly from next-generation sequencing reads. In experiments, StrainXpress reconstructs strain-specific genomes from metagenomes that involve up to >1000 strains and proves to successfully deal with poorly covered strains. The amount of reconstructed strain-specific sequence exceeds that of the current state-of-the-art approaches by on average 26.75% across all data sets (first quartile: 18.51%, median: 26.60%, third quartile: 35.05%).
Collapse
Affiliation(s)
- Xiongbin Kang
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| | - Xiao Luo
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany
| |
Collapse
|
24
|
Goussarov G, Mysara M, Vandamme P, Van Houdt R. Introduction to the principles and methods underlying the recovery of metagenome-assembled genomes from metagenomic data. Microbiologyopen 2022; 11:e1298. [PMID: 35765182 PMCID: PMC9179125 DOI: 10.1002/mbo3.1298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 05/19/2022] [Accepted: 05/19/2022] [Indexed: 11/18/2022] Open
Abstract
The rise of metagenomics offers a leap forward for understanding the genetic diversity of microorganisms in many different complex environments by providing a platform that can identify potentially unlimited numbers of known and novel microorganisms. As such, it is impossible to imagine new major initiatives without metagenomics. Nevertheless, it represents a relatively new discipline with various levels of complexity and demands on bioinformatics. The underlying principles and methods used in metagenomics are often seen as common knowledge and often not detailed or fragmented. Therefore, we reviewed these to guide microbiologists in taking the first steps into metagenomics. We specifically focus on a workflow aimed at reconstructing individual genomes, that is, metagenome‐assembled genomes, integrating DNA sequencing, assembly, binning, identification and annotation.
Collapse
Affiliation(s)
- Gleb Goussarov
- Microbiology Unit, Belgian Nuclear Research Centre (SCK CEN), Mol, Belgium.,Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Mohamed Mysara
- Microbiology Unit, Belgian Nuclear Research Centre (SCK CEN), Mol, Belgium
| | - Peter Vandamme
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Rob Van Houdt
- Microbiology Unit, Belgian Nuclear Research Centre (SCK CEN), Mol, Belgium
| |
Collapse
|
25
|
Dufault‐Thompson K, Jiang X. Applications of de Bruijn graphs in microbiome research. IMETA 2022; 1:e4. [PMID: 38867733 PMCID: PMC10989854 DOI: 10.1002/imt2.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 01/24/2022] [Accepted: 01/24/2022] [Indexed: 06/14/2024]
Abstract
High-throughput sequencing has become an increasingly central component of microbiome research. The development of de Bruijn graph-based methods for assembling high-throughput sequencing data has been an important part of the broader adoption of sequencing as part of biological studies. Recent advances in the construction and representation of de Bruijn graphs have led to new approaches that utilize the de Bruijn graph data structure to aid in different biological analyses. One type of application of these methods has been in alternative approaches to the assembly of sequencing data like gene-targeted assembly, where only gene sequences are assembled out of larger metagenomes, and differential assembly, where sequences that are differentially present between two samples are assembled. de Bruijn graphs have also been applied for comparative genomics where they can be used to represent large sets of multiple genomes or metagenomes where structural features in the graphs can be used to identify variants, indels, and homologous regions in sequences. These de Bruijn graph-based representations of sequencing data have even begun to be applied to whole sequencing databases for large-scale searches and experiment discovery. de Bruijn graphs have played a central role in how high-throughput sequencing data is worked with, and the rapid development of new tools that rely on these data structures suggests that they will continue to play an important role in biology in the future.
Collapse
Affiliation(s)
- Keith Dufault‐Thompson
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| | - Xiaofang Jiang
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| |
Collapse
|
26
|
Song JG, Yu MS, Lee B, Lee J, Hwang SH, Na D, Kim HW. Analysis methods for the gut microbiome in neuropsychiatric and neurodegenerative disorders. Comput Struct Biotechnol J 2022; 20:1097-1110. [PMID: 35317228 PMCID: PMC8902474 DOI: 10.1016/j.csbj.2022.02.024] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2021] [Revised: 02/24/2022] [Accepted: 02/24/2022] [Indexed: 12/14/2022] Open
Abstract
For a long time, the central nervous system was believed to be the only regulator of cognitive functions. However, accumulating evidence suggests that the composition of the microbiome is strongly associated with brain functions and diseases. Indeed, the gut microbiome is involved in neuropsychiatric diseases (e.g., depression, autism spectrum disorder, and anxiety) and neurodegenerative diseases (e.g., Parkinson’s disease and Alzheimer’s disease). In this review, we provide an overview of the link between the gut microbiome and neuropsychiatric or neurodegenerative disorders. We also introduce analytical methods used to assess the connection between the gut microbiome and the brain. The limitations of the methods used at present are also discussed. The accurate translation of the microbiome information to brain disorder could promote better understanding of neuronal diseases and aid in finding alternative and novel therapies.
Collapse
Affiliation(s)
- Jae Gwang Song
- Department of Bio-integrated Science and Technology, College of Life Sciences, Sejong University, Seoul 05006, Republic of Korea
| | - Myeong-Sang Yu
- Department of Biomedical Engineering, Department of Biomedical Engineering, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Bomi Lee
- Department of Bio-integrated Science and Technology, College of Life Sciences, Sejong University, Seoul 05006, Republic of Korea
| | - Jingyu Lee
- Department of Biomedical Engineering, Department of Biomedical Engineering, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Su-Hee Hwang
- Department of Biomedical Engineering, Department of Biomedical Engineering, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Dokyun Na
- Department of Biomedical Engineering, Department of Biomedical Engineering, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
- Corresponding authors.
| | - Hyung Wook Kim
- Department of Bio-integrated Science and Technology, College of Life Sciences, Sejong University, Seoul 05006, Republic of Korea
- Corresponding authors.
| |
Collapse
|
27
|
Chen Y, You D, Zhang T, Wang G. SLDMS: A Tool for Calculating the Overlapping Regions of Sequences. FRONTIERS IN PLANT SCIENCE 2022; 12:813036. [PMID: 35046988 PMCID: PMC8761809 DOI: 10.3389/fpls.2021.813036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 11/29/2021] [Indexed: 06/14/2023]
Abstract
In the field of genome assembly, contig assembly is one of the most important parts. Contig assembly requires the processing of overlapping regions of a large number of DNA sequences and this calculation usually takes a lot of time. The time consumption of contig assembly algorithms is an important indicator to evaluate the degree of algorithm superiority. Existing methods for processing overlapping regions of sequences consume too much in terms of running time. Therefore, we propose a method SLDMS for processing sequence overlapping regions based on suffix array and monotonic stack, which can effectively improve the efficiency of sequence overlapping regions processing. The running time of the SLDMS is much less than that of Canu and Flye in dealing with the sequence overlap interval and in some data with most sequencing errors occur at both the ends of the sequencing data, the running time of the SLDMS is only about one-tenth of the other two methods.
Collapse
Affiliation(s)
- Yu Chen
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - DongLiang You
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - TianJiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - GuoHua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin, China
| |
Collapse
|
28
|
Genome assembly and annotation. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00013-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
29
|
Shmakov NА. Improving the quality of barley transcriptome de novo assembling by using a hybrid approach for lines with varying spike and stem coloration. Vavilovskii Zhurnal Genet Selektsii 2021; 25:30-38. [PMID: 34901701 PMCID: PMC8627909 DOI: 10.18699/vj21.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Revised: 01/15/2021] [Accepted: 01/15/2021] [Indexed: 11/19/2022] Open
Abstract
De novo transcriptome assembly is an important stage of RNA-seq data computational analysis. It allows the researchers to obtain the sequences of transcripts presented in the biological sample of interest. The availability of accurate and complete transcriptome sequence of the organism of interest is, in turn, an indispensable condition for further analysis of RNA-seq data. Through years of transcriptomic research, the bioinformatics community has developed a number of assembler programs for transcriptome reconstruction from short reads of RNA-seq libraries. Different assemblers makes it possible to conduct a de novo transcriptome reconstruction and a genome-guided reconstruction. The majority of the assemblers working with RNA-seq data are based on the De Bruijn graph method of sequence reconstruction. However, specif ics of their procedures can vary drastically, as do their results. A number of authors recommend a hybrid approach to transcriptome reconstruction based on combining the results of several assemblers in order to achieve a better transcriptome assembly. The advantage of this approach has been demonstrated in a number of studies, with RNA-seq experiments conducted on the Illumina platform. In this paper, we propose a hybrid approach for creating a transcriptome assembly of the barley Hordeum vulgare isogenic line Bowman and two nearly isogenic lines contrasting in spike pigmentation, based on the results of sequencing on the IonTorrent platform. This approach implements several de novo assemblers: Trinity, Trans-ABySS and rnaSPAdes. Several assembly metrics were examined: the percentage of reference transcripts observed in the assemblies, the percentage of RNA-seq reads involved, and BUSCO scores. It was shown that, based on the summation of these metrics, transcriptome meta-assembly surpasses individual transcriptome assemblies it consists of.
Collapse
Affiliation(s)
- N А Shmakov
- Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia Kurchatov Genomics Center, Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| |
Collapse
|
30
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
31
|
NanoHIV: A Bioinformatics Pipeline for Producing Accurate, Near Full-Length HIV Proviral Genomes Sequenced Using the Oxford Nanopore Technology. Cells 2021; 10:cells10102577. [PMID: 34685559 PMCID: PMC8534097 DOI: 10.3390/cells10102577] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 09/22/2021] [Accepted: 09/24/2021] [Indexed: 12/13/2022] Open
Abstract
HIV-1 proviral single-genome sequencing by limiting-dilution polymerase chain reaction (PCR) amplification is important for differentiating the sequence-intact from defective proviruses that persist during antiretroviral therapy (ART). Intact proviruses may rebound if ART is interrupted and are the barrier to an HIV cure. Oxford Nanopore Technologies (ONT) sequencing offers a promising, cost-effective approach to the sequencing of long amplicons such as near full-length HIV-1 proviruses, but the high diversity of HIV-1 and the ONT sequencing error render analysis of the generated data difficult. NanoHIV is a new tool that uses an iterative consensus generation approach to construct accurate, near full-length HIV-1 proviral single-genome sequences from ONT data. To validate the approach, single-genome sequences generated using NanoHIV consensus building were compared to Illumina® consensus building of the same nine single-genome near full-length amplicons and an average agreement of 99.4% was found between the two sequencing approaches.
Collapse
|
32
|
Boev AS, Rakitko AS, Usmanov SR, Kobzeva AN, Popov IV, Ilinsky VV, Kiktenko EO, Fedorov AK. Genome assembly using quantum and quantum-inspired annealing. Sci Rep 2021; 11:13183. [PMID: 34162895 PMCID: PMC8222255 DOI: 10.1038/s41598-021-88321-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Accepted: 04/09/2021] [Indexed: 02/05/2023] Open
Abstract
Recent advances in DNA sequencing open prospects to make whole-genome analysis rapid and reliable, which is promising for various applications including personalized medicine. However, existing techniques for de novo genome assembly, which is used for the analysis of genomic rearrangements, chromosome phasing, and reconstructing genomes without a reference, require solving tasks of high computational complexity. Here we demonstrate a method for solving genome assembly tasks with the use of quantum and quantum-inspired optimization techniques. Within this method, we present experimental results on genome assembly using quantum annealers both for simulated data and the [Formula: see text]X 174 bacteriophage. Our results pave a way for a significant increase in the efficiency of solving bioinformatics problems with the use of quantum computing technologies and, in particular, quantum annealing might be an effective method. We expect that the new generation of quantum annealing devices would outperform existing techniques for de novo genome assembly. To the best of our knowledge, this is the first experimental study of de novo genome assembly problems both for real and synthetic data on quantum annealing devices and quantum-inspired techniques.
Collapse
Affiliation(s)
- A S Boev
- Russian Quantum Center, Skolkovo, Moscow, 143025, Russia
| | | | - S R Usmanov
- Russian Quantum Center, Skolkovo, Moscow, 143025, Russia
| | - A N Kobzeva
- Russian Quantum Center, Skolkovo, Moscow, 143025, Russia
| | - I V Popov
- Genotek ltd., Moscow, 105120, Russia
| | | | - E O Kiktenko
- Russian Quantum Center, Skolkovo, Moscow, 143025, Russia
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russia
| | - A K Fedorov
- Russian Quantum Center, Skolkovo, Moscow, 143025, Russia.
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russia.
| |
Collapse
|
33
|
Zhao P, Xu S, Huang Z, Deng P, Zhang Y. Identify specific gene pairs for subarachnoid hemorrhage based on wavelet analysis and genetic algorithm. PLoS One 2021; 16:e0253219. [PMID: 34138931 PMCID: PMC8211192 DOI: 10.1371/journal.pone.0253219] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2021] [Accepted: 05/29/2021] [Indexed: 11/18/2022] Open
Abstract
Subarachnoid hemorrhage (SAH) is a fatal stroke caused by bleeding in the brain. SAH can be caused by a ruptured aneurysm or head injury. One-third of patients will survive and recover. One-third will survive with disability; one-third will die. The focus of treatment is to stop bleeding, restore normal blood flow, and prevent vasospasm. Treatment for SAH varies, depending on the bleeding’s underlying cause and the extent of damage to the brain. Treatment may include lifesaving measures, symptom relief, repair of the bleeding vessel, and complication prevention. However, the useful diagnostic biomarkers of SAH are still limited due to the instability of gene marker expression. To overcome this limitation, we developed a new protocol pairing genes and screened significant gene pairs based on the feature selection algorithm. A classifier was constructed with the selected gene pairs and achieved a high performance.
Collapse
Affiliation(s)
- Pengcheng Zhao
- Department of Neurosurgery, Anhui No. 2 Provincal People’s Hospital, Hefei, Anhui, China
| | - Shaonian Xu
- Department of Neurosurgery, Anhui No. 2 Provincal People’s Hospital, Hefei, Anhui, China
| | - Zhenshan Huang
- Department of Neurosurgery, Anhui No. 2 Provincal People’s Hospital, Hefei, Anhui, China
| | - Pengcheng Deng
- Department of Neurosurgery, Anhui No. 2 Provincal People’s Hospital, Hefei, Anhui, China
| | - Yongming Zhang
- Department of Neurosurgery, Anhui No. 2 Provincal People’s Hospital, Hefei, Anhui, China
- * E-mail:
| |
Collapse
|
34
|
Wang Y, Xue H, Pourcel C, Du Y, Gautheret D. 2-kupl: mapping-free variant detection from DNA-seq data of matched samples. BMC Bioinformatics 2021; 22:304. [PMID: 34090332 PMCID: PMC8180056 DOI: 10.1186/s12859-021-04185-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 05/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. RESULTS We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. CONCLUSIONS We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.
Collapse
Affiliation(s)
- Yunfeng Wang
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Haoliang Xue
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Christine Pourcel
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Yang Du
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Daniel Gautheret
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- IHU PRISM, Gustave Roussy, 114 rue Edouard Vaillant, 94800 Villejuif, France
| |
Collapse
|
35
|
Gatter T, von Löhneysen S, Fallmann J, Drozdova P, Hartmann T, Stadler PF. LazyB: fast and cheap genome assembly. Algorithms Mol Biol 2021; 16:8. [PMID: 34074310 PMCID: PMC8168326 DOI: 10.1186/s13015-021-00186-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 05/06/2021] [Indexed: 12/27/2022] Open
Abstract
Background Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, “hybrid” methods that integrate short and long read data have been devised to address this need. Results LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. Conclusions LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. Availability The LazyB prototype is available at https://github.com/TGatter/LazyB.
Collapse
|
36
|
Wu P, Xu C, Chen H, Yang J, Zhang X, Zhou S. NOVOWrap: An automated solution for plastid genome assembly and structure standardization. Mol Ecol Resour 2021; 21:2177-2186. [PMID: 33934526 DOI: 10.1111/1755-0998.13410] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 04/22/2021] [Accepted: 04/26/2021] [Indexed: 11/28/2022]
Abstract
Plastid genomes play an important role in genomics and evolutionary biology. Next-generation sequencing has revolutionized plastid genomic data acquisition to the point that genome assembly has become a bottleneck for widespread utilization of plastid genome data. To solve this problem, we developed an open-source, cross-platform tool known as, NOVOWrap, which includes both command-line and graphical interfaces for automatically assembling plastid genomes on personal computers. With minimal inputs, settings, and user intervention, NOVOWrap can automatically assemble plastid genomes, validate results and standardize the structure using affordable computer resources. The performance of this software has been successfully benchmarked against the plastid genomes of 11 species belonging to lycopods, gymnosperms, and angiosperms. By liberating researchers from laborious and cumbersome computer manipulations and create reliable and standardized genomic data, NOVOWrap is expected to accelerate plastid genome assembly, ease the process of data exchange, and contribute to downstream analysis.
Collapse
Affiliation(s)
- Ping Wu
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Chao Xu
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China
| | - Hao Chen
- Shaanxi University of Science and Technology, Xi'an, China
| | - Jie Yang
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xianchun Zhang
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Shiliang Zhou
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
37
|
Lopes M, Louzada S, Gama-Carvalho M, Chaves R. Genomic Tackling of Human Satellite DNA: Breaking Barriers through Time. Int J Mol Sci 2021; 22:4707. [PMID: 33946766 PMCID: PMC8125562 DOI: 10.3390/ijms22094707] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 04/24/2021] [Accepted: 04/27/2021] [Indexed: 12/12/2022] Open
Abstract
(Peri)centromeric repetitive sequences and, more specifically, satellite DNA (satDNA) sequences, constitute a major human genomic component. SatDNA sequences can vary on a large number of features, including nucleotide composition, complexity, and abundance. Several satDNA families have been identified and characterized in the human genome through time, albeit at different speeds. Human satDNA families present a high degree of sub-variability, leading to the definition of various subfamilies with different organization and clustered localization. Evolution of satDNA analysis has enabled the progressive characterization of satDNA features. Despite recent advances in the sequencing of centromeric arrays, comprehensive genomic studies to assess their variability are still required to provide accurate and proportional representation of satDNA (peri)centromeric/acrocentric short arm sequences. Approaches combining multiple techniques have been successfully applied and seem to be the path to follow for generating integrated knowledge in the promising field of human satDNA biology.
Collapse
Affiliation(s)
- Mariana Lopes
- Laboratory of Cytogenomics and Animal Genomics (CAG), Department of Genetics and Biotechnology (DGB), University of Trás-os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal; (M.L.); (S.L.)
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| | - Sandra Louzada
- Laboratory of Cytogenomics and Animal Genomics (CAG), Department of Genetics and Biotechnology (DGB), University of Trás-os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal; (M.L.); (S.L.)
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| | - Margarida Gama-Carvalho
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| | - Raquel Chaves
- Laboratory of Cytogenomics and Animal Genomics (CAG), Department of Genetics and Biotechnology (DGB), University of Trás-os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal; (M.L.); (S.L.)
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| |
Collapse
|
38
|
Alipanahi B, Muggli MD, Jundi M, Noyes NR, Boucher C. Metagenome SNP calling via read-colored de Bruijn graphs. Bioinformatics 2021; 36:5275-5281. [PMID: 32049324 DOI: 10.1093/bioinformatics/btaa081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 01/08/2020] [Accepted: 02/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahar Alipanahi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Martin D Muggli
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Musa Jundi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Noelle R Noyes
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
39
|
Hosseini ZZ, Rahimi SK, Forouzan E, Baraani A. RMI-DBG algorithm: A more agile iterative de Bruijn graph algorithm in short read genome assembly. J Bioinform Comput Biol 2021; 19:2150005. [PMID: 33866959 DOI: 10.1142/s0219720021500050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The de Bruijn Graph algorithm (DBG) as one of the cornerstones algorithms in short read assembly has extended with the rapid advancement of the Next Generation Sequencing (NGS) technologies and low-cost production of millions of high-quality short reads. Erroneous reads, non-uniform coverage, and genomic repeats are three major problems that influence the performance of short read assemblers. To encounter these problems, the iterative DBG algorithm applies multiple [Formula: see text]-mers instead of a single [Formula: see text]-mer, by iterating the DBG graph over a range of [Formula: see text]-mer sizes from the minimum to the maximum. However, the iteration paradigm of iterative DBG deals with complex graphs from the beginning of the algorithm and therefore, causes more potential errors and computational time for resolving various unreal branches. In this research, we propose the Reverse Modified Iterative DBG graph (named RMI-DBG) for short read assembly. RMI-DBG utilizes the DBG algorithm and String graph to achieve the advantages of both algorithms. We present that RMI-DBG performs faster with comparable results in comparison to iterative DBG. Additionally, the quality of the proposed algorithm in terms of continuity and accuracy is evaluated with some commonly-used assemblers via several real datasets of the GAGE-B benchmark.
Collapse
Affiliation(s)
| | | | - Esmaeil Forouzan
- National Institute for Genetic, Engineering & Biotechnology, (NIGEB), Tehran, Iran.,GeneMan Genomics Ltd, (www.ggenomics.ir), Shiraz, Iran
| | - Ahmad Baraani
- Department of Software Engineering, University of Isfahan, Iran
| |
Collapse
|
40
|
Lapidus AL, Korobeynikov AI. Metagenomic Data Assembly - The Way of Decoding Unknown Microorganisms. Front Microbiol 2021; 12:613791. [PMID: 33833738 PMCID: PMC8021871 DOI: 10.3389/fmicb.2021.613791] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Accepted: 03/03/2021] [Indexed: 01/08/2023] Open
Abstract
Metagenomics is a segment of conventional microbial genomics dedicated to the sequencing and analysis of combined genomic DNA of entire environmental samples. The most critical step of the metagenomic data analysis is the reconstruction of individual genes and genomes of the microorganisms in the communities using metagenomic assemblers - computational programs that put together small fragments of sequenced DNA generated by sequencing instruments. Here, we describe the challenges of metagenomic assembly, a wide spectrum of applications in which metagenomic assemblies were used to better understand the ecology and evolution of microbial ecosystems, and present one of the most efficient microbial assemblers, SPAdes that was upgraded to become applicable for metagenomics.
Collapse
Affiliation(s)
- Alla L. Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, Saint Petersburg, Russia
| | | |
Collapse
|
41
|
Hu T, Li J, Zhou H, Li C, Holmes EC, Shi W. Bioinformatics resources for SARS-CoV-2 discovery and surveillance. Brief Bioinform 2021; 22:631-641. [PMID: 33416890 PMCID: PMC7929396 DOI: 10.1093/bib/bbaa386] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/10/2020] [Accepted: 11/27/2020] [Indexed: 12/22/2022] Open
Abstract
In early January 2020, the novel coronavirus (SARS-CoV-2) responsible for a pneumonia outbreak in Wuhan, China, was identified using next-generation sequencing (NGS) and readily available bioinformatics pipelines. In addition to virus discovery, these NGS technologies and bioinformatics resources are currently being employed for ongoing genomic surveillance of SARS-CoV-2 worldwide, tracking its spread, evolution and patterns of variation on a global scale. In this review, we summarize the bioinformatics resources used for the discovery and surveillance of SARS-CoV-2. We also discuss the advantages and disadvantages of these bioinformatics resources and highlight areas where additional technical developments are urgently needed. Solutions to these problems will be beneficial not only to the prevention and control of the current COVID-19 pandemic but also to infectious disease outbreaks of the future.
Collapse
Affiliation(s)
- Tao Hu
- Shandong First Medical University, China
| | - Juan Li
- Shandong First Medical University, China
| | - Hong Zhou
- Shandong First Medical University, China
| | - Cixiu Li
- Shandong First Medical University, China
| | | | | |
Collapse
|
42
|
Cortese IJ, Castrillo ML, Onetto AL, Bich GÁ, Zapata PD, Laczeski ME. De novo genome assembly of Bacillus altitudinis 19RS3 and Bacillus altitudinis T5S-T4, two plant growth-promoting bacteria isolated from Ilex paraguariensis St. Hil. (yerba mate). PLoS One 2021; 16:e0248274. [PMID: 33705487 PMCID: PMC7954119 DOI: 10.1371/journal.pone.0248274] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 02/23/2021] [Indexed: 11/18/2022] Open
Abstract
Plant growth-promoting bacteria (PGPB) are a heterogeneous group of bacteria that can exert beneficial effects on plant growth directly or indirectly by different mechanisms. PGPB-based inoculant formulation has been used to replace chemical fertilizers and pesticides. In our previous studies, two endophytic endospore-forming bacteria identified as Bacillus altitudinis were isolated from roots of Ilex paraguariensis St. Hil. seedlings and selected for their plant growth-promoting (PGP) properties shown in vitro and in vivo. The purposes of this work were to assemble the genomes of B. altitudinis 19RS3 and T5S-T4, using different assemblers available for Windows and Linux and to select the best assembly for each strain. Both genomes were also automatically annotated to detect PGP genes and compare sequences with other genomes reported. Library construction and draft genome sequencing were performed by Macrogen services. Raw reads were filtered using the Trimmomatic tool. Genomes were assembled using SPAdes, ABySS, Velvet, and SOAPdenovo2 assemblers for Linux, and Geneious and CLC Genomics Workbench assemblers for Windows. Assembly evaluation was done by the QUAST tool. The parameters evaluated were the number of contigs ≥ 500 bp and ≥ 1000 bp, the length of the longest contig, and the N50 value. For genome annotation PROKKA, RAST, and KAAS tools were used. The best assembly for both genomes was obtained using Velvet. The B. altitudinis 19RS3 genome was assembled into 15 contigs with an N50 value of 1,943,801 bp. The B. altitudinis T5S-T4 genome was assembled into 24 contigs with an N50 of 344,151 bp. Both genomes comprise several genes related to PGP mechanisms, such as those for nitrogen fixation, iron metabolism, phosphate metabolism, and auxin biosynthesis. The results obtained offer the basis for a better understanding of B. altitudinis 19RS3 and T5S-T4 and make them promissory for bioinoculant development.
Collapse
Affiliation(s)
- Iliana Julieta Cortese
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - María Lorena Castrillo
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Andrea Liliana Onetto
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Gustavo Ángel Bich
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Pedro Darío Zapata
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Margarita Ester Laczeski
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
- Cátedra de Bacteriología, Dpto. de Microbiología, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| |
Collapse
|
43
|
Collins JH, Keating KW, Jones TR, Balaji S, Marsan CB, Çomo M, Newlon ZJ, Mitchell T, Bartley B, Adler A, Roehner N, Young EM. Engineered yeast genomes accurately assembled from pure and mixed samples. Nat Commun 2021; 12:1485. [PMID: 33674578 PMCID: PMC7935868 DOI: 10.1038/s41467-021-21656-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Accepted: 02/04/2021] [Indexed: 01/31/2023] Open
Abstract
Yeast whole genome sequencing (WGS) lacks end-to-end workflows that identify genetic engineering. Here we present Prymetime, a tool that assembles yeast plasmids and chromosomes and annotates genetic engineering sequences. It is a hybrid workflow-it uses short and long reads as inputs to perform separate linear and circular assembly steps. This structure is necessary to accurately resolve genetic engineering sequences in plasmids and the genome. We show this by assembling diverse engineered yeasts, in some cases revealing unintended deletions and integrations. Furthermore, the resulting whole genomes are high quality, although the underlying assembly software does not consistently resolve highly repetitive genome features. Finally, we assemble plasmids and genome integrations from metagenomic sequencing, even with 1 engineered cell in 1000. This work is a blueprint for building WGS workflows and establishes WGS-based identification of yeast genetic engineering.
Collapse
Affiliation(s)
- Joseph H Collins
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Kevin W Keating
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Trent R Jones
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Shravani Balaji
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Celeste B Marsan
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Marina Çomo
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Zachary J Newlon
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA
| | - Tom Mitchell
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Bryan Bartley
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Aaron Adler
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Nicholas Roehner
- Synthetic Biology, Raytheon BBN Technologies, Cambridge, MA, USA
| | - Eric M Young
- Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA.
| |
Collapse
|
44
|
Oliva M, Milicchio F, King K, Benson G, Boucher C, Prosperi M. Portable nanopore analytics: are we there yet? Bioinformatics 2021; 36:4399-4405. [PMID: 32277811 DOI: 10.1093/bioinformatics/btaa237] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Revised: 02/07/2020] [Accepted: 04/06/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Oxford Nanopore technologies (ONT) add miniaturization and real time to high-throughput sequencing. All available software for ONT data analytics run on cloud/clusters or personal computers. Instead, a linchpin to true portability is software that works on mobile devices of internet connections. Smartphones' and tablets' chipset/memory/operating systems differ from desktop computers, but software can be recompiled. We sought to understand how portable current ONT analysis methods are. RESULTS Several tools, from base-calling to genome assembly, were ported and benchmarked on an Android smartphone. Out of 23 programs, 11 succeeded. Recompilation failures included lack of standard headers and unsupported instruction sets. Only DSK, BCALM2 and Kraken were able to process files up to 16 GB, with linearly scaling CPU-times. However, peak CPU temperatures were high. In conclusion, the portability scenario is not favorable. Given the fast market growth, attention of developers to ARM chipsets and Android/iOS is warranted, as well as initiatives to implement mobile-specific libraries. AVAILABILITY AND IMPLEMENTATION The source code is freely available at: https://github.com/marco-oliva/portable-nanopore-analytics.
Collapse
Affiliation(s)
- Marco Oliva
- Department of Engineering, Roma Tre University, Rome, Italy.,Department of Computer and Information Science and Engineering
| | | | - Kaden King
- Department of Computer and Information Science and Engineering
| | - Grace Benson
- Department of Computer and Information Science and Engineering
| | | | - Mattia Prosperi
- Department of Epidemiology, University of Florida, Gainesville, FL 32610, USA
| |
Collapse
|
45
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
46
|
Yang C, Zheng Y, Tan S, Meng G, Rao W, Yang C, Bourne DG, O'Brien PA, Xu J, Liao S, Chen A, Chen X, Jia X, Zhang AB, Liu S. Efficient COI barcoding using high throughput single-end 400 bp sequencing. BMC Genomics 2020; 21:862. [PMID: 33276723 PMCID: PMC7716423 DOI: 10.1186/s12864-020-07255-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Accepted: 11/18/2020] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Over the last decade, the rapid development of high-throughput sequencing platforms has accelerated species description and assisted morphological classification through DNA barcoding. However, the current high-throughput DNA barcoding methods cannot obtain full-length barcode sequences due to read length limitations (e.g. a maximum read length of 300 bp for the Illumina's MiSeq system), or are hindered by a relatively high cost or low sequencing output (e.g. a maximum number of eight million reads per cell for the PacBio's SEQUEL II system). RESULTS Pooled cytochrome c oxidase subunit I (COI) barcodes from individual specimens were sequenced on the MGISEQ-2000 platform using the single-end 400 bp (SE400) module. We present a bioinformatic pipeline, HIFI-SE, that takes reads generated from the 5' and 3' ends of the COI barcode region and assembles them into full-length barcodes. HIFI-SE is written in Python and includes four function modules of filter, assign, assembly and taxonomy. We applied the HIFI-SE to a set of 845 samples (30 marine invertebrates, 815 insects) and delivered a total of 747 fully assembled COI barcodes as well as 70 Wolbachia and fungi symbionts. Compared to their corresponding Sanger sequences (72 sequences available), nearly all samples (71/72) were correctly and accurately assembled, including 46 samples that had a similarity score of 100% and 25 of ca. 99%. CONCLUSIONS The HIFI-SE pipeline represents an efficient way to produce standard full-length barcodes, while the reasonable cost and high sensitivity of our method can contribute considerably more DNA barcodes under the same budget. Our method thereby advances DNA-based species identification from diverse ecosystems and increases the number of relevant applications.
Collapse
Affiliation(s)
| | - Yuxuan Zheng
- College of Life Sciences, Capital Normal University, Beijing, 100048, China
| | | | | | - Wei Rao
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Caiqing Yang
- College of Life Sciences, Capital Normal University, Beijing, 100048, China
| | - David G Bourne
- College of Science and Engineering, James Cook University, Townsville, QLD, Australia
- Australian Institute of Marine Science, Townsville, QLD, Australia
- AIMS@JCU, Townsville, QLD, Australia
| | - Paul A O'Brien
- College of Science and Engineering, James Cook University, Townsville, QLD, Australia
- Australian Institute of Marine Science, Townsville, QLD, Australia
- AIMS@JCU, Townsville, QLD, Australia
| | | | - Sha Liao
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Ao Chen
- BGI-Shenzhen, Shenzhen, 518083, China
| | | | - Xinrui Jia
- College of Life Sciences, Capital Normal University, Beijing, 100048, China
| | - Ai-Bing Zhang
- College of Life Sciences, Capital Normal University, Beijing, 100048, China.
| | - Shanlin Liu
- BGI-Shenzhen, Shenzhen, 518083, China.
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Plant Protection, China Agricultural University, Beijing, 100193, China.
| |
Collapse
|
47
|
Zhang M, Li Z, Li J, Huang T, Peng G, Tang W, Yi G, Zhang L, Song Y, Liu T, Hu X, Ren L, Liu H, Butler JE, Han H, Zhao Y. Revisiting the Pig IGHC Gene Locus in Different Breeds Uncovers Nine Distinct IGHG Genes. THE JOURNAL OF IMMUNOLOGY 2020; 205:2137-2145. [PMID: 32929042 DOI: 10.4049/jimmunol.1901483] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 08/13/2020] [Indexed: 11/19/2022]
Abstract
IgG subclass diversification is common in placental mammals. It has been well documented in humans and mice that different IgG subclasses, with diversified functions, synergistically regulate humoral immunity. However, our knowledge on the genomic and functional diversification of IgG subclasses in the pig, a mammalian species with high agricultural and biomedical importance, is incomplete. Using bacterial artificial chromosome sequencing and newly assembled genomes generated by the PacBio sequencing approach, we characterized and mapped the IgH C region gene locus in three indigenous Chinese breeds (Erhualian, Xiang, and Luchuan) and compared them to that of Duroc. Our data revealed that IGHG genes in Chinese pigs differ from the Duroc, whereas the IGHM, IGHD, IGHA, and IGHE genes were all single copy and highly conserved in the pig breeds examined. Most striking were differences in numbers of IGHG genes: there are seven genes in Erhualian pigs, six in the Duroc, but only five in Xiang pigs. Phylogenetic analysis suggested that all reported porcine IGHG genes could be classified into nine subclasses: IGHG1, IGHG2a, IGHG2b, IGHG2c, IGHG3, IGHG4, IGHG5a, IGHG5b, and IGHG5c. Using sequence information, we developed a mouse mAb specific for IgG3. This study offers a starting point to investigate the structure-function relationship of IgG subclasses in pigs.
Collapse
Affiliation(s)
- Ming Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Zhenrong Li
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Jingying Li
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Tian Huang
- School of Life Sciences, Henan University, Kaifeng 475004, People's Republic of China
| | - Gaochuang Peng
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Wenda Tang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Guoqiang Yi
- Research Centre for Animal Genome, Agricultural Genome Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, People's Republic of China
| | - Lifan Zhang
- College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, People's Republic of China; and
| | - Yu Song
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Tianran Liu
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Xiaoxiang Hu
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Liming Ren
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China
| | - Honglin Liu
- College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, People's Republic of China; and
| | - John E Butler
- Department of Microbiology, University of Iowa Carver College of Medicine, Iowa City, IA 52242
| | - Haitang Han
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China;
| | - Yaofeng Zhao
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing 100193, People's Republic of China;
| |
Collapse
|
48
|
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 2020; 21:249. [PMID: 32943081 PMCID: PMC7499882 DOI: 10.1186/s13059-020-02135-8] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 08/06/2020] [Indexed: 02/07/2023] Open
Abstract
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in.Availability https://github.com/pmelsted/bifrost.
Collapse
Affiliation(s)
- Guillaume Holley
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland.
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland
| |
Collapse
|
49
|
Sharpe RM, Williamson-Benavides B, Edwards GE, Dhingra A. Methods of analysis of chloroplast genomes of C 3, Kranz type C 4 and Single Cell C 4 photosynthetic members of Chenopodiaceae. PLANT METHODS 2020; 16:119. [PMID: 32874195 PMCID: PMC7457496 DOI: 10.1186/s13007-020-00662-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Accepted: 08/20/2020] [Indexed: 06/11/2023]
Abstract
BACKGROUND Chloroplast genome information is critical to understanding forms of photosynthesis in the plant kingdom. During the evolutionary process, plants have developed different photosynthetic strategies that are accompanied by complementary biochemical and anatomical features. Members of family Chenopodiaceae have species with C3 photosynthesis, and variations of C4 photosynthesis in which photorespiration is reduced by concentrating CO2 around Rubisco through dual coordinated functioning of dimorphic chloroplasts. Among dicots, the family has the largest number of C4 species, and greatest structural and biochemical diversity in forms of C4 including the canonical dual-cell Kranz anatomy, and the recently identified single cell C4 with the presence of dimorphic chloroplasts separated by a vacuole. This is the first comparative analysis of chloroplast genomes in species representative of photosynthetic types in the family. RESULTS Methodology with high throughput sequencing complemented with Sanger sequencing of selected loci provided high quality and complete chloroplast genomes of seven species in the family and one species in the closely related Amaranthaceae family, representing C3, Kranz type C4 and single cell C4 (SSC4) photosynthesis six of the eight chloroplast genomes are new, while two are improved versions of previously published genomes. The depth of coverage obtained using high-throughput sequencing complemented with targeted resequencing of certain loci enabled superior resolution of the border junctions, directionality and repeat region sequences. Comparison of the chloroplast genomes with previously sequenced plastid genomes revealed similar genome organization, gene order and content with a few revisions. High-quality complete chloroplast genome sequences resulted in correcting the orientation the LSC region of the published Bienertia sinuspersici chloroplast genome, identification of stop codons in the rpl23 gene in B. sinuspersici and B. cycloptera, and identifying an instance of IR expansion in the Haloxylon ammodendron inverted repeat sequence. The rare observation of a mitochondria-to-chloroplast inter-organellar gene transfer event was identified in family Chenopodiaceae. CONCLUSIONS This study reports complete chloroplast genomes from seven Chenopodiaceae and one Amaranthaceae species. The depth of coverage obtained using high-throughput sequencing complemented with targeted resequencing of certain loci enabled superior resolution of the border junctions, directionality, and repeat region sequences. Therefore, the use of high throughput and Sanger sequencing, in a hybrid method, reaffirms to be rapid, efficient, and reliable for chloroplast genome sequencing.
Collapse
Affiliation(s)
- Richard M. Sharpe
- Department of Horticulture, Washington State University, Pullman, WA 99164 USA
| | - Bruce Williamson-Benavides
- Department of Horticulture, Washington State University, Pullman, WA 99164 USA
- Molecular Plants Sciences, Washington State University, Pullman, WA 99164 USA
| | - Gerald E. Edwards
- Molecular Plants Sciences, Washington State University, Pullman, WA 99164 USA
- School of Biological Sciences, Washington State University, Pullman, WA 99164 USA
| | - Amit Dhingra
- Department of Horticulture, Washington State University, Pullman, WA 99164 USA
- Molecular Plants Sciences, Washington State University, Pullman, WA 99164 USA
| |
Collapse
|
50
|
Yuan Y, Chung CYL, Chan TF. Advances in optical mapping for genomic research. Comput Struct Biotechnol J 2020; 18:2051-2062. [PMID: 32802277 PMCID: PMC7419273 DOI: 10.1016/j.csbj.2020.07.018] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/08/2020] [Accepted: 07/24/2020] [Indexed: 12/28/2022] Open
Abstract
Recent advances in optical mapping have allowed the construction of improved genome assemblies with greater contiguity. Optical mapping also enables genome comparison and identification of large-scale structural variations. Association of these large-scale genomic features with biological functions is an important goal in plant and animal breeding and in medical research. Optical mapping has also been used in microbiology and still plays an important role in strain typing and epidemiological studies. Here, we review the development of optical mapping in recent decades to illustrate its importance in genomic research. We detail its applications and algorithms to show its specific advantages. Finally, we discuss the challenges required to facilitate the optimization of optical mapping and improve its future development and application.
Collapse
Key Words
- 3D, three-dimensional
- DBG, de Bruijn graph
- DLS, direct label and strain
- DNA, deoxyribonucleic acid
- Genome assembly
- Hi-C, high-throughput chromosome conformation capture
- Mb, million base pair
- Next generation sequencing
- OLC, overlap-layout-consensus
- Optical mapping
- PCR, polymerase chain reaction
- PacBio, Pacific Biosciences
- SRS, short-read sequencing
- SV, structural variation
- Structural variation
- bp, base pair
- kb, kilobase pair
Collapse
Affiliation(s)
- Yuxuan Yuan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Claire Yik-Lok Chung
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory for Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong SAR, China
- AoE Centre for Genomic Studies on Plant-Environment Interaction for Sustainable Agriculture and Food Security, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|