1
|
Sadurski J, Polak-Berecka M, Staniszewski A, Waśko A. Step-by-Step Metagenomics for Food Microbiome Analysis: A Detailed Review. Foods 2024; 13:2216. [PMID: 39063300 PMCID: PMC11276190 DOI: 10.3390/foods13142216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 07/11/2024] [Accepted: 07/12/2024] [Indexed: 07/28/2024] Open
Abstract
This review article offers a comprehensive overview of the current understanding of using metagenomic tools in food microbiome research. It covers the scientific foundation and practical application of genetic analysis techniques for microbial material from food, including bioinformatic analysis and data interpretation. The method discussed in the article for analyzing microorganisms in food without traditional culture methods is known as food metagenomics. This approach, along with other omics technologies such as nutrigenomics, proteomics, metabolomics, and transcriptomics, collectively forms the field of foodomics. Food metagenomics allows swift and thorough examination of bacteria and potential metabolic pathways by utilizing foodomic databases. Despite its established scientific basis and available bioinformatics resources, the research approach of food metagenomics outlined in the article is not yet widely implemented in industry. The authors believe that the integration of next-generation sequencing (NGS) with rapidly advancing digital technologies such as artificial intelligence (AI), the Internet of Things (IoT), and big data will facilitate the widespread adoption of this research strategy in microbial analysis for the food industry. This adoption is expected to enhance food safety and product quality in the near future.
Collapse
Affiliation(s)
- Jan Sadurski
- Department of Biotechnology, Microbiology and Human Nutrition, Faculty of Food Science and Biotechnology, University of Life Sciences in Lublin, 20-704 Lublin, Poland; (M.P.-B.); (A.S.); (A.W.)
| | | | | | | |
Collapse
|
2
|
Kim N, Ma J, Kim W, Kim J, Belenky P, Lee I. Genome-resolved metagenomics: a game changer for microbiome medicine. Exp Mol Med 2024:10.1038/s12276-024-01262-7. [PMID: 38945961 DOI: 10.1038/s12276-024-01262-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 03/06/2024] [Accepted: 03/25/2024] [Indexed: 07/02/2024] Open
Abstract
Recent substantial evidence implicating commensal bacteria in human diseases has given rise to a new domain in biomedical research: microbiome medicine. This emerging field aims to understand and leverage the human microbiota and derivative molecules for disease prevention and treatment. Despite the complex and hierarchical organization of this ecosystem, most research over the years has relied on 16S amplicon sequencing, a legacy of bacterial phylogeny and taxonomy. Although advanced sequencing technologies have enabled cost-effective analysis of entire microbiota, translating the relatively short nucleotide information into the functional and taxonomic organization of the microbiome has posed challenges until recently. In the last decade, genome-resolved metagenomics, which aims to reconstruct microbial genomes directly from whole-metagenome sequencing data, has made significant strides and continues to unveil the mysteries of various human-associated microbial communities. There has been a rapid increase in the volume of whole metagenome sequencing data and in the compilation of novel metagenome-assembled genomes and protein sequences in public depositories. This review provides an overview of the capabilities and methods of genome-resolved metagenomics for studying the human microbiome, with a focus on investigating the prokaryotic microbiota of the human gut. Just as decoding the human genome and its variations marked the beginning of the genomic medicine era, unraveling the genomes of commensal microbes and their sequence variations is ushering us into the era of microbiome medicine. Genome-resolved metagenomics stands as a pivotal tool in this transition and can accelerate our journey toward achieving these scientific and medical milestones.
Collapse
Affiliation(s)
- Nayeon Kim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, 03722, Republic of Korea
| | - Junyeong Ma
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, 03722, Republic of Korea
| | - Wonjong Kim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, 03722, Republic of Korea
| | - Jungyeon Kim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, 03722, Republic of Korea
| | - Peter Belenky
- Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, 02912, USA.
| | - Insuk Lee
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, 03722, Republic of Korea.
- POSTECH Biotech Center, Pohang University of Science and Technology (POSTECH), Pohang, 37673, Republic of Korea.
| |
Collapse
|
3
|
Manoil D, Parga A, Bostanci N, Belibasakis GN. Microbial diagnostics in periodontal diseases. Periodontol 2000 2024. [PMID: 38797888 DOI: 10.1111/prd.12571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/27/2024] [Accepted: 04/15/2024] [Indexed: 05/29/2024]
Abstract
Microbial analytical methods have been instrumental in elucidating the complex microbial etiology of periodontal diseases, by shaping our understanding of subgingival community dynamics. Certain pathobionts can orchestrate the establishment of dysbiotic communities that can subvert the host immune system, triggering inflammation and tissue destruction. Yet, diagnosis and management of periodontal conditions still rely on clinical and radiographic examinations, overlooking the well-established microbial etiology. This review summarizes the chronological emergence of periodontal etiological models and the co-evolution with technological advances in microbial detection. We additionally review the microbial analytical approaches currently accessible to clinicians, highlighting their value in broadening the periodontal assessment. The epidemiological importance of obtaining culture-based antimicrobial susceptibility profiles of periodontal taxa for antibiotic resistance surveillance is also underscored, together with clinically relevant analytical approaches to guide antibiotherapy choices, when necessary. Furthermore, the importance of 16S-based community and shotgun metagenomic profiling is discussed in outlining dysbiotic microbial signatures. Because dysbiosis precedes periodontal damage, biomarker identification offers early diagnostic possibilities to forestall disease relapses during maintenance. Altogether, this review highlights the underutilized potential of clinical microbiology in periodontology, spotlighting the clinical areas most conductive to its diagnostic implementation for enhancing prevention, treatment predictability, and addressing global antibiotic resistance.
Collapse
Affiliation(s)
- Daniel Manoil
- Division of Cariology and Endodontics, University Clinics of Dental Medicine, Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Division of Oral Health and Periodontology, Department of Dental Medicine, Karolinska Institutet, Huddinge, Stockholm, Sweden
| | - Ana Parga
- Division of Cariology and Endodontics, University Clinics of Dental Medicine, Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Department of Microbiology and Parasitology, CIBUS-Faculty of Biology, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
| | - Nagihan Bostanci
- Division of Oral Health and Periodontology, Department of Dental Medicine, Karolinska Institutet, Huddinge, Stockholm, Sweden
| | - Georgios N Belibasakis
- Division of Oral Health and Periodontology, Department of Dental Medicine, Karolinska Institutet, Huddinge, Stockholm, Sweden
| |
Collapse
|
4
|
Zhou Y, Wang Y, Prangishvili D, Krupovic M. Exploring the Archaeal Virosphere by Metagenomics. Methods Mol Biol 2024; 2732:1-22. [PMID: 38060114 DOI: 10.1007/978-1-0716-3515-5_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
During the past decade, environmental research has demonstrated that archaea are abundant and widespread in nature and play important ecological roles at a global scale. Currently, however, the majority of archaeal lineages cannot be cultivated under laboratory conditions and are known exclusively or nearly exclusively through metagenomics. A similar trend extends to the archaeal virosphere, where isolated representatives are available for a handful of model archaeal virus-host systems. Viral metagenomics provides an alternative way to circumvent the limitations of culture-based virus discovery and offers insight into the diversity, distribution, and environmental impact of uncultured archaeal viruses. Presently, metagenomics approaches have been successfully applied to explore the viromes associated with various lineages of extremophilic and mesophilic archaea, including Asgard archaea (Asgardarchaeota), ANME-1 archaea (Methanophagales), thaumarchaea (Nitrososphaeria), altiarchaea (Altiarchaeota), and marine group II archaea (Poseidoniales). Here, we provide an overview of methods widely used in archaeal virus metagenomics, covering metavirome preparation, genome annotation, phylogenetic and phylogenomic analyses, and archaeal host assignment. We hope that this summary will contribute to further exploration and characterization of the enigmatic archaeal virome lurking in diverse environments.
Collapse
Affiliation(s)
- Yifan Zhou
- Institut Pasteur, Université Paris Cité, Archaeal Virology Unit, Paris, France
- Sorbonne Université, Collège Doctoral, Paris, France
| | - Yongjie Wang
- College of Food Science and Technology, Shanghai Ocean University, Shanghai, China
- Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Laboratory of Quality and Safety Risk Assessment for Aquatic Products on Storage and Preservation (Shanghai), Ministry of Agriculture, Shanghai, China
| | - David Prangishvili
- Institut Pasteur, Université Paris Cité, Archaeal Virology Unit, Paris, France
- Ivane Javakhishvili Tbilisi State University, Tbilisi, Georgia
| | - Mart Krupovic
- Institut Pasteur, Université Paris Cité, Archaeal Virology Unit, Paris, France.
| |
Collapse
|
5
|
Reinar WB, Tørresen OK, Nederbragt AJ, Matschiner M, Jentoft S, Jakobsen KS. Teleost genomic repeat landscapes in light of diversification rates and ecology. Mob DNA 2023; 14:14. [PMID: 37789366 PMCID: PMC10546739 DOI: 10.1186/s13100-023-00302-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Accepted: 09/20/2023] [Indexed: 10/05/2023] Open
Abstract
Repetitive DNA make up a considerable fraction of most eukaryotic genomes. In fish, transposable element (TE) activity has coincided with rapid species diversification. Here, we annotated the repetitive content in 100 genome assemblies, covering the major branches of the diverse lineage of teleost fish. We investigated if TE content correlates with family level net diversification rates and found support for a weak negative correlation. Further, we demonstrated that TE proportion correlates with genome size, but not to the proportion of short tandem repeats (STRs), which implies independent evolutionary paths. Marine and freshwater fish had large differences in STR content, with the most extreme propagation detected in the genomes of codfish species and Atlantic herring. Such a high density of STRs is likely to increase the mutational load, which we propose could be counterbalanced by high fecundity as seen in codfishes and herring.
Collapse
Affiliation(s)
| | - Ole K Tørresen
- Department of Biosciences, University of Oslo, Oslo, Norway
| | - Alexander J Nederbragt
- Department of Biosciences, University of Oslo, Oslo, Norway
- Department of Informatics, University of Oslo, Oslo, Norway
| | - Michael Matschiner
- Department of Biosciences, University of Oslo, Oslo, Norway
- University of Oslo, Natural History Museum, Oslo, Norway
| | - Sissel Jentoft
- Department of Biosciences, University of Oslo, Oslo, Norway
| | | |
Collapse
|
6
|
Magdy Mohamed Abdelaziz Barakat S, Sallehuddin R, Yuhaniz SS, R. Khairuddin RF, Mahmood Y. Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges. PeerJ Comput Sci 2023; 9:e1180. [PMID: 37547391 PMCID: PMC10403225 DOI: 10.7717/peerj-cs.1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 04/27/2023] [Indexed: 08/08/2023]
Abstract
Background The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge. Method The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article's primary aim and contribution are to support the researchers through an extensive review to ease other researchers' search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization. Results Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach. Conclusion We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance.
Collapse
Affiliation(s)
| | - Roselina Sallehuddin
- Computer Science, School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia
| | - Siti Sophiayati Yuhaniz
- Advanced Informatics Department, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Kuala Lumpur, Malaysia
| | | | - Yasir Mahmood
- Faculty of Information Technology, The University of Lahore, Lahore, Lahore, Pakistan
| |
Collapse
|
7
|
Medvedev P. Theoretical Analysis of Sequencing Bioinformatics Algorithms and Beyond. COMMUNICATIONS OF THE ACM 2023; 66:118-125. [PMID: 38736702 PMCID: PMC11087067 DOI: 10.1145/3571723] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]
Abstract
A case study reveals the theoretical analysis of algorithms is not always as helpful as standard dogma might suggest.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science and Engineering and the Department of Biochemistry and Molecular Biology and the Director of the Center for Computational Biology and Bioinformatics at Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
8
|
Zhang A, Ma Y, Deng Y, Zhou Z, Cao Y, Yang B, Bai J, Sun Q. Enhancing Protease and Amylase Activities in Bacillus licheniformis XS-4 for Traditional Soy Sauce Fermentation Using ARTP Mutagenesis. Foods 2023; 12:2381. [PMID: 37372591 DOI: 10.3390/foods12122381] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 05/22/2023] [Accepted: 05/31/2023] [Indexed: 06/29/2023] Open
Abstract
This study was conducted to increase the enzymatic activity of Bacillus licheniformis XS-4, which was isolated from the traditional fermented mash of Xianshi soy sauce. The mutation was induced by atmospheric and room-temperature plasma (ARTP), and a mutant strain, mut80, was obtained. mut80 exhibited significant increases in protease and amylase activity by 90.54% and 143.10%, respectively, and the enhanced enzymatic activities were stably maintained after 20 consecutive incubations. Re-sequencing analysis of mut80 revealed that the mutation sites were located in 1518447(AT-T) and 4253106(G-A) in its genome, which was involved in the metabolic pathways of amino acids. The expression of the protease synthetic gene (aprX) increased 1.54 times, while that of the amylase gene (amyA) increased 11.26 times, as confirmed via RT-qPCR. Using ARTP mutagenesis, the present study proposes a highly efficient microbial resource with enhanced protease and amylase activity provided by B. licheniformis, which can potentially be used to improve the efficiency of traditional soy sauce fermentation.
Collapse
Affiliation(s)
- Andong Zhang
- Key Laboratory of Bio-Resources and Eco-Environment of the Ministry of the Education, College of Life Sciences, Sichuan University, Chengdu 610064, China
| | - Yudong Ma
- Key Laboratory of Bio-Resources and Eco-Environment of the Ministry of the Education, College of Life Sciences, Sichuan University, Chengdu 610064, China
| | - Yue Deng
- School of China Alcoholic Drinks, Luzhou Vocational and Technical College, Luzhou 646000, China
| | - Zhiwei Zhou
- Key Laboratory of Bio-Resources and Eco-Environment of the Ministry of the Education, College of Life Sciences, Sichuan University, Chengdu 610064, China
| | - Yue Cao
- Key Laboratory of Bio-Resources and Eco-Environment of the Ministry of the Education, College of Life Sciences, Sichuan University, Chengdu 610064, China
| | - Bin Yang
- Key Laboratory of Bio-Resources and Eco-Environment of the Ministry of the Education, College of Life Sciences, Sichuan University, Chengdu 610064, China
| | - Jing Bai
- Key Laboratory of Bio-Resources and Eco-Environment of the Ministry of the Education, College of Life Sciences, Sichuan University, Chengdu 610064, China
| | - Qun Sun
- Key Laboratory of Bio-Resources and Eco-Environment of the Ministry of the Education, College of Life Sciences, Sichuan University, Chengdu 610064, China
| |
Collapse
|
9
|
Cristina Diaconu C, Madalina Pitica I, Chivu-Economescu M, Georgiana Necula L, Botezatu A, Virginia Iancu I, Iulia Neagu A, L. Radu E, Matei L, Maria Ruta S, Bleotu C. SARS-CoV-2 Variant Surveillance in Genomic Medicine Era. Infect Dis (Lond) 2023. [DOI: 10.5772/intechopen.107137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 07/26/2024] Open
Abstract
In the genomic medicine era, the emergence of SARS-CoV-2 was immediately followed by viral genome sequencing and world-wide sequences sharing. Almost in real-time, based on these sequences, resources were developed and applied around the world, such as molecular diagnostic tests, informed public health decisions, and vaccines. Molecular SARS-CoV-2 variant surveillance was a normal approach in this context yet, considering that the viral genome modification occurs commonly in viral replication process, the challenge is to identify the modifications that significantly affect virulence, transmissibility, reduced effectiveness of vaccines and therapeutics or failure of diagnostic tests. However, assessing the importance of the emergence of new mutations and linking them to epidemiological trend, is still a laborious process and faster phenotypic evaluation approaches, in conjunction with genomic data, are required in order to release timely and efficient control measures.
Collapse
|
10
|
Naranjo-Ortiz MA, Molina M, Fuentes D, Mixão V, Gabaldón T. Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other nonstandard architectures in genome assemblies. Gigascience 2022; 11:6751106. [PMID: 36205401 PMCID: PMC9540331 DOI: 10.1093/gigascience/giac088] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Revised: 11/23/2021] [Accepted: 08/24/2022] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Recent technological developments have made genome sequencing and assembly highly accessible and widely used. However, the presence in sequenced organisms of certain genomic features such as high heterozygosity, polyploidy, aneuploidy, heterokaryosis, or extreme compositional biases can challenge current standard assembly procedures and result in highly fragmented assemblies. Hence, we hypothesized that genome databases must contain a nonnegligible fraction of low-quality assemblies that result from such type of intrinsic genomic factors. FINDINGS Here we present Karyon, a Python-based toolkit that uses raw sequencing data and de novo genome assembly to assess several parameters and generate informative plots to assist in the identification of nonchanonical genomic traits. Karyon includes automated de novo genome assembly and variant calling pipelines. We tested Karyon by diagnosing 35 highly fragmented publicly available assemblies from 19 different Mucorales (Fungi) species. CONCLUSIONS Our results show that 10 (28.57%) of the assemblies presented signs of unusual genomic configurations, suggesting that these are common, at least for some lineages within the Fungi.
Collapse
Affiliation(s)
- Miguel A Naranjo-Ortiz
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain,Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain,Biology Department, Clark University, Worcester, MA 01610, USA,Naturhistoriskmuseum, University of Oslo, Oslo 0562, Norway
| | - Manu Molina
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain,Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain,Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain
| | - Diego Fuentes
- Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain
| | - Verónica Mixão
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain,Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain,Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain
| | - Toni Gabaldón
- Correspondence address. Toni Gabaldón, Plaça Eusebi Güell, 1-3, Barcelona 08034, Spain. E-mail:
| |
Collapse
|
11
|
Ko BJ, Lee C, Kim J, Rhie A, Yoo DA, Howe K, Wood J, Cho S, Brown S, Formenti G, Jarvis ED, Kim H. Widespread false gene gains caused by duplication errors in genome assemblies. Genome Biol 2022; 23:205. [PMID: 36167596 PMCID: PMC9516828 DOI: 10.1186/s13059-022-02764-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Accepted: 09/02/2022] [Indexed: 12/22/2022] Open
Abstract
Background False duplications in genome assemblies lead to false biological conclusions. We quantified false duplications in popularly used previous genome assemblies for platypus, zebra finch, and Anna’s Hummingbird, and their new counterparts of the same species generated by the Vertebrate Genomes Project, of which the Vertebrate Genomes Project pipeline attempted to eliminate false duplications through haplotype phasing and purging. These assemblies are among the first generated by the Vertebrate Genomes Project where there was a prior chromosomal level reference assembly to compare with. Results Whole genome alignments revealed that 4 to 16% of the sequences are falsely duplicated in the previous assemblies, impacting hundreds to thousands of genes. These lead to overestimated gene family expansions. The main source of the false duplications is heterotype duplications, where the haplotype sequences were relatively more divergent than other parts of the genome leading the assembly algorithms to classify them as separate genes or genomic regions. A minor source is sequencing errors. Ancient ATP nucleotide binding gene families have a higher prevalence of false duplications compared to other gene families. Although present in a smaller proportion, we observe false duplications remaining in the Vertebrate Genomes Project assemblies that can be identified and purged. Conclusions This study highlights the need for more advanced assembly methods that better separate haplotypes and sequence errors, and the need for cautious analyses on gene gains. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02764-1.
Collapse
Affiliation(s)
- Byung June Ko
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
| | - Chul Lee
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Juwan Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, USA
| | - Dong Ahn Yoo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | | | | | - Seoae Cho
- eGnome, Inc, Seoul, Republic of Korea
| | - Samara Brown
- Laboratory of the Neurogenetics of Language, The Rockefeller University, New York, NY, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Giulio Formenti
- Laboratory of the Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Erich D Jarvis
- Laboratory of the Neurogenetics of Language, The Rockefeller University, New York, NY, USA. .,Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| | - Heebal Kim
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea. .,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea. .,eGnome, Inc, Seoul, Republic of Korea.
| |
Collapse
|
12
|
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 2022; 23:190. [PMID: 36076275 PMCID: PMC9454175 DOI: 10.1186/s13059-022-02743-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 08/01/2022] [Indexed: 11/13/2022] Open
Abstract
The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17-23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54-58 h, using considerably more memory.
Collapse
Affiliation(s)
- Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| | - Marek Kokot
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| |
Collapse
|
13
|
Rahman A, Medvedev P. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs. Genome Res 2022; 32:gr.276601.122. [PMID: 35896425 PMCID: PMC9528984 DOI: 10.1101/gr.276601.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 07/26/2022] [Indexed: 11/24/2022]
Abstract
Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from misassemblies (joining sequences that should not be adjacent) and from underassemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers. We prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e., they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in underassembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. In particular, when coverage is low, then even error-free data result in unsafe unitigs; also, unitigs may unnecessarily split palindromes in half if special care is not taken. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
14
|
Goel M, Schneeberger K. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics 2022; 38:2922-2926. [PMID: 35561173 PMCID: PMC9113368 DOI: 10.1093/bioinformatics/btac196] [Citation(s) in RCA: 56] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 03/15/2022] [Accepted: 04/11/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Third-generation genome sequencing technologies have led to a sharp increase in the number of high-quality genome assemblies. This allows the comparison of multiple assembled genomes of individual species and demands new tools for visualizing their structural properties. Here, we present plotsr, an efficient tool to visualize structural similarities and rearrangements between genomes. It can be used to compare genomes on chromosome level or to zoom in on any selected region. In addition, plotsr can augment the visualization with regional identifiers (e.g. genes or genomic markers) or histogram tracks for continuous features (e.g. GC content or polymorphism density). AVAILABILITY AND IMPLEMENTATION plotsr is implemented as a python package and uses the standard matplotlib library for plotting. It is freely available under the MIT license at GitHub (https://github.com/schneebergerlab/plotsr) and bioconda (https://anaconda.org/bioconda/plotsr). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Manish Goel
- Faculty of Biology, LMU Munich, Planegg-Martinsried 82152, Germany
- Department of Genetics, Faculty of Biology, LMU Munich, Germany
| | | |
Collapse
|
15
|
Bendall ML, Gibson KM, Steiner MC, Rentia U, Pérez-Losada M, Crandall KA. HAPHPIPE: Haplotype Reconstruction and Phylodynamics for Deep Sequencing of Intrahost Viral Populations. Mol Biol Evol 2021; 38:1677-1690. [PMID: 33367849 PMCID: PMC8042772 DOI: 10.1093/molbev/msaa315] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Deep sequencing of viral populations using next-generation sequencing (NGS) offers opportunities to understand and investigate evolution, transmission dynamics, and population genetics. Currently, the standard practice for processing NGS data to study viral populations is to summarize all the observed sequences from a sample as a single consensus sequence, thus discarding valuable information about the intrahost viral molecular epidemiology. Furthermore, existing analytical pipelines may only analyze genomic regions involved in drug resistance, thus are not suited for full viral genome analysis. Here, we present HAPHPIPE, a HAplotype and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes. The HAPHPIPE protocol includes modules for quality trimming, error correction, de novo assembly, alignment, and haplotype reconstruction. The resulting consensus sequences, haplotypes, and alignments can be further analyzed using a variety of phylogenetic and population genetic software. HAPHPIPE is designed to provide users with a single pipeline to rapidly analyze sequences from viral populations generated from NGS platforms and provide quality output properly formatted for downstream evolutionary analyses.
Collapse
Affiliation(s)
- Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Margaret C Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Uzma Rentia
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA.,Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| |
Collapse
|
16
|
Chiara M, D’Erchia AM, Gissi C, Manzari C, Parisi A, Resta N, Zambelli F, Picardi E, Pavesi G, Horner DS, Pesole G. Next generation sequencing of SARS-CoV-2 genomes: challenges, applications and opportunities. Brief Bioinform 2021; 22:616-630. [PMID: 33279989 PMCID: PMC7799330 DOI: 10.1093/bib/bbaa297] [Citation(s) in RCA: 118] [Impact Index Per Article: 39.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2020] [Revised: 09/27/2020] [Accepted: 10/07/2020] [Indexed: 12/31/2022] Open
Abstract
Various next generation sequencing (NGS) based strategies have been successfully used in the recent past for tracing origins and understanding the evolution of infectious agents, investigating the spread and transmission chains of outbreaks, as well as facilitating the development of effective and rapid molecular diagnostic tests and contributing to the hunt for treatments and vaccines. The ongoing COVID-19 pandemic poses one of the greatest global threats in modern history and has already caused severe social and economic costs. The development of efficient and rapid sequencing methods to reconstruct the genomic sequence of SARS-CoV-2, the etiological agent of COVID-19, has been fundamental for the design of diagnostic molecular tests and to devise effective measures and strategies to mitigate the diffusion of the pandemic. Diverse approaches and sequencing methods can, as testified by the number of available sequences, be applied to SARS-CoV-2 genomes. However, each technology and sequencing approach has its own advantages and limitations. In the current review, we will provide a brief, but hopefully comprehensive, account of currently available platforms and methodological approaches for the sequencing of SARS-CoV-2 genomes. We also present an outline of current repositories and databases that provide access to SARS-CoV-2 genomic data and associated metadata. Finally, we offer general advice and guidelines for the appropriate sharing and deposition of SARS-CoV-2 data and metadata, and suggest that more efficient and standardized integration of current and future SARS-CoV-2-related data would greatly facilitate the struggle against this new pathogen. We hope that our 'vademecum' for the production and handling of SARS-CoV-2-related sequencing data, will contribute to this objective.
Collapse
Affiliation(s)
- Matteo Chiara
- molecular biology and bioinformatics at the University of Milan
| | - Anna Maria D’Erchia
- molecular biology at the University of Bari and research associate at the Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies of the National Research Council in Bari
| | - Carmela Gissi
- molecular biology at the University of Bari and research associate at the Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies of the National Research Council in Bari
| | - Caterina Manzari
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies of the National Research Council in Bari
| | - Antonio Parisi
- Genetic and Molecular Epidemiology Laboratory at the Experimental Zooprophylactic Institute of Apulia and Basilicata
| | - Nicoletta Resta
- Medical Genetics at the University of Bari. She heads the Laboratory Unit of Medical Genetics and the School of Specialization in Medical Genetics
| | | | - Ernesto Picardi
- molecular biology and bioinformatics at the University of Bari and research associate at the Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies of the National Research Council in Bari
| | - Giulio Pavesi
- Associate Professor of bioinformatics at the University of Milan (Italy)
| | - David S Horner
- molecular biology and bioinformatics at the University of Milan
| | - Graziano Pesole
- molecular biology at the University of Bari and Research Associate at the Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies of the National Research Council in Bari
| |
Collapse
|
17
|
Luo J, Wei Y, Lyu M, Wu Z, Liu X, Luo H, Yan C. A comprehensive review of scaffolding methods in genome assembly. Brief Bioinform 2021; 22:6149347. [PMID: 33634311 DOI: 10.1093/bib/bbab033] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/20/2022] Open
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Yawei Wei
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Zhengjiang Wu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
18
|
|
19
|
Steyaert A, Audenaert P, Fostier J. Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields. BMC Bioinformatics 2020; 21:402. [PMID: 32928110 PMCID: PMC7491180 DOI: 10.1186/s12859-020-03740-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 09/04/2020] [Indexed: 12/01/2022] Open
Abstract
Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.
Collapse
|
20
|
Li Y, Wei H, Yang J, Du K, Li J, Zhang Y, Qiu T, Liu Z, Ren Y, Song L, Kang X. High-quality de novo assembly of the Eucommia ulmoides haploid genome provides new insights into evolution and rubber biosynthesis. HORTICULTURE RESEARCH 2020; 7:183. [PMID: 33328448 PMCID: PMC7603500 DOI: 10.1038/s41438-020-00406-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 08/13/2020] [Accepted: 09/04/2020] [Indexed: 05/06/2023]
Abstract
We report the acquisition of a high-quality haploid chromosome-scale genome assembly for the first time in a tree species, Eucommia ulmoides, which is known for its rubber biosynthesis and medicinal applications. The assembly was obtained by applying PacBio and Hi-C technologies to a haploid that we specifically generated. Compared to the initial genome release, this one has significantly improved assembly quality. The scaffold N50 (53.15 MB) increased 28-fold, and the repetitive sequence content (520 Mb) increased by 158.24 Mb, whereas the number of gaps decreased from 104,772 to 128. A total of 92.87% of the 26,001 predicted protein-coding genes identified with multiple strategies were anchored to the 17 chromosomes. A new whole-genome duplication event was superimposed on the earlier γ paleohexaploidization event, and the expansion of long terminal repeats contributed greatly to the evolution of the genome. The more primitive rubber biosynthesis of this species, as opposed to that in Hevea brasiliensis, relies on the methylerythritol-phosphate pathway rather than the mevalonate pathway to synthesize isoprenyl diphosphate, as the MEP pathway operates predominantly in trans-polyisoprene-containing leaves and central peels. Chlorogenic acid biosynthesis pathway enzymes were preferentially expressed in leaves rather than in bark. This assembly with higher sequence contiguity can foster not only studies on genome structure and evolution, gene mapping, epigenetic analysis and functional genomics but also efforts to improve E. ulmoides for industrial and medical uses through genetic engineering.
Collapse
Affiliation(s)
- Yun Li
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Hairong Wei
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- School of Forest Resources and Environmental, Science, Michigan Technological University, Houghton, MI, 49931, USA
| | - Jun Yang
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Kang Du
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Jiang Li
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Ying Zhang
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Tong Qiu
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Zhao Liu
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Yongyu Ren
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China
| | - Lianjun Song
- Hebei Huayang Fine Seeds and Seedlings Co., Ltd., 054700, Hebei, People's Republic of China
| | - Xiangyang Kang
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, 100083, Beijing, People's Republic of China.
- National Engineering Laboratory for Tree Breeding, Beijing Forestry University, 100083, Beijing, People's Republic of China.
- College of Biological Sciences and Technology, Beijing Forestry University, 100083, Beijing, People's Republic of China.
| |
Collapse
|
21
|
Segerman B. The Most Frequently Used Sequencing Technologies and Assembly Methods in Different Time Segments of the Bacterial Surveillance and RefSeq Genome Databases. Front Cell Infect Microbiol 2020; 10:527102. [PMID: 33194784 PMCID: PMC7604302 DOI: 10.3389/fcimb.2020.527102] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 09/08/2020] [Indexed: 01/05/2023] Open
Abstract
Whole genome sequencing has become a powerful tool in modern microbiology. Especially bacterial genomes are sequenced in high numbers. Whole genome sequencing is not only used in research projects, but also in surveillance projects and outbreak investigations. Many whole genome analysis workflows begins with the production of a genome assembly. To accomplish this, a number of different sequencing technologies and assembly methods are available. Here, a summarization is provided over the most frequently used sequence technology and genome assembly approaches reported for the bacterial RefSeq genomes and for the bacterial genomes submitted as belonging to a surveillance project. The data is presented both in total and broken up on a per year basis. Information associated with over 400,000 publically available genomes dated April 2020 and prior were used. The information summarized include (i) the most frequently used sequencing technologies, (ii) the most common combinations of sequencing technologies, (iii) the most reported sequencing depth, and (iv) the most frequently used assembly software solutions. In all, this mini review provides an overview of the currently most common workflows for producing bacterial whole genome sequence assemblies.
Collapse
Affiliation(s)
- Bo Segerman
- Department of Microbiology, National Veterinary Institute (SVA), Uppsala, Sweden.,Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| |
Collapse
|
22
|
Rastas P. Lep-Anchor: automated construction of linkage map anchored haploid genomes. Bioinformatics 2020; 36:2359-2364. [PMID: 31913460 DOI: 10.1093/bioinformatics/btz978] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Revised: 12/12/2019] [Accepted: 01/02/2020] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Linkage mapping provides a practical way to anchor de novo genome assemblies into chromosomes and to detect chimeric or otherwise erroneous contigs. Such anchoring improves with higher number of markers and individuals, as long as the mapping software can handle all the information. Recent software Lep-MAP3 can robustly construct linkage maps for millions of genotyped markers and on thousands of individuals, providing optimal maps for genome anchoring. For such large datasets, automated and robust genome anchoring tool is especially valuable and can significantly reduce intensive computational and manual work involved. RESULTS Here, we present a software Lep-Anchor (LA) to anchor genome assemblies automatically using dense linkage maps. As the main novelty, it takes into account the uncertainty of the linkage map positions caused by low recombination regions, cross type or poor mapping data quality. Furthermore, it can automatically detect and cut chimeric contigs, and use contig-contig, single read or alternative genome assembly alignments as additional information on contig order and orientations and to collapse haplotype contigs. We demonstrate the performance of LA using real data and show that it outperforms ALLMAPS on anchoring completeness and speed. Accuracy-wise LA and ALLMAPS are about equal, but at the expense of lower completeness of ALLMAPS. The software Chromonomer was faster than the other two methods but has major limitations and is lower in accuracy. We also show that with additional information, such as contig-contig and read alignments, the anchoring completeness can be improved by up to 70% without significant loss in accuracy. Based on simulated data, we conclude that the anchoring accuracy can be improved by utilizing information about map position uncertainty. Accuracy is the rate of contigs in correct orientation and completeness is the number contigs with inferred orientation. AVAILABILITY AND IMPLEMENTATION Lep-Anchor is available with the source code under GNU general public license from http://sourceforge.net/projects/lep-anchor. All the scripts and code used to produce the reported results are included with Lep-Anchor.
Collapse
Affiliation(s)
- Pasi Rastas
- Institute of Biotechnology, HiLIFE, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
23
|
Garg S, Aach J, Li H, Sebenius I, Durbin R, Church G. A haplotype-aware de novo assembly of related individuals using pedigree sequence graph. Bioinformatics 2020; 36:2385-2392. [PMID: 31860070 DOI: 10.1093/bioinformatics/btz942] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Revised: 11/23/2019] [Accepted: 12/18/2019] [Indexed: 01/11/2023] Open
Abstract
MOTIVATION Reconstructing high-quality haplotype-resolved assemblies for related individuals has important applications in Mendelian diseases and population genomics. Through major genomics sequencing efforts such as the Personal Genome Project, the Vertebrate Genome Project (VGP) and the Genome in a Bottle project (GIAB), a variety of sequencing datasets from trios of diploid genomes are becoming available. Current trio assembly approaches are not designed to incorporate long- and short-read data from mother-father-child trios, and therefore require relatively high coverages of costly long-read data to produce high-quality assemblies. Thus, building a trio-aware assembler capable of producing accurate and chromosomal-scale diploid genomes of all individuals in a pedigree, while being cost-effective in terms of sequencing costs, is a pressing need of the genomics community. RESULTS We present a novel pedigree sequence graph based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences (PacBio) data from all related individuals, thereby generalizing our previous work on single individuals. We demonstrate the effectiveness of our pedigree approach on a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from human chromosome. We show that we require as little as 30× coverage Illumina data and 15× PacBio data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, we show that we can detect and phase variants from generated phased assemblies. AVAILABILITY AND IMPLEMENTATION https://github.com/shilpagarg/WHdenovo.
Collapse
Affiliation(s)
- Shilpa Garg
- Department of Genetics, Harvard Medical School.,Wyss Institute for Biologically Inspired Engineering, Harvard University
| | - John Aach
- Department of Genetics, Harvard Medical School
| | - Heng Li
- Department of Biomedical Informatics, Harvard Medical School, Boston
| | - Isaac Sebenius
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | - Richard Durbin
- Department of Genetics, University of Cambridge, Cambridge, UK
| | - George Church
- Department of Genetics, Harvard Medical School.,Wyss Institute for Biologically Inspired Engineering, Harvard University
| |
Collapse
|
24
|
Boudabous A, Tekaia F. Enhancing Bioinformatics and Genomics Courses: Building Capacity and Skills via Lab Meeting Activities: Fostering a Culture of Critical Capacities to Read, Write, Communicate and Engage in Rigorous Scientific Exchanges. Bioessays 2020; 42:e2000134. [PMID: 32830345 DOI: 10.1002/bies.202000134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Revised: 07/08/2020] [Indexed: 11/08/2022]
Abstract
Reading, writing, publishing, and publicly presenting scientific works are vital for a young researcher's profile building and career development. Generally, the traditional educational curricula do not offer training possibilities to learn and practice how to prepare, write, and present scientific works. These are rather a part of lab meeting activities in research groups. The lack of such training is more critical in some developing countries because this adds to the rare opportunities to discuss and become involved in the exchanges on state of the art scientific literature. Here the authors relate their experience in introducing a weekly 1-day lab meeting in the framework of two previously organized 3-month courses on "Bioinformatics and Genome Analyses". The main activities which are developed during these lab meetings include scientific literature follow up as well as preparing and presenting oral and written scientific reviews. These activities prove to be useful for a student's self-confidence building, for enhancing their active participation during the lectures and practical sessions, as well as for the positive impact on running the whole course program. Incorporation of such lab meeting activities in the course program significantly improves the capacity building of the participants, their analytical and critical reading of scientific literature, as well as communication skills. In this work it is shown how to proceed with the different steps involved in the implementation of lab meeting activities, and to recommend their regular institution in similar courses.
Collapse
Affiliation(s)
- Abdellatif Boudabous
- Faculté des Sciences de Tunis, Campus Universitaire El-Manar, El Manar, Tunis, 2092, Tunisia
| | - Fredj Tekaia
- Institut Pasteur Paris, 28, rue du Dr Roux, 75724, Paris, Cedex, 15, France
| |
Collapse
|
25
|
Gibson KM, Steiner MC, Rentia U, Bendall ML, Pérez-Losada M, Crandall KA. Validation of Variant Assembly Using HAPHPIPE with Next-Generation Sequence Data from Viruses. Viruses 2020; 12:E758. [PMID: 32674515 PMCID: PMC7412389 DOI: 10.3390/v12070758] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 07/03/2020] [Accepted: 07/06/2020] [Indexed: 01/04/2023] Open
Abstract
Next-generation sequencing (NGS) offers a powerful opportunity to identify low-abundance, intra-host viral sequence variants, yet the focus of many bioinformatic tools on consensus sequence construction has precluded a thorough analysis of intra-host diversity. To take full advantage of the resolution of NGS data, we developed HAplotype PHylodynamics PIPEline (HAPHPIPE), an open-source tool for the de novo and reference-based assembly of viral NGS data, with both consensus sequence assembly and a focus on the quantification of intra-host variation through haplotype reconstruction. We validate and compare the consensus sequence assembly methods of HAPHPIPE to those of two alternative software packages, HyDRA and Geneious, using simulated HIV and empirical HIV, HCV, and SARS-CoV-2 datasets. Our validation methods included read mapping, genetic distance, and genetic diversity metrics. In simulated NGS data, HAPHPIPE generated pol consensus sequences significantly closer to the true consensus sequence than those produced by HyDRA and Geneious and performed comparably to Geneious for HIV gp120 sequences. Furthermore, using empirical data from multiple viruses, we demonstrate that HAPHPIPE can analyze larger sequence datasets due to its greater computational speed. Therefore, we contend that HAPHPIPE provides a more user-friendly platform for users with and without bioinformatics experience to implement current best practices for viral NGS assembly than other currently available options.
Collapse
Affiliation(s)
- Keylie M. Gibson
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Margaret C. Steiner
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Uzma Rentia
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Matthew L. Bendall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, 4169-007 Vairão, Portugal
| | - Keith A. Crandall
- Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA; (M.C.S.); (U.R.); (M.L.B.); (M.P.-L.); (K.A.C.)
- Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA
| |
Collapse
|
26
|
Medvedev P. Modeling biological problems in computer science: a case study in genome assembly. Brief Bioinform 2020; 20:1376-1383. [PMID: 29394324 DOI: 10.1093/bib/bby003] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Revised: 12/07/2017] [Indexed: 11/14/2022] Open
Abstract
As computer scientists working in bioinformatics/computational biology, we often face the challenge of coming up with an algorithm to answer a biological question. This occurs in many areas, such as variant calling, alignment and assembly. In this tutorial, we use the example of the genome assembly problem to demonstrate how to go from a question in the biological realm to a solution in the computer science realm. We show the modeling process step-by-step, including all the intermediate failed attempts. Please note this is not an introduction to how genome assembly algorithms work and, if treated as such, would be incomplete and unnecessarily long-winded.
Collapse
|
27
|
Jo J, Oh J, Park C. Microbial community analysis using high-throughput sequencing technology: a beginner's guide for microbiologists. J Microbiol 2020; 58:176-192. [PMID: 32108314 DOI: 10.1007/s12275-020-9525-5] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2019] [Revised: 12/11/2019] [Accepted: 12/16/2019] [Indexed: 12/19/2022]
Abstract
Microbial communities present in diverse environments from deep seas to human body niches play significant roles in the complex ecosystem and human health. Characterizing their structural and functional diversities is indispensable, and many approaches, such as microscopic observation, DNA fingerprinting, and PCR-based marker gene analysis, have been successfully applied to identify microorganisms. Since the revolutionary improvement of DNA sequencing technologies, direct and high-throughput analysis of genomic DNA from a whole environmental community without prior cultivation has become the mainstream approach, overcoming the constraints of the classical approaches. Here, we first briefly review the history of environmental DNA analysis applications with a focus on profiling the taxonomic composition and functional potentials of microbial communities. To this end, we aim to introduce the shotgun metagenomic sequencing (SMS) approach, which is used for the untargeted ("shotgun") sequencing of all ("meta") microbial genomes ("genomic") present in a sample. SMS data analyses are performed in silico using various software programs; however, in silico analysis is typically regarded as a burden on wet-lab experimental microbiologists. Therefore, in this review, we present microbiologists who are unfamiliar with in silico analyses with a basic and practical SMS data analysis protocol. This protocol covers all the bioinformatics processes of the SMS analysis in terms of data preprocessing, taxonomic profiling, functional annotation, and visualization.
Collapse
Affiliation(s)
- Jihoon Jo
- School of Biological Sciences and Technology, Chonnam National University, Gwangju, 61186, Republic of Korea
| | - Jooseong Oh
- School of Biological Sciences and Technology, Chonnam National University, Gwangju, 61186, Republic of Korea
| | - Chungoo Park
- School of Biological Sciences and Technology, Chonnam National University, Gwangju, 61186, Republic of Korea.
| |
Collapse
|
28
|
Mai D, Nalley MJ, Bachtrog D. Patterns of Genomic Differentiation in the Drosophila nasuta Species Complex. Mol Biol Evol 2020; 37:208-220. [PMID: 31556453 PMCID: PMC6984368 DOI: 10.1093/molbev/msz215] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
The Drosophila nasuta species complex contains over a dozen recently diverged species that are distributed widely across South-East Asia, and which shows varying degrees of pre- and postzygotic isolation. Here, we assemble a high-quality genome for D. albomicans using single-molecule sequencing and chromatin conformation capture, and draft genomes for 11 additional species and 67 individuals across the clade, to infer the species phylogeny and patterns of genetic diversity in this group. Our assembly recovers entire chromosomes, and we date the origin of this radiation ∼2 Ma. Despite low levels of overall differentiation, most species or subspecies show clear clustering into their designated taxonomic groups using population genetics and phylogenetic methods. Local evolutionary history is heterogeneous across the genome, and differs between the autosomes and the X chromosome for species in the sulfurigaster subgroup, likely due to autosomal introgression. Our study establishes the nasuta species complex as a promising model system to further characterize the evolution of pre- and postzygotic isolation in this clade.
Collapse
Affiliation(s)
- Dat Mai
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA
| | - Matthew J Nalley
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA
| | - Doris Bachtrog
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA
| |
Collapse
|
29
|
Hofreiter M, Hartmann S. Reconstructing protein-coding sequences from ancient DNA. Methods Enzymol 2020; 642:21-33. [DOI: 10.1016/bs.mie.2020.05.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
30
|
Goel M, Sun H, Jiao WB, Schneeberger K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol 2019; 20:277. [PMID: 31842948 PMCID: PMC6913012 DOI: 10.1186/s13059-019-1911-0] [Citation(s) in RCA: 265] [Impact Index Per Article: 53.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Accepted: 12/02/2019] [Indexed: 01/27/2023] Open
Abstract
Genomic differences range from single nucleotide differences to complex structural variations. Current methods typically annotate sequence differences ranging from SNPs to large indels accurately but do not unravel the full complexity of structural rearrangements, including inversions, translocations, and duplications, where highly similar sequence changes in location, orientation, or copy number. Here, we present SyRI, a pairwise whole-genome comparison tool for chromosome-level assemblies. SyRI starts by finding rearranged regions and then searches for differences in the sequences, which are distinguished for residing in syntenic or rearranged regions. This distinction is important as rearranged regions are inherited differently compared to syntenic regions.
Collapse
Affiliation(s)
- Manish Goel
- Max Planck Institute for Plant Breeding Research, 50829 Cologne, Germany
| | - Hequan Sun
- Max Planck Institute for Plant Breeding Research, 50829 Cologne, Germany
| | - Wen-Biao Jiao
- Max Planck Institute for Plant Breeding Research, 50829 Cologne, Germany
| | - Korbinian Schneeberger
- Max Planck Institute for Plant Breeding Research, 50829 Cologne, Germany
- Faculty of Biology, LMU Munich, 82152 Planegg-Martinsried, Germany
| |
Collapse
|
31
|
Grigoreva E, Ulianich P, Ben C, Gentzbittel L, Potokina E. First Insights into the Guar (Cyamopsis tetragonoloba (L.) Taub.) Genome of the ‘Vavilovskij 130’ Accession, Using Second and Third-Generation Sequencing Technologies. RUSS J GENET+ 2019. [DOI: 10.1134/s102279541911005x] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
32
|
Guo J, Quensen JF, Sun Y, Wang Q, Brown CT, Cole JR, Tiedje JM. Review, Evaluation, and Directions for Gene-Targeted Assembly for Ecological Analyses of Metagenomes. Front Genet 2019; 10:957. [PMID: 31749830 PMCID: PMC6843070 DOI: 10.3389/fgene.2019.00957] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Accepted: 09/09/2019] [Indexed: 12/28/2022] Open
Abstract
Shotgun metagenomics has greatly advanced our understanding of microbial communities over the last decade. Metagenomic analyses often include assembly and genome binning, computationally daunting tasks especially for big data from complex environments such as soil and sediments. In many studies, however, only a subset of genes and pathways involved in specific functions are of interest; thus, it is not necessary to attempt global assembly. In addition, methods that target genes can be computationally more efficient and produce more accurate assembly by leveraging rich databases, especially for those genes that are of broad interest such as those involved in biogeochemical cycles, biodegradation, and antibiotic resistance or used as phylogenetic markers. Here, we review six gene-targeted assemblers with unique algorithms for extracting and/or assembling targeted genes: Xander, MegaGTA, SAT-Assembler, HMM-GRASPx, GenSeed-HMM, and MEGAN. We tested these tools using two datasets with known genomes, a synthetic community of artificial reads derived from the genomes of 17 bacteria, shotgun sequence data from a mock community with 48 bacteria and 16 archaea genomes, and a large soil shotgun metagenomic dataset. We compared assemblies of a universal single copy gene (rplB) and two N cycle genes (nifH and nirK). We measured their computational efficiency, sensitivity, specificity, and chimera rate and found Xander and MegaGTA, which both use a probabilistic graph structure to model the genes, have the best overall performance with all three datasets, although MEGAN, a reference matching assembler, had better sensitivity with synthetic and mock community members chosen from its reference collection. Also, Xander and MegaGTA are the only tools that include post-assembly scripts tuned for common molecular ecology and diversity analyses. Additionally, we provide a mathematical model for estimating the probability of assembling targeted genes in a metagenome for estimating required sequencing depth.
Collapse
Affiliation(s)
- Jiarong Guo
- Center for Microbial Ecology, Michigan State University, East Lansing, MI, United States
| | - John F. Quensen
- Center for Microbial Ecology, Michigan State University, East Lansing, MI, United States
| | - Yanni Sun
- Department of Electronical Engineering, City University of Hong Kong, Kowloon, Hong Kong
| | - Qiong Wang
- Center for Microbial Ecology, Michigan State University, East Lansing, MI, United States
| | - C. Titus Brown
- Department of Population Health and Reproduction, University of California, Davis, Davis, CA, United States
| | - James R. Cole
- Center for Microbial Ecology, Michigan State University, East Lansing, MI, United States
| | - James M. Tiedje
- Center for Microbial Ecology, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
33
|
|
34
|
Klinger CM, Richardson E. Small Genomes and Big Data: Adaptation of Plastid Genomics to the High-Throughput Era. Biomolecules 2019; 9:E299. [PMID: 31344945 PMCID: PMC6723049 DOI: 10.3390/biom9080299] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2019] [Revised: 07/15/2019] [Accepted: 07/16/2019] [Indexed: 12/17/2022] Open
Abstract
Plastid genome sequences are becoming more readily available with the increase in high-throughput sequencing, and whole-organelle genetic data is available for algae and plants from across the diversity of photosynthetic eukaryotes. This has provided incredible opportunities for studying species which may not be amenable to in vivo study or genetic manipulation or may not yet have been cultured. Research into plastid genomes has pushed the limits of what can be deduced from genomic information, and in particular genomic information obtained from public databases. In this Review, we discuss how research into plastid genomes has benefitted enormously from the explosion of publicly available genome sequence. We describe two case studies in how using publicly available gene data has supported previously held hypotheses about plastid traits from lineage-restricted experiments across algal and plant diversity. We propose how this approach could be used across disciplines for inferring functional and biological characteristics from genomic approaches, including integration of new computational and bioinformatic approaches such as machine learning. We argue that the techniques developed to gain the maximum possible insight from plastid genomes can be applied across the eukaryotic tree of life.
Collapse
Affiliation(s)
- Christen M Klinger
- Division of Infectious Diseases, Department of Medicine, University of Alberta, Edmonton, AB T6G 2R3, Canada
| | - Elisabeth Richardson
- Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2R3, Canada.
| |
Collapse
|
35
|
Abstract
The computational reconstruction of genome sequences from shotgun sequencing data has been greatly simplified by the advent of sequencing technologies that generate long reads. In the case of relatively small genomes (e.g., bacterial or viral), complete genome sequences can frequently be reconstructed computationally without the need for further experiments. However, large and complex genomes, such as those of most animals and plants, continue to pose significant challenges. In such genomes, assembly software produces incomplete and fragmented reconstructions that require additional experimentally derived information and manual intervention in order to reconstruct individual chromosome arms. Recent technologies originally designed to capture chromatin structure have been shown to effectively complement sequencing data, leading to much more contiguous reconstructions of genomes than previously possible. Here, we survey these technologies and the algorithms used to assemble and analyze large eukaryotic genomes, placed within the historical context of genome scaffolding technologies that have been in existence since the dawn of the genomic era.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| |
Collapse
|
36
|
Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Brief Bioinform 2019; 20:866-876. [PMID: 29112696 PMCID: PMC6585154 DOI: 10.1093/bib/bbx147] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Revised: 09/22/2017] [Indexed: 12/20/2022] Open
Abstract
Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms.
Collapse
|
37
|
Pucker B, Holtgräwe D, Stadermann KB, Frey K, Huettel B, Reinhardt R, Weisshaar B. A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set. PLoS One 2019; 14:e0216233. [PMID: 31112551 PMCID: PMC6529160 DOI: 10.1371/journal.pone.0216233] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 04/16/2019] [Indexed: 01/27/2023] Open
Abstract
In addition to the BAC-based reference sequence of the accession Columbia-0 from the year 2000, several short read assemblies of THE plant model organism Arabidopsis thaliana were published during the last years. Also, a SMRT-based assembly of Landsberg erecta has been generated that identified translocation and inversion polymorphisms between two genotypes of the species. Here we provide a chromosome-arm level assembly of the A. thaliana accession Niederzenz-1 (AthNd-1_v2c) based on SMRT sequencing data. The best assembly comprises 69 nucleome sequences and displays a contig length of up to 16 Mbp. Compared to an earlier Illumina short read-based NGS assembly (AthNd-1_v1), a 75 fold increase in contiguity was observed for AthNd-1_v2c. To assign contig locations independent from the Col-0 gold standard reference sequence, we used genetic anchoring to generate a de novo assembly. In addition, we assembled the chondrome and plastome sequences. Detailed analyses of AthNd-1_v2c allowed reliable identification of large genomic rearrangements between A. thaliana accessions contributing to differences in the gene sets that distinguish the genotypes. One of the differences detected identified a gene that is lacking from the Col-0 gold standard sequence. This de novo assembly extends the known proportion of the A. thaliana pan-genome.
Collapse
Affiliation(s)
- Boas Pucker
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Daniela Holtgräwe
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Kai Bernd Stadermann
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Katharina Frey
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| | - Bruno Huettel
- Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | - Richard Reinhardt
- Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | - Bernd Weisshaar
- Bielefeld University, Faculty of Biology & Center for Biotechnology, Bielefeld, Germany
| |
Collapse
|
38
|
A comparative analysis of methods for de novo assembly of hymenopteran genomes using either haploid or diploid samples. Sci Rep 2019; 9:6480. [PMID: 31019201 PMCID: PMC6482151 DOI: 10.1038/s41598-019-42795-6] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Accepted: 04/04/2019] [Indexed: 01/05/2023] Open
Abstract
Diverse invertebrate taxa including all 200,000 species of Hymenoptera (ants, bees, wasps and sawflies) have a haplodiploid sex determination system, where females are diploid and males are haploid. Thus, hymenopteran genome projects can make use of DNA from a single haploid male sample, which is assumed advantageous for genome assembly. For the purpose of gene annotation, transcriptome sequencing is usually conducted using RNA from a pool of individuals. We conducted a comparative analysis of genome and transcriptome assembly and annotation methods, using genetic sources of different ploidy: (1) DNA from a haploid male or a diploid female (2) RNA from the same haploid male or a pool of individuals. We predicted that the use of a haploid male as opposed to a diploid female will simplify the genome assembly and gene annotation thanks to the lack of heterozygosity. Using DNA and RNA from the same haploid individual is expected to provide better confidence in transcript-to-genome alignment, and improve the annotation of gene structure in terms of the exon/intron boundaries. The haploid genome assemblies proved to be more contiguous, with both contig and scaffold N50 size at least threefold greater than their diploid counterparts. Completeness evaluation showed mixed results. The SOAPdenovo2 diploid assembly was missing more genes than the haploid assembly. The SPAdes diploid assembly had more complete genes, but a higher level of duplicates, and a greatly overestimated genome size. When aligning the two transcriptomes against the male genome, the male transcriptome gave 2–3% more complete transcripts than the pool transcriptome for genes with comparable expression levels in both transcriptomes. However, this advantage disappears in the final results of the gene annotation pipeline that incorporates evidence from homologous proteins. The RNA pool is still required to obtain the full transcriptome with genes that are expressed in other life stages and castes. In conclusion, the use of a haploid source material for a de novo genome project provides a substantial advantage to the quality of the genome draft and the use of RNA from the same haploid individual for transcriptome to genome alignment provides a minor advantage for genes that are expressed in the adult male.
Collapse
|
39
|
Tian S, Yan H, Klee EW, Kalmbach M, Slager SL. Comparative analysis of de novo assemblers for variation discovery in personal genomes. Brief Bioinform 2019; 19:893-904. [PMID: 28407084 PMCID: PMC6169673 DOI: 10.1093/bib/bbx037] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/08/2017] [Indexed: 12/30/2022] Open
Abstract
Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations and provide little phase information in personal genomes. A few de novo assemblers have been developed to identify variants through direct variant calling from the assembly graph, micro-assembly and whole-genome assembly, but mainly for whole-genome sequencing (WGS) data. We developed SGVar, a de novo assembly workflow for haplotype-based variant discovery from whole-exome sequencing (WES) data. Using simulated human exome data, we compared SGVar with five variation-aware de novo assemblers and with BWA-MEM together with three haplotype- or local de novo assembly-based callers. SGVar outperforms the other assemblers in sensitivity and tolerance of sequencing errors. We recapitulated the findings on whole-genome and exome data from a Utah residents with Northern and Western European ancestry (CEU) trio, showing that SGVar had high sensitivity both in the highly divergent human leukocyte antigen (HLA) region and in non-HLA regions of chromosome 6. In particular, SGVar is robust to sequencing error, k-mer selection, divergence level and coverage depth. Unlike mapping-based approaches, SGVar is capable of resolving long-range phase and identifying large INDELs from WES, more prominently from WGS. We conclude that SGVar represents an ideal platform for WES-based variant discovery in highly divergent regions and across the whole genome.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Eric W Klee
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.,Center for Individualized Medicine Bioinformatics Program, Mayo Clinic, USA
| | - Michael Kalmbach
- Division of Information Management and Analytics, Department of Information Technology, Mayo Clinic, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
40
|
Abstract
Affordable, high-throughput DNA sequencing has accelerated the pace of genome assembly over the past decade. Genome assemblies from high-throughput, short-read sequencing, however, are often not as contiguous as the first generation of genome assemblies. Whereas early genome assembly projects were often aided by clone maps or other mapping data, many current assembly projects forego these scaffolding data and only assemble genomes into smaller segments. Recently, new technologies have been invented that allow chromosome-scale assembly at a lower cost and faster speed than traditional methods. Here, we give an overview of the problem of chromosome-scale assembly and traditional methods for tackling this problem. We then review new technologies for chromosome-scale assembly and recent genome projects that used these technologies to create highly contiguous genome assemblies at low cost.
Collapse
Affiliation(s)
- Edward S. Rice
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA;,
| | - Richard E. Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA;,
- Dovetail Genomics, LLC, Santa Cruz, California 95060, USA
| |
Collapse
|
41
|
Sohn JI, Nam JW. The present and future of de novo whole-genome assembly. Brief Bioinform 2018; 19:23-40. [PMID: 27742661 DOI: 10.1093/bib/bbw096] [Citation(s) in RCA: 75] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Indexed: 12/15/2022] Open
Abstract
As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
Collapse
|
42
|
SCOP: a novel scaffolding algorithm based on contig classification and optimization. Bioinformatics 2018; 35:1142-1150. [DOI: 10.1093/bioinformatics/bty773] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 08/10/2018] [Accepted: 09/01/2018] [Indexed: 12/20/2022] Open
|
43
|
Duharcourt S, Sperling L. The Challenges of Genome-Wide Studies in a Unicellular Eukaryote With Two Nuclear Genomes. Methods Enzymol 2018; 612:101-126. [PMID: 30502938 DOI: 10.1016/bs.mie.2018.08.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
We present here methods to study a eukaryotic microorganism with two nuclear genomes, both originating from the same zygotic genome. Paramecium, like other ciliates, is characterized by nuclear dimorphism, which is the presence of two types of nuclei with distinct organization and functions in the same cytoplasm. The two diploid germline micronuclei (MIC) undergo meiosis and fertilization to transmit the genetic information across sexual generations. The highly polyploid somatic macronucleus (MAC) contains a reduced version of the genome optimized for gene expression. Reproducible programmed DNA elimination of about 30% of the complexity of the 100Mb MIC genome occurs during development of the MAC along with endoreplication to 800 copies. Large regions that contain transposable elements and other repeats are eliminated, and short single copy remnants of transposable elements, which often interrupt coding sequences, are precisely excised to restore functional open reading frames. Genome-wide studies of this process require access to MIC DNA which has long been impossible. The breakthrough with respect to this technical obstacle came with development of a MIC purification protocol involving a critical step of flow cytometry to sort nuclei representing only 0.5% of total genomic DNA. Here, we provide a step-by-step protocol and important tips for purifying nuclei, and present the methods developed for downstream analysis of NGS data.
Collapse
Affiliation(s)
- Sandra Duharcourt
- Institut Jacques Monod, CNRS, UMR7592, Sorbonne Paris Cité, Paris, France.
| | - Linda Sperling
- Institute for Integrative Biology of the Cell (I2BC), CNRS, CEA, Univ. Paris-Sud, Université Paris-Saclay, Gif-sur-Yvette CEDEX, France
| |
Collapse
|
44
|
Li M, Tang L, Liao Z, Luo J, Wu F, Pan Y, Wang J. A novel scaffolding algorithm based on contig error correction and path extension. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:764-773. [PMID: 30040649 DOI: 10.1109/tcbb.2018.2858267] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The sequence assembly process can be divided into three stages: contigs extension, scaffolding, and gap filling. The scaffolding method is an essential step during the process to infer the direction and sequence relationships between the contigs. However, scaffolding still faces the challenges of uneven sequencing depth, genome repetitive regions, and sequencing errors, which often leads to many false relationships between contigs. The performance of scaffolding can be improved by removing potential false conjunctions between contigs. In this study, a novel scaffolding algorithm which is on the basis of path extension Loose-Strict-Loose strategy and contig error correction, called iLSLS. iLSLS helps reduce the false relationships between contigs, and improve the accuracy of subsequent steps. iLSLS utilizes a scoring function, which estimates the correctness of candidate paths by the distribution of paired reads, and try to conduction the extension with the path which is scored the highest. What's more, iLSLS can precisely estimate the gap size. We conduct experiments on two real datasets, and the results show that LSLS strategy is efficient to increase the correctness of scaffolds, and iLSLS performs better than other scaffolding methods.
Collapse
|
45
|
Loose MW. The potential impact of nanopore sequencing on human genetics. Hum Mol Genet 2018; 26:R202-R207. [PMID: 28977449 DOI: 10.1093/hmg/ddx287] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Accepted: 07/17/2017] [Indexed: 12/21/2022] Open
Abstract
Nanopore sequencing has been available to researchers for a little over 3 years. Recently, the milestone of sequencing and assembling a human genome on this platform was achieved for the first time. Significant improvements to the platform in yield and accuracy, coupled with higher throughput nanopore sequencers, mean that human genome sequencing at scale is now possible. Here, a brief recent history of the nanopore platform is provided, key papers and innovations are highlighted and some of the challenges for the future are discussed.
Collapse
Affiliation(s)
- Matthew W Loose
- School of Life Sciences, University of Nottingham, Nottingham NG7 2UH, UK
| |
Collapse
|
46
|
Obscura Acosta N, Mäkinen V, Tomescu AI. A safe and complete algorithm for metagenomic assembly. Algorithms Mol Biol 2018; 13:3. [PMID: 29445416 PMCID: PMC5802251 DOI: 10.1186/s13015-018-0122-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Accepted: 01/20/2018] [Indexed: 11/10/2022] Open
Abstract
Background Reconstructing the genome of a species from short fragments is one of the oldest bioinformatics problems. Metagenomic assembly is a variant of the problem asking to reconstruct the circular genomes of all bacterial species present in a sequencing sample. This problem can be naturally formulated as finding a collection of circular walks of a directed graph G that together cover all nodes, or edges, of G. Approach We address this problem with the “safe and complete” framework of Tomescu and Medvedev (Research in computational Molecular biology—20th annual conference, RECOMB 9649:152–163, 2016). An algorithm is called safe if it returns only those walks (also called safe) that appear as subwalk in all metagenomic assembly solutions for G. A safe algorithm is called complete if it returns all safe walks of G. Results We give graph-theoretic characterizations of the safe walks of G, and a safe and complete algorithm finding all safe walks of G. In the node-covering case, our algorithm runs in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(m^2 + n^3)$$\end{document}O(m2+n3), and in the edge-covering case it runs in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$O(m^2n)$$\end{document}O(m2n); n and m denote the number of nodes and edges, respectively, of G. This algorithm constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.
Collapse
|
47
|
Evans T, Johnson AD, Loose M. Virtual Genome Walking across the 32 Gb Ambystoma mexicanum genome; assembling gene models and intronic sequence. Sci Rep 2018; 8:618. [PMID: 29330416 PMCID: PMC5766544 DOI: 10.1038/s41598-017-19128-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Accepted: 12/19/2017] [Indexed: 11/09/2022] Open
Abstract
Large repeat rich genomes present challenges for assembly using short read technologies. The 32 Gb axolotl genome is estimated to contain ~19 Gb of repetitive DNA making an assembly from short reads alone effectively impossible. Indeed, this model species has been sequenced to 20× coverage but the reads could not be conventionally assembled. Using an alternative strategy, we have assembled subsets of these reads into scaffolds describing over 19,000 gene models. We call this method Virtual Genome Walking as it locally assembles whole genome reads based on a reference transcriptome, identifying exons and iteratively extending them into surrounding genomic sequence. These assemblies are then linked and refined to generate gene models including upstream and downstream genomic, and intronic, sequence. Our assemblies are validated by comparison with previously published axolotl bacterial artificial chromosome (BAC) sequences. Our analyses of axolotl intron length, intron-exon structure, repeat content and synteny provide novel insights into the genic structure of this model species. This resource will enable new experimental approaches in axolotl, such as ChIP-Seq and CRISPR and aid in future whole genome sequencing efforts. The assembled sequences and annotations presented here are freely available for download from https://tinyurl.com/y8gydc6n . The software pipeline is available from https://github.com/LooseLab/iterassemble .
Collapse
Affiliation(s)
- Teri Evans
- School of Life Sciences, University of Nottingham, Nottingham, NG7 2UH, UK
| | - Andrew D Johnson
- School of Life Sciences, University of Nottingham, Nottingham, NG7 2UH, UK
| | - Matthew Loose
- School of Life Sciences, University of Nottingham, Nottingham, NG7 2UH, UK.
| |
Collapse
|
48
|
Nakato R, Shirahige K. Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Brief Bioinform 2017; 18:279-290. [PMID: 26979602 PMCID: PMC5444249 DOI: 10.1093/bib/bbw023] [Citation(s) in RCA: 78] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Indexed: 02/06/2023] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) analysis can detect protein/DNA-binding and histone-modification sites across an entire genome. Recent advances in sequencing technologies and analyses enable us to compare hundreds of samples simultaneously; such large-scale analysis has potential to reveal the high-dimensional interrelationship level for regulatory elements and annotate novel functional genomic regions de novo. Because many experimental considerations are relevant to the choice of a method in a ChIP-seq analysis, the overall design and quality management of the experiment are of critical importance. This review offers guiding principles of computation and sample preparation for ChIP-seq analyses, highlighting the validity and limitations of the state-of-the-art procedures at each step. We also discuss the latest challenges of single-cell analysis that will encourage a new era in this field.
Collapse
Affiliation(s)
- Ryuichiro Nakato
- Research Center for Epigenetic Disease, Institute of Molecular and Cellular Biosciences, The University of Tokyo, Tokyo, Japan
| | - Katsuhiko Shirahige
- Research Center for Epigenetic Disease, Institute of Molecular and Cellular Biosciences, The University of Tokyo, Tokyo, Japan.,Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency, Kawaguchi, Japan
| |
Collapse
|
49
|
Shotgun metagenomics, from sampling to analysis. Nat Biotechnol 2017; 35:833-844. [PMID: 28898207 DOI: 10.1038/nbt.3935] [Citation(s) in RCA: 825] [Impact Index Per Article: 117.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2015] [Accepted: 07/12/2017] [Indexed: 02/06/2023]
Abstract
Diverse microbial communities of bacteria, archaea, viruses and single-celled eukaryotes have crucial roles in the environment and in human health. However, microbes are frequently difficult to culture in the laboratory, which can confound cataloging of members and understanding of how communities function. High-throughput sequencing technologies and a suite of computational pipelines have been combined into shotgun metagenomics methods that have transformed microbiology. Still, computational approaches to overcome the challenges that affect both assembly-based and mapping-based metagenomic profiling, particularly of high-complexity samples or environments containing organisms with limited similarity to sequenced genomes, are needed. Understanding the functions and characterizing specific strains of these communities offers biotechnological promise in therapeutic discovery and innovative ways to synthesize products using microbial factories and can pinpoint the contributions of microorganisms to planetary, animal and human health.
Collapse
|
50
|
Rastas P. Lep-MAP3: robust linkage mapping even for low-coverage whole genome sequencing data. Bioinformatics 2017; 33:3726-3732. [DOI: 10.1093/bioinformatics/btx494] [Citation(s) in RCA: 208] [Impact Index Per Article: 29.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2017] [Accepted: 08/01/2017] [Indexed: 11/13/2022] Open
Affiliation(s)
- Pasi Rastas
- Department of Zoology, Butterfly Genetics Group, University of Cambridge, Cambridge, UK
- Department of Biosciences, Ecological Genetics Research Unit, University of Helsinki, Helsinki, Finland
| |
Collapse
|