1
|
Betschart RO, Riccio C, Aguilera-Garcia D, Blankenberg S, Guo L, Moch H, Seidl D, Solleder H, Thalén F, Thiéry A, Twerenbold R, Zeller T, Zoche M, Ziegler A. Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control. Biom J 2024; 66:e202300278. [PMID: 38988195 DOI: 10.1002/bimj.202300278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 03/21/2024] [Accepted: 05/14/2024] [Indexed: 07/12/2024]
Abstract
Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.
Collapse
Affiliation(s)
| | | | - Domingo Aguilera-Garcia
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Linlin Guo
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Holger Moch
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Dagmar Seidl
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Hugo Solleder
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | - Felix Thalén
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | | | - Raphael Twerenbold
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Tanja Zeller
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Martin Zoche
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
| |
Collapse
|
2
|
Hou J, Ji X, Chu X, Shi Z, Wang B, Sun K, Wei H, Song Z, Wen F. Comprehensive lipidomic analysis revealed the effects of fermented Morus alba L. intake on lipid profile in backfat and muscle tissue of Yuxi black pigs. J Anim Physiol Anim Nutr (Berl) 2024; 108:764-777. [PMID: 38305489 DOI: 10.1111/jpn.13932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 11/08/2023] [Accepted: 01/18/2024] [Indexed: 02/03/2024]
Abstract
Mulberry leaf is a widely used protein feed and is often used as a strategy to reduce feed costs and improve meat quality in the livestock industry. However, to date, there is a lack of research on the improvement of meat quality using mulberry leaves, and the exact mechanisms are not yet known. The results showed that fermented mulberry leaves significantly reduced backfat content but had no significant effect on intramuscular fat (IMF). Lipidomic analysis showed that 98 and 303 differential lipid molecules (p < 0.05) were identified in adipose and muscle tissues, respectively, including triglycerides (TG), phosphatidylcholine, phosphatidylethanolamine, sphingolipids, and especially TG; therefore, we analysed the acyl carbon atom number of TG. The statistical results of acyl with different carbon atom numbers of TG in adipose tissue showed that the acyl group containing 13 carbon atoms (C13) in TG was significantly upregulated, whereas C15, C16, C17, and C23 were significantly downregulated, whereas in muscle tissue, the C12, C19, C23, C25, and C26 in TG were significantly downregulated. Acyl changes in TG were different for different numbers of carbon atoms in different tissues. We found that the correlations of C (14-18) in adipose tissue were higher, but in muscle tissue, the correlations of C (18-26) were higher. Through pathway enrichment analysis, we identified six and four metabolic pathways with the highest contributions of differential lipid metabolites in adipose and muscle tissues respectively. These findings suggest that fermented mulberry leaves improve meat quality mainly by inhibiting TG deposition by downregulating medium- and short-chain fatty acids in backfat tissue and long-chain fatty acids in muscle tissue.
Collapse
Affiliation(s)
- Junjie Hou
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
| | - Xiang Ji
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
| | - Xiaoran Chu
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
| | - Zhuoyan Shi
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
| | - Binjie Wang
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
| | - Kangle Sun
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
| | - Haibo Wei
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
| | - Zhen Song
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
- The Kay Laboratory of High Quality Livestock and Poultry Germplasm Resources and Genetic Breeding of Luoyang, Henan University of Science and Technology, Luoyang, China
| | - Fengyun Wen
- College of Animal Science and Technology, Henan University of Science and Technology, Luoyang, China
- The Kay Laboratory of High Quality Livestock and Poultry Germplasm Resources and Genetic Breeding of Luoyang, Henan University of Science and Technology, Luoyang, China
| |
Collapse
|
3
|
Pérez V, Liu Y, Hengst MB, Weyrich LS. A Case Study for the Recovery of Authentic Microbial Ancient DNA from Soil Samples. Microorganisms 2022; 10:microorganisms10081623. [PMID: 36014039 PMCID: PMC9414430 DOI: 10.3390/microorganisms10081623] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 08/01/2022] [Accepted: 08/02/2022] [Indexed: 11/16/2022] Open
Abstract
High Throughput DNA Sequencing (HTS) revolutionized the field of paleomicrobiology, leading to an explosive growth of microbial ancient DNA (aDNA) studies, especially from environmental samples. However, aDNA studies that examine environmental microbes routinely fail to authenticate aDNA, examine laboratory and environmental contamination, and control for biases introduced during sample processing. Here, we surveyed the available literature for environmental aDNA projects—from sample collection to data analysis—and assessed previous methodologies and approaches used in the published microbial aDNA studies. We then integrated these concepts into a case study, using shotgun metagenomics to examine methodological, technical, and analytical biases during an environmental aDNA study of soil microbes. Specifically, we compared the impact of five DNA extraction methods and eight bioinformatic pipelines on the recovery of microbial aDNA information in soil cores from extreme environments. Our results show that silica-based methods optimized for aDNA research recovered significantly more damaged and shorter reads (<100 bp) than a commercial kit or a phenol−chloroform method. Additionally, we described a stringent pipeline for data preprocessing, efficiently decreasing the representation of low-complexity and duplicated reads in our datasets and downstream analyses, reducing analytical biases in taxonomic classification.
Collapse
Affiliation(s)
- Vilma Pérez
- Australian Centre for Ancient DNA (ACAD), School of Biological Sciences, University of Adelaide, Adelaide, SA 5005, Australia
- ARC Centre of Excellence for Australian Biodiversity and Heritage (CABAH), School of Biological Sciences, University of Adelaide, Adelaide, SA 5005, Australia
- Correspondence:
| | - Yichen Liu
- Key Laboratory of Vertebrate Evolution and Human Origins, Institute of Vertebrate Paleontology and Paleoanthropology, Center for Excellence in Life and Paleoenvironment, Chinese Academy of Sciences, Beijing 100044, China
| | - Martha B. Hengst
- Laboratorio de Ecología Molecular y Microbiología Aplicada, Departamento de Ciencias Farmacéuticas, Facultad de Ciencias, Universidad Católica del Norte, Antofagasta 1270300, Chile
| | - Laura S. Weyrich
- ARC Centre of Excellence for Australian Biodiversity and Heritage (CABAH), School of Biological Sciences, University of Adelaide, Adelaide, SA 5005, Australia
- Department of Anthropology and Huck Institutes of the Life Sciences, The Pennsylvania State University, State College, PA 16802, USA
| |
Collapse
|
4
|
Palevich N, Maclean PH. Sequencing and Reconstructing Helminth Mitochondrial Genomes Directly from Genomic Next-Generation Sequencing Data. Methods Mol Biol 2022; 2369:27-40. [PMID: 34313982 DOI: 10.1007/978-1-0716-1681-9_3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Abstract
We present a detailed method for extraction of high-molecular weight genomic DNA suitable for numerous DNA sequencing applications, and a straightforward in silico approach for reconstructing novel mitochondrial (mt) genomes directly from total genomic DNA extracts derived from next-generation sequencing (NGS) data sets. The in silico post-sequencing pipeline described is fast, accurate, and highly efficient, with modest memory requirements that can be performed using a standard desktop computer. The approach is particularly effective for obtaining mitochondrial genomes for species with little or no mitochondrial sequence information currently available and overcomes many of the limitations of traditional strategies. The described methodologies are also applicable for metagenomics sequencing from mixed or pooled samples containing multiple species and subsequent specific assembly of specific mitochondrial genomes.
Collapse
Affiliation(s)
- Nikola Palevich
- AgResearch Limited, Grasslands Research Centre, Palmerston North, New Zealand.
| | - Paul Haydon Maclean
- AgResearch Limited, Grasslands Research Centre, Palmerston North, New Zealand
| |
Collapse
|
5
|
Chen Y, Chen Y, Shi C, Huang Z, Zhang Y, Li S, Li Y, Ye J, Yu C, Li Z, Zhang X, Wang J, Yang H, Fang L, Chen Q. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience 2018; 7:1-6. [PMID: 29220494 PMCID: PMC5788068 DOI: 10.1093/gigascience/gix120] [Citation(s) in RCA: 900] [Impact Index Per Article: 150.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Accepted: 11/22/2017] [Indexed: 12/15/2022] Open
Abstract
Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we demonstrate SOAPnuke as a tool with abundant functions for a “QC-Preprocess-QC” workflow and MapReduce acceleration framework. Four modules with different preprocessing functions are designed for processing datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments, respectively. As a workflow-like tool, SOAPnuke centralizes processing functions into 1 executable and predefines their order to avoid the necessity of reformatting different files when switching tools. Furthermore, the MapReduce framework enables large scalability to distribute all the processing works to an entire compute cluster. We conducted a benchmarking where SOAPnuke and other tools are used to preprocess a ∼30× NA12878 dataset published by GIAB. The standalone operation of SOAPnuke struck a balance between resource occupancy and performance. When accelerated on 16 working nodes with MapReduce, SOAPnuke achieved ∼5.7 times the fastest speed of other tools.
Collapse
Affiliation(s)
| | | | - Chunmei Shi
- Department of Oncology, Fujian Medical University Union Hospital, Fuzhou 350001.,Fujian Key Laboratory of Translational Cancer Medicine, Fuzhou 350014.,Department of Stem Cell Research Institute, Fujian Medical University Stem Cell Research Institute, Fuzhou 350000
| | | | - Yong Zhang
- BGI-Shenzhen, Shenzhen 518083.,Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073
| | - Shengkang Li
- BGI-Shenzhen, Shenzhen 518083.,Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073
| | - Yan Li
- BGI-Shenzhen, Shenzhen 518083
| | - Jia Ye
- BGI-Shenzhen, Shenzhen 518083
| | - Chang Yu
- Intel China Ltd., Shanghai 200336
| | - Zhuo Li
- Guangdong Provincial Hospital of Chinese Medicine, Guangzhou 510120.,Department of Surgery, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong
| | | | - Jian Wang
- BGI-Shenzhen, Shenzhen 518083.,James D. Watson Institute of Genome Sciences, Hangzhou 310058, China
| | - Huanming Yang
- BGI-Shenzhen, Shenzhen 518083.,James D. Watson Institute of Genome Sciences, Hangzhou 310058, China
| | - Lin Fang
- BGI-Shenzhen, Shenzhen 518083.,Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073
| | - Qiang Chen
- Department of Oncology, Fujian Medical University Union Hospital, Fuzhou 350001.,Fujian Key Laboratory of Translational Cancer Medicine, Fuzhou 350014.,Department of Stem Cell Research Institute, Fujian Medical University Stem Cell Research Institute, Fuzhou 350000
| |
Collapse
|
6
|
Shankar J. Insights into study design and statistical analyses in translational microbiome studies. ANNALS OF TRANSLATIONAL MEDICINE 2017; 5:249. [PMID: 28706917 DOI: 10.21037/atm.2017.01.13] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Research questions in translational microbiome studies are substantially more complex than their counterparts in basic science. Robust study designs with appropriate statistical analysis frameworks are pivotal to the success of these translational studies. This review considers how study designs can account for heterogeneous phenotypes by adopting representative sampling schemes for recruiting the study population and making careful choices about the control population. Advantages and limitations of 16S profiling and whole-genome sequencing, the two primary techniques for measuring the microbiome, are discussed followed by an overview of bioinformatic processing of high-throughput sequencing data from these measurements. Practical insights into the downstream statistical analyses including data processing and integration, variable transformations, and data exploration are provided. The merits of regularization and ensemble modeling for analyzing microbiome data are discussed along with a recommendation for selecting modeling approaches based on data-driven simulations and objective evaluation. The review builds on several recent discussions of study design issues in microbiome research but with a stronger emphasis on the downstream and often-ignored aspects of statistical analyses that are crucial for bridging the gap between basic science and translation.
Collapse
|
7
|
Manconi A, Moscatelli M, Armano G, Gnocchi M, Orro A, Milanesi L. Removing duplicate reads using graphics processing units. BMC Bioinformatics 2016; 17:346. [PMID: 28185553 PMCID: PMC5123249 DOI: 10.1186/s12859-016-1192-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Background During library construction polymerase chain reaction is used to enrich the DNA before sequencing. Typically, this process generates duplicate read sequences. Removal of these artifacts is mandatory, as they can affect the correct interpretation of data in several analyses. Ideally, duplicate reads should be characterized by identical nucleotide sequences. However, due to sequencing errors, duplicates may also be nearly-identical. Removing nearly-identical duplicates can result in a notable computational effort. To deal with this challenge, we recently proposed a GPU method aimed at removing identical and nearly-identical duplicates generated with an Illumina platform. The method implements an approach based on prefix-suffix comparison. Read sequences with identical prefix are considered potential duplicates. Then, their suffixes are compared to identify and remove those that are actually duplicated. Although the method can be efficiently used to remove duplicates, there are some limitations that need to be overcome. In particular, it cannot to detect potential duplicates in the event that prefixes are longer than 27 bases, and it does not provide support for paired-end read libraries. Moreover, large clusters of potential duplicates are split into smaller with the aim to guarantees a reasonable computing time. This heuristic may affect the accuracy of the analysis. Results In this work we propose GPU-DupRemoval, a new implementation of our method able to (i) cluster reads without constraints on the maximum length of the prefixes, (ii) support both single- and paired-end read libraries, and (iii) analyze large clusters of potential duplicates. Conclusions Due to the massive parallelization obtained by exploiting graphics cards, GPU-DupRemoval removes duplicate reads faster than other cutting-edge solutions, while outperforming most of them in terms of amount of duplicates reads.
Collapse
Affiliation(s)
- Andrea Manconi
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy.
| | - Marco Moscatelli
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| | - Giuliano Armano
- Department of Electrical and Electronic Engineering, University of Cagliari, P.zza D'Armi, Cagliari (CA), 09123, Italy
| | - Matteo Gnocchi
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| | - Alessandro Orro
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| | - Luciano Milanesi
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| |
Collapse
|
8
|
Zhou X, Peris D, Kominek J, Kurtzman CP, Hittinger CT, Rokas A. In Silico Whole Genome Sequencer and Analyzer (iWGS): a Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies. G3 (BETHESDA, MD.) 2016; 6:3655-3662. [PMID: 27638685 PMCID: PMC5100864 DOI: 10.1534/g3.116.034249] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Accepted: 09/08/2016] [Indexed: 11/18/2022]
Abstract
The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS ( in silicoWhole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.
Collapse
Affiliation(s)
- Xiaofan Zhou
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
| | - David Peris
- Laboratory of Genetics, Genome Center of Wisconsin, Department of Energy Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Wisconsin 53706
| | - Jacek Kominek
- Laboratory of Genetics, Genome Center of Wisconsin, Department of Energy Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Wisconsin 53706
| | - Cletus P Kurtzman
- Mycotoxin Prevention and Applied Microbiology Research Unit, National Center for Agricultural Utilization Research, Agricultural Research Service, US Department of Agriculture, Peoria, Illinois 61604
| | - Chris Todd Hittinger
- Laboratory of Genetics, Genome Center of Wisconsin, Department of Energy Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Wisconsin 53706
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
| |
Collapse
|
9
|
Vincent AT, Derome N, Boyle B, Culley AI, Charette SJ. Next-generation sequencing (NGS) in the microbiological world: How to make the most of your money. J Microbiol Methods 2016; 138:60-71. [PMID: 26995332 DOI: 10.1016/j.mimet.2016.02.016] [Citation(s) in RCA: 71] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Revised: 01/26/2016] [Accepted: 02/24/2016] [Indexed: 12/16/2022]
Abstract
The Sanger sequencing method produces relatively long DNA sequences of unmatched quality and has been considered for long time as the gold standard for sequencing DNA. Many improvements of the Sanger method that culminated with fluorescent dyes coupled with automated capillary electrophoresis enabled the sequencing of the first genomes. Nevertheless, using this technology to sequence whole genomes was costly, laborious and time consuming even for genomes that are relatively small in size. A major technological advance was the introduction of next-generation sequencing (NGS) pioneered by 454 Life Sciences in the early part of the 21th century. NGS allowed scientists to sequence thousands to millions of DNA molecules in a single machine run. Since then, new NGS technologies have emerged and existing NGS platforms have been improved, enabling the production of genome sequences at an unprecedented rate as well as broadening the spectrum of NGS applications. The current affordability of generating genomic information, especially with microbial samples, has resulted in a false sense of simplicity that belies the fact that many researchers still consider these technologies a black box. In this review, our objective is to identify and discuss four steps that we consider crucial to the success of any NGS-related project. These steps are: (1) the definition of the research objectives beyond sequencing and appropriate experimental planning, (2) library preparation, (3) sequencing and (4) data analysis. The goal of this review is to give an overview of the process, from sample to analysis, and discuss how to optimize your resources to achieve the most from your NGS-based research. Regardless of the evolution and improvement of the sequencing technologies, these four steps will remain relevant.
Collapse
Affiliation(s)
- Antony T Vincent
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, Quebec City, QC G1V 0A6, Canada; Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie de Québec, Quebec City, QC G1V 4G5, Canada
| | - Nicolas Derome
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biologie, Faculté des sciences et de génie, Université Laval, Quebec City G1V 0A6, Canada
| | - Brian Boyle
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
| | - Alexander I Culley
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, Quebec City, QC G1V 0A6, Canada; Groupe de Recherche en Écologie Buccale (GREB), Faculté de médecine dentaire, Université Laval, Quebec City, QC G1V 0A6, Canada
| | - Steve J Charette
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, Quebec City, QC G1V 0A6, Canada; Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie de Québec, Quebec City, QC G1V 4G5, Canada.
| |
Collapse
|
10
|
Schubert M, Lindgreen S, Orlando L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes 2016; 9:88. [PMID: 26868221 PMCID: PMC4751634 DOI: 10.1186/s13104-016-1900-2] [Citation(s) in RCA: 936] [Impact Index Per Article: 117.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2015] [Accepted: 02/02/2016] [Indexed: 02/06/2023] Open
Abstract
Background As high-throughput sequencing platforms produce longer and longer reads, sequences generated from short inserts, such as those obtained from fossil and degraded material, are increasingly expected to contain adapter sequences. Efficient adapter trimming algorithms are also needed to process the growing amount of data generated per sequencing run. Findings We introduce AdapterRemoval v2, a major revision of AdapterRemoval v1, which introduces (i) striking improvements in throughput, through the use of single instruction, multiple data (SIMD; SSE1 and SSE2) instructions and multi-threading support, (ii) the ability to handle datasets containing reads or read-pairs with different adapters or adapter pairs, (iii) simultaneous demultiplexing and adapter trimming, (iv) the ability to reconstruct adapter sequences from paired-end reads for poorly documented data sets, and (v) native gzip and bzip2 support. Conclusions We show that AdapterRemoval v2 compares favorably with existing tools, while offering superior throughput to most alternatives examined here, both for single and multi-threaded operations. Electronic supplementary material The online version of this article (doi:10.1186/s13104-016-1900-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mikkel Schubert
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, 1350, Copenhagen, Denmark.
| | - Stinus Lindgreen
- Department of Biology, Section for Computational and RNA Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200, Copenhagen, Denmark. .,Carlsberg Research Laboratory, Gamle Carlsberg Vej 4-10, 1799, Copenhagen, Denmark.
| | - Ludovic Orlando
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, 1350, Copenhagen, Denmark. .,Laboratoire AMIS, Université de Toulouse, University Paul Sabatier (UPS), CNRS UMR 5288, 37 Allées Jules Guesde, 31000, Toulouse, France.
| |
Collapse
|
11
|
González-Domínguez J, Schmidt B. ParDRe: faster parallel duplicated reads removal tool for sequencing studies: Table 1. Bioinformatics 2016; 32:1562-4. [DOI: 10.1093/bioinformatics/btw038] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 01/17/2016] [Indexed: 11/14/2022] Open
|
12
|
Colman RE, Schupp JM, Hicks ND, Smith DE, Buchhagen JL, Valafar F, Crudu V, Romancenco E, Noroc E, Jackson L, Catanzaro DG, Rodwell TC, Catanzaro A, Keim P, Engelthaler DM. Detection of Low-Level Mixed-Population Drug Resistance in Mycobacterium tuberculosis Using High Fidelity Amplicon Sequencing. PLoS One 2015; 10:e0126626. [PMID: 25970423 PMCID: PMC4430321 DOI: 10.1371/journal.pone.0126626] [Citation(s) in RCA: 68] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2014] [Accepted: 04/03/2015] [Indexed: 12/20/2022] Open
Abstract
Undetected and untreated, low-levels of drug resistant (DR) subpopulations in clinical Mycobacterium tuberculosis (Mtb) infections may lead to development of DR-tuberculosis, potentially resulting in treatment failure. Current phenotypic DR susceptibility testing has a theoretical potential for 1% sensitivity, is not quantitative, and requires several weeks to complete. The use of "single molecule-overlapping reads" (SMOR) analysis with next generation DNA sequencing for determination of ultra-rare target alleles in complex mixtures provides increased sensitivity over standard DNA sequencing. Ligation free amplicon sequencing with SMOR analysis enables the detection of resistant allele subpopulations at ≥0.1% of the total Mtb population in near real-time analysis. We describe the method using standardized mixtures of DNA from resistant and susceptible Mtb isolates and the assay's performance for detecting ultra-rare DR subpopulations in DNA extracted directly from clinical sputum samples. SMOR analysis enables rapid near real-time detection and tracking of previously undetectable DR sub-populations in clinical samples allowing for the evaluation of the clinical relevance of low-level DR subpopulations. This will provide insights into interventions aimed at suppressing minor DR subpopulations before they become clinically significant.
Collapse
MESH Headings
- Antitubercular Agents/pharmacology
- Antitubercular Agents/therapeutic use
- DNA, Bacterial/genetics
- DNA, Bacterial/isolation & purification
- Drug Resistance, Multiple, Bacterial/genetics
- Gene Frequency
- Genetic Loci
- High-Throughput Nucleotide Sequencing
- Humans
- Microbial Sensitivity Tests
- Molecular Diagnostic Techniques
- Mycobacterium tuberculosis/genetics
- Polymorphism, Single Nucleotide
- Sequence Analysis, DNA
- Sputum/microbiology
- Tuberculosis, Multidrug-Resistant/diagnosis
- Tuberculosis, Multidrug-Resistant/drug therapy
- Tuberculosis, Multidrug-Resistant/microbiology
- Tuberculosis, Pulmonary/diagnosis
- Tuberculosis, Pulmonary/drug therapy
- Tuberculosis, Pulmonary/microbiology
Collapse
Affiliation(s)
- Rebecca E. Colman
- Translational Genomics Research Institute, Flagstaff, AZ, United States of America
| | - James M. Schupp
- Translational Genomics Research Institute, Flagstaff, AZ, United States of America
| | - Nathan D. Hicks
- Translational Genomics Research Institute, Flagstaff, AZ, United States of America
| | - David E. Smith
- Translational Genomics Research Institute, Flagstaff, AZ, United States of America
| | - Jordan L. Buchhagen
- Translational Genomics Research Institute, Flagstaff, AZ, United States of America
| | - Faramarz Valafar
- San Diego State University, San Diego, CA, United States of America
| | - Valeriu Crudu
- Phthisiopneumology Institute (PPI), Chisinau, Republic of Moldova
| | - Elena Romancenco
- University of California San Diego, San Diego, CA, United States of America
| | - Ecaterina Noroc
- Phthisiopneumology Institute (PPI), Chisinau, Republic of Moldova
| | - Lynn Jackson
- University of California San Diego, San Diego, CA, United States of America
| | - Donald G. Catanzaro
- University of Arkansas College of Education and Health Professions, Fayetteville, AR, United States of America
| | - Timothy C. Rodwell
- University of California San Diego, San Diego, CA, United States of America
| | - Antonino Catanzaro
- University of California San Diego, San Diego, CA, United States of America
| | - Paul Keim
- Translational Genomics Research Institute, Flagstaff, AZ, United States of America
- Center for Microbial Genetics & Genomics, Northern Arizona University, Flagstaff, AZ, United States of America
| | - David M. Engelthaler
- Translational Genomics Research Institute, Flagstaff, AZ, United States of America
| |
Collapse
|
13
|
Li YL, Weng JC, Hsiao CC, Chou MT, Tseng CW, Hung JH. PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm. BMC Bioinformatics 2015; 16 Suppl 1:S2. [PMID: 25707528 PMCID: PMC4331701 DOI: 10.1186/1471-2105-16-s1-s2] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND In modern paired-end sequencing protocols short DNA fragments lead to adapter-appended reads. Current paired-end adapter removal approaches trim adapter by scanning the fragment of adapter on the 3' end of the reads, which are not competent in some applications. RESULTS Here, we propose a fast and highly accurate adapter-trimming algorithm, PEAT, designed specifically for paired-end sequencing. PEAT requires no a priori adaptor sequence, which is convenient for large-scale meta-analyses. We assessed the performance of PEAT with many adapter trimmers in both simulated and real life paired-end sequencing libraries. The importance of adapter trimming was exemplified by the influence of the downstream analyses on RNA-seq, ChIP-seq and MNase-seq. Several useful guidelines of applying adapter trimmers with aligners were suggested. CONCLUSIONS PEAT can be easily included in the routine paired-end sequencing pipeline. The executable binaries and the standalone C++ source code package of PEAT are freely available online.
Collapse
|
14
|
Rieseberg L, Vines T, Gow J, Geraldes A. Editorial 2015. Mol Ecol 2015; 24:1-17. [DOI: 10.1111/mec.12997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Accepted: 11/10/2014] [Indexed: 11/30/2022]
|
15
|
Harris SE, O'Neill RJ, Munshi-South J. Transcriptome resources for the white-footed mouse (Peromyscus leucopus): new genomic tools for investigating ecologically divergent urban and rural populations. Mol Ecol Resour 2014; 15:382-94. [PMID: 24980186 DOI: 10.1111/1755-0998.12301] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Revised: 06/26/2014] [Accepted: 06/27/2014] [Indexed: 12/30/2022]
Abstract
Genomic resources are important and attainable for examining evolutionary change in divergent natural populations of nonmodel species. We utilized two next-generation sequencing (NGS) platforms, 454 and SOLiD 5500XL, to assemble low-coverage transcriptomes of the white-footed mouse (Peromyscus leucopus), a widespread and abundant native rodent in eastern North America. We sequenced liver mRNA transcripts from multiple individuals collected from urban populations in New York City and rural populations in undisturbed protected areas nearby and assembled a reference transcriptome using 1 080 065 954 SOLiD 5500XL (75 bp) reads and 3 052 640 454 GS FLX + reads. The reference contained 40 908 contigs with a N50 = 1044 bp and a total content of 30.06 Megabases (Mb). Contigs were annotated from Mus musculus (39.96% annotated) Uniprot databases. We identified 104 655 high-quality single nucleotide polymorphisms (SNPs) and 65 single sequence repeats (SSRs) with flanking primers. We also used normalized read counts to identify putative gene expression differences in 10 genes between populations. There were 19 contigs significantly differentially expressed in urban populations compared to rural populations, with gene function annotations generally related to the translation and modification of proteins and those involved in immune responses. The individual transcriptomes generated in this study will be used to investigate evolutionary responses to urbanization. The reference transcriptome provides a valuable resource for the scientific community using North American Peromyscus species as emerging model systems for ecological genetics and adaptation.
Collapse
Affiliation(s)
- Stephen E Harris
- Program in Ecology, Evolutionary Biology, & Behavior, The Graduate Center, City University of New York (CUNY), New York, NY, 10016, USA
| | | | | |
Collapse
|
16
|
Ekblom R, Wolf JBW. A field guide to whole-genome sequencing, assembly and annotation. Evol Appl 2014; 7:1026-42. [PMID: 25553065 PMCID: PMC4231593 DOI: 10.1111/eva.12178] [Citation(s) in RCA: 188] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2014] [Accepted: 05/20/2014] [Indexed: 12/12/2022] Open
Abstract
Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high-throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco-evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step-by-step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole-genome sequencing projects.
Collapse
Affiliation(s)
- Robert Ekblom
- Department of Evolutionary Biology, Uppsala University Uppsala, Sweden
| | - Jochen B W Wolf
- Department of Evolutionary Biology, Uppsala University Uppsala, Sweden
| |
Collapse
|
17
|
Stapleton AE. A biologist, a statistician, and a bioinformatician walk into a conference room… and walk out with a great metagenomics project plan. FRONTIERS IN PLANT SCIENCE 2014; 5:250. [PMID: 24917875 PMCID: PMC4042100 DOI: 10.3389/fpls.2014.00250] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 05/15/2014] [Indexed: 06/03/2023]
|