1
|
Betschart RO, Riccio C, Aguilera-Garcia D, Blankenberg S, Guo L, Moch H, Seidl D, Solleder H, Thalén F, Thiéry A, Twerenbold R, Zeller T, Zoche M, Ziegler A. Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control. Biom J 2024; 66:e202300278. [PMID: 38988195 DOI: 10.1002/bimj.202300278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 03/21/2024] [Accepted: 05/14/2024] [Indexed: 07/12/2024]
Abstract
Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.
Collapse
Affiliation(s)
| | | | - Domingo Aguilera-Garcia
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Linlin Guo
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Holger Moch
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Dagmar Seidl
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Hugo Solleder
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | - Felix Thalén
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | | | - Raphael Twerenbold
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Tanja Zeller
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Martin Zoche
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
| |
Collapse
|
2
|
Li CI, Samuels DC, Zhao YY, Shyr Y, Guo Y. Power and sample size calculations for high-throughput sequencing-based experiments. Brief Bioinform 2019; 19:1247-1255. [PMID: 28605403 DOI: 10.1093/bib/bbx061] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2017] [Indexed: 12/22/2022] Open
Abstract
Power/sample size (power) analysis estimates the likelihood of successfully finding the statistical significance in a data set. There has been a growing recognition of the importance of power analysis in the proper design of experiments. Power analysis is complex, yet necessary for the success of large studies. It is important to design a study that produces statistically accurate and reliable results. Power computation methods have been well established for both microarray-based gene expression studies and genotyping microarray-based genome-wide association studies. High-throughput sequencing (HTS) has greatly enhanced our ability to conduct biomedical studies at the highest possible resolution (per nucleotide). However, the complexity of power computations is much greater for sequencing data than for the simpler genotyping array data. Research on methods of power computations for HTS-based studies has been recently conducted but is not yet well known or widely used. In this article, we describe the power computation methods that are currently available for a range of HTS-based studies, including DNA sequencing, RNA-sequencing, microbiome sequencing and chromatin immunoprecipitation sequencing. Most importantly, we review the methods of power analysis for several types of sequencing data and guide the reader to the relevant methods for each data type.
Collapse
Affiliation(s)
- Chung-I Li
- Department of Statistics, National Cheng Kung University in Taiwan
| | - David C Samuels
- Department of Molecular Physiology and Biophysics, Vanderbilt University, USA
| | | | - Yu Shyr
- Department of Biostatistics, Vanderbilt University, USA
| | - Yan Guo
- Department of Cancer Biology, Vanderbilt University
| |
Collapse
|
3
|
Dorant Y, Benestan L, Rougemont Q, Normandeau E, Boyle B, Rochette R, Bernatchez L. Comparing Pool-seq, Rapture, and GBS genotyping for inferring weak population structure: The American lobster ( Homarus americanus) as a case study. Ecol Evol 2019; 9:6606-6623. [PMID: 31236247 PMCID: PMC6580275 DOI: 10.1002/ece3.5240] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 04/10/2019] [Accepted: 04/13/2019] [Indexed: 01/02/2023] Open
Abstract
Unraveling genetic population structure is challenging in species potentially characterized by large population size and high dispersal rates, often resulting in weak genetic differentiation. Genotyping a large number of samples can improve the detection of subtle genetic structure, but this may substantially increase sequencing cost and downstream bioinformatics computational time. To overcome this challenge, alternative, cost-effective sequencing approaches, namely Pool-seq and Rapture, have been developed. We empirically measured the power of resolution and congruence of these two methods in documenting weak population structure in nonmodel species with high gene flow comparatively to a conventional genotyping-by-sequencing (GBS) approach. For this, we used the American lobster (Homarus americanus) as a case study. First, we found that GBS, Rapture, and Pool-seq approaches gave similar allele frequency estimates (i.e., correlation coefficient over 0.90) and all three revealed the same weak pattern of population structure. Yet, Pool-seq data showed F ST estimates three to five times higher than GBS and Rapture, while the latter two methods returned similar F ST estimates, indicating that individual-based approaches provided more congruent results than Pool-seq. We conclude that despite higher costs, GBS and Rapture are more convenient approaches to use in the case of species exhibiting very weak differentiation. While both GBS and Rapture approaches provided similar results with regard to estimates of population genetic parameters, GBS remains more cost-effective in project involving a relatively small numbers of genotyped individuals (e.g., <1,000). Overall, this study illustrates the complexity of estimating genetic differentiation and other summary statistics in complex biological systems characterized by large population size and migration rates.
Collapse
Affiliation(s)
- Yann Dorant
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébecCanada
| | - Laura Benestan
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébecCanada
- Pêches et Océans CanadaInstitut Maurice‐LamontagneMont‐JoliCanada
| | - Quentin Rougemont
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébecCanada
| | - Eric Normandeau
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébecCanada
| | - Brian Boyle
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébecCanada
- Plateforme d'analyses génomiques, Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébecCanada
| | - Rémy Rochette
- Department of BiologyUniversity of New BrunswickSaint JohnCanada
| | - Louis Bernatchez
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébecCanada
| |
Collapse
|
4
|
Muyas F, Bosio M, Puig A, Susak H, Domènech L, Escaramis G, Zapata L, Demidov G, Estivill X, Rabionet R, Ossowski S. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum Mutat 2018; 40:115-126. [PMID: 30353964 PMCID: PMC6587442 DOI: 10.1002/humu.23674] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 09/17/2018] [Accepted: 10/20/2018] [Indexed: 12/13/2022]
Abstract
In recent years, next‐generation sequencing (NGS) has become a cornerstone of clinical genetics and diagnostics. Many clinical applications require high precision, especially if rare events such as somatic mutations in cancer or genetic variants causing rare diseases need to be identified. Although random sequencing errors can be modeled statistically and deep sequencing minimizes their impact, systematic errors remain a problem even at high depth of coverage. Understanding their source is crucial to increase precision of clinical NGS applications. In this work, we studied the relation between recurrent biases in allele balance (AB), systematic errors, and false positive variant calls across a large cohort of human samples analyzed by whole exome sequencing (WES). We have modeled the AB distribution for biallelic genotypes in 987 WES samples in order to identify positions recurrently deviating significantly from the expectation, a phenomenon we termed allele balance bias (ABB). Furthermore, we have developed a genotype callability score based on ABB for all positions of the human exome, which detects false positive variant calls that passed state‐of‐the‐art filters. Finally, we demonstrate the use of ABB for detection of false associations proposed by rare variant association studies. Availability: https://github.com/Francesc-Muyas/ABB.
Collapse
Affiliation(s)
- Francesc Muyas
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| | - Mattia Bosio
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Anna Puig
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Hana Susak
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Laura Domènech
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
| | - Georgia Escaramis
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
| | - Luis Zapata
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - German Demidov
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| | - Xavier Estivill
- Sidra Medicine, Doha, Qatar.,Women's Health Dexeus, Barcelona, Spain
| | - Raquel Rabionet
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain.,Institut de Recerca Sant Joan de Déu; Institut de Biomedicina de la Universitat de Barcelona (IBUB), ; & Department of Genetics, Microbiology and Statistics, University of Barcelona, Barcelona, Spain
| | - Stephan Ossowski
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| |
Collapse
|
5
|
Sheng Q, Vickers K, Zhao S, Wang J, Samuels DC, Koues O, Shyr Y, Guo Y. Multi-perspective quality control of Illumina RNA sequencing data analysis. Brief Funct Genomics 2018; 16:194-204. [PMID: 27687708 DOI: 10.1093/bfgp/elw035] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Quality control (QC) is a critical step in RNA sequencing (RNA-seq). Yet, it is often ignored or conducted on a limited basis. Here, we present a multi-perspective strategy for QC of RNA-seq experiments. The QC of RNA-seq can be divided into four related stages: (1) RNA quality, (2) raw read data (FASTQ), (3) alignment and (4) gene expression. We illustrate the importance of conducting QC at each stage of an RNA-seq experiment and demonstrate our recommended RNA-seq QC strategy. Furthermore, we discuss the major and often neglected quality issues associated with the three major types of RNA-seq: mRNA, total RNA and small RNA. This RNA-seq QC overview provides comprehensive guidance for researchers who conduct RNA-seq experiments.
Collapse
|
6
|
Raskin L, Guo Y, Du L, Clendenning M, Rosty C, Lindor NM, Gruber SB, Buchanan DD. Targeted sequencing of established and candidate colorectal cancer genes in the Colon Cancer Family Registry Cohort. Oncotarget 2017; 8:93450-93463. [PMID: 29212164 PMCID: PMC5706810 DOI: 10.18632/oncotarget.18596] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Accepted: 04/19/2017] [Indexed: 01/07/2023] Open
Abstract
The underlying genetic cause of colorectal cancer (CRC) can be identified for 5-10% of all cases, while at least 20% of CRC cases are thought to be due to inherited genetic factors. Screening for highly penetrant mutations in genes associated with Mendelian cancer syndromes using next-generation sequencing (NGS) can be prohibitively expensive for studies requiring large samples sizes. The aim of the study was to identify rare single nucleotide variants and small indels in 40 established or candidate CRC susceptibility genes in 1,046 familial CRC cases (including both MSS and MSI-H tumor subtypes) and 1,006 unrelated controls from the Colon Cancer Family Registry Cohort using a robust and cost-effective DNA pooling NGS strategy. We identified 264 variants in 38 genes that were observed only in cases, comprising either very rare (minor allele frequency <0.001) or not previously reported (n=90, 34%) in reference databases, including six stop-gain, three frameshift, and 255 non-synonymous variants predicted to be damaging. We found novel germline mutations in established CRC genes MLH1, APC, and POLE, and likely pathogenic variants in cancer susceptibility genes BAP1, CDH1, CHEK2, ENG, and MSH3. For the candidate CRC genes, we identified likely pathogenic variants in the helicase domain of POLQ and in the LRIG1, SH2B3, and NOS1 genes and present their clinicopathological characteristics. Using a DNA pooling NGS strategy, we identified novel germline mutations in established CRC susceptibility genes in familial CRC cases. Further studies are required to support the role of POLQ, LRIG1, SH2B3 and NOS1 as CRC susceptibility genes.
Collapse
Affiliation(s)
- Leon Raskin
- Division of Epidemiology, School of Medicine, Vanderbilt University Medical Center and Vanderbilt Ingram Comprehensive Cancer Center, Nashville, TN, USA
| | - Yan Guo
- Center for Quantitative Sciences, Vanderbilt University Medical Center and Vanderbilt Ingram Comprehensive Cancer Center, Nashville, TN, USA
| | - Liping Du
- Center for Quantitative Sciences, Vanderbilt University Medical Center and Vanderbilt Ingram Comprehensive Cancer Center, Nashville, TN, USA
| | - Mark Clendenning
- Colorectal Oncogenomics Group, Genetic Epidemiology Laboratory, Department of Pathology, University of Melbourne, Parkville, Victoria, Australia
| | - Christophe Rosty
- Colorectal Oncogenomics Group, Genetic Epidemiology Laboratory, Department of Pathology, University of Melbourne, Parkville, Victoria, Australia
- Envoi Specialist Pathologists, Herston, Queensland, Australia
- University of Queensland, School of Medicine, Herston, Queensland, Australia
| | - Colon Cancer Family Registry (CCFR)
- Division of Epidemiology, School of Medicine, Vanderbilt University Medical Center and Vanderbilt Ingram Comprehensive Cancer Center, Nashville, TN, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center and Vanderbilt Ingram Comprehensive Cancer Center, Nashville, TN, USA
- Colorectal Oncogenomics Group, Genetic Epidemiology Laboratory, Department of Pathology, University of Melbourne, Parkville, Victoria, Australia
- Envoi Specialist Pathologists, Herston, Queensland, Australia
- University of Queensland, School of Medicine, Herston, Queensland, Australia
- Department of Health Sciences Research, Mayo Clinic, Scottsdale, AZ, USA
- USC Norris Comprehensive Cancer Center, Los Angeles, CA, USA
- Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
- Genetic Medicine and Familial Cancer Centre, The Royal Melbourne Hospital, Parkville, Victoria, Australia
| | - Noralane M. Lindor
- Department of Health Sciences Research, Mayo Clinic, Scottsdale, AZ, USA
| | - Stephen B. Gruber
- USC Norris Comprehensive Cancer Center, Los Angeles, CA, USA
- Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| | - Daniel D. Buchanan
- Colorectal Oncogenomics Group, Genetic Epidemiology Laboratory, Department of Pathology, University of Melbourne, Parkville, Victoria, Australia
- University of Queensland, School of Medicine, Herston, Queensland, Australia
- Genetic Medicine and Familial Cancer Centre, The Royal Melbourne Hospital, Parkville, Victoria, Australia
| |
Collapse
|
7
|
Guo Y, Zhao S, Sheng Q, Samuels DC, Shyr Y. The discrepancy among single nucleotide variants detected by DNA and RNA high throughput sequencing data. BMC Genomics 2017; 18:690. [PMID: 28984205 PMCID: PMC5629567 DOI: 10.1186/s12864-017-4022-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND High throughput sequencing technology enables the both the human genome and transcriptome to be screened at the single nucleotide resolution. Tools have been developed to infer single nucleotide variants (SNVs) from both DNA and RNA sequencing data. To evaluate how much difference can be expected between DNA and RNA sequencing data, and among tissue sources, we designed a study to examine the single nucleotide difference among five sources of high throughput sequencing data generated from the same individual, including exome sequencing from blood, tumor and adjacent normal tissue, and RNAseq from tumor and adjacent normal tissue. RESULTS Through careful quality control and analysis of the SNVs, we found little difference between DNA-DNA pairs (1%-2%). However, between DNA-RNA pairs, SNV differences ranged anywhere from 10% to 20%. CONCLUSIONS Only a small portion of these differences can be explained by RNA editing. Instead, the majority of the DNA-RNA differences should be attributed to technical errors from sequencing and post-processing of RNAseq data. Our analysis results suggest that SNV detection using RNAseq is subject to high false positive rates.
Collapse
Affiliation(s)
- Yan Guo
- Department of Biomedical Informatics, Vanderbilt University, 2220 Pierce Ave, 571 PRB, Nashville, TN, 37027, USA.
| | - Shilin Zhao
- Department of Biomedical Informatics, Vanderbilt University, 2220 Pierce Ave, 571 PRB, Nashville, TN, 37027, USA
| | - Quanhu Sheng
- Department of Biomedical Informatics, Vanderbilt University, 2220 Pierce Ave, 571 PRB, Nashville, TN, 37027, USA
| | - David C Samuels
- Vanderbilt Genetics Institute, Department of Molecular Physiology and Biophysics, Vanderbilt University Medical School, Nashville, TN, USA
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University, 2220 Pierce Ave, 571 PRB, Nashville, TN, 37027, USA.
| |
Collapse
|
8
|
Chiara M, Pavesi G. Evaluation of Quality Assessment Protocols for High Throughput Genome Resequencing Data. Front Genet 2017; 8:94. [PMID: 28736571 PMCID: PMC5500642 DOI: 10.3389/fgene.2017.00094] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Accepted: 06/21/2017] [Indexed: 12/14/2022] Open
Abstract
Large-scale initiatives aiming to recover the complete sequence of thousands of human genomes are currently being undertaken worldwide, concurring to the generation of a comprehensive catalog of human genetic variation. The ultimate and most ambitious goal of human population scale genomics is the characterization of the so-called human “variome,” through the identification of causal mutations or haplotypes. Several research institutions worldwide currently use genotyping assays based on Next-Generation Sequencing (NGS) for diagnostics and clinical screenings, and the widespread application of such technologies promises major revolutions in medical science. Bioinformatic analysis of human resequencing data is one of the main factors limiting the effectiveness and general applicability of NGS for clinical studies. The requirement for multiple tools, to be combined in dedicated protocols in order to accommodate different types of data (gene panels, exomes, or whole genomes) and the high variability of the data makes difficult the establishment of a ultimate strategy of general use. While there already exist several studies comparing sensitivity and accuracy of bioinformatic pipelines for the identification of single nucleotide variants from resequencing data, little is known about the impact of quality assessment and reads pre-processing strategies. In this work we discuss major strengths and limitations of the various genome resequencing protocols are currently used in molecular diagnostics and for the discovery of novel disease-causing mutations. By taking advantage of publicly available data we devise and suggest a series of best practices for the pre-processing of the data that consistently improve the outcome of genotyping with minimal impacts on computational costs.
Collapse
Affiliation(s)
- Matteo Chiara
- Dipartimento di Bioscienze, Università di MilanoMilan, Italy
| | - Giulio Pavesi
- Dipartimento di Bioscienze, Università di MilanoMilan, Italy
| |
Collapse
|
9
|
Weldatsadik RG, Wang J, Puhakainen K, Jiao H, Jalava J, Räisänen K, Datta N, Skoog T, Vuopio J, Jokiranta TS, Kere J. Sequence analysis of pooled bacterial samples enables identification of strain variation in group A streptococcus. Sci Rep 2017; 7:45771. [PMID: 28361960 PMCID: PMC5374712 DOI: 10.1038/srep45771] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Accepted: 03/02/2017] [Indexed: 12/30/2022] Open
Abstract
Knowledge of the genomic variation among different strains of a pathogenic microbial species can help in selecting optimal candidates for diagnostic assays and vaccine development. Pooled sequencing (Pool-seq) is a cost effective approach for population level genetic studies that require large numbers of samples such as various strains of a microbe. To test the use of Pool-seq in identifying variation, we pooled DNA of 100 Streptococcus pyogenes strains of different emm types in two pools, each containing 50 strains. We used four variant calling tools (Freebayes, UnifiedGenotyper, SNVer, and SAMtools) and one emm1 strain, SF370, as a reference genome. In total 63719 SNPs and 164 INDELs were identified in the two pools concordantly by at least two of the tools. Majority of the variants (93.4%) from six individually sequenced strains used in the pools could be identified from the two pools and 72.3% and 97.4% of the variants in the pools could be mined from the analysis of the 44 complete Str. pyogenes genomes and 3407 sequence runs deposited in the European Nucleotide Archive respectively. We conclude that DNA sequencing of pooled samples of large numbers of bacterial strains is a robust, rapid and cost-efficient way to discover sequence variation.
Collapse
Affiliation(s)
- Rigbe G Weldatsadik
- Research Programs Unit, Immunobiology, University of Helsinki, and Helsinki University Central Hospital, Helsinki, Finland
| | - Jingwen Wang
- Department of Biosciences and Nutrition, Karolinska Institutet, Huddinge, Sweden
| | - Kai Puhakainen
- Bacterial Infections Unit, National Institute for Health and Welfare, Turku, Finland.,Department of Medical Microbiology and Immunology, University of Turku, Turku, Finland
| | - Hong Jiao
- Department of Biosciences and Nutrition, Karolinska Institutet, Huddinge, Sweden
| | - Jari Jalava
- Bacterial Infections Unit, National Institute for Health and Welfare, Turku, Finland
| | - Kati Räisänen
- Bacterial Infections Unit, National Institute for Health and Welfare, Turku, Finland
| | - Neeta Datta
- Research Programs Unit, Immunobiology, University of Helsinki, and Helsinki University Central Hospital, Helsinki, Finland
| | - Tiina Skoog
- Department of Biosciences and Nutrition, Karolinska Institutet, Huddinge, Sweden
| | - Jaana Vuopio
- Bacterial Infections Unit, National Institute for Health and Welfare, Turku, Finland.,Department of Medical Microbiology and Immunology, University of Turku, Turku, Finland
| | - T Sakari Jokiranta
- Research Programs Unit, Immunobiology, University of Helsinki, and Helsinki University Central Hospital, Helsinki, Finland
| | - Juha Kere
- Department of Biosciences and Nutrition, Karolinska Institutet, Huddinge, Sweden.,Molecular Neurology Research Program, University of Helsinki, and Folkhälsan Institute of Genetics, Biomedicum Helsinki, Helsinki, Finland.,Department of Genetics and Molecular Medicine, King's College London, London, UK
| |
Collapse
|
10
|
Just RS, Irwin JA, Parson W. Mitochondrial DNA heteroplasmy in the emerging field of massively parallel sequencing. Forensic Sci Int Genet 2015; 18:131-9. [PMID: 26009256 PMCID: PMC4550493 DOI: 10.1016/j.fsigen.2015.05.003] [Citation(s) in RCA: 94] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Revised: 04/24/2015] [Accepted: 05/05/2015] [Indexed: 12/12/2022]
Abstract
Long an important and useful tool in forensic genetic investigations, mitochondrial DNA (mtDNA) typing continues to mature. Research in the last few years has demonstrated both that data from the entire molecule will have practical benefits in forensic DNA casework, and that massively parallel sequencing (MPS) methods will make full mitochondrial genome (mtGenome) sequencing of forensic specimens feasible and cost-effective. A spate of recent studies has employed these new technologies to assess intraindividual mtDNA variation. However, in several instances, contamination and other sources of mixed mtDNA data have been erroneously identified as heteroplasmy. Well vetted mtGenome datasets based on both Sanger and MPS sequences have found authentic point heteroplasmy in approximately 25% of individuals when minor component detection thresholds are in the range of 10-20%, along with positional distribution patterns in the coding region that differ from patterns of point heteroplasmy in the well-studied control region. A few recent studies that examined very low-level heteroplasmy are concordant with these observations when the data are examined at a common level of resolution. In this review we provide an overview of considerations related to the use of MPS technologies to detect mtDNA heteroplasmy. In addition, we examine published reports on point heteroplasmy to characterize features of the data that will assist in the evaluation of future mtGenome data developed by any typing method.
Collapse
Affiliation(s)
- Rebecca S Just
- Armed Forces DNA Identification Laboratory, Armed Forces Medical Examiner System, Dover, DE, USA; American Registry of Pathology, Rockville, MD, USA
| | | | - Walther Parson
- Institute of Legal Medicine, Medical University of Innsbruck, Innsbruck, Austria; Forensic Science Program, The Pennsylvania State University, University Park, PA, USA.
| |
Collapse
|
11
|
Guo Y, Zhao S, Bjoring M, Han L. Advanced Datamining Using RNAseq Data. BIG DATA ANALYTICS IN BIOINFORMATICS AND HEALTHCARE 2015. [DOI: 10.4018/978-1-4666-6611-5.ch001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
In recent years, RNA sequencing (RNAseq) technology has experienced a rapid rise in popularity. Often seen as a competitor of and the ultimate successor to microarray technology given its more accurate and quantitative gene expression measurement, RNAseq also offers a wealth of additional information that is often overlooked, and given the massive accumulation of RNAseq data available in public data repositories over the past few years, these data are ripe for discovery. Abundant opportunities exist for researchers to conduct in-depth, non-traditional analyses that take advantage of these secondary uses and for bioinformaticians to develop tools to make these data more accessible. This is discussed in this chapter.
Collapse
|
12
|
Han L, Vickers KC, Samuels DC, Guo Y. Alternative applications for distinct RNA sequencing strategies. Brief Bioinform 2014; 16:629-39. [PMID: 25246237 DOI: 10.1093/bib/bbu032] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2014] [Accepted: 08/19/2014] [Indexed: 12/30/2022] Open
Abstract
Recent advances in RNA library preparation methods, platform accessibility and cost efficiency have allowed high-throughput RNA sequencing (RNAseq) to replace conventional hybridization microarray platforms as the method of choice for mRNA profiling and transcriptome analyses. RNAseq is a powerful technique to profile both long and short RNA expression, and the depth of information gained from distinct RNAseq methods is striking and facilitates discovery. In addition to expression analysis, distinct RNAseq approaches also allow investigators the ability to assess transcriptional elongation, DNA variance and exogenous RNA content. Here we review the current state of the art in transcriptome sequencing and address epigenetic regulation, quantification of transcription activation, RNAseq output and a diverse set of applications for RNAseq data. We detail how RNAseq can be used to identify allele-specific expression, single-nucleotide polymorphisms and somatic mutations and discuss the benefits and limitations of using RNAseq to monitor DNA characteristics. Moreover, we highlight the power of combining RNA- and DNAseq methods for genomic analysis. In summary, RNAseq provides the opportunity to gain greater insight into transcriptional regulation and output than simply miRNA and mRNA profiling.
Collapse
|
13
|
Ye F, Samuels DC, Clark T, Guo Y. High-throughput sequencing in mitochondrial DNA research. Mitochondrion 2014; 17:157-63. [PMID: 24859348 PMCID: PMC4149223 DOI: 10.1016/j.mito.2014.05.004] [Citation(s) in RCA: 63] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2014] [Revised: 04/04/2014] [Accepted: 05/13/2014] [Indexed: 12/14/2022]
Abstract
Next-generation sequencing, also known as high-throughput sequencing, has greatly enhanced researchers' ability to conduct biomedical research on all levels. Mitochondrial research has also benefitted greatly from high-throughput sequencing; sequencing technology now allows for screening of all 16,569 base pairs of the mitochondrial genome simultaneously for SNPs and low level heteroplasmy and, in some cases, the estimation of mitochondrial DNA copy number. It is important to realize the full potential of high-throughput sequencing for the advancement of mitochondrial research. To this end, we review how high-throughput sequencing has impacted mitochondrial research in the categories of SNPs, low level heteroplasmy, copy number, and structural variants. We also discuss the different types of mitochondrial DNA sequencing and their pros and cons. Based on previous studies conducted by various groups, we provide strategies for processing mitochondrial DNA sequencing data, including assembly, variant calling, and quality control.
Collapse
Affiliation(s)
- Fei Ye
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA
| | - David C Samuels
- Center for Human Genetics, Vanderbilt University, Nashville, TN 37232, USA
| | - Travis Clark
- Vanderbilt Technology for Advanced Genomics, Vanderbilt University, Nashville, TN 37232, USA
| | - Yan Guo
- Department of Cancer Biology, Vanderbilt University, Nashville, TN 37232, USA
| |
Collapse
|
14
|
Konczal M, Koteja P, Stuglik MT, Radwan J, Babik W. Accuracy of allele frequency estimation using pooled RNA-Seq. Mol Ecol Resour 2013; 14:381-92. [PMID: 24119300 DOI: 10.1111/1755-0998.12186] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2013] [Revised: 09/30/2013] [Accepted: 10/06/2013] [Indexed: 11/28/2022]
Abstract
For nonmodel organisms, genome-wide information that describes functionally relevant variation may be obtained by RNA-Seq following de novo transcriptome assembly. While sequencing has become relatively inexpensive, the preparation of a large number of sequencing libraries remains prohibitively expensive for population genetic analyses of nonmodel species. Pooling samples may be then an attractive alternative. To test whether pooled RNA-Seq accurately predicts true allele frequencies, we analysed the liver transcriptomes of 10 bank voles. Each sample was sequenced both as an individually barcoded library and as a part of a pool. Equal amounts of total RNA from each vole were pooled prior to mRNA selection and library construction. Reads were mapped onto the de novo assembled reference transcriptome. High-quality genotypes for individual voles, determined for 23,682 SNPs, provided information on 'true' allele frequencies; allele frequencies estimated from the pool were then compared with these values. 'True' frequencies and those estimated from the pool were highly correlated. Mean relative estimation error was 21% and did not depend on expression level. However, we also observed a minor effect of interindividual variation in gene expression and allele-specific gene expression influencing allele frequency estimation accuracy. Moreover, we observed strong negative relationship between minor allele frequency and relative estimation error. Our results indicate that pooled RNA-Seq exhibits accuracy comparable with pooled genome resequencing, but variation in expression level between individuals should be assessed and accounted for. This should help in taking account the difference in accuracy between conservatively expressed transcripts and these which are variable in expression level.
Collapse
Affiliation(s)
- M Konczal
- Institute of Environmental Sciences, Jagiellonian University, Gronostajowa 7, 30-387, Kraków, Poland
| | | | | | | | | |
Collapse
|
15
|
Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform 2013; 15:879-89. [PMID: 24067931 DOI: 10.1093/bib/bbt069] [Citation(s) in RCA: 117] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
Advances in next-generation sequencing (NGS) technologies have greatly improved our ability to detect genomic variants for biomedical research. In particular, NGS technologies have been recently applied with great success to the discovery of mutations associated with the growth of various tumours and in rare Mendelian diseases. The advance in NGS technologies has also created significant challenges in bioinformatics. One of the major challenges is quality control of the sequencing data. In this review, we discuss the proper quality control procedures and parameters for Illumina technology-based human DNA re-sequencing at three different stages of sequencing: raw data, alignment and variant calling. Monitoring quality control metrics at each of the three stages of NGS data provides unique and independent evaluations of data quality from differing perspectives. Properly conducting quality control protocols at all three stages and correctly interpreting the quality control results are crucial to ensure a successful and meaningful study.
Collapse
|