1
|
Efficient Two-Stage Analysis for Complex Trait Association with Arbitrary Depth Sequencing Data. STATS 2023. [DOI: 10.3390/stats6010029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/22/2023] Open
Abstract
Sequencing-based genetic association analysis is typically performed by first generating genotype calls from sequence data and then performing association tests on the called genotypes. Standard approaches require accurate genotype calling (GC), which can be achieved either with high sequencing depth (typically available in a small number of individuals) or via computationally intensive multi-sample linkage disequilibrium (LD)-aware methods. We propose a computationally efficient two-stage combination approach for association analysis, in which single-nucleotide polymorphisms (SNPs) are screened in the first stage via a rapid maximum likelihood (ML)-based method on sequence data directly (without first calling genotypes), and then the selected SNPs are evaluated in the second stage by performing association tests on genotypes from multi-sample LD-aware calling. Extensive simulation- and real data-based studies show that the proposed two-stage approaches can save 80% of the computational costs and still obtain more than 90% of the power of the classical method to genotype all markers at various depths d≥2.
Collapse
|
2
|
Eph and Ephrin Variants in Malaysian Neural Tube Defect Families. Genes (Basel) 2022; 13:genes13060952. [PMID: 35741713 PMCID: PMC9222557 DOI: 10.3390/genes13060952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 05/20/2022] [Accepted: 05/23/2022] [Indexed: 02/01/2023] Open
Abstract
Neural tube defects (NTDs) are common birth defects with a complex genetic etiology. Mouse genetic models have indicated a number of candidate genes, of which functional mutations in some have been found in human NTDs, usually in a heterozygous state. This study focuses on Ephs-ephrins as candidate genes of interest owing to growing evidence of the role of this gene family during neural tube closure in mouse models. Eph-ephrin genes were analyzed in 31 Malaysian individuals comprising seven individuals with sporadic spina bifida, 13 parents, one twin-sibling and 10 unrelated controls. Whole exome sequencing analysis and bioinformatic analysis were performed to identify variants in 22 known Eph-ephrin genes. We reported that three out of seven spina bifida probands and three out of thirteen family members carried a variant in either EPHA2 (rs147977279), EPHB6 (rs780569137) or EFNB1 (rs772228172). Analysis of public databases shows that these variants are rare. In exome datasets of the probands and parents of the probands with Eph-ephrin variants, the genotypes of spina bifida-related genes were compared to investigate the probability of the gene–gene interaction in relation to environmental risk factors. We report the presence of Eph-ephrin gene variants that are prevalent in a small cohort of spina bifida patients in Malaysian families.
Collapse
|
3
|
Rashid I, Campos M, Collier T, Crepeau M, Weakley A, Gripkey H, Lee Y, Schmidt H, Lanzaro GC. Spontaneous mutation rate estimates for the principal malaria vectors Anopheles coluzzii and Anopheles stephensi. Sci Rep 2022; 12:226. [PMID: 34996998 PMCID: PMC8742016 DOI: 10.1038/s41598-021-03943-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 12/07/2021] [Indexed: 11/17/2022] Open
Abstract
Using high-depth whole genome sequencing of F0 mating pairs and multiple individual F1 offspring, we estimated the nuclear mutation rate per generation in the malaria vectors Anopheles coluzzii and Anopheles stephensi by detecting de novo genetic mutations. A purpose-built computer program was employed to filter actual mutations from a deep background of superficially similar artifacts resulting from read misalignment. Performance of filtering parameters was determined using software-simulated mutations, and the resulting estimate of false negative rate was used to correct final mutation rate estimates. Spontaneous mutation rates by base substitution were estimated at 1.00 × 10−9 (95% confidence interval, 2.06 × 10−10—2.91 × 10−9) and 1.36 × 10−9 (95% confidence interval, 4.42 × 10−10—3.18 × 10−9) per site per generation in A. coluzzii and A. stephensi respectively. Although similar studies have been performed on other insect species including dipterans, this is the first study to empirically measure mutation rates in the important genus Anopheles, and thus provides an estimate of µ that will be of utility for comparative evolutionary genomics, as well as for population genetic analysis of malaria vector mosquito species.
Collapse
Affiliation(s)
- Iliyas Rashid
- Vector Genetics Laboratory, Department of Pathology, Microbiology and Immunology, UC Davis, 1089 Veterinary Medicine Dr, 4225 VM3B, Davis, CA, 95616, USA.,Section of Cell and Developmental Biology, University of California, San Diego, La Jolla, CA, USA.,Tata Institute for Genetics and Society, Center at inStem, Bangalore, Karnataka, 560065, India
| | - Melina Campos
- Vector Genetics Laboratory, Department of Pathology, Microbiology and Immunology, UC Davis, 1089 Veterinary Medicine Dr, 4225 VM3B, Davis, CA, 95616, USA
| | - Travis Collier
- Vector Genetics Laboratory, Department of Pathology, Microbiology and Immunology, UC Davis, 1089 Veterinary Medicine Dr, 4225 VM3B, Davis, CA, 95616, USA
| | - Marc Crepeau
- Vector Genetics Laboratory, Department of Pathology, Microbiology and Immunology, UC Davis, 1089 Veterinary Medicine Dr, 4225 VM3B, Davis, CA, 95616, USA
| | - Allison Weakley
- Department of ChEM-H Operations, Stanford University, 450 Serra Mall, Stanford, CA, 94305, USA
| | - Hans Gripkey
- Vector Genetics Laboratory, Department of Pathology, Microbiology and Immunology, UC Davis, 1089 Veterinary Medicine Dr, 4225 VM3B, Davis, CA, 95616, USA
| | - Yoosook Lee
- Florida Medical Entomology Laboratory, University of Florida, 200 9th St SE, Vero Beach, FL, 32962, USA
| | - Hanno Schmidt
- Anthropology, Institute of Organismic and Molecular Evolution (iomE), Johannes Gutenberg University of Mainz, Saarstraße 21, 55122, Mainz, Germany
| | - Gregory C Lanzaro
- Vector Genetics Laboratory, Department of Pathology, Microbiology and Immunology, UC Davis, 1089 Veterinary Medicine Dr, 4225 VM3B, Davis, CA, 95616, USA.
| |
Collapse
|
4
|
Variant Calling Using Whole Genome Resequencing and Sequence Capture for Population and Evolutionary Genomic Inferences in Norway Spruce (Picea Abies). COMPENDIUM OF PLANT GENOMES 2020. [DOI: 10.1007/978-3-030-21001-4_2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
5
|
Tieman D, Zhu G, Resende MFR, Lin T, Nguyen C, Bies D, Rambla JL, Beltran KSO, Taylor M, Zhang B, Ikeda H, Liu Z, Fisher J, Zemach I, Monforte A, Zamir D, Granell A, Kirst M, Huang S, Klee H. A chemical genetic roadmap to improved tomato flavor. Science 2017; 355:391-394. [PMID: 28126817 DOI: 10.1126/science.aal1556] [Citation(s) in RCA: 378] [Impact Index Per Article: 47.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 12/22/2016] [Indexed: 11/02/2022]
Abstract
Modern commercial tomato varieties are substantially less flavorful than heirloom varieties. To understand and ultimately correct this deficiency, we quantified flavor-associated chemicals in 398 modern, heirloom, and wild accessions. A subset of these accessions was evaluated in consumer panels, identifying the chemicals that made the most important contributions to flavor and consumer liking. We found that modern commercial varieties contain significantly lower amounts of many of these important flavor chemicals than older varieties. Whole-genome sequencing and a genome-wide association study permitted identification of genetic loci that affect most of the target flavor chemicals, including sugars, acids, and volatiles. Together, these results provide an understanding of the flavor deficiencies in modern commercial varieties and the information necessary for the recovery of good flavor through molecular breeding.
Collapse
Affiliation(s)
- Denise Tieman
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China.,Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| | - Guangtao Zhu
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China.,Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture, Sino-Dutch Joint Laboratory of Horticultural Genomics, Beijing 100081, China
| | | | - Tao Lin
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China.,Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture, Sino-Dutch Joint Laboratory of Horticultural Genomics, Beijing 100081, China
| | - Cuong Nguyen
- Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| | - Dawn Bies
- Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| | - Jose Luis Rambla
- Instituto de Biología Molecular y Celular de Plantas (Consejo Superior de Investigaciones Científicas-Universitat Politècnica de València), València, Spain
| | - Kristty Stephanie Ortiz Beltran
- Instituto de Biología Molecular y Celular de Plantas (Consejo Superior de Investigaciones Científicas-Universitat Politècnica de València), València, Spain
| | - Mark Taylor
- Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| | - Bo Zhang
- Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| | - Hiroki Ikeda
- Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| | - Zhongyuan Liu
- Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| | - Josef Fisher
- Faculty of Agriculture, Hebrew University of Jerusalem, Rehovot, Israel
| | - Itay Zemach
- Faculty of Agriculture, Hebrew University of Jerusalem, Rehovot, Israel
| | - Antonio Monforte
- Instituto de Biología Molecular y Celular de Plantas (Consejo Superior de Investigaciones Científicas-Universitat Politècnica de València), València, Spain
| | - Dani Zamir
- Faculty of Agriculture, Hebrew University of Jerusalem, Rehovot, Israel
| | - Antonio Granell
- Instituto de Biología Molecular y Celular de Plantas (Consejo Superior de Investigaciones Científicas-Universitat Politècnica de València), València, Spain
| | - Matias Kirst
- School of Forest Resources and Conservation, Genetics Institute, University of Florida, Gainesville, FL 32611, USA
| | - Sanwen Huang
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China. .,Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture, Sino-Dutch Joint Laboratory of Horticultural Genomics, Beijing 100081, China
| | - Harry Klee
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 7, Pengfei Road, Dapeng District, Shenzhen 518124, China. .,Horticultural Sciences, Plant Innovation Center, University of Florida, Post Office Box 110690, Gainesville, FL 32611, USA
| |
Collapse
|
6
|
Yan S, Yuan S, Xu Z, Zhang B, Zhang B, Kang G, Byrnes A, Li Y. Likelihood-based complex trait association testing for arbitrary depth sequencing data. Bioinformatics 2015; 31:2955-62. [PMID: 25979475 PMCID: PMC4668777 DOI: 10.1093/bioinformatics/btv307] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2014] [Revised: 05/06/2015] [Accepted: 05/11/2015] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED In next generation sequencing (NGS)-based genetic studies, researchers typically perform genotype calling first and then apply standard genotype-based methods for association testing. However, such a two-step approach ignores genotype calling uncertainty in the association testing step and may incur power loss and/or inflated type-I error. In the recent literature, a few robust and efficient likelihood based methods including both likelihood ratio test (LRT) and score test have been proposed to carry out association testing without intermediate genotype calling. These methods take genotype calling uncertainty into account by directly incorporating genotype likelihood function (GLF) of NGS data into association analysis. However, existing LRT methods are computationally demanding or do not allow covariate adjustment; while existing score tests are not applicable to markers with low minor allele frequency (MAF). We provide an LRT allowing flexible covariate adjustment, develop a statistically more powerful score test and propose a combination strategy (UNC combo) to leverage the advantages of both tests. We have carried out extensive simulations to evaluate the performance of our proposed LRT and score test. Simulations and real data analysis demonstrate the advantages of our proposed combination strategy: it offers a satisfactory trade-off in terms of computational efficiency, applicability (accommodating both common variants and variants with low MAF) and statistical power, particularly for the analysis of quantitative trait where the power gain can be up to ∼60% when the causal variant is of low frequency (MAF < 0.01). AVAILABILITY AND IMPLEMENTATION UNC combo and the associated R files, including documentation, examples, are available at http://www.unc.edu/∼yunmli/UNCcombo/ CONTACT yunli@med.unc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Song Yan
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Shuai Yuan
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Zheng Xu
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Baqun Zhang
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Bo Zhang
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Guolian Kang
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Andrea Byrnes
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Yun Li
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| |
Collapse
|
7
|
Iyer S, Casey E, Bouzek H, Kim M, Deng W, Larsen BB, Zhao H, Bumgarner RE, Rolland M, Mullins JI. Comparison of Major and Minor Viral SNPs Identified through Single Template Sequencing and Pyrosequencing in Acute HIV-1 Infection. PLoS One 2015; 10:e0135903. [PMID: 26317928 PMCID: PMC4552882 DOI: 10.1371/journal.pone.0135903] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2014] [Accepted: 07/27/2015] [Indexed: 01/03/2023] Open
Abstract
Massively parallel sequencing (MPS) technologies, such as 454-pyrosequencing, allow for the identification of variants in sequence populations at lower levels than consensus sequencing and most single-template Sanger sequencing experiments. We sought to determine if the greater depth of population sampling attainable using MPS technology would allow detection of minor variants in HIV founder virus populations very early in infection in instances where Sanger sequencing detects only a single variant. We compared single nucleotide polymorphisms (SNPs) during acute HIV-1 infection from 32 subjects using both single template Sanger and 454-pyrosequencing. Pyrosequences from a median of 2400 viral templates per subject and encompassing 40% of the HIV-1 genome, were compared to a median of five individually amplified near full-length viral genomes sequenced using Sanger technology. There was no difference in the consensus nucleotide sequences over the 3.6kb compared in 84% of the subjects infected with single founders and 33% of subjects infected with multiple founder variants: among the subjects with disagreements, mismatches were found in less than 1% of the sites evaluated (of a total of nearly 117,000 sites across all subjects). The majority of the SNPs observed only in pyrosequences were present at less than 2% of the subject’s viral sequence population. These results demonstrate the utility of the Sanger approach for study of early HIV infection and provide guidance regarding the design, utility and limitations of population sequencing from variable template sources, and emphasize parameters for improving the interpretation of massively parallel sequencing data to address important questions regarding target sequence evolution.
Collapse
Affiliation(s)
- Shyamala Iyer
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Eleanor Casey
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Heather Bouzek
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Moon Kim
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Wenjie Deng
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Brendan B. Larsen
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Hong Zhao
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Roger E. Bumgarner
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Morgane Rolland
- US Military HIV Research Program, WRAIR, Silver Spring, MD, 20910, United States of America
- Henry Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, MD, 20817, United States of America
| | - James I. Mullins
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
- Department of Medicine, University of Washington, Seattle, WA, 98195, United States of America
- Department of Laboratory Medicine, Seattle, WA, 98195, United States of America
- * E-mail:
| |
Collapse
|
8
|
Yuan S, Johnston HR, Zhang G, Li Y, Hu YJ, Qin ZS. One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies. PLoS Comput Biol 2015; 11:e1004448. [PMID: 26267278 PMCID: PMC4534450 DOI: 10.1371/journal.pcbi.1004448] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2015] [Accepted: 07/13/2015] [Indexed: 12/13/2022] Open
Abstract
With rapid decline of the sequencing cost, researchers today rush to embrace whole genome sequencing (WGS), or whole exome sequencing (WES) approach as the next powerful tool for relating genetic variants to human diseases and phenotypes. A fundamental step in analyzing WGS and WES data is mapping short sequencing reads back to the reference genome. This is an important issue because incorrectly mapped reads affect the downstream variant discovery, genotype calling and association analysis. Although many read mapping algorithms have been developed, the majority of them uses the universal reference genome and do not take sequence variants into consideration. Given that genetic variants are ubiquitous, it is highly desirable if they can be factored into the read mapping procedure. In this work, we developed a novel strategy that utilizes genotypes obtained a priori to customize the universal haploid reference genome into a personalized diploid reference genome. The new strategy is implemented in a program named RefEditor. When applying RefEditor to real data, we achieved encouraging improvements in read mapping, variant discovery and genotype calling. Compared to standard approaches, RefEditor can significantly increase genotype calling consistency (from 43% to 61% at 4X coverage; from 82% to 92% at 20X coverage) and reduce Mendelian inconsistency across various sequencing depths. Because many WGS and WES studies are conducted on cohorts that have been genotyped using array-based genotyping platforms previously or concurrently, we believe the proposed strategy will be of high value in practice, which can also be applied to the scenario where multiple NGS experiments are conducted on the same cohort. The RefEditor sources are available at https://github.com/superyuan/refeditor.
Collapse
Affiliation(s)
- Shuai Yuan
- Mathematics & Computer Science Department, Emory University, Atlanta, Georgia, United States of America
| | - H. Richard Johnston
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, Georgia, United States of America
| | - Guosheng Zhang
- Department of Genetics, Department of Biostatistics, Department of Computer Science, University of North Carolina, Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Yun Li
- Department of Genetics, Department of Biostatistics, Department of Computer Science, University of North Carolina, Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Yi-Juan Hu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, Georgia, United States of America
| | - Zhaohui S. Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, Georgia, United States of America
- * E-mail:
| |
Collapse
|
9
|
Siretskiy A, Sundqvist T, Voznesenskiy M, Spjuth O. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience 2015; 4:26. [PMID: 26045962 PMCID: PMC4455317 DOI: 10.1186/s13742-015-0058-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2014] [Accepted: 04/09/2015] [Indexed: 01/09/2023] Open
Abstract
Background New high-throughput technologies, such as massively parallel sequencing, have transformed the life sciences into a data-intensive field. The most common e-infrastructure for analyzing this data consists of batch systems that are based on high-performance computing resources; however, the bioinformatics software that is built on this platform does not scale well in the general case. Recently, the Hadoop platform has emerged as an interesting option to address the challenges of increasingly large datasets with distributed storage, distributed processing, built-in data locality, fault tolerance, and an appealing programming methodology. Results In this work we introduce metrics and report on a quantitative comparison between Hadoop and a single node of conventional high-performance computing resources for the tasks of short read mapping and variant calling. We calculate efficiency as a function of data size and observe that the Hadoop platform is more efficient for biologically relevant data sizes in terms of computing hours for both split and un-split data files. We also quantify the advantages of the data locality provided by Hadoop for NGS problems, and show that a classical architecture with network-attached storage will not scale when computing resources increase in numbers. Measurements were performed using ten datasets of different sizes, up to 100 gigabases, using the pipeline implemented in Crossbow. To make a fair comparison, we implemented an improved preprocessor for Hadoop with better performance for splittable data files. For improved usability, we implemented a graphical user interface for Crossbow in a private cloud environment using the CloudGene platform. All of the code and data in this study are freely available as open source in public repositories. Conclusions From our experiments we can conclude that the improved Hadoop pipeline scales better than the same pipeline on high-performance computing resources, we also conclude that Hadoop is an economically viable option for the common data sizes that are currently used in massively parallel sequencing. Given that datasets are expected to increase over time, Hadoop is a framework that we envision will have an increasingly important role in future biological data analysis. Electronic supplementary material The online version of this article (doi:10.1186/s13742-015-0058-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexey Siretskiy
- Department of Information Technology, Uppsala University, P.O. Box 337, Uppsala, SE-75105 Sweden
| | - Tore Sundqvist
- Department of Information Technology, Uppsala University, P.O. Box 337, Uppsala, SE-75105 Sweden
| | - Mikhail Voznesenskiy
- Department of Physical Chemistry, institute of Chemistry, St-Petersburg State University, Saint-Petersburg, Russia
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, P.O. Box 541, Uppsala, SE-75124 Sweden
| |
Collapse
|
10
|
Ten years of next-generation sequencing technology. Trends Genet 2014; 30:418-26. [PMID: 25108476 DOI: 10.1016/j.tig.2014.07.001] [Citation(s) in RCA: 872] [Impact Index Per Article: 79.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Revised: 07/08/2014] [Accepted: 07/09/2014] [Indexed: 02/06/2023]
Abstract
Ten years ago next-generation sequencing (NGS) technologies appeared on the market. During the past decade, tremendous progress has been made in terms of speed, read length, and throughput, along with a sharp reduction in per-base cost. Together, these advances democratized NGS and paved the way for the development of a large number of novel NGS applications in basic science as well as in translational research areas such as clinical diagnostics, agrigenomics, and forensic science. Here we provide an overview of the evolution of NGS and discuss the most significant improvements in sequencing technologies and library preparation protocols. We also explore the current landscape of NGS applications and provide a perspective for future developments.
Collapse
|
11
|
Abstract
Moving from a traditional medical model of treating pathologies to an individualized predictive and preventive model of personalized medicine promises to reduce the healthcare cost on an overburdened and overwhelmed system. Next-generation sequencing (NGS) has the potential to accelerate the early detection of disorders and the identification of pharmacogenetics markers to customize treatments. This review explains the historical facts that led to the development of NGS along with the strengths and weakness of NGS, with a special emphasis on the analytical aspects used to process NGS data. There are solutions to all the steps necessary for performing NGS in the clinical context where the majority of them are very efficient, but there are some crucial steps in the process that need immediate attention.
Collapse
Affiliation(s)
- Manuel L. Gonzalez-Garay
- Center for Molecular Imaging, Division of Genomics & Bioinformatics, The Brown Foundation Institute of Molecular Medicine, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
12
|
Yu X, Sun S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinformatics 2013; 14:274. [PMID: 24044377 PMCID: PMC3848615 DOI: 10.1186/1471-2105-14-274] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Accepted: 09/12/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. RESULTS To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. CONCLUSIONS Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
Collapse
Affiliation(s)
- Xiaoqing Yu
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44106, USA.
| | | |
Collapse
|