1
|
Baumann A, Ruckert C, Meier C, Hutschenreiter T, Remy R, Schnur B, Döbel M, Fankep RCN, Skowronek D, Kutz O, Arnold N, Katzke AL, Forster M, Kobiela AL, Thiedig K, Zimmer A, Ritter J, Weber BHF, Honisch E, Hackmann K, Schmidt G, Sturm M, Ernst C. Limitations in next-generation sequencing-based genotyping of breast cancer polygenic risk score loci. Eur J Hum Genet 2024; 32:987-997. [PMID: 38907004 PMCID: PMC11291653 DOI: 10.1038/s41431-024-01647-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 05/17/2024] [Accepted: 06/10/2024] [Indexed: 06/23/2024] Open
Abstract
Considering polygenic risk scores (PRSs) in individual risk prediction is increasingly implemented in genetic testing for hereditary breast cancer (BC) based on next-generation sequencing (NGS). To calculate individual BC risks, the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA) with the inclusion of the BCAC 313 or the BRIDGES 306 BC PRS is commonly used. The PRS calculation depends on accurately reproducing the variant allele frequencies (AFs) and, consequently, the distribution of PRS values anticipated by the algorithm. Here, the 324 loci of the BCAC 313 and the BRIDGES 306 BC PRS were examined in population-specific database gnomAD and in real-world data sets of five centers of the German Consortium for Hereditary Breast and Ovarian Cancer (GC-HBOC), to determine whether these expected AFs can be reproduced by NGS-based genotyping. Four PRS loci were non-existent in gnomAD v3.1.2 non-Finnish Europeans, further 24 loci showed noticeably deviating AFs. In real-world data, between 11 and 23 loci were reported with noticeably deviating AFs, and were shown to have effects on final risk prediction. Deviations depended on the sequencing approach, variant caller and calling mode (forced versus unforced) employed. Therefore, this study demonstrates the necessity to apply quality assurance not only in terms of sequencing coverage but also observed AFs in a sufficiently large cohort, when implementing PRSs in a routine diagnostic setting. Furthermore, future PRS design should be guided by the technical reproducibility of expected AFs across commonly used genotyping methods, especially NGS, in addition to the observed effect sizes.
Collapse
Affiliation(s)
- Alexandra Baumann
- Institute for Clinical Genetics, University Hospital Carl Gustav Carus at TUD Dresden University of Technology and Faculty of Medicine of TUD Dresden University of Technology, Dresden, Germany
- ERN GENTURIS, Hereditary Cancer Syndrome Center Dresden, Dresden, Germany
- National Center for Tumor Diseases (NCT), NCT/UCC Dresden, a partnership between German Cancer Research Center (DKFZ), Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology and Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- German Cancer Research Center (DKFZ), Heidelberg, Germany
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| | - Christian Ruckert
- Department of Medical Genetics, University Hospital Münster, Münster, Germany
| | - Christoph Meier
- Institute of Human Genetics, University of Regensburg, Regensburg, Germany
| | - Tim Hutschenreiter
- Institute for Clinical Genetics, University Hospital Carl Gustav Carus at TUD Dresden University of Technology and Faculty of Medicine of TUD Dresden University of Technology, Dresden, Germany
- ERN GENTURIS, Hereditary Cancer Syndrome Center Dresden, Dresden, Germany
- National Center for Tumor Diseases (NCT), NCT/UCC Dresden, a partnership between German Cancer Research Center (DKFZ), Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology and Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- German Cancer Research Center (DKFZ), Heidelberg, Germany
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| | - Robert Remy
- Center for Familial Breast and Ovarian Cancer, Center for Integrated Oncology (CIO), Medical Faculty, University of Cologne and University Hospital Cologne, Cologne, Germany
| | - Benedikt Schnur
- Department of Human Genetics, Hannover Medical School (MHH), Hannover, Germany
| | - Marvin Döbel
- Institute of Medical Genetics and Applied Genomics, University Hospital Tübingen, Tübingen, Germany
| | - Rudel Christian Nkouamedjo Fankep
- Center for Familial Breast and Ovarian Cancer, Center for Integrated Oncology (CIO), Medical Faculty, University of Cologne and University Hospital Cologne, Cologne, Germany
| | - Dariush Skowronek
- Department of Human Genetics, University Medicine Greifswald and Interfaculty Institute of Genetics and Functional Genomics, University of Greifswald, Greifswald, Germany
| | - Oliver Kutz
- Institute for Clinical Genetics, University Hospital Carl Gustav Carus at TUD Dresden University of Technology and Faculty of Medicine of TUD Dresden University of Technology, Dresden, Germany
- ERN GENTURIS, Hereditary Cancer Syndrome Center Dresden, Dresden, Germany
- National Center for Tumor Diseases (NCT), NCT/UCC Dresden, a partnership between German Cancer Research Center (DKFZ), Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology and Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- German Cancer Research Center (DKFZ), Heidelberg, Germany
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Department of Gynecology and Obstetrics, University Hospital Carl Gustav Carus at TUD Dresden University of Technology and Faculty of Medicine of TUD Dresden University of Technology, Dresden, Germany
| | - Norbert Arnold
- Department of Gynecology and Obstetrics, Institute of Clinical Chemistry Institute of Clinical Molecular Biology, University Hospital Schleswig-Holstein, Campus Kiel, Kiel, Germany
| | - Anna-Lena Katzke
- Department of Human Genetics, Hannover Medical School (MHH), Hannover, Germany
| | - Michael Forster
- Department of Gynecology and Obstetrics, Institute of Clinical Chemistry Institute of Clinical Molecular Biology, University Hospital Schleswig-Holstein, Campus Kiel, Kiel, Germany
| | - Anna-Lena Kobiela
- Center for Familial Breast and Ovarian Cancer, Center for Integrated Oncology (CIO), Medical Faculty, University of Cologne and University Hospital Cologne, Cologne, Germany
| | - Katharina Thiedig
- Division of Gynaecology and Obstetrics, Klinikum rechts der Isar der Technischen Universität München, München, Germany
| | - Andreas Zimmer
- Institute for Human Genetics, Medical Center University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Julia Ritter
- Department of Human Genetics, Labor Berlin - Charité Vivantes GmbH, Berlin, Germany
| | - Bernhard H F Weber
- Institute of Human Genetics, University of Regensburg, Regensburg, Germany
- Institute of Clinical Human Genetics, University Hospital Regensburg, Regensburg, Germany
| | - Ellen Honisch
- Department of Gynaecology and Obstetrics, University Hospital Düsseldorf, Heinrich-Heine University Düsseldorf, Düsseldorf, Germany
| | - Karl Hackmann
- Institute for Clinical Genetics, University Hospital Carl Gustav Carus at TUD Dresden University of Technology and Faculty of Medicine of TUD Dresden University of Technology, Dresden, Germany
- ERN GENTURIS, Hereditary Cancer Syndrome Center Dresden, Dresden, Germany
- National Center for Tumor Diseases (NCT), NCT/UCC Dresden, a partnership between German Cancer Research Center (DKFZ), Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology and Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
- German Cancer Consortium (DKTK), Dresden, Germany
- German Cancer Research Center (DKFZ), Heidelberg, Germany
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| | - Gunnar Schmidt
- Department of Human Genetics, Hannover Medical School (MHH), Hannover, Germany
| | - Marc Sturm
- Institute of Medical Genetics and Applied Genomics, University Hospital Tübingen, Tübingen, Germany
| | - Corinna Ernst
- Center for Familial Breast and Ovarian Cancer, Center for Integrated Oncology (CIO), Medical Faculty, University of Cologne and University Hospital Cologne, Cologne, Germany.
| |
Collapse
|
2
|
Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet 2023:10.1038/s41588-023-01415-w. [PMID: 37386248 DOI: 10.1038/s41588-023-01415-w] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 05/04/2023] [Indexed: 07/01/2023]
Abstract
Phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000. Furthermore, we outline a method for phasing singletons, which, although less precise, constitutes an important step towards future developments. We then demonstrate that the use of UKB as a reference panel improves the accuracy of genotype imputation, which is even more pronounced when phased with SHAPEIT5 compared with other methods. Finally, we screen the UKB data for loss-of-function compound heterozygous events and identify 549 genes where both gene copies are knocked out. These genes complement current knowledge of gene essentiality in the human genome.
Collapse
Affiliation(s)
- Robin J Hofmeister
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Diogo M Ribeiro
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Simone Rubinacci
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Olivier Delaneau
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
| |
Collapse
|
3
|
Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci Rep 2022; 12:11331. [PMID: 35790846 PMCID: PMC9256665 DOI: 10.1038/s41598-022-15563-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 06/27/2022] [Indexed: 11/09/2022] Open
Abstract
Identification of genetic variations is a central part of population and quantitative genomics studies based on high-throughput sequencing data. Even though popular variant callers such as Bcftools mpileup and GATK HaplotypeCaller were developed nearly 10 years ago, their performance is still largely unknown for non-human species. Here, we showed by benchmark analyses with a simulated insect population that Bcftools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and accuracy regardless of mapping software. The vast majority of false positives were observed from repeats, especially for GATK HaplotypeCaller. Variant scores calculated by GATK did not clearly distinguish true positives from false positives in the vast majority of cases, implying that hard-filtering with GATK could be challenging. These results suggest that Bcftools mpileup may be the first choice for non-human studies and that variants within repeats might have to be excluded for downstream analyses.
Collapse
Affiliation(s)
| | - Kiwoong Nam
- DGIMI, Univ Montpellier, INRAE, Montpellier, France.
| |
Collapse
|
4
|
Zhang C, Zheng T, Ma Q, Yang L, Zhang M, Wang J, Teng X, Miao Y, Lin HC, Yang Y, Han D. Logical Analysis of Multiple Single-Nucleotide-Polymorphisms with Programmable DNA Molecular Computation for Clinical Diagnostics. Angew Chem Int Ed Engl 2022; 61:e202117658. [PMID: 35137499 DOI: 10.1002/anie.202117658] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2021] [Indexed: 11/07/2022]
Abstract
Analyzing complex single-nucleotide-polymorphism (SNP) combinations in the genome is important for research and clinical applications, given that different SNP combinations can generate different phenotypic consequences. Recent works have shown that DNA-based molecular computing is powerful for simultaneously sensing and analyzing complex molecular information. Here, we designed a switching circuit-based DNA computational scheme that can integrate the sensing of multiple SNPs and simultaneously perform logical analysis of the detected SNP information to directly report clinical outcomes. As a demonstration, we successfully achieved automatic and accurate identification of 21 different blood group genotypes from 83 clinical blood samples with 100 % accuracy compared to sequencing data in a more rapid manner (3 hours). Our method enables a new mode of automatic and logical sensing and analyzing subtle molecular information for clinical diagnosis, as well as guiding personalized medication.
Collapse
Affiliation(s)
- Chao Zhang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Tingting Zheng
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Qian Ma
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Linlin Yang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Mingzhi Zhang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Junyan Wang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Xiaoyan Teng
- Department of Laboratory Medicine, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, 201306, China
| | - Yanyan Miao
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Hsiao-Chu Lin
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - Yang Yang
- Department of Thoracic Surgery, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai, 200433, China
| | - Da Han
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| |
Collapse
|
5
|
Zhang C, Zheng T, Ma Q, Yang L, Zhang M, Wang J, Teng X, Miao Y, Lin H, Yang Y, Han D. Logical Analysis of Multiple Single‐Nucleotide‐Polymorphisms with Programmable DNA Molecular Computation for Clinical Diagnostics. Angew Chem Int Ed Engl 2022. [DOI: 10.1002/ange.202117658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Chao Zhang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Tingting Zheng
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Qian Ma
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Linlin Yang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Mingzhi Zhang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Junyan Wang
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Xiaoyan Teng
- Department of Laboratory Medicine Shanghai Jiao Tong University Affiliated Sixth People's Hospital Shanghai 201306 China
| | - Yanyan Miao
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Hsiao‐chu Lin
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| | - Yang Yang
- Department of Thoracic Surgery Shanghai Pulmonary Hospital Tongji University School of Medicine Shanghai 200433 China
| | - Da Han
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine State Key Laboratory of Oncogenes and Related Genes Renji Hospital School of Medicine Shanghai Jiao Tong University Shanghai 200127 China
| |
Collapse
|
6
|
Zanti M, Michailidou K, Loizidou MA, Machattou C, Pirpa P, Christodoulou K, Spyrou GM, Kyriacou K, Hadjisavvas A. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC Bioinformatics 2021; 22:218. [PMID: 33910496 PMCID: PMC8080428 DOI: 10.1186/s12859-021-04144-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 04/15/2021] [Indexed: 11/10/2022] Open
Abstract
Background Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04144-1.
Collapse
Affiliation(s)
- Maria Zanti
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriaki Michailidou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Biostatistics Unit, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Maria A Loizidou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Christina Machattou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Panagiota Pirpa
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyproula Christodoulou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Neurogenetics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - George M Spyrou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriacos Kyriacou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Andreas Hadjisavvas
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus. .,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.
| |
Collapse
|
7
|
Quinodoz M, Peter VG, Bedoni N, Royer Bertrand B, Cisarova K, Salmaninejad A, Sepahi N, Rodrigues R, Piran M, Mojarrad M, Pasdar A, Ghanbari Asad A, Sousa AB, Coutinho Santos L, Superti-Furga A, Rivolta C. AutoMap is a high performance homozygosity mapping tool using next-generation sequencing data. Nat Commun 2021; 12:518. [PMID: 33483490 PMCID: PMC7822856 DOI: 10.1038/s41467-020-20584-4] [Citation(s) in RCA: 76] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Accepted: 12/09/2020] [Indexed: 12/11/2022] Open
Abstract
Homozygosity mapping is a powerful method for identifying mutations in patients with recessive conditions, especially in consanguineous families or isolated populations. Historically, it has been used in conjunction with genotypes from highly polymorphic markers, such as DNA microsatellites or common SNPs. Traditional software performs rather poorly with data from Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS), which are now extensively used in medical genetics. We develop AutoMap, a tool that is both web-based or downloadable, to allow performing homozygosity mapping directly on VCF (Variant Call Format) calls from WES or WGS projects. Following a training step on WES data from 26 consanguineous families and a validation procedure on a matched cohort, our method shows higher overall performances when compared with eight existing tools. Most importantly, when tested on real cases with negative molecular diagnosis from an internal set, AutoMap detects three gene-disease and multiple variant-disease associations that were previously unrecognized, projecting clear benefits for both molecular diagnosis and research activities in medical genetics. Homozygosity mapping is a useful tool for identifying candidate mutations in recessive conditions, however application to next generation sequencing data has been sub-optimal. Here, the authors present AutoMap, which efficiently identifies runs of homozygosity in whole exome/genome sequencing data.
Collapse
Affiliation(s)
- Mathieu Quinodoz
- Institute of Molecular and Clinical Ophthalmology Basel (IOB), Basel, Switzerland.,Department of Ophthalmology, University of Basel, Basel, Switzerland.,Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Virginie G Peter
- Institute of Molecular and Clinical Ophthalmology Basel (IOB), Basel, Switzerland.,Department of Ophthalmology, University of Basel, Basel, Switzerland.,Department of Genetics and Genome Biology, University of Leicester, Leicester, UK.,Institute of Experimental Pathology, Lausanne University Hospital (CHUV), Lausanne, Switzerland
| | - Nicola Bedoni
- Service of Medical Genetics, Lausanne University Hospital (CHUV), Lausanne, Switzerland
| | - Béryl Royer Bertrand
- Service of Medical Genetics, Lausanne University Hospital (CHUV), Lausanne, Switzerland
| | - Katarina Cisarova
- Service of Medical Genetics, Lausanne University Hospital (CHUV), Lausanne, Switzerland
| | - Arash Salmaninejad
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Neda Sepahi
- Noncommunicable Diseases Research Center, Fasa University of Sciences, Fasa, Iran
| | - Raquel Rodrigues
- Department of Medical Genetics, Hospital Santa Maria, Centro Hospitalar Universitário Lisboa Norte (CHULN), Lisbon Academic Medical Center (CAML), Lisbon, Portugal
| | - Mehran Piran
- Noncommunicable Diseases Research Center, Fasa University of Sciences, Fasa, Iran.,Bioinformatics and Computational Biology Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Majid Mojarrad
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Alireza Pasdar
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.,Division of Applied Medicine, Medical School, University of Aberdeen, Aberdeen, UK
| | - Ali Ghanbari Asad
- Noncommunicable Diseases Research Center, Fasa University of Sciences, Fasa, Iran
| | - Ana Berta Sousa
- Department of Medical Genetics, Hospital Santa Maria, Centro Hospitalar Universitário Lisboa Norte (CHULN), Lisbon Academic Medical Center (CAML), Lisbon, Portugal.,Medical Faculty, Lisbon University, Lisbon, Portugal
| | | | - Andrea Superti-Furga
- Service of Medical Genetics, Lausanne University Hospital (CHUV), Lausanne, Switzerland
| | - Carlo Rivolta
- Institute of Molecular and Clinical Ophthalmology Basel (IOB), Basel, Switzerland. .,Department of Ophthalmology, University of Basel, Basel, Switzerland. .,Department of Genetics and Genome Biology, University of Leicester, Leicester, UK.
| |
Collapse
|
8
|
Molina-Mora JA, Solano-Vargas M. Set-theory based benchmarking of three different variant callers for targeted sequencing. BMC Bioinformatics 2021; 22:20. [PMID: 33413082 PMCID: PMC7791862 DOI: 10.1186/s12859-020-03926-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 12/09/2020] [Indexed: 12/05/2022] Open
Abstract
Background Next generation sequencing (NGS) technologies have improved the study of hereditary diseases. Since the evaluation of bioinformatics pipelines is not straightforward, NGS demands effective strategies to analyze data that is of paramount relevance for decision making under a clinical scenario. According to the benchmarking framework of the Global Alliance for Genomics and Health (GA4GH), we implemented a new simple and user-friendly set-theory based method to assess variant callers using a gold standard variant set and high confidence regions. As model, we used TruSight Cardio kit sequencing data of the reference genome NA12878. This targeted sequencing kit is used to identify variants in key genes related to Inherited Cardiac Conditions (ICCs), a group of cardiovascular diseases with high rates of morbidity and mortality. Results We implemented and compared three variant calling pipelines (Isaac, Freebayes, and VarScan). Performance metrics using our set-theory approach showed high-resolution pipelines and revealed: (1) a perfect recall of 1.000 for all three pipelines, (2) very high precision values, i.e. 0.987 for Freebayes, 0.928 for VarScan, and 1.000 for Isaac, when compared with the reference material, and (3) a ROC curve analysis with AUC > 0.94 for all cases. Moreover, significant differences were obtained between the three pipelines. In general, results indicate that the three pipelines were able to recognize the expected variants in the gold standard data set. Conclusions Our set-theory approach to calculate metrics was able to identify the expected ICCs related variants by the three selected pipelines, but results were completely dependent on the algorithms. We emphasize the importance to assess pipelines using gold standard materials to achieve the most reliable results for clinical application.
Collapse
Affiliation(s)
- Jose Arturo Molina-Mora
- Centro de Investigación en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica (UCR), San José, Costa Rica. .,Centro de Investigaciones en Hematología y Transtornos Afines (CIHATA), Universidad de Costa Rica (UCR), San José, Costa Rica.
| | - Mariela Solano-Vargas
- Centro de Investigaciones en Hematología y Transtornos Afines (CIHATA), Universidad de Costa Rica (UCR), San José, Costa Rica
| |
Collapse
|
9
|
Alosaimi S, van Biljon N, Awany D, Thami PK, Defo J, Mugo JW, Bope CD, Mazandu GK, Mulder NJ, Chimusa ER. Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches. Brief Bioinform 2020; 22:6042242. [PMID: 33341897 DOI: 10.1093/bib/bbaa366] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 11/14/2020] [Accepted: 01/08/2020] [Indexed: 12/15/2022] Open
Abstract
Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.
Collapse
Affiliation(s)
- Shatha Alosaimi
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Noëlle van Biljon
- Department of Statistical Sciences, University of Cape Town, Cape Town, South Africa
| | - Denis Awany
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Prisca K Thami
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Joel Defo
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Jacquiline W Mugo
- Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | - Christian D Bope
- Faculty of Sciences, Department of Mathematics and Computer Science, University of Kinshasa, Kinshasa, DRC
| | - Gaston K Mazandu
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | - Nicola J Mulder
- Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa
| | - Emile R Chimusa
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa
| |
Collapse
|
10
|
DeepVariant-on-Spark: Small-Scale Genome Analysis Using a Cloud-Based Computing Framework. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:7231205. [PMID: 32952600 PMCID: PMC7481958 DOI: 10.1155/2020/7231205] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 08/15/2020] [Accepted: 08/21/2020] [Indexed: 12/18/2022]
Abstract
Although sequencing a human genome has become affordable, identifying genetic variants from whole-genome sequence data is still a hurdle for researchers without adequate computing equipment or bioinformatics support. GATK is a gold standard method for the identification of genetic variants and has been widely used in genome projects and population genetic studies for many years. This was until the Google Brain team developed a new method, DeepVariant, which utilizes deep neural networks to construct an image classification model to identify genetic variants. However, the superior accuracy of DeepVariant comes at the cost of computational intensity, largely constraining its applications. Accordingly, we present DeepVariant-on-Spark to optimize resource allocation, enable multi-GPU support, and accelerate the processing of the DeepVariant pipeline. To make DeepVariant-on-Spark more accessible to everyone, we have deployed the DeepVariant-on-Spark to the Google Cloud Platform (GCP). Users can deploy DeepVariant-on-Spark on the GCP following our instruction within 20 minutes and start to analyze at least ten whole-genome sequencing datasets using free credits provided by the GCP. DeepVaraint-on-Spark is freely available for small-scale genome analysis using a cloud-based computing framework, which is suitable for pilot testing or preliminary study, while reserving the flexibility and scalability for large-scale sequencing projects.
Collapse
|
11
|
Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinformatics 2019; 20:342. [PMID: 31208315 PMCID: PMC6580603 DOI: 10.1186/s12859-019-2928-9] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Accepted: 05/31/2019] [Indexed: 12/30/2022] Open
Abstract
Background Whole exome sequencing (WES) is a cost-effective method that identifies clinical variants but it demands accurate variant caller tools. Currently available tools have variable accuracy in predicting specific clinical variants. But it may be possible to find the best combination of aligner-variant caller tools for detecting accurate single nucleotide variants (SNVs) and small insertion and deletion (InDels) separately. Moreover, many important aspects of InDel detection are overlooked while comparing the performance of tools, particularly its base pair length. Results We assessed the performance of variant calling pipelines using the combinations of four variant callers and five aligners on human NA12878 and simulated exome data. We used high confidence variant calls from Genome in a Bottle (GiaB) consortium for validation, and GRCh37 and GRCh38 as the human reference genome. Based on the performance metrics, both BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for InDels. Furthermore, we obtained similar results on human NA24385 and NA24631 exome data from GiaB. Conclusion In this study, DeepVariant with BWA and Novoalign performed best for detecting accurate SNVs and InDels. The accuracy of variant calling was improved by merging the top performing pipelines. The results of our study provide useful recommendations for analysis of WES data in clinical genomics. Electronic supplementary material The online version of this article (10.1186/s12859-019-2928-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Manojkumar Kumaran
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.,School of Chemical and Biotechnology, SASTRA (Deemed to be University), Thanjavur, Tamil Nadu, 613401, India
| | - Umadevi Subramanian
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India
| | - Bharanidharan Devarajan
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.
| |
Collapse
|
12
|
Gonda I, Ashrafi H, Lyon DA, Strickler SR, Hulse-Kemp AM, Ma Q, Sun H, Stoffel K, Powell AF, Futrell S, Thannhauser TW, Fei Z, Van Deynze AE, Mueller LA, Giovannoni JJ, Foolad MR. Sequencing-Based Bin Map Construction of a Tomato Mapping Population, Facilitating High-Resolution Quantitative Trait Loci Detection. THE PLANT GENOME 2019; 12:180010. [PMID: 30951101 DOI: 10.3835/plantgenome2018.02.0010] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Genotyping-by-sequencing (GBS) was employed to construct a highly saturated genetic linkage map of a tomato ( L.) recombinant inbred line (RIL) population, derived from a cross between cultivar NC EBR-1 and the wild tomato L. accession LA2093. A pipeline was developed to convert single nucleotide polymorphism (SNP) data into genomic bins, which could be used for fine mapping of quantitative trait loci (QTL) and identification of candidate genes. The pipeline, implemented in a python script named SNPbinner, adopts a hidden Markov model approach for calculation of recombination breakpoints followed by genomic bins construction. The total length of the newly developed high-resolution genetic map was 1.2-fold larger than previously estimated based on restriction fragment length polymorphism (RFLP) and polymerase chain reaction (PCR)-based markers. The map was used to verify and refine QTL previously identified for two fruit quality traits in the RIL population, fruit weight (FW) and fruit lycopene content (LYC). Two well-described FW QTL ( and ) were localized precisely at their known underlying causative genes, and the QTL intervals were decreased by two- to tenfold. A major QTL for LYC content () was verified at high resolution and its underlying causative gene was determined to be ζ (). The RIL population, the high resolution genetic map, and the easy-to-use genotyping pipeline, SNPbinner, are made publicly available.
Collapse
|
13
|
Dharanipragada P, Seelam SR, Parekh N. SeqVItA: Sequence Variant Identification and Annotation Platform for Next Generation Sequencing Data. Front Genet 2018; 9:537. [PMID: 30487811 PMCID: PMC6247818 DOI: 10.3389/fgene.2018.00537] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 10/23/2018] [Indexed: 12/20/2022] Open
Abstract
The current trend in clinical data analysis is to understand how individuals respond to therapies and drug interactions based on their genetic makeup. This has led to a paradigm shift in healthcare; caring for patients is now 99% information and 1% intervention. Reducing costs of next generation sequencing (NGS) technologies has made it possible to take genetic profiling to the clinical setting. This requires not just fast and accurate algorithms for variant detection, but also a knowledge-base for variant annotation and prioritization to facilitate tailored therapeutics based on an individual's genetic profile. Here we show that it is possible to provide a fast and easy access to all possible information about a variant and its impact on the gene, its protein product, associated pathways and drug-variant interactions by integrating previously reported knowledge from various databases. With this objective, we have developed a pipeline, Sequence Variants Identification and Annotation (SeqVItA) that provides end-to-end solution for small sequence variants detection, annotation and prioritization on a single platform. Parallelization of the variant detection step and with numerous resources incorporated to infer functional impact, clinical relevance and drug-variant associations, SeqVItA will benefit the clinical and research communities alike. Its open-source platform and modular framework allows for easy customization of the workflow depending on the data type (single, paired, or pooled samples), variant type (germline and somatic), and variant annotation and prioritization. Performance comparison of SeqVItA on simulated data and detection, interpretation and analysis of somatic variants on real data (24 liver cancer patients) is carried out. We demonstrate the efficacy of annotation module in facilitating personalized medicine based on patient's mutational landscape. SeqVItA is freely available at https://bioinf.iiit.ac.in/seqvita.
Collapse
Affiliation(s)
- Prashanthi Dharanipragada
- Center for Computational Natural Science and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| | - Sampreeth Reddy Seelam
- Center for Computational Natural Science and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| | - Nita Parekh
- Center for Computational Natural Science and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| |
Collapse
|
14
|
Zhang C, Liu X, Yao Y, Liu K, Hui W, Zhu J, Dou Y, Hua K, Peng M, Wang Z, Vermorken AJM, Cui Y. Genotyping of Multiple Clinical Samples with a Combined Direct PCR and Magnetic Lateral Flow Assay. iScience 2018; 7:170-179. [PMID: 30245369 PMCID: PMC6153416 DOI: 10.1016/j.isci.2018.09.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 08/19/2018] [Accepted: 09/05/2018] [Indexed: 02/09/2023] Open
Abstract
Developing a sensitive, low-cost, and easy-to-use point-of-care testing system for genotyping is important for informing treatment decisions and predicting the risk of underlying diseases. Conventional methods normally require complex operational procedures as well as expensive and sophisticated instruments. Here, we report a general approach that enables us to detect the genotype of multiple sample types directly without DNA purification. Moreover, the PCR results can be further quantitatively analyzed based on a magnetic lateral flow assay (MLFA) system, which avoids multiple steps needed for conventional nucleic acid biosensors. As a demonstration, we show that three genotypes of aldehyde dehydrogenase 2 (ALDH2) can be identified using a small volume of sample with an accuracy of 100% and a sensitivity of 1.0 × 102 cells/μL, which are better than those of the gold standard methods. We believe that the direct PCR-MLFA system represents a significant advance toward the development of portable, sensitive biomedical platforms.
Collapse
Affiliation(s)
- Chao Zhang
- College of Life Sciences, Northwest University, Xi'an, China
| | - Xiaonan Liu
- College of Life Sciences, Northwest University, Xi'an, China
| | - Yao Yao
- Shaanxi Provincial Engineering Research Center of Nano-Biomedical Detection, Xi'an, China
| | - Kewu Liu
- College of Life Sciences, Northwest University, Xi'an, China
| | - Wenli Hui
- College of Life Sciences, Northwest University, Xi'an, China
| | - Juanli Zhu
- Shaanxi Provincial Engineering Research Center of Nano-Biomedical Detection, Xi'an, China
| | - Yaling Dou
- Department of Clinical Laboratory, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, Beijing, China
| | - Kai Hua
- College of Life Sciences, Northwest University, Xi'an, China
| | - Mingli Peng
- Shaanxi Provincial Engineering Research Center of Nano-Biomedical Detection, Xi'an, China
| | - Zuankai Wang
- Department of Mechanical Engineering, City University of Hong Kong, Hong Kong, China.
| | | | - Yali Cui
- College of Life Sciences, Northwest University, Xi'an, China; Shaanxi Provincial Engineering Research Center of Nano-Biomedical Detection, Xi'an, China.
| |
Collapse
|
15
|
Smith SD, Kawash JK, Grigoriev A. Lightning-fast genome variant detection with GROM. Gigascience 2018; 6:1-7. [PMID: 29048532 PMCID: PMC5737730 DOI: 10.1093/gigascience/gix091] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2017] [Accepted: 09/13/2017] [Indexed: 12/30/2022] Open
Abstract
Current human whole genome sequencing projects produce massive amounts of data, often creating significant computational challenges. Different approaches have been developed for each type of genome variant and method of its detection, necessitating users to run multiple algorithms to find variants. We present Genome Rearrangement OmniMapper (GROM), a novel comprehensive variant detection algorithm accepting aligned read files as input and finding SNVs, indels, structural variants (SVs), and copy number variants (CNVs). We show that GROM outperforms state-of-the-art methods on 7 validated benchmarks using 2 whole genome sequencing (WGS) data sets. Additionally, GROM boasts lightning-fast run times, analyzing a 50× WGS human data set (NA12878) on commonly available computer hardware in 11 minutes, more than an order of magnitude (up to 72 times) faster than tools detecting a similar range of variants. Addressing the needs of big data analysis, GROM combines in 1 algorithm SNV, indel, SV, and CNV detection, providing superior speed, sensitivity, and precision. GROM is also able to detect CNVs, SNVs, and indels in non-paired-read WGS libraries, as well as SNVs and indels in whole exome or RNA sequencing data sets.
Collapse
Affiliation(s)
- Sean D Smith
- Department of Biology, Center for Computational and Integrative Biology, Rutgers University, 315 Penn St, Camden 08102, NJ, USA
| | - Joseph K Kawash
- Department of Biology, Center for Computational and Integrative Biology, Rutgers University, 315 Penn St, Camden 08102, NJ, USA
| | - Andrey Grigoriev
- Department of Biology, Center for Computational and Integrative Biology, Rutgers University, 315 Penn St, Camden 08102, NJ, USA
| |
Collapse
|
16
|
Tuzov N. A framework for the estimation of the proportion of true discoveries in single nucleotide variant detection studies for human data. PLoS One 2018; 13:e0196058. [PMID: 29694377 PMCID: PMC5918994 DOI: 10.1371/journal.pone.0196058] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 04/05/2018] [Indexed: 12/30/2022] Open
Abstract
Any single nucleotide variant detection study could benefit from a fast and cheap method of measuring the quality of variant call list. It is advantageous to be able to see how the call list quality is affected by different variant filtering thresholds and other adjustments to the study parameters. Here we look into a possibility of estimating the proportion of true positives in a single nucleotide variant call list for human data. Using whole-exome and whole-genome gold standard data sets for training, we focus on building a generic model that only relies on information available from any variant caller. We assess and compare the performance of different candidate models based on their practical accuracy. We find that the generic model delivers decent accuracy most of the time. Further, we conclude that its performance could be improved substantially by leveraging the variant quality metrics that are specific to each variant calling tool.
Collapse
Affiliation(s)
- Nik Tuzov
- Partek Incorporated, Saint Louis, Missouri, United States of America
- * E-mail:
| |
Collapse
|
17
|
Takamatsu T, Baslam M, Inomata T, Oikawa K, Itoh K, Ohnishi T, Kinoshita T, Mitsui T. Optimized Method of Extracting Rice Chloroplast DNA for High-Quality Plastome Resequencing and de Novo Assembly. FRONTIERS IN PLANT SCIENCE 2018; 9:266. [PMID: 29541088 PMCID: PMC5835797 DOI: 10.3389/fpls.2018.00266] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
Chloroplasts, which perform photosynthesis, are one of the most important organelles in green plants and algae. Chloroplasts maintain an independent genome that includes important genes encoding their photosynthetic machinery and various housekeeping functions. Owing to its non-recombinant nature, low mutation rates, and uniparental inheritance, the chloroplast genome (plastome) can give insights into plant evolution and ecology and in the development of biotechnological and breeding applications. However, efficient methods to obtain high-quality chloroplast DNA (cpDNA) are currently not available, impeding powerful sequencing and further functional genomics research. To investigate effects on rice chloroplast genome quality, we compared cpDNA extraction by three extraction protocols: liquid nitrogen coupled with sucrose density gradient centrifugation, high-salt buffer, and Percoll gradient centrifugation. The liquid nitrogen-sucrose gradient method gave a high yield of high-quality cpDNA with reliable purity. The cpDNA isolated by this technique was evaluated, resequenced, and assembled de novo to build a robust framework for genomic and genetic studies. Comparison of this high-purity cpDNA with total DNAs revealed the read coverage of the sequenced regions; next-generation sequencing data showed that the high-quality cpDNA eliminated noise derived from contamination by nuclear and mitochondrial DNA, which frequently occurs in total DNA. The assembly process produced highly accurate, long contigs. We summarize the extent to which this improved method of isolating cpDNA from rice can provide practical progress in overcoming challenges related to chloroplast genomes and in further exploring the development of new sequencing technologies.
Collapse
Affiliation(s)
- Takeshi Takamatsu
- Department of Life and Food Sciences, Graduate School of Science and Technology, Niigata University, Niigata, Japan
- Laboratory of Biochemistry, Faculty of Agriculture, Niigata University, Niigata, Japan
| | - Marouane Baslam
- Laboratory of Biochemistry, Faculty of Agriculture, Niigata University, Niigata, Japan
| | - Takuya Inomata
- Department of Life and Food Sciences, Graduate School of Science and Technology, Niigata University, Niigata, Japan
| | - Kazusato Oikawa
- Laboratory of Biochemistry, Faculty of Agriculture, Niigata University, Niigata, Japan
| | - Kimiko Itoh
- Department of Life and Food Sciences, Graduate School of Science and Technology, Niigata University, Niigata, Japan
- Laboratory of Biochemistry, Faculty of Agriculture, Niigata University, Niigata, Japan
| | - Takayuki Ohnishi
- Center for Education and Research of Community Collaboration, Utsunomiya University, Utsunomiya, Japan
| | - Tetsu Kinoshita
- Kihara Institute for Biological Research, Yokohama City University, Yokohama, Japan
| | - Toshiaki Mitsui
- Department of Life and Food Sciences, Graduate School of Science and Technology, Niigata University, Niigata, Japan
- Laboratory of Biochemistry, Faculty of Agriculture, Niigata University, Niigata, Japan
- *Correspondence: Toshiaki Mitsui,
| |
Collapse
|
18
|
Ma T, Zhang A. Omics Informatics: From Scattered Individual Software Tools to Integrated Workflow Management Systems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:926-946. [PMID: 26930689 DOI: 10.1109/tcbb.2016.2535251] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Omic data analyses pose great informatics challenges. As an emerging subfield of bioinformatics, omics informatics focuses on analyzing multi-omic data efficiently and effectively, and is gaining momentum. There are two underlying trends in the expansion of omics informatics landscape: the explosion of scattered individual omics informatics tools with each of which focuses on a specific task in both single- and multi- omic settings, and the fast-evolving integrated software platforms such as workflow management systems that can assemble multiple tools into pipelines and streamline integrative analysis for complicated tasks. In this survey, we give a holistic view of omics informatics, from scattered individual informatics tools to integrated workflow management systems. We not only outline the landscape and challenges of omics informatics, but also sample a number of widely used and cutting-edge algorithms in omics data analysis to give readers a fine-grained view. We survey various workflow management systems (WMSs), classify them into three levels of WMSs from simple software toolkits to integrated multi-omic analytical platforms, and point out the emerging needs for developing intelligent workflow management systems. We also discuss the challenges, strategies and some existing work in systematic evaluation of omics informatics tools. We conclude by providing future perspectives of emerging fields and new frontiers in omics informatics.
Collapse
|
19
|
|
20
|
Pantazatos SP, Huang YY, Rosoklija GB, Dwork AJ, Arango V, Mann JJ. Whole-transcriptome brain expression and exon-usage profiling in major depression and suicide: evidence for altered glial, endothelial and ATPase activity. Mol Psychiatry 2017; 22:760-773. [PMID: 27528462 PMCID: PMC5313378 DOI: 10.1038/mp.2016.130] [Citation(s) in RCA: 142] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Revised: 06/04/2016] [Accepted: 06/07/2016] [Indexed: 12/30/2022]
Abstract
Brain gene expression profiling studies of suicide and depression using oligonucleotide microarrays have often failed to distinguish these two phenotypes. Moreover, next generation sequencing approaches are more accurate in quantifying gene expression and can detect alternative splicing. Using RNA-seq, we examined whole-exome gene and exon expression in non-psychiatric controls (CON, N=29), DSM-IV major depressive disorder suicides (MDD-S, N=21) and MDD non-suicides (MDD, N=9) in the dorsal lateral prefrontal cortex (Brodmann Area 9) of sudden death medication-free individuals post mortem. Using small RNA-seq, we also examined miRNA expression (nine samples per group). DeSeq2 identified 35 genes differentially expressed between groups and surviving adjustment for false discovery rate (adjusted P<0.1). In depression, altered genes include humanin-like-8 (MTRNRL8), interleukin-8 (IL8), and serpin peptidase inhibitor, clade H (SERPINH1) and chemokine ligand 4 (CCL4), while exploratory gene ontology (GO) analyses revealed lower expression of immune-related pathways such as chemokine receptor activity, chemotaxis and cytokine biosynthesis, and angiogenesis and vascular development in (adjusted P<0.1). Hypothesis-driven GO analysis suggests lower expression of genes involved in oligodendrocyte differentiation, regulation of glutamatergic neurotransmission, and oxytocin receptor expression in both suicide and depression, and provisional evidence for altered DNA-dependent ATPase expression in suicide only. DEXSEq analysis identified differential exon usage in ATPase, class II, type 9B (adjusted P<0.1) in depression. Differences in miRNA expression or structural gene variants were not detected. Results lend further support for models in which deficits in microglial, endothelial (blood-brain barrier), ATPase activity and astrocytic cell functions contribute to MDD and suicide, and identify putative pathways and mechanisms for further study in these disorders.
Collapse
Affiliation(s)
- Spiro P. Pantazatos
- Molecular Imaging and Neuropathology Division, New York State Psychiatric Institute, New York, NY,Department of Psychiatry, New York, NY
| | - Yung-yu Huang
- Molecular Imaging and Neuropathology Division, New York State Psychiatric Institute, New York, NY,Department of Psychiatry, New York, NY
| | - Gorazd B. Rosoklija
- Molecular Imaging and Neuropathology Division, New York State Psychiatric Institute, New York, NY,Department of Psychiatry, New York, NY
| | | | - Victoria Arango
- Molecular Imaging and Neuropathology Division, New York State Psychiatric Institute, New York, NY,Department of Psychiatry, New York, NY
| | - J. John Mann
- Molecular Imaging and Neuropathology Division, New York State Psychiatric Institute, New York, NY,Department of Psychiatry, New York, NY,To whom correspondence should be addressed:
| |
Collapse
|
21
|
Levano S, Gonzalez A, Singer M, Demougin P, Rüffert H, Urwyler A, Girard T. Resequencing array for gene variant detection in malignant hyperthermia and butyrylcholinestherase deficiency. Neuromuscul Disord 2017; 27:492-499. [PMID: 28259615 DOI: 10.1016/j.nmd.2017.02.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Revised: 12/20/2016] [Accepted: 02/15/2017] [Indexed: 11/30/2022]
Abstract
Malignant hyperthermia (MH) and butyrylcholinestherase (BCHE) deficiency are two relevant pharmacogenetic disorders in anesthetic practice linked with sequence variants, the former in the RyR1 and CACNA1S genes, the latter in the BCHE gene. Genotyping for known pathogenic variants in these genes is useful to help identify susceptible individuals, and others may exist but remain unknown, because full-length sequence of these genes is, in general, not investigated. To facilitate this task, we developed a resequencing DNA array, the perioperative patient safety (POPS) array, to be able to screen the entire coding sequences of the RyR1, CACNA1S and BCHE genes. MH-susceptible individuals (n = 121) identified with the in vitro contracture test, the standard diagnostic tool for MH susceptibility, were genotyped with the arrays. Compared with capillary sequencing, call rates with the arrays could achieve 100% at maximal sensitivity, although to reduce false positive rates, sensitivity was adjusted to 0.85, 0.87 and 0.66 for RyR1, CACNA1S and BCHE respectively, with overall base call specificity exceeding 99%. Detection of 29 predetermined RyR1 variants in 44 individuals was successful in 97% of the cases, among them all 16 variants of established diagnostic value. In a trial application of the arrays, 21 MH-susceptible subjects with no known RyR1 or CACNA1S variants were screened, resulting in the discovery of new variants, all confirmed by capillary sequencing. In conclusion, arrays offer an efficient high-throughput alternative for diagnostic genotyping of candidate genes affecting MH susceptibility, BCHE deficiency and other neuromuscular disorders, simultaneously enabling a comprehensive search for rare variants in these genes.
Collapse
Affiliation(s)
- Soledad Levano
- Department of Biomedicine, University Hospital Basel, Switzerland; Department Anesthesiology, University Hospital Basel, Switzerland
| | - Asensio Gonzalez
- Department of Biomedicine, University Hospital Basel, Switzerland; Department Anesthesiology, University Hospital Basel, Switzerland.
| | - Martine Singer
- Department of Biomedicine, University Hospital Basel, Switzerland; Department Anesthesiology, University Hospital Basel, Switzerland
| | - Philippe Demougin
- Biozentrum, Life Sciences Training Facility, University of Basel, Switzerland
| | - Henrik Rüffert
- University of Leipzig, Helios Kliniken Leipziger Land Leipzig, Germany
| | - Albert Urwyler
- Department of Biomedicine, University Hospital Basel, Switzerland; Department Anesthesiology, University Hospital Basel, Switzerland
| | - Thierry Girard
- Department of Biomedicine, University Hospital Basel, Switzerland; Department Anesthesiology, University Hospital Basel, Switzerland
| |
Collapse
|
22
|
PEMapper and PECaller provide a simplified approach to whole-genome sequencing. Proc Natl Acad Sci U S A 2017; 114:E1923-E1932. [PMID: 28223510 DOI: 10.1073/pnas.1618065114] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The analysis of human whole-genome sequencing data presents significant computational challenges. The sheer size of datasets places an enormous burden on computational, disk array, and network resources. Here, we present an integrated computational package, PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk arrays, create output files that are minimal in size, and run in a highly computationally efficient way, with the single goal of enabling whole-genome sequencing at scale. In addition to improved computational efficiency, we implement a statistical framework that allows for a base by base error model, allowing this package to perform as well or better than the widely used Genome Analysis Toolkit (GATK) in all key measures of performance on human whole-genome sequences.
Collapse
|
23
|
Brumme CJ, Poon AFY. Promises and pitfalls of Illumina sequencing for HIV resistance genotyping. Virus Res 2016; 239:97-105. [PMID: 27993623 DOI: 10.1016/j.virusres.2016.12.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Revised: 12/15/2016] [Accepted: 12/15/2016] [Indexed: 12/13/2022]
Abstract
Genetic sequencing ("genotyping") plays a critical role in the modern clinical management of HIV infection. This virus evolves rapidly within patients because of its error-prone reverse transcriptase and short generation time. Consequently, HIV variants with mutations that confer resistance to one or more antiretroviral drugs can emerge during sub-optimal treatment. There are now multiple HIV drug resistance interpretation algorithms that take the region of the HIV genome encoding the major drug targets as inputs; expert use of these algorithms can significantly improve to clinical outcomes in HIV treatment. Next-generation sequencing has the potential to revolutionize HIV resistance genotyping by lowering the threshold that rare but clinically significant HIV variants can be detected reproducibly, and by conferring improved cost-effectiveness in high-throughput scenarios. In this review, we discuss the relative merits and challenges of deploying the Illumina MiSeq instrument for clinical HIV genotyping.
Collapse
Affiliation(s)
- Chanson J Brumme
- BC Centre for Excellence in HIV/AIDS, Vancouver, British Columbia, Canada
| | - Art F Y Poon
- Department of Pathology & Laboratory Medicine, Western University, London, Ontario, Canada.
| |
Collapse
|
24
|
Rudewicz J, Soueidan H, Uricaru R, Bonnefoi H, Iggo R, Bergh J, Nikolski M. MICADo - Looking for Mutations in Targeted PacBio Cancer Data: An Alignment-Free Method. Front Genet 2016; 7:214. [PMID: 28008336 PMCID: PMC5143680 DOI: 10.3389/fgene.2016.00214] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 11/23/2016] [Indexed: 12/11/2022] Open
Abstract
Targeted sequencing is commonly used in clinical application of NGS technology since it enables generation of sufficient sequencing depth in the targeted genes of interest and thus ensures the best possible downstream analysis. This notwithstanding, the accurate discovery and annotation of disease causing mutations remains a challenging problem even in such favorable context. The difficulty is particularly salient in the case of third generation sequencing technology, such as PacBio. We present MICADo, a de Bruijn graph based method, implemented in python, that makes possible to distinguish between patient specific mutations and other alterations for targeted sequencing of a cohort of patients. MICADo analyses NGS reads for each sample within the context of the data of the whole cohort in order to capture the differences between specificities of the sample with respect to the cohort. MICADo is particularly suitable for sequencing data from highly heterogeneous samples, especially when it involves high rates of non-uniform sequencing errors. It was validated on PacBio sequencing datasets from several cohorts of patients. The comparison with two widely used available tools, namely VarScan and GATK, shows that MICADo is more accurate, especially when true mutations have frequencies close to backgound noise. The source code is available at http://github.com/cbib/MICADo.
Collapse
Affiliation(s)
- Justine Rudewicz
- Centre de BioInformatique de Bordeaux, University of BordeauxBordeaux, France; Laboratoire Bordelais de Recherche en Informatique, Centre National de la Recherche Scientifique, University of BordeauxBordeaux, France; Bergonié Cancer Institute, Institut National de la Santé et de la Recherche Médicale U1218, University of BordeauxBordeaux, France
| | - Hayssam Soueidan
- Centre de BioInformatique de Bordeaux, University of BordeauxBordeaux, France; Laboratoire Bordelais de Recherche en Informatique, Centre National de la Recherche Scientifique, University of BordeauxBordeaux, France
| | - Raluca Uricaru
- Centre de BioInformatique de Bordeaux, University of BordeauxBordeaux, France; Laboratoire Bordelais de Recherche en Informatique, Centre National de la Recherche Scientifique, University of BordeauxBordeaux, France
| | - Hervé Bonnefoi
- Bergonié Cancer Institute, Institut National de la Santé et de la Recherche Médicale U1218, University of Bordeaux Bordeaux, France
| | - Richard Iggo
- Bergonié Cancer Institute, Institut National de la Santé et de la Recherche Médicale U1218, University of Bordeaux Bordeaux, France
| | - Jonas Bergh
- Karolinska Institute and University Hospital Stockholm, Sweden
| | - Macha Nikolski
- Centre de BioInformatique de Bordeaux, University of BordeauxBordeaux, France; Laboratoire Bordelais de Recherche en Informatique, Centre National de la Recherche Scientifique, University of BordeauxBordeaux, France
| |
Collapse
|
25
|
Tian S, Yan H, Neuhauser C, Slager SL. An analytical workflow for accurate variant discovery in highly divergent regions. BMC Genomics 2016; 17:703. [PMID: 27590916 PMCID: PMC5010666 DOI: 10.1186/s12864-016-3045-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Accepted: 08/25/2016] [Indexed: 02/07/2023] Open
Abstract
Background Current variant discovery methods often start with the mapping of short reads to a reference genome; yet, their performance deteriorates in genomic regions where the reads are highly divergent from the reference sequence. This is particularly problematic for the human leukocyte antigen (HLA) region on chromosome 6p21.3. This region is associated with over 100 diseases, but variant calling is hindered by the extreme divergence across different haplotypes. Results We simulated reads from chromosome 6 exonic regions over a wide range of sequence divergence and coverage depth. We systematically assessed combinations between five mappers and five callers for their performance on simulated data and exome-seq data from NA12878, a well-studied individual in which multiple public call sets have been generated. Among those combinations, the number of known SNPs differed by about 5 % in the non-HLA regions of chromosome 6 but over 20 % in the HLA region. Notably, GSNAP mapping combined with GATK UnifiedGenotyper calling identified about 20 % more known SNPs than most existing methods without a noticeable loss of specificity, with 100 % sensitivity in three highly polymorphic HLA genes examined. Much larger differences were observed among these combinations in INDEL calling from both non-HLA and HLA regions. We obtained similar results with our internal exome-seq data from a cohort of chronic lymphocytic leukemia patients. Conclusions We have established a workflow enabling variant detection, with high sensitivity and specificity, over the full spectrum of divergence seen in the human genome. Comparing to public call sets from NA12878 has highlighted the overall superiority of GATK UnifiedGenotyper, followed by GATK HaplotypeCaller and SAMtools, in SNP calling, and of GATK HaplotypeCaller and Platypus in INDEL calling, particularly in regions of high sequence divergence such as the HLA region. GSNAP and Novoalign are the ideal mappers in combination with the above callers. We expect that the proposed workflow should be applicable to variant discovery in other highly divergent regions. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3045-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Claudia Neuhauser
- Informatics Institute, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA.
| |
Collapse
|
26
|
Humble E, Thorne MAS, Forcada J, Hoffman JI. Transcriptomic SNP discovery for custom genotyping arrays: impacts of sequence data, SNP calling method and genotyping technology on the probability of validation success. BMC Res Notes 2016; 9:418. [PMID: 27562535 PMCID: PMC5000416 DOI: 10.1186/s13104-016-2209-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2016] [Accepted: 08/06/2016] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Single nucleotide polymorphism (SNP) discovery is an important goal of many studies. However, the number of 'putative' SNPs discovered from a sequence resource may not provide a reliable indication of the number that will successfully validate with a given genotyping technology. For this it may be necessary to account for factors such as the method used for SNP discovery and the type of sequence data from which it originates, suitability of the SNP flanking sequences for probe design, and genomic context. To explore the relative importance of these and other factors, we used Illumina sequencing to augment an existing Roche 454 transcriptome assembly for the Antarctic fur seal (Arctocephalus gazella). We then mapped the raw Illumina reads to the new hybrid transcriptome using BWA and BOWTIE2 before calling SNPs with GATK. The resulting markers were pooled with two existing sets of SNPs called from the original 454 assembly using NEWBLER and SWAP454. Finally, we explored the extent to which SNPs discovered using these four methods overlapped and predicted the corresponding validation outcomes for both Illumina Infinium iSelect HD and Affymetrix Axiom arrays. RESULTS Collating markers across all discovery methods resulted in a global list of 34,718 SNPs. However, concordance between the methods was surprisingly poor, with only 51.0 % of SNPs being discovered by more than one method and 13.5 % being called from both the 454 and Illumina datasets. Using a predictive modeling approach, we could also show that SNPs called from the Illumina data were on average more likely to successfully validate, as were SNPs called by more than one method. Above and beyond this pattern, predicted validation outcomes were also consistently better for Affymetrix Axiom arrays. CONCLUSIONS Our results suggest that focusing on SNPs called by more than one method could potentially improve validation outcomes. They also highlight possible differences between alternative genotyping technologies that could be explored in future studies of non-model organisms.
Collapse
Affiliation(s)
- Emily Humble
- Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501, Bielefeld, Germany. .,British Antarctic Survey, High Cross, Madingley Road, Cambridge, CB3 OET, UK.
| | - Michael A S Thorne
- British Antarctic Survey, High Cross, Madingley Road, Cambridge, CB3 OET, UK
| | - Jaume Forcada
- British Antarctic Survey, High Cross, Madingley Road, Cambridge, CB3 OET, UK
| | - Joseph I Hoffman
- Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501, Bielefeld, Germany
| |
Collapse
|
27
|
Menon R, Patel AB, Joshi C. Comparative analysis of SNP candidates in disparate milk yielding river buffaloes using targeted sequencing. PeerJ 2016; 4:e2147. [PMID: 27441113 PMCID: PMC4941740 DOI: 10.7717/peerj.2147] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2016] [Accepted: 05/27/2016] [Indexed: 12/17/2022] Open
Abstract
River buffalo (Bubalus bubalis) milk plays an important role in economy and nutritious diet in several developing countries. However, reliable milk-yield genomic markers and their functional insights remain unexposed. Here, we have used a target capture sequencing approach in three economically important buffalo breeds namely: Banni, Jafrabadi and Mehsani, belonging to either high or low milk-yield group. Blood samples were collected from the milk-yield/breed balanced group of 12 buffaloes, and whole exome sequencing was performed using Roche 454 GS-FLX Titanium sequencer. Using an innovative approach namely, MultiCom; we have identified high-quality SNPs specific for high and low-milk yield buffaloes. Almost 70% of the reported genes in QTL regions of milk-yield and milk-fat in cattle were present among the buffalo milk-yield gene candidates. Functional analysis highlighted transcriptional regulation category in the low milk-yield group, and several new pathways in the two groups. Further, the discovered SNP candidates may account for more than half of mammary transcriptome changes in high versus low-milk yielding cattle. Thus, starting from the design of a reliable strategy, we identified reliable genomic markers specific for high and low-milk yield buffalo breeds and addressed possible downstream effects.
Collapse
Affiliation(s)
- Ramesh Menon
- Department of Animal Biotechnology, Anand Agricultural University, Anand, India
| | - Anand B Patel
- Department of Animal Biotechnology, Anand Agricultural University, Anand, India
| | - Chaitanya Joshi
- Department of Animal Biotechnology, Anand Agricultural University, Anand, India
| |
Collapse
|
28
|
Damiati E, Borsani G, Giacopuzzi E. Amplicon-based semiconductor sequencing of human exomes: performance evaluation and optimization strategies. Hum Genet 2016; 135:499-511. [PMID: 27003585 PMCID: PMC4835520 DOI: 10.1007/s00439-016-1656-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 03/12/2016] [Indexed: 02/02/2023]
Abstract
The Ion Proton platform allows to perform whole exome sequencing (WES) at low cost, providing rapid turnaround time and great flexibility. Products for WES on Ion Proton system include the AmpliSeq Exome kit and the recently introduced HiQ sequencing chemistry. Here, we used gold standard variants from GIAB consortium to assess the performances in variants identification, characterize the erroneous calls and develop a filtering strategy to reduce false positives. The AmpliSeq Exome kit captures a large fraction of bases (>94 %) in human CDS, ClinVar genes and ACMG genes, but with 2,041 (7 %), 449 (13 %) and 11 (19 %) genes not fully represented, respectively. Overall, 515 protein coding genes contain hard-to-sequence regions, including 90 genes from ClinVar. Performance in variants detection was maximum at mean coverage >120×, while at 90× and 70× we measured a loss of variants of 3.2 and 4.5 %, respectively. WES using HiQ chemistry showed ~71/97.5 % sensitivity, ~37/2 % FDR and ~0.66/0.98 F1 score for indels and SNPs, respectively. The proposed low, medium or high-stringency filters reduced the amount of false positives by 10.2, 21.2 and 40.4 % for indels and 21.2, 41.9 and 68.2 % for SNP, respectively. Amplicon-based WES on Ion Proton platform using HiQ chemistry emerged as a competitive approach, with improved accuracy in variants identification. False-positive variants remain an issue for the Ion Torrent technology, but our filtering strategy can be applied to reduce erroneous variants.
Collapse
Affiliation(s)
- E Damiati
- Unit of Genetics, Department of Molecular and Translational Medicine, University of Brescia, 25123, Brescia, Italy
| | - G Borsani
- Unit of Genetics, Department of Molecular and Translational Medicine, University of Brescia, 25123, Brescia, Italy
| | - Edoardo Giacopuzzi
- Unit of Genetics, Department of Molecular and Translational Medicine, University of Brescia, 25123, Brescia, Italy.
| |
Collapse
|
29
|
Li J, Batcha AMN, Grüning B, Mansmann UR. An NGS Workflow Blueprint for DNA Sequencing Data and Its Application in Individualized Molecular Oncology. Cancer Inform 2016; 14:87-107. [PMID: 27081306 PMCID: PMC4827795 DOI: 10.4137/cin.s30793] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 03/02/2016] [Accepted: 03/17/2016] [Indexed: 12/23/2022] Open
Abstract
Next-generation sequencing (NGS) technologies that have advanced rapidly in the past few years possess the potential to classify diseases, decipher the molecular code of related cell processes, identify targets for decision-making on targeted therapy or prevention strategies, and predict clinical treatment response. Thus, NGS is on its way to revolutionize oncology. With the help of NGS, we can draw a finer map for the genetic basis of diseases and can improve our understanding of diagnostic and prognostic applications and therapeutic methods. Despite these advantages and its potential, NGS is facing several critical challenges, including reduction of sequencing cost, enhancement of sequencing quality, improvement of technical simplicity and reliability, and development of semiautomated and integrated analysis workflow. In order to address these challenges, we conducted a literature research and summarized a four-stage NGS workflow for providing a systematic review on NGS-based analysis, explaining the strength and weakness of diverse NGS-based software tools, and elucidating its potential connection to individualized medicine. By presenting this four-stage NGS workflow, we try to provide a minimal structural layout required for NGS data storage and reproducibility.
Collapse
Affiliation(s)
- Jian Li
- Institute for Medical Informatics, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany.; German Cancer Consortium (DKTK), Heidelberg, Germany.; German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Aarif Mohamed Nazeer Batcha
- Institute for Medical Informatics, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany.; German Cancer Consortium (DKTK), Heidelberg, Germany.; German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Björn Grüning
- Bioinformatics Group, Department of Computer Science, Albert-Ludwigs-University, Freiburg, Freiburg, Germany.; Center for Biological Systems Analysis (ZBSA), University of Freiburg, Freiburg, Germany
| | - Ulrich R Mansmann
- Institute for Medical Informatics, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany.; German Cancer Consortium (DKTK), Heidelberg, Germany
| |
Collapse
|
30
|
Chua EW, Cree SL, Ton KNT, Lehnert K, Shepherd P, Helsby N, Kennedy MA. Cross-Comparison of Exome Analysis, Next-Generation Sequencing of Amplicons, and the iPLEX(®) ADME PGx Panel for Pharmacogenomic Profiling. Front Pharmacol 2016; 7:1. [PMID: 26858644 PMCID: PMC4726781 DOI: 10.3389/fphar.2016.00001] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2015] [Accepted: 01/06/2016] [Indexed: 12/30/2022] Open
Abstract
Whole-exome sequencing (WES) has been widely used for analysis of human genetic diseases, but its value for the pharmacogenomic profiling of individuals is not well studied. Initially, we performed an in-depth evaluation of the accuracy of WES variant calling in the pharmacogenes CYP2D6 and CYP2C19 by comparison with MiSeq(®) amplicon sequencing data (n = 36). This analysis revealed that the concordance rate between WES and MiSeq(®) was high, achieving 99.60% for variants that were called without exceeding the truth-sensitivity threshold (99%), defined during variant quality score recalibration (VQSR). Beyond this threshold, the proportion of discordant calls increased markedly. Subsequently, we expanded our findings beyond CYP2D6 and CYP2C19 to include more genes genotyped by the iPLEX(®) ADME PGx Panel in the subset of twelve samples. WES performed well, agreeing with the genotyping panel in approximately 99% of the selected pass-filter variant calls. Overall, our results have demonstrated WES to be a promising approach for pharmacogenomic profiling, with an estimated error rate of lower than 1%. Quality filters, particularly VQSR, are important for reducing the number of false variants. Future studies may benefit from examining the role of WES in the clinical setting for guiding drug therapy.
Collapse
Affiliation(s)
- Eng Wee Chua
- Carney Centre for Pharmacogenomics, Department of Pathology, University of OtagoChristchurch, New Zealand
- Faculty of Pharmacy, Universiti Kebangsaan MalaysiaKuala Lumpur, Malaysia
| | - Simone L. Cree
- Carney Centre for Pharmacogenomics, Department of Pathology, University of OtagoChristchurch, New Zealand
| | - Kim N. T. Ton
- Carney Centre for Pharmacogenomics, Department of Pathology, University of OtagoChristchurch, New Zealand
| | - Klaus Lehnert
- School of Biological Sciences, The University of AucklandAuckland, New Zealand
| | - Phillip Shepherd
- Auckland UniServices Sequenom Facility, Liggins Institute, The University of AucklandAuckland, New Zealand
| | - Nuala Helsby
- School of Medical Sciences, The University of AucklandAuckland, New Zealand
| | - Martin A. Kennedy
- Carney Centre for Pharmacogenomics, Department of Pathology, University of OtagoChristchurch, New Zealand
| |
Collapse
|
31
|
Field MA, Cho V, Andrews TD, Goodnow CC. Reliably Detecting Clinically Important Variants Requires Both Combined Variant Calls and Optimized Filtering Strategies. PLoS One 2015; 10:e0143199. [PMID: 26600436 PMCID: PMC4658170 DOI: 10.1371/journal.pone.0143199] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2015] [Accepted: 11/02/2015] [Indexed: 12/21/2022] Open
Abstract
A diversity of tools is available for identification of variants from genome sequence data. Given the current complexity of incorporating external software into a genome analysis infrastructure, a tendency exists to rely on the results from a single tool alone. The quality of the output variant calls is highly variable however, depending on factors such as sequence library quality as well as the choice of short-read aligner, variant caller, and variant caller filtering strategy. Here we present a two-part study first using the high quality 'genome in a bottle' reference set to demonstrate the significant impact the choice of aligner, variant caller, and variant caller filtering strategy has on overall variant call quality and further how certain variant callers outperform others with increased sample contamination, an important consideration when analyzing sequenced cancer samples. This analysis confirms previous work showing that combining variant calls of multiple tools results in the best quality resultant variant set, for either specificity or sensitivity, depending on whether the intersection or union, of all variant calls is used respectively. Second, we analyze a melanoma cell line derived from a control lymphocyte sample to determine whether software choices affect the detection of clinically important melanoma risk-factor variants finding that only one of the three such variants is unanimously detected under all conditions. Finally, we describe a cogent strategy for implementing a clinical variant detection pipeline; a strategy that requires careful software selection, variant caller filtering optimizing, and combined variant calls in order to effectively minimize false negative variants. While implementing such features represents an increase in complexity and computation the results offer indisputable improvements in data quality.
Collapse
Affiliation(s)
- Matthew A. Field
- Department of Immunology, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
- National Computational Infrastructure, Australian National University, Canberra, ACT, Australia
| | - Vicky Cho
- Department of Immunology, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
- Australian Phenomics Facility, Australian National University, Canberra, ACT, Australia
| | - T. Daniel Andrews
- Department of Immunology, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
- National Computational Infrastructure, Australian National University, Canberra, ACT, Australia
| | - Chris C. Goodnow
- Department of Immunology, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
- Immunogenomics Group, Immunology Research Program, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| |
Collapse
|
32
|
Abstract
In this review, we describe key components of a computational infrastructure for a precision medicine program that is based on clinical-grade genomic sequencing. Specific aspects covered in this review include software components and hardware infrastructure, reporting, integration into Electronic Health Records for routine clinical use and regulatory aspects. We emphasize informatics components related to reproducibility and reliability in genomic testing, regulatory compliance, traceability and documentation of processes, integration into clinical workflows, privacy requirements, prioritization and interpretation of results to report based on clinical needs, rapidly evolving knowledge base of genomic alterations and clinical treatments and return of results in a timely and predictable fashion. We also seek to differentiate between the use of precision medicine in germline and cancer.
Collapse
|
33
|
Vandeweyer G, Van Laer L, Loeys B, Van den Bulcke T, Kooy RF. VariantDB: a flexible annotation and filtering portal for next generation sequencing data. Genome Med 2014; 6:74. [PMID: 25352915 PMCID: PMC4210545 DOI: 10.1186/s13073-014-0074-6] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 09/15/2014] [Indexed: 12/30/2022] Open
Abstract
Interpretation of the multitude of variants obtained from next generation sequencing (NGS) is labor intensive and complex. Web-based interfaces such as Galaxy streamline the generation of variant lists but lack flexibility in the downstream annotation and filtering that are necessary to identify causative variants in medical genomics. To this end, we built VariantDB, a web-based interactive annotation and filtering platform that automatically annotates variants with allele frequencies, functional impact, pathogenicity predictions and pathway information. VariantDB allows filtering by all annotations, under dominant, recessive or de novo inheritance models and is freely available at http://www.biomina.be/app/variantdb/.
Collapse
Affiliation(s)
- Geert Vandeweyer
- Department of Medical Genetics, University of Antwerp, 2650 Edegem, Antwerp Belgium ; Biomedical Informatics Research Center Antwerp, University and University Hospital of Antwerp, 2650 Edegem, Antwerp Belgium
| | - Lut Van Laer
- Department of Medical Genetics, University of Antwerp, 2650 Edegem, Antwerp Belgium ; Department of Medical Genetics, University Hospital of Antwerp, 2650 Edegem, Antwerp Belgium
| | - Bart Loeys
- Department of Medical Genetics, University of Antwerp, 2650 Edegem, Antwerp Belgium ; Department of Medical Genetics, University Hospital of Antwerp, 2650 Edegem, Antwerp Belgium
| | - Tim Van den Bulcke
- Biomedical Informatics Research Center Antwerp, University and University Hospital of Antwerp, 2650 Edegem, Antwerp Belgium
| | - R Frank Kooy
- Department of Medical Genetics, University of Antwerp, 2650 Edegem, Antwerp Belgium
| |
Collapse
|
34
|
Warden CD, Adamson AW, Neuhausen SL, Wu X. Detailed comparison of two popular variant calling packages for exome and targeted exon studies. PeerJ 2014; 2:e600. [PMID: 25289185 PMCID: PMC4184249 DOI: 10.7717/peerj.600] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2014] [Accepted: 09/09/2014] [Indexed: 12/22/2022] Open
Abstract
The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an equally comprehensive comparison for VarScan not yet been performed. More specifically, we compare (1) the effects of different pre-processing steps prior to variant calling with both GATK and VarScan, (2) VarScan variants called with increasingly conservative parameters, and (3) filtered and unfiltered GATK variant calls (for both the UnifiedGenotyper and the HaplotypeCaller). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. In most cases, pre-processing steps (e.g., indel realignment and quality score base recalibration using GATK) had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. Based upon concordance statistics presented in this study, we recommend GATK users focus on “high-quality” GATK variants by filtering out variants flagged as low-quality. We also found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a reproducible list of variants, with high concordance (>97%) to high-quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84–88% of the high-quality GATK SNPs in the exome datasets. This study also provides limited evidence that VarScan-Cons has a decreased false positive rate among novel variants (relative to high-quality GATK SNPs) and that the GATK HaplotypeCaller has an increased false positive rate for indels (relative to VarScan-Cons and high-quality GATK UnifiedGenotyper indels). More broadly, we believe the metrics used for comparison in this study can be useful in assessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.
Collapse
Affiliation(s)
- Charles D Warden
- Department of Computational Medicine and Bioinformatics, University of Michigan , Ann Arbor, MI , USA
| | - Aaron W Adamson
- Department of Population Sciences, City of Hope National Medical Center , Duarte, CA , USA
| | - Susan L Neuhausen
- Department of Population Sciences, City of Hope National Medical Center , Duarte, CA , USA
| | - Xiwei Wu
- Integrative Genomics Core, Department of Molecular and Cellular Biology, City of Hope National Medical Center , Duarte, CA , USA
| |
Collapse
|