1
|
Mahmood K, Sarup P, Oertelt L, Jahoor A, Orabi J. Assessing myBaits Target Capture Sequencing Methodology Using Short-Read Sequencing for Variant Detection in Oat Genomics and Breeding. Genes (Basel) 2024; 15:700. [PMID: 38927635 PMCID: PMC11203172 DOI: 10.3390/genes15060700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 05/18/2024] [Accepted: 05/22/2024] [Indexed: 06/28/2024] Open
Abstract
The integration of target capture systems with next-generation sequencing has emerged as an efficient tool for exploring specific genetic regions with a high resolution and facilitating the rapid discovery of novel alleles. Despite these advancements, the application of targeted sequencing methodologies, such as the myBaits technology, in polyploid oat species remains relatively unexplored. In this study, we utilized the myBaits target capture method offered by Daicel Arbor Biosciences to detect variants and assess their reliability for variant detection in oat genomics and breeding. Ten oat genotypes were carefully chosen for targeted sequencing, focusing on specific regions on chromosome 2A to detect variants. The selected region harbors 98 genes. Precisely designed baits targeting the genes within these regions were employed for the target capture sequencing. We employed various mappers and variant callers to identify variants. After the identification of variants, we focused on the variants identified via all variants callers to assess the applicability of the myBaits sequencing methodology in oat breeding. In our efforts to validate the identified variants, we focused on two SNPs, one deletion and one insertion identified via all variant callers in the genotypes KF-318 and NOS 819111-70 but absent in the remaining eight genotypes. The Sanger sequencing of targeted SNPs failed to reproduce target capture data obtained through the myBaits technology. Similarly, the validation of deletion and insertion variants via high-resolution melting (HRM) curve analysis also failed to reproduce target capture data, again suggesting limitations in the reliability of the myBaits target capture sequencing using short-read sequencing for variant detection in the oat genome. This study shed light on the importance of exercising caution when employing the myBaits target capture strategy for variant detection in oats. This study provides valuable insights for breeders seeking to advance oat breeding efforts and marker development using myBaits target capture sequencing, emphasizing the significance of methodological sequencing considerations in oat genomics research.
Collapse
Affiliation(s)
- Khalid Mahmood
- Nordic Seed, Grindsnabevej 25, 8300 Odder, Denmark; (P.S.); (A.J.); (J.O.)
| | - Pernille Sarup
- Nordic Seed, Grindsnabevej 25, 8300 Odder, Denmark; (P.S.); (A.J.); (J.O.)
| | - Lukas Oertelt
- Nordic Seed Germany, Kirchhorster Str. 16, 31688 Nienstädt, Germany;
| | - Ahmed Jahoor
- Nordic Seed, Grindsnabevej 25, 8300 Odder, Denmark; (P.S.); (A.J.); (J.O.)
- Nordic Seed Germany, Kirchhorster Str. 16, 31688 Nienstädt, Germany;
| | - Jihad Orabi
- Nordic Seed, Grindsnabevej 25, 8300 Odder, Denmark; (P.S.); (A.J.); (J.O.)
| |
Collapse
|
2
|
Kalleberg J, Rissman J, Schnabel RD. Overcoming Limitations to Deep Learning in Domesticated Animals with TrioTrain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589602. [PMID: 38659907 PMCID: PMC11042298 DOI: 10.1101/2024.04.15.589602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Variant calling across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a "universal" algorithm has magnified the unknown impacts when used with non-human genomes. Here, we use bovine genomes to assess the limits of human-genome-trained models in other species. We introduce the first multi-species DV model that achieves a lower Mendelian Inheritance Error (MIE) rate during single-sample genotyping. Our novel approach, TrioTrain, automates extending DV for species without Genome In A Bottle (GIAB) resources and uses region shuffling to mitigate barriers for SLURM-based clusters. To offset imperfect truth labels for animal genomes, we remove Mendelian discordant variants before training, where models are tuned to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to build 30 model iterations across five phases. We observe remarkable performance across phases when testing the GIAB human trios with a mean SNP F1 score >0.990. In HG002, our phase 4 bovine model identifies more variants at a lower MIE rate than DeepTrio. In bovine F1-hybrid genomes, our model substantially reduces inheritance errors with a mean MIE rate of 0.03 percent. Although constrained by imperfect labels, we find that multi-species, trio-based training produces a robust variant calling model. Our research demonstrates that exclusively training with human genomes restricts the application of deep-learning approaches for comparative genomics.
Collapse
Affiliation(s)
- Jenna Kalleberg
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Jacob Rissman
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
| | - Robert D Schnabel
- University of Missouri, Division of Animal Sciences, Columbia, MO, 65201 USA
- University of Missouri, Genetics Area Program, Columbia, MO, 65201 USA
| |
Collapse
|
3
|
Kosugi S, Terao C. Comparative evaluation of SNVs, indels, and structural variations detected with short- and long-read sequencing data. Hum Genome Var 2024; 11:18. [PMID: 38632226 PMCID: PMC11024196 DOI: 10.1038/s41439-024-00276-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/12/2024] [Accepted: 03/20/2024] [Indexed: 04/19/2024] Open
Abstract
Short- and long-read sequencing technologies are routinely used to detect DNA variants, including SNVs, indels, and structural variations (SVs). However, the differences in the quality and quantity of variants detected between short- and long-read data are not fully understood. In this study, we comprehensively evaluated the variant calling performance of short- and long-read-based SNV, indel, and SV detection algorithms (6 for SNVs, 12 for indels, and 13 for SVs) using a novel evaluation framework incorporating manual visual inspection. The results showed that indel-insertion calls greater than 10 bp were poorly detected by short-read-based detection algorithms compared to long-read-based algorithms; however, the recall and precision of SNV and indel-deletion detection were similar between short- and long-read data. The recall of SV detection with short-read-based algorithms was significantly lower in repetitive regions, especially for small- to intermediate-sized SVs, than that detected with long-read-based algorithms. In contrast, the recall and precision of SV detection in nonrepetitive regions were similar between short- and long-read data. These findings suggest the need for refined strategies, such as incorporating multiple variant detection algorithms, to generate a more complete set of variants using short-read data.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Center for Genome Informatics, Research Organization of Information and Systems, Joint Support-Center for Data Science Research, Shizuoka, Japan.
- Advanced Genomics Center, National Institute of Genetics, Shizuoka, Japan.
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan.
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
- The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
| |
Collapse
|
4
|
de Jong TV, Pan Y, Rastas P, Munro D, Tutaj M, Akil H, Benner C, Chen D, Chitre AS, Chow W, Colonna V, Dalgard CL, Demos WM, Doris PA, Garrison E, Geurts AM, Gunturkun HM, Guryev V, Hourlier T, Howe K, Huang J, Kalbfleisch T, Kim P, Li L, Mahaffey S, Martin FJ, Mohammadi P, Ozel AB, Polesskaya O, Pravenec M, Prins P, Sebat J, Smith JR, Solberg Woods LC, Tabakoff B, Tracey A, Uliano-Silva M, Villani F, Wang H, Sharp BM, Telese F, Jiang Z, Saba L, Wang X, Murphy TD, Palmer AA, Kwitek AE, Dwinell MR, Williams RW, Li JZ, Chen H. A revamped rat reference genome improves the discovery of genetic diversity in laboratory rats. CELL GENOMICS 2024; 4:100527. [PMID: 38537634 PMCID: PMC11019364 DOI: 10.1016/j.xgen.2024.100527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 12/26/2023] [Accepted: 02/29/2024] [Indexed: 04/09/2024]
Abstract
The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared with its predecessor. Gene annotations are now more complete, improving the mapping precision of genomic, transcriptomic, and proteomics datasets. We jointly analyzed 163 short-read whole-genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ∼20.0 million sequence variations, of which 18,700 are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.
Collapse
Affiliation(s)
- Tristan V de Jong
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Yanchao Pan
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Pasi Rastas
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Daniel Munro
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA; Department of Integrative Structural and Computational Biology, Scripps Research, San Diego, CA, USA
| | - Monika Tutaj
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Huda Akil
- Michigan Neuroscience Institute, University of Michigan, Ann Arbor, MI, USA
| | - Chris Benner
- Department of Medicine, University of California San Diego, San Diego, CA, USA
| | - Denghui Chen
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Apurva S Chitre
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - William Chow
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy; Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Clifton L Dalgard
- Department of Anatomy, Physiology & Genetics, The American Genome Center, Uniformed Services University of the Health Sciences, Bethesda, MD, USA
| | - Wendy M Demos
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Peter A Doris
- The Brown Foundation Institute of Molecular Medicine, Center for Human Genetics, University of Texas Health Science Center, Houston, TX, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Aron M Geurts
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Hakan M Gunturkun
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Victor Guryev
- Genome Structure and Ageing, University of Groningen, UMC, Groningen, the Netherlands
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Jun Huang
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ted Kalbfleisch
- Gluck Equine Research Center, Department of Veterinary Science, University of Kentucky, Louisville, KY, USA
| | - Panjun Kim
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ling Li
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA; Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Spencer Mahaffey
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Pejman Mohammadi
- Center for Immunity and Immunotherapies, Seattle Children's Research Institute, Seattle, WA, USA; Department of Pediatrics, University of Washington School of Medicine, Seattle, WA, USA
| | - Ayse Bilge Ozel
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Oksana Polesskaya
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Michal Pravenec
- Institute of Physiology, Czech Academy of Sciences, Prague, Czechia
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonathan Sebat
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Jennifer R Smith
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Leah C Solberg Woods
- Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - Boris Tabakoff
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | | | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hongyang Wang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Burt M Sharp
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Francesca Telese
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Zhihua Jiang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Laura Saba
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Xusheng Wang
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA; Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Abraham A Palmer
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA; Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Anne E Kwitek
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Melinda R Dwinell
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA; Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Robert W Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jun Z Li
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA.
| |
Collapse
|
5
|
Olszewska M, Malcher A, Stokowy T, Pollock N, Berman AJ, Budkiewicz S, Kamieniczna M, Jackowiak H, Suszynska-Zajczyk J, Jedrzejczak P, Yatsenko AN, Kurpisz M. Effects of Tcte1 knockout on energy chain transportation and spermatogenesis: implications for male infertility. Hum Reprod Open 2024; 2024:hoae020. [PMID: 38650655 PMCID: PMC11035007 DOI: 10.1093/hropen/hoae020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Revised: 03/08/2024] [Indexed: 04/25/2024] Open
Abstract
STUDY QUESTION Is the Tcte1 mutation causative for male infertility? SUMMARY ANSWER Our collected data underline the complex and devastating effect of the single-gene mutation on the testicular molecular network, leading to male reproductive failure. WHAT IS KNOWN ALREADY Recent data have revealed mutations in genes related to axonemal dynein arms as causative for morphology and motility abnormalities in spermatozoa of infertile males, including dysplasia of fibrous sheath (DFS) and multiple morphological abnormalities in the sperm flagella (MMAF). The nexin-dynein regulatory complex (N-DRC) coordinates the dynein arm activity and is built from the DRC1-DRC7 proteins. DRC5 (TCTE1), one of the N-DRC elements, has already been reported as a candidate for abnormal sperm flagella beating; however, only in a restricted manner with no clear explanation of respective observations. STUDY DESIGN SIZE DURATION Using the CRISPR/Cas9 genome editing technique, a mouse Tcte1 gene knockout line was created on the basis of the C57Bl/6J strain. The mouse reproductive potential, semen characteristics, testicular gene expression levels, sperm ATP, and testis apoptosis level measurements were then assessed, followed by visualization of N-DRC proteins in sperm, and protein modeling in silico. Also, a pilot genomic sequencing study of samples from human infertile males (n = 248) was applied for screening of TCTE1 variants. PARTICIPANTS/MATERIALS SETTING METHODS To check the reproductive potential of KO mice, adult animals were crossed for delivery of three litters per caged pair, but for no longer than for 6 months, in various combinations of zygosity. All experiments were performed for wild-type (WT, control group), heterozygous Tcte1+/- and homozygous Tcte1-/- male mice. Gross anatomy was performed on testis and epididymis samples, followed by semen analysis. Sequencing of RNA (RNAseq; Illumina) was done for mice testis tissues. STRING interactions were checked for protein-protein interactions, based on changed expression levels of corresponding genes identified in the mouse testis RNAseq experiments. Immunofluorescence in situ staining was performed to detect the N-DRC complex proteins: Tcte1 (Drc5), Drc7, Fbxl13 (Drc6), and Eps8l1 (Drc3) in mouse spermatozoa. To determine the amount of ATP in spermatozoa, the luminescence level was measured. In addition, immunofluorescence in situ staining was performed to check the level of apoptosis via caspase 3 visualization on mouse testis samples. DNA from whole blood samples of infertile males (n = 137 with non-obstructive azoospermia or cryptozoospermia, n = 111 samples with a spectrum of oligoasthenoteratozoospermia, including n = 47 with asthenozoospermia) was extracted to perform genomic sequencing (WGS, WES, or Sanger). Protein prediction modeling of human-identified variants and the exon 3 structure deleted in the mouse knockout was also performed. MAIN RESULTS AND THE ROLE OF CHANCE No progeny at all was found for the homozygous males which were revealed to have oligoasthenoteratozoospermia, while heterozygous animals were fertile but manifested oligozoospermia, suggesting haploinsufficiency. RNA-sequencing of the testicular tissue showed the influence of Tcte1 mutations on the expression pattern of 21 genes responsible for mitochondrial ATP processing or linked with apoptosis or spermatogenesis. In Tcte1-/- males, the protein was revealed in only residual amounts in the sperm head nucleus and was not transported to the sperm flagella, as were other N-DRC components. Decreased ATP levels (2.4-fold lower) were found in the spermatozoa of homozygous mice, together with disturbed tail:midpiece ratios, leading to abnormal sperm tail beating. Casp3-positive signals (indicating apoptosis) were observed in spermatogonia only, at a similar level in all three mouse genotypes. Mutation screening of human infertile males revealed one novel and five ultra-rare heterogeneous variants (predicted as disease-causing) in 6.05% of the patients studied. Protein prediction modeling of identified variants revealed changes in the protein surface charge potential, leading to disruption in helix flexibility or its dynamics, thus suggesting disrupted interactions of TCTE1 with its binding partners located within the axoneme. LARGE SCALE DATA All data generated or analyzed during this study are included in this published article and its supplementary information files. RNAseq data are available in the GEO database (https://www.ncbi.nlm.nih.gov/geo/) under the accession number GSE207805. The results described in the publication are based on whole-genome or exome sequencing data which includes sensitive information in the form of patient-specific germline variants. Information regarding such variants must not be shared publicly following European Union legislation, therefore access to raw data that support the findings of this study are available from the corresponding author upon reasonable request. LIMITATIONS REASONS FOR CAUTION In the study, the in vitro fertilization performance of sperm from homozygous male mice was not checked. WIDER IMPLICATIONS OF THE FINDINGS This study contains novel and comprehensive data concerning the role of TCTE1 in male infertility. The TCTE1 gene is the next one that should be added to the 'male infertility list' because of its crucial role in spermatogenesis and proper sperm functioning. STUDY FUNDING/COMPETING INTERESTS This work was supported by National Science Centre in Poland, grants no.: 2015/17/B/NZ2/01157 and 2020/37/B/NZ5/00549 (to M.K.), 2017/26/D/NZ5/00789 (to A.M.), and HD096723, GM127569-03, NIH SAP #4100085736 PA DoH (to A.N.Y.). The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.
Collapse
Affiliation(s)
- Marta Olszewska
- Institute of Human Genetics, Polish Academy of Sciences, Poznan, Poland
| | - Agnieszka Malcher
- Institute of Human Genetics, Polish Academy of Sciences, Poznan, Poland
| | - Tomasz Stokowy
- Scientific Computing Group, IT Division, University of Bergen, Bergen, Norway
| | - Nijole Pollock
- Department of OB/GYN and Reproductive Sciences, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Andrea J Berman
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Sylwia Budkiewicz
- Institute of Human Genetics, Polish Academy of Sciences, Poznan, Poland
| | | | - Hanna Jackowiak
- Department of Histology and Embryology, Poznan University of Life Sciences, Poznan, Poland
| | | | - Piotr Jedrzejczak
- Division of Infertility and Reproductive Endocrinology, Department of Gynecology, Obstetrics and Gynecological Oncology, Poznan University of Medical Sciences, Poznan, Poland
| | - Alexander N Yatsenko
- Department of OB/GYN and Reproductive Sciences, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Maciej Kurpisz
- Institute of Human Genetics, Polish Academy of Sciences, Poznan, Poland
| |
Collapse
|
6
|
Chafai N, Bonizzi L, Botti S, Badaoui B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024; 61:140-163. [PMID: 37815417 DOI: 10.1080/10408363.2023.2259466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023]
Abstract
The integration of artificial intelligence technologies has propelled the progress of clinical and genomic medicine in recent years. The significant increase in computing power has facilitated the ability of artificial intelligence models to analyze and extract features from extensive medical data and images, thereby contributing to the advancement of intelligent diagnostic tools. Artificial intelligence (AI) models have been utilized in the field of personalized medicine to integrate clinical data and genomic information of patients. This integration allows for the identification of customized treatment recommendations, ultimately leading to enhanced patient outcomes. Notwithstanding the notable advancements, the application of artificial intelligence (AI) in the field of medicine is impeded by various obstacles such as the limited availability of clinical and genomic data, the diversity of datasets, ethical implications, and the inconclusive interpretation of AI models' results. In this review, a comprehensive evaluation of multiple machine learning algorithms utilized in the fields of clinical and genomic medicine is conducted. Furthermore, we present an overview of the implementation of artificial intelligence (AI) in the fields of clinical medicine, drug discovery, and genomic medicine. Finally, a number of constraints pertaining to the implementation of artificial intelligence within the healthcare industry are examined.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
| | - Luigi Bonizzi
- Department of Biomedical, Surgical and Dental Science, University of Milan, Milan, Italy
| | - Sara Botti
- PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, Italy
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laâyoune, Morocco
| |
Collapse
|
7
|
Abdelwahab O, Belzile F, Torkamaneh D. Performance analysis of conventional and AI-based variant callers using short and long reads. BMC Bioinformatics 2023; 24:472. [PMID: 38097928 PMCID: PMC10720095 DOI: 10.1186/s12859-023-05596-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 12/04/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND The accurate detection of variants is essential for genomics-based studies. Currently, there are various tools designed to detect genomic variants, however, it has always been a challenge to decide which tool to use, especially when various major genome projects have chosen to use different tools. Thus far, most of the existing tools were mainly developed to work on short-read data (i.e., Illumina); however, other sequencing technologies (e.g. PacBio, and Oxford Nanopore) have recently shown that they can also be used for variant calling. In addition, with the emergence of artificial intelligence (AI)-based variant calling tools, there is a pressing need to compare these tools in terms of efficiency, accuracy, computational power, and ease of use. RESULTS In this study, we evaluated five of the most widely used conventional and AI-based variant calling tools (BCFTools, GATK4, Platypus, DNAscope, and DeepVariant) in terms of accuracy and computational cost using both short-read and long-read data derived from three different sequencing technologies (Illumina, PacBio HiFi, and ONT) for the same set of samples from the Genome In A Bottle project. The analysis showed that AI-based variant calling tools supersede conventional ones for calling SNVs and INDELs using both long and short reads in most aspects. In addition, we demonstrate the advantages and drawbacks of each tool while ranking them in each aspect of these comparisons. CONCLUSION This study provides best practices for variant calling using AI-based and conventional variant callers with different types of sequencing data.
Collapse
Affiliation(s)
- Omar Abdelwahab
- Département de Phytologie, Université Laval, Québec, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Canada
- Centre de recherche et d'innovation sur les végétaux (CRIV), Université Laval, Québec, Canada
- Institut intelligence et données (IID), Université Laval, Québec, Canada
| | - François Belzile
- Département de Phytologie, Université Laval, Québec, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Canada
- Centre de recherche et d'innovation sur les végétaux (CRIV), Université Laval, Québec, Canada
| | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec, Canada.
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Canada.
- Centre de recherche et d'innovation sur les végétaux (CRIV), Université Laval, Québec, Canada.
- Institut intelligence et données (IID), Université Laval, Québec, Canada.
| |
Collapse
|
8
|
Park H, Gim J. A comparative investigation of single nucleotide variant calling for a personal non-Caucasian sequencing sample. Genes Genomics 2023; 45:1527-1536. [PMID: 37651066 DOI: 10.1007/s13258-023-01439-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 08/04/2023] [Indexed: 09/01/2023]
Abstract
BACKGROUND Dropping cost and increasing clinical application of whole genome sequencing (WGS) lead a necessity of efficient (accurate and rapid) variant calling procedures from a personal WGS data (n = 1). A number of variant calling pipelines have been introduced utilizing the human genome reference GRCh38 as a reference and a benchmark dataset called 'NA12878', which are both 'standard' but limited ethnic origin. Considering the nature of variant calling algorithms and recent updates in sequencing protocol, however, it is necessary to revisit the efficiency of the current best pipelines for a personal WGS data from diverse ethnicity. OBJECTIVE We discuss the most efficient practices for variant calling of a personal WGS reads, with a particular emphasis on whether (1) ethnic match or mismatch between the reference genome and a WGS data produces a distinct result and more importantly (2) there is an ethnic-specific optimal workflow. METHODS Here, we generate an appropriate WGS data, DNA array, and sufficient number of Sanger validated variants from a single Korean subject to perform such a comprehensive comparison. We applied this WGS reads and the 'NA12878' reads to 8 different variant calling pipelines with 2 different reference genomes (GRCh38 and KOREF, a Korean reference genome) to which the WGS reads from different ethnic origins are aligned. RESULTS We evaluated the performance of the pipelines with the matched array genotype data and Sanger sequencing validation and demonstrated that: regardless to the ethnic match/mismatch (1) Novoalign-GATK4 showed the most efficient performance with the exceptional calls in MHC region; (2) the overall performance was better with GRCh38, while a significant difference in recall was observed. In addition, we found it is largely reduced computing cost maintaining performance to remove 'markduplication' step with PCR-free WGS data. CONCLUSION For variant calling of a personal PCR-free WGS data, regardless of ethnicity consideration, we recommend the use of the Novoalign + GATK4 with GRCh38 and without 'markduplication'.
Collapse
Affiliation(s)
- HyeonSeul Park
- BK21 FOUR, Department of Integrative Biological Sciences, Chosun University, Gwangju, Republic of Korea
| | - JungSoo Gim
- BK21 FOUR, Department of Integrative Biological Sciences, Chosun University, Gwangju, Republic of Korea.
- Department of Biomedical Science, Chosun University, Gwangju, Republic of Korea.
- Asian Dementia Research Initiative, Chosun University, Gwangju, Republic of Korea.
| |
Collapse
|
9
|
Xiang X, Lu B, Song D, Li J, Shu K, Pu D. Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data. Sci Rep 2023; 13:20444. [PMID: 37993475 PMCID: PMC10665316 DOI: 10.1038/s41598-023-47135-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/09/2023] [Indexed: 11/24/2023] Open
Abstract
Detection of low-frequency variants with high accuracy plays an important role in biomedical research and clinical practice. However, it is challenging to do so with next-generation sequencing (NGS) approaches due to the high error rates of NGS. To accurately distinguish low-level true variants from these errors, many statistical variants calling tools for calling low-frequency variants have been proposed, but a systematic performance comparison of these tools has not yet been performed. Here, we evaluated four raw-reads-based variant callers (SiNVICT, outLyzer, Pisces, and LoFreq) and four UMI-based variant callers (DeepSNVMiner, MAGERI, smCounter2, and UMI-VarCal) considering their capability to call single nucleotide variants (SNVs) with allelic frequency as low as 0.025% in deep sequencing data. We analyzed a total of 54 simulated data with various sequencing depths and variant allele frequencies (VAFs), two reference data, and Horizon Tru-Q sample data. The results showed that the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers regarding detection limit. Sequencing depth had almost no effect on the UMI-based callers but significantly influenced on the raw-reads-based callers. Regardless of the sequencing depth, MAGERI showed the fastest analysis, while smCounter2 consistently took the longest to finish the variant calling process. Overall, DeepSNVMiner and UMI-VarCal performed the best with considerably good sensitivity and precision of 88%, 100%, and 84%, 100%, respectively. In conclusion, the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers in terms of sensitivity and precision. We recommend using DeepSNVMiner and UMI-VarCal for low-frequency variant detection. The results provide important information regarding future directions for reliable low-frequency variant detection and algorithm development, which is critical in genetics-based medical research and clinical applications.
Collapse
Affiliation(s)
- Xudong Xiang
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Bowen Lu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Dongyang Song
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Jie Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Kunxian Shu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| | - Dan Pu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| |
Collapse
|
10
|
Guhlin J, Le Lec MF, Wold J, Koot E, Winter D, Biggs PJ, Galla SJ, Urban L, Foster Y, Cox MP, Digby A, Uddstrom LR, Eason D, Vercoe D, Davis T, Howard JT, Jarvis ED, Robertson FE, Robertson BC, Gemmell NJ, Steeves TE, Santure AW, Dearden PK. Species-wide genomics of kākāpō provides tools to accelerate recovery. Nat Ecol Evol 2023; 7:1693-1705. [PMID: 37640765 DOI: 10.1038/s41559-023-02165-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Accepted: 07/11/2023] [Indexed: 08/31/2023]
Abstract
The kākāpō is a critically endangered, intensively managed, long-lived nocturnal parrot endemic to Aotearoa New Zealand. We generated and analysed whole-genome sequence data for nearly all individuals living in early 2018 (169 individuals) to generate a high-quality species-wide genetic variant callset. We leverage extensive long-term metadata to quantify genome-wide diversity of the species over time and present new approaches using probabilistic programming, combined with a phenotype dataset spanning five decades, to disentangle phenotypic variance into environmental and genetic effects while quantifying uncertainty in small populations. We find associations for growth, disease susceptibility, clutch size and egg fertility within genic regions previously shown to influence these traits in other species. Finally, we generate breeding values to predict phenotype and illustrate that active management over the past 45 years has maintained both genome-wide diversity and diversity in breeding values and, hence, evolutionary potential. We provide new pathways for informing future conservation management decisions for kākāpō, including prioritizing individuals for translocation and monitoring individuals with poor growth or high disease risk. Overall, by explicitly addressing the challenge of the small sample size, we provide a template for the inclusion of genomic data that will be transformational for species recovery efforts around the globe.
Collapse
Affiliation(s)
- Joseph Guhlin
- Genomics Aotearoa, Biochemistry Department, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
| | - Marissa F Le Lec
- Genomics Aotearoa, Biochemistry Department, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
| | - Jana Wold
- School of Biological Sciences, University of Canterbury, Christchurch, Aotearoa New Zealand
| | - Emily Koot
- The New Zealand Institute for Plant and Food Research Ltd, Palmerston North, Aotearoa New Zealand
| | - David Winter
- School of Natural Sciences, Massey University, Palmerston North, Aotearoa New Zealand
| | - Patrick J Biggs
- School of Natural Sciences, Massey University, Palmerston North, Aotearoa New Zealand
- School of Veterinary Science, Massey University, Palmerston North, Aotearoa New Zealand
| | - Stephanie J Galla
- School of Biological Sciences, University of Canterbury, Christchurch, Aotearoa New Zealand
- Department of Biological Sciences, Boise State University, Boise, ID, USA
| | - Lara Urban
- Department of Anatomy, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
- Helmholtz Pioneer Campus, Helmholtz Zentrum Muenchen, Neuherberg, Germany
- Helmholtz AI, Helmholtz Zentrum Muenchen, Neuherberg, Germany
- School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Yasmin Foster
- Department of Zoology, University of Otago, Dunedin, Aotearoa New Zealand
| | - Murray P Cox
- School of Natural Sciences, Massey University, Palmerston North, Aotearoa New Zealand
- Department of Statistics, University of Auckland, Auckland, Aotearoa New Zealand
| | - Andrew Digby
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Lydia R Uddstrom
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Daryl Eason
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Deidre Vercoe
- Kākāpō Recovery Programme, Department of Conservation, Invercargill, Aotearoa New Zealand
| | - Tāne Davis
- Rakiura Tītī Islands Administering Body, Invercargill, Aotearoa New Zealand
| | - Jason T Howard
- Neurogenetics of Language Lab, The Rockefeller University, New York, NY, USA
- Mirxes, Cambridge, MA, USA
| | - Erich D Jarvis
- The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Fiona E Robertson
- Department of Zoology, University of Otago, Dunedin, Aotearoa New Zealand
| | - Bruce C Robertson
- Department of Zoology, University of Otago, Dunedin, Aotearoa New Zealand
| | - Neil J Gemmell
- Department of Anatomy, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand
| | - Tammy E Steeves
- School of Biological Sciences, University of Canterbury, Christchurch, Aotearoa New Zealand
| | - Anna W Santure
- School of Biological Sciences, University of Auckland, Auckland, Aotearoa New Zealand
| | - Peter K Dearden
- Genomics Aotearoa, Biochemistry Department, School of Biomedical Sciences, University of Otago, Dunedin, Aotearoa New Zealand.
| |
Collapse
|
11
|
de Jong TV, Pan Y, Rastas P, Munro D, Tutaj M, Akil H, Benner C, Chen D, Chitre AS, Chow W, Colonna V, Dalgard CL, Demos WM, Doris PA, Garrison E, Geurts AM, Gunturkun HM, Guryev V, Hourlier T, Howe K, Huang J, Kalbfleisch T, Kim P, Li L, Mahaffey S, Martin FJ, Mohammadi P, Ozel AB, Polesskaya O, Pravenec M, Prins P, Sebat J, Smith JR, Solberg Woods LC, Tabakoff B, Tracey A, Uliano-Silva M, Villani F, Wang H, Sharp BM, Telese F, Jiang Z, Saba L, Wang X, Murphy TD, Palmer AA, Kwitek AE, Dwinell MR, Williams RW, Li JZ, Chen H. A revamped rat reference genome improves the discovery of genetic diversity in laboratory rats. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.13.536694. [PMID: 37214860 PMCID: PMC10197727 DOI: 10.1101/2023.04.13.536694] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared to its predecessor. Gene annotations are now more complete, significantly improving the mapping precision of genomic, transcriptomic, and proteomics data sets. We jointly analyzed 163 short-read whole genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ~20.0 million sequence variations, of which 18.7 thousand are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats.
Collapse
Affiliation(s)
- Tristan V de Jong
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Yanchao Pan
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Pasi Rastas
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Daniel Munro
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
- Department of Integrative Structural and Computational Biology, Scripps Research, San Diego, CA, USA
| | - Monika Tutaj
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
- Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Huda Akil
- Michigan Neuroscience Institute, University of Michigan, Ann Arbor, MI, USA
| | - Chris Benner
- Department of Medicine, University of California San Diego, San Diego, CA, USA
| | - Denghui Chen
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Apurva S Chitre
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - William Chow
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Clifton L Dalgard
- Department of Anatomy, Physiology & Genetics; The American Genome Center, Uniformed Services University of the Health Sciences, Washington DC, USA
| | - Wendy M Demos
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
- Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Peter A Doris
- The Brown Foundation Institute of Molecular Medicine, Center For Human Genetics, University of Texas Health Science Center, Houston, TX, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Aron M Geurts
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Hakan M Gunturkun
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Victor Guryev
- Genome Structure and Ageing, University of Groningen, UMC Groningen, The Netherlands
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Jun Huang
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ted Kalbfleisch
- Gluck Equine Research Center, Department of Veterinary Science, University of Kentucky, Louisville, KY, USA
| | - Panjun Kim
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ling Li
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Center for Proteomics and Metabolomics, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - Spencer Mahaffey
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus in Hinxton, Cambridgeshire, UK
| | - Pejman Mohammadi
- Center for Immunity and Immunotherapies, Seattle Children’s Research Institute, Seattle, WA, USA
- Department of Pediatrics, University of Washington School of Medicine, Seattle, WA, USA
| | - Ayse Bilge Ozel
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Oksana Polesskaya
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Michal Pravenec
- Institute of Physiology, Czech Academy of Sciences, Prague, Czechia
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonathan Sebat
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Jennifer R Smith
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
- Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Leah C Solberg Woods
- Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - Boris Tabakoff
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | | | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hongyang Wang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Burt M Sharp
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Francesca Telese
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - Zhihua Jiang
- Department of Animal Sciences, Washington State University, Pullman, WA, USA
| | - Laura Saba
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Xusheng Wang
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Center for Proteomics and Metabolomics, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Abraham A Palmer
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Anne E Kwitek
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
- Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Melinda R Dwinell
- Department of Physiology, Medical College of Wisconsin, Milwaukee, WI, USA
- Rat Genome Database, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Robert W Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jun Z Li
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| |
Collapse
|
12
|
Chen NC, Kolesnikov A, Goel S, Yun T, Chang PC, Carroll A. Improving variant calling using population data and deep learning. BMC Bioinformatics 2023; 24:197. [PMID: 37173615 PMCID: PMC10182612 DOI: 10.1186/s12859-023-05294-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 04/17/2023] [Indexed: 05/15/2023] Open
Abstract
Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.
Collapse
Affiliation(s)
- Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
| | | | | | | | | | | |
Collapse
|
13
|
Park H, Gim J. A comparative investigation of variant calling and genotyping for a single non-Caucasian whole genome. RESEARCH SQUARE 2023:rs.3.rs-2580940. [PMID: 36945432 PMCID: PMC10029055 DOI: 10.21203/rs.3.rs-2580940/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/08/2023]
Abstract
Most genome benchmark studies utilize hg38 as a reference genome (based on Caucasian and African samples) and 'NA12878' (a Caucasian sequencing read) for comparison. Here, we aimed to elucidate whether 1) ethnic match or mismatch between the reference genome and sequencing reads produces a distinct result; 2) there is an optimal work flow for single genome data. We assessed the performance of variant calling pipelines using hg38 and a Korean genome (reference genomes) and two whole-genome sequencing (WGS) reads from different ethnic origins: Caucasian (NA12878) and Korean. The pipelines used BWA-mem and Novoalign as mapping tools and GATK4, Strelka2, DeepVariant, and Samtools as variant callers. Using hg38 led to better performance (based on precision and recall), regardless of the ethnic origin of the WGS reads. Novoalign + GATK4 demonstrated best performance when using both WGS data. We assessed pipeline efficiency by removing the markduplicate process, and all pipelines, except Novoalign + DeepVariant, maintained their performance. Novoalign identified more variants overall and in MHC of chr6 when combined with GATK4. No evidence suggested improved variant calling performance from single WGS reads with a different ethnic reference, re-validating hg38 utility. We recommend using Novoalign + GATK4 without markduplication for single PCR-free WGS data.
Collapse
|
14
|
Abstract
Advancements in high-throughput sequencing have yielded vast amounts of genomic data, which are studied using genome-wide association study (GWAS)/phenome-wide association study (PheWAS) methods to identify associations between the genotype and phenotype. The associated findings have contributed to pharmacogenomics and improved clinical decision support at the point of care in many healthcare systems. However, the accumulation of genomic data from sequencing and clinical data from electronic health records (EHRs) poses significant challenges for data scientists. Following the rise of artificial intelligence (AI) technology such as machine learning and deep learning, an increasing number of GWAS/PheWAS studies have successfully leveraged this technology to overcome the aforementioned challenges. In this review, we focus on the application of data science and AI technology in three areas, including risk prediction and identification of causal single-nucleotide polymorphisms, EHR-based phenotyping and CRISPR guide RNA design. Additionally, we highlight a few emerging AI technologies, such as transfer learning and multi-view learning, which will or have started to benefit genomic studies.
Collapse
Affiliation(s)
- Jing Lin
- NUHS Corporate Office, National University Health System, Singapore
| | - Kee Yuan Ngiam
- NUHS Corporate Office, National University Health System, Singapore,Department of Surgery, National University of Singapore, Singapore,Correspondence: A/Prof Kee Yuan Ngiam, Group Chief Technology Officer, NUHS Corporate Office, National University Health System, 1E Kent Ridge Road, 119228, Singapore. E-mail:
| |
Collapse
|
15
|
Cai Y, Chen R, Gao S, Li W, Liu Y, Su G, Song M, Jiang M, Jiang C, Zhang X. Artificial intelligence applied in neoantigen identification facilitates personalized cancer immunotherapy. Front Oncol 2023; 12:1054231. [PMID: 36698417 PMCID: PMC9868469 DOI: 10.3389/fonc.2022.1054231] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 12/16/2022] [Indexed: 01/10/2023] Open
Abstract
The field of cancer neoantigen investigation has developed swiftly in the past decade. Predicting novel and true neoantigens derived from large multi-omics data became difficult but critical challenges. The rise of Artificial Intelligence (AI) or Machine Learning (ML) in biomedicine application has brought benefits to strengthen the current computational pipeline for neoantigen prediction. ML algorithms offer powerful tools to recognize the multidimensional nature of the omics data and therefore extract the key neoantigen features enabling a successful discovery of new neoantigens. The present review aims to outline the significant technology progress of machine learning approaches, especially the newly deep learning tools and pipelines, that were recently applied in neoantigen prediction. In this review article, we summarize the current state-of-the-art tools developed to predict neoantigens. The standard workflow includes calling genetic variants in paired tumor and blood samples, and rating the binding affinity between mutated peptide, MHC (I and II) and T cell receptor (TCR), followed by characterizing the immunogenicity of tumor epitopes. More specifically, we highlight the outstanding feature extraction tools and multi-layer neural network architectures in typical ML models. It is noted that more integrated neoantigen-predicting pipelines are constructed with hybrid or combined ML algorithms instead of conventional machine learning models. In addition, the trends and challenges in further optimizing and integrating the existing pipelines are discussed.
Collapse
Affiliation(s)
- Yu Cai
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Rui Chen
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Shenghan Gao
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Wenqing Li
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Yuru Liu
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Guodong Su
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Mingming Song
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Mengju Jiang
- School of Medicine, Northwest University, Xi’an, Shaanxi, China
| | - Chao Jiang
- Department of Neurology, The Second Affiliated Hospital of Xi’an Medical University, Xi’an, Shaanxi, China,*Correspondence: Chao Jiang, ; Xi Zhang,
| | - Xi Zhang
- School of Medicine, Northwest University, Xi’an, Shaanxi, China,*Correspondence: Chao Jiang, ; Xi Zhang,
| |
Collapse
|
16
|
Betschart RO, Thiéry A, Aguilera-Garcia D, Zoche M, Moch H, Twerenbold R, Zeller T, Blankenberg S, Ziegler A. Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment. Sci Rep 2022; 12:21502. [PMID: 36513709 PMCID: PMC9748128 DOI: 10.1038/s41598-022-26181-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Accepted: 12/12/2022] [Indexed: 12/14/2022] Open
Abstract
Rapid advances in high-throughput DNA sequencing technologies have enabled the conduct of whole genome sequencing (WGS) studies, and several bioinformatics pipelines have become available. The aim of this study was the comparison of 6 WGS data pre-processing pipelines, involving two mapping and alignment approaches (GATK utilizing BWA-MEM2 2.2.1, and DRAGEN 3.8.4) and three variant calling pipelines (GATK 4.2.4.1, DRAGEN 3.8.4 and DeepVariant 1.1.0). We sequenced one genome in a bottle (GIAB) sample 70 times in different runs, and one GIAB trio in triplicate. The truth set of the GIABs was used for comparison, and performance was assessed by computation time, F1 score, precision, and recall. In the mapping and alignment step, the DRAGEN pipeline was faster than the GATK with BWA-MEM2 pipeline. DRAGEN showed systematically higher F1 score, precision, and recall values than GATK for single nucleotide variations (SNVs) and Indels in simple-to-map, complex-to-map, coding and non-coding regions. In the variant calling step, DRAGEN was fastest. In terms of accuracy, DRAGEN and DeepVariant performed similarly and both superior to GATK, with slight advantages for DRAGEN for Indels and for DeepVariant for SNVs. The DRAGEN pipeline showed the lowest Mendelian inheritance error fraction for the GIAB trios. Mapping and alignment played a key role in variant calling of WGS, with the DRAGEN outperforming GATK.
Collapse
Affiliation(s)
- Raphael O. Betschart
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265 Davos Wolfgang, Switzerland
| | - Alexandre Thiéry
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265 Davos Wolfgang, Switzerland
| | - Domingo Aguilera-Garcia
- grid.412004.30000 0004 0478 9977Institute of Pathology and Molecular Pathology, University Hospital Zurich, Schmelzbergstrasse 12, 8091 Zurich, Switzerland
| | - Martin Zoche
- grid.412004.30000 0004 0478 9977Institute of Pathology and Molecular Pathology, University Hospital Zurich, Schmelzbergstrasse 12, 8091 Zurich, Switzerland
| | - Holger Moch
- grid.412004.30000 0004 0478 9977Institute of Pathology and Molecular Pathology, University Hospital Zurich, Schmelzbergstrasse 12, 8091 Zurich, Switzerland
| | - Raphael Twerenbold
- grid.13648.380000 0001 2180 3484Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251 Hamburg, Germany ,grid.13648.380000 0001 2180 3484University Center of Cardiovascular Research Hamburg, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251 Hamburg, Germany ,grid.452396.f0000 0004 5937 5237German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Tanja Zeller
- grid.13648.380000 0001 2180 3484Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251 Hamburg, Germany ,grid.13648.380000 0001 2180 3484University Center of Cardiovascular Research Hamburg, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251 Hamburg, Germany ,grid.452396.f0000 0004 5937 5237German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265 Davos Wolfgang, Switzerland ,grid.13648.380000 0001 2180 3484Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251 Hamburg, Germany ,grid.13648.380000 0001 2180 3484University Center of Cardiovascular Research Hamburg, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251 Hamburg, Germany ,grid.452396.f0000 0004 5937 5237German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 1, 7265 Davos Wolfgang, Switzerland ,grid.13648.380000 0001 2180 3484Department of Cardiology, University Heart & Vascular Center, University Medical Center Hamburg Eppendorf, Martinistr. 52, 20251 Hamburg, Germany ,School Mathematics, Statistics and Computer Science, Scottsville, Private Bag X01, Pietermaritzburg, 3209 South Africa
| |
Collapse
|
17
|
Woerner AE, Mandape S, Kapema KB, Duque TM, Smuts A, King JL, Crysup B, Wang X, Huang M, Ge J, Budowle B. Optimized variant calling for estimating kinship. Forensic Sci Int Genet 2022; 61:102785. [DOI: 10.1016/j.fsigen.2022.102785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 08/07/2022] [Accepted: 09/29/2022] [Indexed: 11/16/2022]
|
18
|
Malcher A, Stokowy T, Berman A, Olszewska M, Jedrzejczak P, Sielski D, Nowakowski A, Rozwadowska N, Yatsenko AN, Kurpisz MK. Whole-genome sequencing identifies new candidate genes for nonobstructive azoospermia. Andrology 2022; 10:1605-1624. [PMID: 36017582 PMCID: PMC9826517 DOI: 10.1111/andr.13269] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Revised: 06/21/2022] [Accepted: 08/17/2022] [Indexed: 01/11/2023]
Abstract
BACKGROUND Genetic causes that lead to spermatogenetic failure in patients with nonobstructive azoospermia (NOA) have not been yet completely established. OBJECTIVE To identify low-frequency NOA-associated single nucleotide variants (SNVs) using whole-genome sequencing (WGS). MATERIALS AND METHODS Men with various types of NOA (n = 39), including samples that had been previously tested with whole-exome sequencing (WES; n = 6) and did not result in diagnostic conclusions. Variants were annotated using the Ensembl Variant Effect Predictor, utilizing frequencies from GnomAD and other databases to provide clinically relevant information (ClinVar), conservation scores (phyloP), and effect predictions (i.e., MutationTaster). Structural protein modeling was also performed. RESULTS Using WGS, we revealed potential NOA-associated SNVs, such as: TKTL1, IGSF1, ZFPM2, VCX3A (novel disease causing variants), ESX1, TEX13A, TEX14, DNAH1, FANCM, QRICH2, FSIP2, USP9Y, PMFBP1, MEI1, PIWIL1, WDR66, ZFX, KCND1, KIAA1210, DHRSX, ZMYM3, FAM47C, FANCB, FAM50B (genes previously known to be associated with infertility) and ALG13, BEND2, BRWD3, DDX53, TAF4, FAM47B, FAM9B, FAM9C, MAGEB6, MAP3K15, RBMXL3, SSX3 and FMR1NB genes, which may be involved in spermatogenesis. DISCUSSION AND CONCLUSION In this study, we identified novel potential candidate NOA-associated genes in 29 individuals out of 39 azoospermic males. Note that in 5 out of 6 patients subjected previously to WES analysis, which did not disclose potentially causative variants, the WGS analysis was successful with NOA-associated gene findings.
Collapse
Affiliation(s)
| | - Tomasz Stokowy
- Scientific Computing GroupIT DivisionUniversity of BergenNorway
| | - Andrea Berman
- Department of Biological SciencesUniversity of PittsburghPittsburghPennsylvaniaUSA
| | - Marta Olszewska
- Institute of Human GeneticsPolish Academy of SciencesPoznanPoland
| | - Piotr Jedrzejczak
- Division of Infertility and Reproductive EndocrinologyDepartment of GynecologyObstetrics and Gynecological OncologyPoznan University of Medical SciencesPoznanPoland
| | | | - Adam Nowakowski
- Department of Urology and Urologic Oncology in St. Families HospitalPoznanPoland
| | | | - Alexander N. Yatsenko
- Department of OB/GYN and Reproductive SciencesSchool of MedicineUniversity of PittsburghPittsburghPennsylvaniaUSA
| | | |
Collapse
|
19
|
Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer. Genes (Basel) 2022; 13:genes13091583. [PMID: 36140751 PMCID: PMC9498802 DOI: 10.3390/genes13091583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 08/30/2022] [Accepted: 08/31/2022] [Indexed: 11/23/2022] Open
Abstract
The goal of biomarker testing, in the field of personalized medicine, is to guide treatments to achieve the best possible results for each patient. The accurate and reliable identification of everyone’s genome variants is essential for the success of clinical genomics, employing third-generation sequencing. Different variant calling techniques have been used and recommended by both Oxford Nanopore Technologies (ONT) and Nanopore communities. A thorough examination of the variant callers might give critical guidance for third-generation sequencing-based clinical genomics. In this study, two reference genome sample datasets (NA12878) and (NA24385) and the set of high-confidence variant calls provided by the Genome in a Bottle (GIAB) were used to allow the evaluation of the performance of six variant calling tools, including Human-SNP-wf, Clair3, Clair, NanoCaller, Longshot, and Medaka, as an integral step in the in-house variant detection workflow. Out of the six variant callers understudy, Clair3 and Human-SNP-wf that has Clair3 incorporated into it achieved the highest performance rates in comparison to the other variant callers. Evaluation of the results for the tool was expressed in terms of Precision, Recall, and F1-score using Hap.py tools for the comparison. In conclusion, our findings give important insights for identifying accurate variants from third-generation sequencing of personal genomes using different variant detection tools available for long-read sequencing.
Collapse
|
20
|
Borden ES, Buetow KH, Wilson MA, Hastings KT. Cancer Neoantigens: Challenges and Future Directions for Prediction, Prioritization, and Validation. Front Oncol 2022; 12:836821. [PMID: 35311072 PMCID: PMC8929516 DOI: 10.3389/fonc.2022.836821] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 02/07/2022] [Indexed: 12/16/2022] Open
Abstract
Prioritization of immunogenic neoantigens is key to enhancing cancer immunotherapy through the development of personalized vaccines, adoptive T cell therapy, and the prediction of response to immune checkpoint inhibition. Neoantigens are tumor-specific proteins that allow the immune system to recognize and destroy a tumor. Cancer immunotherapies, such as personalized cancer vaccines, adoptive T cell therapy, and immune checkpoint inhibition, rely on an understanding of the patient-specific neoantigen profile in order to guide personalized therapeutic strategies. Genomic approaches to predicting and prioritizing immunogenic neoantigens are rapidly expanding, raising new opportunities to advance these tools and enhance their clinical relevance. Predicting neoantigens requires acquisition of high-quality samples and sequencing data, followed by variant calling and variant annotation. Subsequently, prioritizing which of these neoantigens may elicit a tumor-specific immune response requires application and integration of tools to predict the expression, processing, binding, and recognition potentials of the neoantigen. Finally, improvement of the computational tools is held in constant tension with the availability of datasets with validated immunogenic neoantigens. The goal of this review article is to summarize the current knowledge and limitations in neoantigen prediction, prioritization, and validation and propose future directions that will improve personalized cancer treatment.
Collapse
Affiliation(s)
- Elizabeth S Borden
- Department of Basic Medical Sciences, College of Medicine-Phoenix, University of Arizona, Phoenix, AZ, United States.,Department of Research and Internal Medicine (Dermatology), Phoenix Veterans Affairs Health Care System, Phoenix, AZ, United States
| | - Kenneth H Buetow
- School of Life Sciences, Arizona State University, Tempe, AZ, United States.,Center for Evolution and Medicine, Arizona State University, Tempe, AZ, United States
| | - Melissa A Wilson
- School of Life Sciences, Arizona State University, Tempe, AZ, United States.,Center for Evolution and Medicine, Arizona State University, Tempe, AZ, United States
| | - Karen Taraszka Hastings
- Department of Basic Medical Sciences, College of Medicine-Phoenix, University of Arizona, Phoenix, AZ, United States.,Department of Research and Internal Medicine (Dermatology), Phoenix Veterans Affairs Health Care System, Phoenix, AZ, United States
| |
Collapse
|
21
|
Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics 2022; 23:155. [PMID: 35193511 PMCID: PMC8862519 DOI: 10.1186/s12864-022-08365-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 02/03/2022] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Although variant caller benchmarks are published constantly, no previous publications have leveraged the full extent of available gold standard whole-genome (WGS) and whole-exome (WES) sequencing datasets. RESULTS In this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Clair3, DeepVariant, Octopus, GATK, FreeBayes, and Strelka2) using a set of 14 "gold standard" WES and WGS datasets available from Genome In A Bottle (GIAB) consortium. Additionally, we have indirectly evaluated each pipeline's performance using a set of 6 non-GIAB samples of African and Russian ethnicity. In our benchmark, Bowtie2 performed significantly worse than other aligners, suggesting it should not be used for medical variant calling. When other aligners were considered, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness. Other actively developed tools, such as Clair3, Octopus, and Strelka2, also performed well, although their efficiency had greater dependence on the quality and type of the input data. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting. CONCLUSIONS The results show surprisingly large differences in the performance of cutting-edge tools even in high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. We also discuss the need for a more diverse set of gold standard genomes that would include samples of African, Hispanic, or mixed ancestry. Additionally, there is also a need for better variant caller assessment in the repetitive regions of the coding genome.
Collapse
Affiliation(s)
- Yury A Barbitoff
- Bioinformatics Institute, St. Petersburg, Russia. .,Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg, Russia. .,Department of Genetics and Biotechnology, St. Petersburg State University, St. Petersburg, Russia.
| | - Ruslan Abasov
- Bioinformatics Institute, St. Petersburg, Russia.,Dmitry Rogachev National Research Center of Pediatric Hematology-Oncology and Immunology, Moscow, Russia
| | - Varvara E Tvorogova
- Bioinformatics Institute, St. Petersburg, Russia.,Department of Genetics and Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Andrey S Glotov
- Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg, Russia
| | | |
Collapse
|
22
|
Abstract
While next-generation sequencing (NGS) has transformed genetic testing, it generates large quantities of noisy data that require a significant amount of bioinformatics to generate useful interpretation. The accuracy of variant calling is therefore critical. Although GATK HaplotypeCaller is a widely used tool for this purpose, newer methods such as DeepVariant have shown higher accuracy in assessments of gold-standard samples for whole-genome sequencing (WGS) and whole-exome sequencing (WES), but a side-by-side comparison on clinical samples has not been performed. Trio WES was used to compare GATK (4.1.2.0) HaplotypeCaller and DeepVariant (v0.8.0). The performance of the two pipelines was evaluated according to the Mendelian error rate, transition-to-transversion (Ti/Tv) ratio, concordance rate, and pathological variant detection rate. Data from 80 trios were analyzed. The Mendelian error rate of the 77 biological trios calculated from the data by DeepVariant (3.09 ± 0.83%) was lower than that calculated from the data by GATK (5.25 ± 0.91%) (p < 0.001). DeepVariant also yielded a higher Ti/Tv ratio (2.38 ± 0.02) than GATK (2.04 ± 0.07) (p < 0.001), suggesting that DeepVariant proportionally called more true positives. The concordance rate between the 2 pipelines was 88.73%. Sixty-three disease-causing variants were detected in the 80 trios. Among them, DeepVariant detected 62 variants, and GATK detected 61 variants. The one variant called by DeepVariant but not GATK HaplotypeCaller might have been missed by GATK HaplotypeCaller due to low coverage. OTC exon 2 (139 bp) deletion was not detected by either method. Mendelian error rate calculation is an effective way to evaluate variant callers. By this method, DeepVariant outperformed GATK, while the two pipelines performed equally in other parameters.
Collapse
|
23
|
Brady SW, Gout AM, Zhang J. Therapeutic and prognostic insights from the analysis of cancer mutational signatures. Trends Genet 2022; 38:194-208. [PMID: 34483003 PMCID: PMC8752466 DOI: 10.1016/j.tig.2021.08.007] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 08/06/2021] [Accepted: 08/11/2021] [Indexed: 02/08/2023]
Abstract
The somatic mutations in each cancer genome are caused by multiple mutational processes, each of which leaves a characteristic imprint (or 'signature'), potentially caused by specific etiologies or exposures. Deconvolution of these signatures offers a glimpse into the evolutionary history of individual tumors. Recent work has shown that mutational signatures may also yield therapeutic and prognostic insights, including the identification of cell-intrinsic signatures as biomarkers of drug response and prognosis. For example, mutational signatures indicating homologous recombination deficiency are associated with poly(ADP)-ribose polymerase (PARP) inhibitor sensitivity, whereas APOBEC-associated signatures are associated with ataxia telangiectasia and Rad3-related kinase (ATR) inhibitor sensitivity. Furthermore, therapy-induced mutational signatures implicated in cancer progression have also been uncovered, including the identification of thiopurine-induced TP53 mutations in leukemia. In this review, we explore the various ways mutational signatures can reveal new therapeutic and prognostic insights, thus extending their traditional role in identifying disease etiology.
Collapse
Affiliation(s)
- Samuel W Brady
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA.
| | - Alexander M Gout
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Jinghui Zhang
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA.
| |
Collapse
|
24
|
Kelly CJ, Brown APY, Taylor JA. Artificial Intelligence in Pediatrics. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
25
|
Bathke J, Lühken G. OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow. BMC Bioinformatics 2021; 22:402. [PMID: 34388963 PMCID: PMC8361789 DOI: 10.1186/s12859-021-04317-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 08/04/2021] [Indexed: 12/30/2022] Open
Abstract
Background The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time. Results A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half. Conclusions The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.
Collapse
Affiliation(s)
- Jochen Bathke
- Institute of Animal Breeding and Genetics, Justus Liebig University Gießen, Ludwigstraße 21, 35390, Gießen, Germany.
| | - Gesine Lühken
- Institute of Animal Breeding and Genetics, Justus Liebig University Gießen, Ludwigstraße 21, 35390, Gießen, Germany
| |
Collapse
|
26
|
de Jong TV, Kim P, Guryev V, Mulligan MK, Williams RW, Redei EE, Chen H. Whole genome sequencing of nearly isogenic WMI and WLI inbred rats identifies genes potentially involved in depression and stress reactivity. Sci Rep 2021; 11:14774. [PMID: 34285244 PMCID: PMC8292482 DOI: 10.1038/s41598-021-92993-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 06/17/2021] [Indexed: 02/06/2023] Open
Abstract
The WMI and WLI inbred rats were generated from the stress-prone, and not yet fully inbred, Wistar Kyoto (WKY) strain. These were selected using bi-directional selection for immobility in the forced swim test and were then sib-mated for over 38 generations. Despite the low level of genetic diversity among WKY progenitors, the WMI substrain is significantly more vulnerable to stress relative to the counter-selected WLI strain. Here we quantify numbers and classes of genomic sequence variants distinguishing these substrains with the long term goal of uncovering functional and behavioral polymorphism that modulate sensitivity to stress and depression-like phenotypes. DNA from WLI and WMI was sequenced using Illumina xTen, IonTorrent, and 10X Chromium linked-read platforms to obtain a combined coverage of ~ 100X for each strain. We identified 4,296 high quality homozygous SNPs and indels between the WMI and WLI. We detected high impact variants in genes previously implicated in depression (e.g. Gnat2), depression-like behavior (e.g. Prlr, Nlrp1a), other psychiatric disease (e.g. Pou6f2, Kdm5a, Reep3, Wdfy3), and responses to psychological stressors (e.g. Pigr). High coverage sequencing data confirm that the two substrains are nearly coisogenic. Nonetheless, the small number of sequence variants contributes to numerous well characterized differences including depression-like behavior, stress reactivity, and addiction related phenotypes. These selected substrains are an ideal resource for forward and reverse genetic studies using a reduced complexity cross.
Collapse
Affiliation(s)
| | - Panjun Kim
- University of Tennessee Health Science Center, Memphis, TN, USA
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University of Groningen, Groningen, The Netherlands
| | | | | | - Eva E Redei
- Northwestern University - Chicago, Chicago, IL, USA
| | - Hao Chen
- University of Tennessee Health Science Center, Memphis, TN, USA.
| |
Collapse
|
27
|
Li H, Dawood M, Khayat MM, Farek JR, Jhangiani SN, Khan ZM, Mitani T, Coban-Akdemir Z, Lupski JR, Venner E, Posey JE, Sabo A, Gibbs RA. Exome variant discrepancies due to reference-genome differences. Am J Hum Genet 2021; 108:1239-1250. [PMID: 34129815 PMCID: PMC8322936 DOI: 10.1016/j.ajhg.2021.05.011] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 05/19/2021] [Indexed: 12/15/2022] Open
Abstract
Despite release of the GRCh38 human reference genome more than seven years ago, GRCh37 remains more widely used by most research and clinical laboratories. To date, no study has quantified the impact of utilizing different reference assemblies for the identification of variants associated with rare and common diseases from large-scale exome-sequencing data. By calling variants on both the GRCh37 and GRCh38 references, we identified single-nucleotide variants (SNVs) and insertion-deletions (indels) in 1,572 exomes from participants with Mendelian diseases and their family members. We found that a total of 1.5% of SNVs and 2.0% of indels were discordant when different references were used. Notably, 76.6% of the discordant variants were clustered within discrete discordant reference patches (DISCREPs) comprising only 0.9% of loci targeted by exome sequencing. These DISCREPs were enriched for genomic elements including segmental duplications, fix patch sequences, and loci known to contain alternate haplotypes. We identified 206 genes significantly enriched for discordant variants, most of which were in DISCREPs and caused by multi-mapped reads on the reference assembly that lacked the variant call. Among these 206 genes, eight are implicated in known Mendelian diseases and 53 are associated with common phenotypes from genome-wide association studies. In addition, variant interpretations could also be influenced by the reference after lifting-over variant loci to another assembly. Overall, we identified genes and genomic loci affected by reference assembly choice, including genes associated with Mendelian disorders and complex human diseases that require careful evaluation in both research and clinical applications.
Collapse
Affiliation(s)
- He Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Moez Dawood
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Medical Scientist Training Program, Baylor College of Medicine, Houston, TX 77030, USA
| | - Michael M Khayat
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jesse R Farek
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Shalini N Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ziad M Khan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Tadahiro Mitani
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Zeynep Coban-Akdemir
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - James R Lupski
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Department of Pediatrics, Texas Children's Hospital, Houston, TX 77030, USA
| | - Eric Venner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jennifer E Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Aniko Sabo
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
| |
Collapse
|
28
|
Zhu N, Swietlik EM, Welch CL, Pauciulo MW, Hagen JJ, Zhou X, Guo Y, Karten J, Pandya D, Tilly T, Lutz KA, Martin JM, Treacy CM, Rosenzweig EB, Krishnan U, Coleman AW, Gonzaga-Jauregui C, Lawrie A, Trembath RC, Wilkins MR, Morrell NW, Shen Y, Gräf S, Nichols WC, Chung WK. Rare variant analysis of 4241 pulmonary arterial hypertension cases from an international consortium implicates FBLN2, PDGFD, and rare de novo variants in PAH. Genome Med 2021; 13:80. [PMID: 33971972 PMCID: PMC8112021 DOI: 10.1186/s13073-021-00891-1] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 04/19/2021] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Pulmonary arterial hypertension (PAH) is a lethal vasculopathy characterized by pathogenic remodeling of pulmonary arterioles leading to increased pulmonary pressures, right ventricular hypertrophy, and heart failure. PAH can be associated with other diseases (APAH: connective tissue diseases, congenital heart disease, and others) but often the etiology is idiopathic (IPAH). Mutations in bone morphogenetic protein receptor 2 (BMPR2) are the cause of most heritable cases but the vast majority of other cases are genetically undefined. METHODS To identify new risk genes, we utilized an international consortium of 4241 PAH cases with exome or genome sequencing data from the National Biological Sample and Data Repository for PAH, Columbia University Irving Medical Center, and the UK NIHR BioResource - Rare Diseases Study. The strength of this combined cohort is a doubling of the number of IPAH cases compared to either national cohort alone. We identified protein-coding variants and performed rare variant association analyses in unrelated participants of European ancestry, including 1647 IPAH cases and 18,819 controls. We also analyzed de novo variants in 124 pediatric trios enriched for IPAH and APAH-CHD. RESULTS Seven genes with rare deleterious variants were associated with IPAH with false discovery rate smaller than 0.1: three known genes (BMPR2, GDF2, and TBX4), two recently identified candidate genes (SOX17, KDR), and two new candidate genes (fibulin 2, FBLN2; platelet-derived growth factor D, PDGFD). The new genes were identified based solely on rare deleterious missense variants, a variant type that could not be adequately assessed in either cohort alone. The candidate genes exhibit expression patterns in lung and heart similar to that of known PAH risk genes, and most variants occur in conserved protein domains. For pediatric PAH, predicted deleterious de novo variants exhibited a significant burden compared to the background mutation rate (2.45×, p = 2.5e-5). At least eight novel pediatric candidate genes carrying de novo variants have plausible roles in lung/heart development. CONCLUSIONS Rare variant analysis of a large international consortium identified two new candidate genes-FBLN2 and PDGFD. The new genes have known functions in vasculogenesis and remodeling. Trio analysis predicted that ~ 15% of pediatric IPAH may be explained by de novo variants.
Collapse
Affiliation(s)
- Na Zhu
- Department of Pediatrics, Columbia University Irving Medical Center, 1150 St. Nicholas Avenue, Room 620, New York, NY, 10032, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Emilia M Swietlik
- Department of Medicine, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
| | - Carrie L Welch
- Department of Pediatrics, Columbia University Irving Medical Center, 1150 St. Nicholas Avenue, Room 620, New York, NY, 10032, USA
| | - Michael W Pauciulo
- Division of Human Genetics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA
| | - Jacob J Hagen
- Department of Pediatrics, Columbia University Irving Medical Center, 1150 St. Nicholas Avenue, Room 620, New York, NY, 10032, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Xueya Zhou
- Department of Pediatrics, Columbia University Irving Medical Center, 1150 St. Nicholas Avenue, Room 620, New York, NY, 10032, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Yicheng Guo
- Department of Systems Biology, Columbia University, New York, NY, USA
| | | | - Divya Pandya
- Department of Medicine, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
| | - Tobias Tilly
- Department of Medicine, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
| | - Katie A Lutz
- Division of Human Genetics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Jennifer M Martin
- Department of Medicine, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
- NIHR BioResource for Translational Research, Cambridge Biomedical Campus, Cambridge, UK
| | - Carmen M Treacy
- Department of Medicine, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
| | - Erika B Rosenzweig
- Department of Pediatrics, Columbia University Irving Medical Center, 1150 St. Nicholas Avenue, Room 620, New York, NY, 10032, USA
| | - Usha Krishnan
- Department of Pediatrics, Columbia University Irving Medical Center, 1150 St. Nicholas Avenue, Room 620, New York, NY, 10032, USA
| | - Anna W Coleman
- Division of Human Genetics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | | | - Allan Lawrie
- Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK
| | - Richard C Trembath
- Department of Medical and Molecular Genetics, King's College London, London, UK
| | - Martin R Wilkins
- National Heart & Lung Institute, Imperial College London, London, UK
| | | | | | | | | | - Nicholas W Morrell
- Department of Medicine, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
- NIHR BioResource for Translational Research, Cambridge Biomedical Campus, Cambridge, UK
- Addenbrooke's Hospital NHS Foundation Trust, Cambridge Biomedical Campus, Cambridge, UK
- Royal Papworth Hospital NHS Foundation Trust, Cambridge Biomedical Campus, Cambridge, UK
| | - Yufeng Shen
- Department of Systems Biology, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Stefan Gräf
- Department of Medicine, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
- NIHR BioResource for Translational Research, Cambridge Biomedical Campus, Cambridge, UK
- Department of Haematology, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
| | - William C Nichols
- Division of Human Genetics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA
| | - Wendy K Chung
- Department of Pediatrics, Columbia University Irving Medical Center, 1150 St. Nicholas Avenue, Room 620, New York, NY, 10032, USA.
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA.
- Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA.
| |
Collapse
|
29
|
Next Generation Sequencing Technology in the Clinic and Its Challenges. Cancers (Basel) 2021; 13:cancers13081751. [PMID: 33916923 PMCID: PMC8067551 DOI: 10.3390/cancers13081751] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 03/30/2021] [Accepted: 04/05/2021] [Indexed: 12/12/2022] Open
Abstract
Simple Summary Precise identification and annotation of mutations are of utmost importance in clinical oncology. Insights of the DNA sequence can provide meaningful knowledge to unravel the underlying genetics of disease. Hence, tailoring of personalized medicine often relies on specific genomic alteration for treatment efficacy. The aim of this review is to highlight that sequencing harbors much more than just four nucleotides. Moreover, the gradual transition from first to second generation sequencing technologies has led to awareness for choosing the most appropriate bioinformatic analytic tools based on the aim, quality and demand for a specific purpose. Thus, the same raw data can lead to various results reflecting the intrinsic features of different datamining pipelines. Abstract Data analysis has become a crucial aspect in clinical oncology to interpret output from next-generation sequencing-based testing. NGS being able to resolve billions of sequencing reactions in a few days has consequently increased the demand for tools to handle and analyze such large data sets. Many tools have been developed since the advent of NGS, featuring their own peculiarities. Increased awareness when interpreting alterations in the genome is therefore of utmost importance, as the same data using different tools can provide diverse outcomes. Hence, it is crucial to evaluate and validate bioinformatic pipelines in clinical settings. Moreover, personalized medicine implies treatment targeting efficacy of biological drugs for specific genomic alterations. Here, we focused on different sequencing technologies, features underlying the genome complexity, and bioinformatic tools that can impact the final annotation. Additionally, we discuss the clinical demand and design for implementing NGS.
Collapse
|
30
|
Fischer C, Koblmüller S, Börger C, Michelitsch G, Trajanoski S, Schlötterer C, Guelly C, Thallinger GG, Sturmbauer C. Genome sequences of Tropheus moorii and Petrochromis trewavasae, two eco-morphologically divergent cichlid fishes endemic to Lake Tanganyika. Sci Rep 2021; 11:4309. [PMID: 33619328 PMCID: PMC7900123 DOI: 10.1038/s41598-021-81030-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Accepted: 12/28/2020] [Indexed: 01/01/2023] Open
Abstract
With more than 1000 species, East African cichlid fishes represent the fastest and most species-rich vertebrate radiation known, providing an ideal model to tackle molecular mechanisms underlying recurrent adaptive diversification. We add high-quality genome reconstructions for two phylogenetic key species of a lineage that diverged about ~ 3-9 million years ago (mya), representing the earliest split of the so-called modern haplochromines that seeded additional radiations such as those in Lake Malawi and Victoria. Along with the annotated genomes we analysed discriminating genomic features of the study species, each representing an extreme trophic morphology, one being an algae browser and the other an algae grazer. The genomes of Tropheus moorii (TM) and Petrochromis trewavasae (PT) comprise 911 and 918 Mbp with 40,300 and 39,600 predicted genes, respectively. Our DNA sequence data are based on 5 and 6 individuals of TM and PT, and the transcriptomic sequences of one individual per species and sex, respectively. Concerning variation, on average we observed 1 variant per 220 bp (interspecific), and 1 variant per 2540 bp (PT vs PT)/1561 bp (TM vs TM) (intraspecific). GO enrichment analysis of gene regions affected by variants revealed several candidates which may influence phenotype modifications related to facial and jaw morphology, such as genes belonging to the Hedgehog pathway (SHH, SMO, WNT9A) and the BMP and GLI families.
Collapse
Affiliation(s)
- C Fischer
- Institute of Biology, University of Graz, Graz, Austria
- Institute of Biomedical Informatics, Graz University of Technology, Graz, Austria
| | - S Koblmüller
- Institute of Biology, University of Graz, Graz, Austria
| | - C Börger
- Institute of Biology, University of Graz, Graz, Austria
| | - G Michelitsch
- Center for Medical Research, Medical University of Graz, Graz, Austria
| | - S Trajanoski
- Center for Medical Research, Medical University of Graz, Graz, Austria
| | - C Schlötterer
- Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria
| | - C Guelly
- Center for Medical Research, Medical University of Graz, Graz, Austria
| | - G G Thallinger
- Institute of Biomedical Informatics, Graz University of Technology, Graz, Austria.
- BioTechMed-Graz, Graz, Austria.
| | - C Sturmbauer
- Institute of Biology, University of Graz, Graz, Austria.
- BioTechMed-Graz, Graz, Austria.
| |
Collapse
|
31
|
Zhou X, Zhang L, Weng Z, Dill DL, Sidow A. Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads. Nat Commun 2021; 12:1077. [PMID: 33597536 PMCID: PMC7889865 DOI: 10.1038/s41467-021-21395-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 01/20/2021] [Indexed: 01/19/2023] Open
Abstract
We introduce Aquila, a new approach to variant discovery in personal genomes, which is critical for uncovering the genetic contributions to health and disease. Aquila uses a reference sequence and linked-read data to generate a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. The contigs of the assemblies from our libraries cover >95% of the human reference genome, with over 98% of that in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased Variant Call Format (VCF) file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective approach that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.
Collapse
Affiliation(s)
- Xin Zhou
- Department of Computer Science, Stanford University, Stanford, CA, USA.
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA.
| | - Lu Zhang
- Department of Pathology, Stanford University, Stanford, CA, USA
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| | - Ziming Weng
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - David L Dill
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Arend Sidow
- Department of Pathology, Stanford University, Stanford, CA, USA.
- Department of Genetics, Stanford University, Stanford, CA, USA.
| |
Collapse
|
32
|
Gorla A, Jew B, Zhang L, Sul JH. xGAP: A python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery. Bioinformatics 2021; 37:9-16. [PMID: 33416856 PMCID: PMC8034531 DOI: 10.1093/bioinformatics/btaa1097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 12/22/2020] [Accepted: 01/04/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Since the first human genome was sequenced in 2001, there has been a rapid growth in the number of bioinformatic methods to process and analyze next generation sequencing (NGS) data for research and clinical studies that aim to identify genetic variants influencing diseases and traits. To achieve this goal, one first needs to call genetic variants from NGS data which requires multiple computationally intensive analysis steps. Unfortunately, there is a lack of an open source pipeline that can perform all these steps on NGS data in a manner which is fully automated, efficient, rapid, scalable, modular, user-friendly and fault tolerant. To address this, we introduce xGAP, an extensible Genome Analysis Pipeline, which implements modified GATK best practice to analyze DNA-seq data with aforementioned functionalities. RESULTS xGAP implements massive parallelization of the modified GATK best practice pipeline by splitting a genome into many smaller regions with efficient load-balancing to achieve high scalability. It can process 30x coverage whole-genome sequencing (WGS) data in approximately 90 minutes. In terms of accuracy of discovered variants, xGAP achieves average F1 scores of 99.37% for SNVs and 99.20% for Indels across seven benchmark WGS datasets. We achieve highly consistent results across multiple on-premises (SGE & SLURM) high performance clusters. Compared to the Churchill pipeline, with similar parallelization, xGAP is 20% faster when analyzing 50X coverage WGS in AWS. Finally, xGAP is user-friendly and fault tolerant where it can automatically re-initiate failed processes to minimize required user intervention. AVAILABILITY xGAP is available at https://github.com/Adigorla/xgap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aditya Gorla
- Department of Bioengineering, University of California, Los, Los, U.S.A Angeles, Angeles, CA 90095
| | - Brandon Jew
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095, Los, U.S.A. Angeles
| | - Luke Zhang
- Undergraduate Neuroscience Interdepartmental Program, University of California, Los Angeles, CA 90095, Los, U.S.A. Angeles
| | - Jae Hoon Sul
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, CA 90095, Los, U.S.A Angeles
| |
Collapse
|
33
|
Molina-Mora JA, Solano-Vargas M. Set-theory based benchmarking of three different variant callers for targeted sequencing. BMC Bioinformatics 2021; 22:20. [PMID: 33413082 PMCID: PMC7791862 DOI: 10.1186/s12859-020-03926-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 12/09/2020] [Indexed: 12/05/2022] Open
Abstract
Background Next generation sequencing (NGS) technologies have improved the study of hereditary diseases. Since the evaluation of bioinformatics pipelines is not straightforward, NGS demands effective strategies to analyze data that is of paramount relevance for decision making under a clinical scenario. According to the benchmarking framework of the Global Alliance for Genomics and Health (GA4GH), we implemented a new simple and user-friendly set-theory based method to assess variant callers using a gold standard variant set and high confidence regions. As model, we used TruSight Cardio kit sequencing data of the reference genome NA12878. This targeted sequencing kit is used to identify variants in key genes related to Inherited Cardiac Conditions (ICCs), a group of cardiovascular diseases with high rates of morbidity and mortality. Results We implemented and compared three variant calling pipelines (Isaac, Freebayes, and VarScan). Performance metrics using our set-theory approach showed high-resolution pipelines and revealed: (1) a perfect recall of 1.000 for all three pipelines, (2) very high precision values, i.e. 0.987 for Freebayes, 0.928 for VarScan, and 1.000 for Isaac, when compared with the reference material, and (3) a ROC curve analysis with AUC > 0.94 for all cases. Moreover, significant differences were obtained between the three pipelines. In general, results indicate that the three pipelines were able to recognize the expected variants in the gold standard data set. Conclusions Our set-theory approach to calculate metrics was able to identify the expected ICCs related variants by the three selected pipelines, but results were completely dependent on the algorithms. We emphasize the importance to assess pipelines using gold standard materials to achieve the most reliable results for clinical application.
Collapse
Affiliation(s)
- Jose Arturo Molina-Mora
- Centro de Investigación en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica (UCR), San José, Costa Rica. .,Centro de Investigaciones en Hematología y Transtornos Afines (CIHATA), Universidad de Costa Rica (UCR), San José, Costa Rica.
| | - Mariela Solano-Vargas
- Centro de Investigaciones en Hematología y Transtornos Afines (CIHATA), Universidad de Costa Rica (UCR), San José, Costa Rica
| |
Collapse
|
34
|
Artificial Intelligence in Pediatrics. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_316-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
35
|
Padmavathi P, Setlur AS, Chandrashekar K, Niranjan V. A comprehensive in-silico computational analysis of twenty cancer exome datasets and identification of associated somatic variants reveals potential molecular markers for detection of varied cancer types. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100762] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
|
36
|
Abstract
Advances in next-generation sequencing technology have enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data. However, there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN and DeepVariant) using the Genome in a Bottle Consortium, "synthetic-diploid" and simulated WGS datasets. DRAGEN and DeepVariant show better accuracy in SNP and indel calling, with no significant differences in their F1-score. DRAGEN platform offers accuracy, flexibility and a highly-efficient execution speed, and therefore superior performance in the analysis of WGS data on a large scale. The combination of DRAGEN and DeepVariant also suggests a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical applications.
Collapse
|
37
|
Zhao S, Agafonov O, Azab A, Stokowy T, Hovig E. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep 2020; 10:20222. [PMID: 33214604 PMCID: PMC7678823 DOI: 10.1038/s41598-020-77218-4] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 11/02/2020] [Indexed: 12/30/2022] Open
Abstract
Advances in next-generation sequencing technology have enabled whole genome sequencing (WGS) to be widely used for identification of causal variants in a spectrum of genetic-related disorders, and provided new insight into how genetic polymorphisms affect disease phenotypes. The development of different bioinformatics pipelines has continuously improved the variant analysis of WGS data. However, there is a necessity for a systematic performance comparison of these pipelines to provide guidance on the application of WGS-based scientific and clinical genomics. In this study, we evaluated the performance of three variant calling pipelines (GATK, DRAGEN and DeepVariant) using the Genome in a Bottle Consortium, "synthetic-diploid" and simulated WGS datasets. DRAGEN and DeepVariant show better accuracy in SNP and indel calling, with no significant differences in their F1-score. DRAGEN platform offers accuracy, flexibility and a highly-efficient execution speed, and therefore superior performance in the analysis of WGS data on a large scale. The combination of DRAGEN and DeepVariant also suggests a good balance of accuracy and efficiency as an alternative solution for germline variant detection in further applications. Our results facilitate the standardization of benchmarking analysis of bioinformatics pipelines for reliable variant detection, which is critical in genetics-based medical research and clinical applications.
Collapse
Affiliation(s)
- Sen Zhao
- Department of Tumor Biology, Institute of Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, 0310, Oslo, Norway
| | | | - Abdulrahman Azab
- Center for Bioinformatics, Department of Informatics, University of Oslo, 0316, Oslo, Norway
- Division of Research Computing, University Center for Information Technology (USIT), University of Oslo, 0316, Oslo, Norway
| | - Tomasz Stokowy
- Computational Biology Unit, Institute of Informatics, University of Bergen, 5008, Bergen, Norway
- Department of Clinical Science, University of Bergen, 5021, Bergen, Norway
| | - Eivind Hovig
- Department of Tumor Biology, Institute of Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, 0310, Oslo, Norway.
- Center for Bioinformatics, Department of Informatics, University of Oslo, 0316, Oslo, Norway.
| |
Collapse
|
38
|
DeepVariant-on-Spark: Small-Scale Genome Analysis Using a Cloud-Based Computing Framework. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:7231205. [PMID: 32952600 PMCID: PMC7481958 DOI: 10.1155/2020/7231205] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 08/15/2020] [Accepted: 08/21/2020] [Indexed: 12/18/2022]
Abstract
Although sequencing a human genome has become affordable, identifying genetic variants from whole-genome sequence data is still a hurdle for researchers without adequate computing equipment or bioinformatics support. GATK is a gold standard method for the identification of genetic variants and has been widely used in genome projects and population genetic studies for many years. This was until the Google Brain team developed a new method, DeepVariant, which utilizes deep neural networks to construct an image classification model to identify genetic variants. However, the superior accuracy of DeepVariant comes at the cost of computational intensity, largely constraining its applications. Accordingly, we present DeepVariant-on-Spark to optimize resource allocation, enable multi-GPU support, and accelerate the processing of the DeepVariant pipeline. To make DeepVariant-on-Spark more accessible to everyone, we have deployed the DeepVariant-on-Spark to the Google Cloud Platform (GCP). Users can deploy DeepVariant-on-Spark on the GCP following our instruction within 20 minutes and start to analyze at least ten whole-genome sequencing datasets using free credits provided by the GCP. DeepVaraint-on-Spark is freely available for small-scale genome analysis using a cloud-based computing framework, which is suitable for pilot testing or preliminary study, while reserving the flexibility and scalability for large-scale sequencing projects.
Collapse
|
39
|
Comparison of commercially available whole-genome sequencing kits for variant detection in circulating cell-free DNA. Sci Rep 2020; 10:6190. [PMID: 32277101 PMCID: PMC7148341 DOI: 10.1038/s41598-020-63102-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 03/19/2020] [Indexed: 12/13/2022] Open
Abstract
Circulating cell-free DNA (ccfDNA) has great potential for non-invasive diagnosis, prognosis and monitoring treatment of disease. However, a sensitive and specific whole-genome sequencing (WGS) method is required to identify novel genetic variations (i.e., SNVs, CNVs and INDELS) on ccfDNA that can be used as clinical biomarkers. In this article, five WGS methods were compared: ThruPLEX Plasma-seq, QIAseq cfDNA All-in-One, NEXTFLEX Cell Free DNA-seq, Accel-NGS 2 S PCR FREE DNA and Accel-NGS 2 S PLUS DNA. The Accel PCR-free kit did not produce enough material for sequencing. The other kits had significant common number of SNVs, INDELs and CNVs and showed similar results for SNVs and CNVs. The detection of variants and genomic signatures depends more upon the type of plasma sample rather than the WGS method used. Accel detected several variants not observed by the other kits. ThruPLEX seemed to identify more low-abundant SNVs and SNV signatures were similar to signatures observed with the QIAseq kit. Accel and NEXTFLEX had similar CNV and SNV signatures. These results demonstrate the importance of establishing a standardized workflow for identifying non-invasive candidate biomarkers. Moreover, the combination of variants discovered in ccfDNA using WGS has the potential to identify enrichment pathways, while the analysis of signatures could identify new subgroups of patients.
Collapse
|
40
|
Fjeld K, Masson E, Lin JH, Michl P, Stokowy T, Gravdal A, El Jellas K, Steine SJ, Hoem D, Johansson BB, Dalva M, Ruffert C, Zou WB, Li ZS, Njølstad PR, Chen JM, Liao Z, Johansson S, Rosendahl J, Férec C, Molven A. Characterization of CEL-DUP2: Complete duplication of the carboxyl ester lipase gene is unlikely to influence risk of chronic pancreatitis. Pancreatology 2020; 20:377-384. [PMID: 32007358 DOI: 10.1016/j.pan.2020.01.011] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Revised: 01/17/2020] [Accepted: 01/18/2020] [Indexed: 12/11/2022]
Abstract
BACKGROUND/OBJECTIVES Carboxyl ester lipase is a pancreatic enzyme encoded by CEL, an extremely polymorphic human gene. Pathogenic variants of CEL either increases the risk for chronic pancreatitis (CP) or cause MODY8, a syndrome of pancreatic exocrine and endocrine dysfunction. Here, we aimed to characterize a novel duplication allele of CEL (CEL-DUP2) and to investigate whether it associates with CP or pancreatic cancer. METHODS The structure of CEL-DUP2 was determined by a combination of Sanger sequencing, DNA fragment analysis, multiplex ligation-dependent probe amplification and whole-genome sequencing. We developed assays for screening of CEL-DUP2 and analyzed cohorts of idiopathic CP, alcoholic CP and pancreatic cancer. CEL protein expression was analyzed by immunohistochemistry. RESULTS CEL-DUP2 consists of an extra copy of the complete CEL gene. The allele has probably arisen from non-allelic, homologous recombination involving the adjacent pseudogene of CEL. We found no association between CEL-DUP2 carrier frequency and CP in cohorts from France (cases/controls: 2.5%/2.4%; P = 1.0), China (10.3%/8.1%; P = 0.08) or Germany (1.6%/2.3%; P = 0.62). Similarly, no association with disease was observed in alcohol-induced pancreatitis (Germany: 3.2%/2.3%; P = 0.51) or pancreatic cancer (Norway; 2.5%/3.2%; P = 0.77). Notably, the carrier frequency of CEL-DUP2 was more than three-fold higher in Chinese compared with Europeans. CEL protein expression was similar in tissues from CEL-DUP2 carriers and controls. CONCLUSIONS Our results support the contention that the number of CEL alleles does not influence the risk of pancreatic exocrine disease. Rather, the pathogenic CEL variants identified so far involve exon 11 sequence changes that substantially alter the protein's tail region.
Collapse
Affiliation(s)
- Karianne Fjeld
- The Gade Laboratory for Pathology, Department of Clinical Medicine, University of Bergen, Bergen, Norway; Department of Medical Genetics, Haukeland University Hospital, Bergen, Norway; Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway.
| | - Emmanuelle Masson
- Univ Brest, Inserm, EFS, UMR 1078, GGB, F-29200, Brest, France; CHRU Brest, Service de Génétique, Brest, France
| | - Jin-Huan Lin
- Department of Gastroenterology, Changhai Hospital, Second Military Medical University, Shanghai, China; Shanghai Institute of Pancreatic Diseases, Shanghai, China
| | - Patrick Michl
- Department of Internal Medicine I, Martin Luther University, Halle, Germany
| | - Tomasz Stokowy
- Genomics Core Facility, Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Anny Gravdal
- The Gade Laboratory for Pathology, Department of Clinical Medicine, University of Bergen, Bergen, Norway; Department of Medical Genetics, Haukeland University Hospital, Bergen, Norway; Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Khadija El Jellas
- The Gade Laboratory for Pathology, Department of Clinical Medicine, University of Bergen, Bergen, Norway; Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Solrun J Steine
- The Gade Laboratory for Pathology, Department of Clinical Medicine, University of Bergen, Bergen, Norway
| | - Dag Hoem
- Department of Gastrointestinal Surgery, Haukeland University Hospital, Bergen, Norway
| | - Bente B Johansson
- Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Monica Dalva
- The Gade Laboratory for Pathology, Department of Clinical Medicine, University of Bergen, Bergen, Norway; Department of Medical Genetics, Haukeland University Hospital, Bergen, Norway
| | - Claudia Ruffert
- Department of Internal Medicine I, Martin Luther University, Halle, Germany
| | - Wen-Bin Zou
- Department of Gastroenterology, Changhai Hospital, Second Military Medical University, Shanghai, China; Shanghai Institute of Pancreatic Diseases, Shanghai, China
| | - Zhao-Shen Li
- Department of Gastroenterology, Changhai Hospital, Second Military Medical University, Shanghai, China; Shanghai Institute of Pancreatic Diseases, Shanghai, China
| | - Pål R Njølstad
- Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway; Department of Pediatrics and Adolescent Medicine, Haukeland University Hospital, Bergen, Norway
| | - Jian-Min Chen
- Univ Brest, Inserm, EFS, UMR 1078, GGB, F-29200, Brest, France
| | - Zhuan Liao
- Department of Gastroenterology, Changhai Hospital, Second Military Medical University, Shanghai, China; Shanghai Institute of Pancreatic Diseases, Shanghai, China
| | - Stefan Johansson
- Department of Medical Genetics, Haukeland University Hospital, Bergen, Norway; Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Jonas Rosendahl
- Department of Internal Medicine I, Martin Luther University, Halle, Germany
| | - Claude Férec
- Univ Brest, Inserm, EFS, UMR 1078, GGB, F-29200, Brest, France; CHRU Brest, Service de Génétique, Brest, France
| | - Anders Molven
- The Gade Laboratory for Pathology, Department of Clinical Medicine, University of Bergen, Bergen, Norway; Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway; Department of Pathology, Haukeland University Hospital, Bergen, Norway
| |
Collapse
|
41
|
Stenton SL, Kremer LS, Kopajtich R, Ludwig C, Prokisch H. The diagnosis of inborn errors of metabolism by an integrative "multi-omics" approach: A perspective encompassing genomics, transcriptomics, and proteomics. J Inherit Metab Dis 2020; 43:25-35. [PMID: 31119744 DOI: 10.1002/jimd.12130] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Revised: 05/21/2019] [Accepted: 05/21/2019] [Indexed: 12/12/2022]
Abstract
Given the rapidly decreasing cost and increasing speed and accessibility of massively parallel technologies, the integration of comprehensive genomic, transcriptomic, and proteomic data into a "multi-omics" diagnostic pipeline is within reach. Even though genomic analysis has the capability to reveal all possible perturbations in our genetic code, analysis typically reaches a diagnosis in just 35% of cases, with a diagnostic gap arising due to limitations in prioritization and interpretation of detected variants. Here we review the utility of complementing genetic data with transcriptomic data and give a perspective for the introduction of proteomics into the diagnostic pipeline. Together these methodologies enable comprehensive capture of the functional consequence of variants, unobtainable by the analysis of each methodology in isolation. This facilitates functional annotation and reprioritization of candidate genes and variants-a promising approach to shed light on the underlying molecular cause of a patient's disease, increasing diagnostic rate, and allowing actionability in clinical practice.
Collapse
Affiliation(s)
- Sarah L Stenton
- Institute of Human Genetics, Technische Universität München, München, Germany
- Institute of Human Genetics, Helmholtz Zentrum München, München, Germany
| | - Laura S Kremer
- Institute of Human Genetics, Technische Universität München, München, Germany
- Institute of Human Genetics, Helmholtz Zentrum München, München, Germany
| | - Robert Kopajtich
- Institute of Human Genetics, Technische Universität München, München, Germany
- Institute of Human Genetics, Helmholtz Zentrum München, München, Germany
| | - Christina Ludwig
- Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), Technische Universität München, München, Germany
| | - Holger Prokisch
- Institute of Human Genetics, Technische Universität München, München, Germany
- Institute of Human Genetics, Helmholtz Zentrum München, München, Germany
| |
Collapse
|
42
|
Loka TP, Tausch SH, Renard BY. Reliable variant calling during runtime of Illumina sequencing. Sci Rep 2019; 9:16502. [PMID: 31712740 PMCID: PMC6848508 DOI: 10.1038/s41598-019-52991-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Accepted: 10/16/2019] [Indexed: 02/03/2023] Open
Abstract
The sequential paradigm of data acquisition and analysis in next-generation sequencing leads to high turnaround times for the generation of interpretable results. We combined a novel real-time read mapping algorithm with fast variant calling to obtain reliable variant calls still during the sequencing process. Thereby, our new algorithm allows for accurate read mapping results for intermediate cycles and supports large reference genomes such as the complete human reference. This enables the combination of real-time read mapping results with complex follow-up analysis. In this study, we showed the accuracy and scalability of our approach by applying real-time read mapping and variant calling to seven publicly available human whole exome sequencing datasets. Thereby, up to 89% of all detected SNPs were already identified after 40 sequencing cycles while showing similar precision as at the end of sequencing. Final results showed similar accuracy to those of conventional post-hoc analysis methods. When compared to standard routines, our live approach enables considerably faster interventions in clinical applications and infectious disease outbreaks. Besides variant calling, our approach can be adapted for a plethora of other mapping-based analyses.
Collapse
Affiliation(s)
- Tobias P Loka
- Bioinformatics Division (MF 1), Department for Methodology and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Simon H Tausch
- Bioinformatics Division (MF 1), Department for Methodology and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Centre for Biological Threats and Special Pathogens: Highly Pathogenic Viruses (ZBS 1), Robert Koch Institute, Berlin, Germany
- German Federal Institute for Risk Assessment (BfR), Department of Biological Safety, Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics Division (MF 1), Department for Methodology and Research Infrastructure, Robert Koch Institute, Berlin, Germany.
| |
Collapse
|
43
|
Abstract
Tumor cells acquire distinct genetic characteristics as a means to survive and proliferate indefinitely. Changes in the genetic code can also translate in changes at the protein level, therefore creating a distinguishable signature unique for tumor cells, and absent in normal tissues. The presence of discernable moieties in tumors is particularly attractive because it represents a therapeutic opportunity to target tumor cells with specificity, while sparing non-transformed cells. In this sense neoantigens, short peptides containing a mutated sequence, are seen attractive therapeutic targets because of their confinement within tumor cells. Neoantigens can be recognized with high affinity and specificity by tumor-targeting T cells, which consequently can initiate a potent anti-tumor immune response. While this is feasible and it has been tested in numerous cancer types including melanoma, colon and lung cancer, to mention a few, there are technical challenges in identifying immunogenic neoantigens. In this manuscript we address the topic of neoantigen identification from tumor samples, offering a technical overview of the bioinformatic methods utilized to profile the neoantigenic load of tumor samples obtained from clinical specimens. This is meant to guide readers through the steps of neoantigen identification using genomic data, by suggesting tools and methods that can provide, with a high degree of confidence, reliable results for downstream in vitro and in vivo applications.
Collapse
Affiliation(s)
- Sebastiano Battaglia
- Center For Immunotherapy, Department of Genetics and Genomics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, United States.
| |
Collapse
|
44
|
Svensson D, Sjögren R, Sundell D, Sjödin A, Trygg J. doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows. BMC Bioinformatics 2019; 20:498. [PMID: 31615395 PMCID: PMC6794737 DOI: 10.1186/s12859-019-3091-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 09/10/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed. RESULTS We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline. CONCLUSIONS Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.
Collapse
Affiliation(s)
- Daniel Svensson
- Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden
| | - Rickard Sjögren
- Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden
- Corporate Research, Sartorius AG, Umeå, Sweden
| | - David Sundell
- Division of CBRN Security and Defence, FOI - Swedish Defence Research Agency, Umeå, Sweden
| | - Andreas Sjödin
- Division of CBRN Security and Defence, FOI - Swedish Defence Research Agency, Umeå, Sweden
| | - Johan Trygg
- Department of Chemistry, Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden.
- Corporate Research, Sartorius AG, Umeå, Sweden.
| |
Collapse
|
45
|
Variant calling and quality control of large-scale human genome sequencing data. Emerg Top Life Sci 2019; 3:399-409. [DOI: 10.1042/etls20190007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Revised: 06/28/2019] [Accepted: 07/16/2019] [Indexed: 12/12/2022]
Abstract
Abstract
Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.
Collapse
|