51
|
Tubeuf H, Charbonnier C, Soukarieh O, Blavier A, Lefebvre A, Dauchel H, Frebourg T, Gaildrat P, Martins A. Large-scale comparative evaluation of user-friendly tools for predicting variant-induced alterations of splicing regulatory elements. Hum Mutat 2020; 41:1811-1829. [PMID: 32741062 DOI: 10.1002/humu.24091] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Revised: 07/11/2020] [Accepted: 07/26/2020] [Indexed: 12/20/2022]
Abstract
Discriminating which nucleotide variants cause disease or contribute to phenotypic traits remains a major challenge in human genetics. In theory, any intragenic variant can potentially affect RNA splicing by altering splicing regulatory elements (SREs). However, these alterations are often ignored mainly because pioneer SRE predictors have proved inefficient. Here, we report the first large-scale comparative evaluation of four user-friendly SRE-dedicated algorithms (QUEPASA, HEXplorer, SPANR, and HAL) tested both as standalone tools and in multiple combined ways based on two independent benchmark datasets adding up to >1,300 exonic variants studied at the messenger RNA level and mapping to 89 different disease-causing genes. These methods display good predictive power, based on decision thresholds derived from the receiver operating characteristics curve analyses, with QUEPASA and HAL having the best accuracies either as standalone or in combination. Still, overall there was a tight race between the four predictors, suggesting that all methods may be of use. Additionally, QUEPASA and HEXplorer may be beneficial as well for predicting variant-induced creation of pseudoexons deep within introns. Our study highlights the potential of SRE predictors as filtering tools for identifying disease-causing candidates among the plethora of variants detected by high-throughput DNA sequencing and provides guidance for their use in genomic medicine settings.
Collapse
Affiliation(s)
- Hélène Tubeuf
- Inserm U1245, UNIROUEN, Normandie University, Normandy Centre for Genomic and Personalized Medicine, Rouen, France.,Interactive Biosoftware, Rouen, France
| | - Camille Charbonnier
- Inserm U1245, UNIROUEN, Normandie University, Normandy Centre for Genomic and Personalized Medicine, Rouen, France
| | - Omar Soukarieh
- Inserm U1245, UNIROUEN, Normandie University, Normandy Centre for Genomic and Personalized Medicine, Rouen, France
| | | | - Arnaud Lefebvre
- Computer Science, Information Processing and Systems Laboratory, UNIROUEN, Normandie University, Mont-Saint-Aignan, France
| | - Hélène Dauchel
- Computer Science, Information Processing and Systems Laboratory, UNIROUEN, Normandie University, Mont-Saint-Aignan, France
| | - Thierry Frebourg
- Inserm U1245, UNIROUEN, Normandie University, Normandy Centre for Genomic and Personalized Medicine, Rouen, France.,Department of Genetics, University Hospital, Normandy Centre for Genomic and Personalized Medicine, Rouen, France
| | - Pascaline Gaildrat
- Inserm U1245, UNIROUEN, Normandie University, Normandy Centre for Genomic and Personalized Medicine, Rouen, France
| | - Alexandra Martins
- Inserm U1245, UNIROUEN, Normandie University, Normandy Centre for Genomic and Personalized Medicine, Rouen, France
| |
Collapse
|
52
|
Lu IN, Muller CP, He FQ. Applying next-generation sequencing to unravel the mutational landscape in viral quasispecies. Virus Res 2020; 283:197963. [PMID: 32278821 PMCID: PMC7144618 DOI: 10.1016/j.virusres.2020.197963] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 04/03/2020] [Accepted: 04/04/2020] [Indexed: 02/07/2023]
Abstract
Next-generation sequencing (NGS) has revolutionized the scale and depth of biomedical sciences. Because of its unique ability for the detection of sub-clonal variants within genetically diverse populations, NGS has been successfully applied to analyze and quantify the exceptionally-high diversity within viral quasispecies, and many low-frequency drug- or vaccine-resistant mutations of therapeutic importance have been discovered. Although many works have intensively discussed the latest NGS approaches and applications in general, none of them has focused on applying NGS in viral quasispecies studies, mostly due to the limited ability of current NGS technologies to accurately detect and quantify rare viral variants. Here, we summarize several error-correction strategies that have been developed to enhance the detection accuracy of minority variants. We also discuss critical considerations for preparing a sequencing library from viral RNAs and for analyzing NGS data to unravel the mutational landscape.
Collapse
Affiliation(s)
- I-Na Lu
- DKFZ-Division Translational Neurooncology at the WTZ, DKTK partner site, University Hospital Essen, D-45147 Essen, Germany; Department of Infectious Diseases, Aarhus University Hospital, DK-8200 Aarhus N, Denmark.
| | - Claude P Muller
- Department of Infection and Immunity, Luxembourg Institute of Health, L-4354 Esch-Sur-Alzette, Luxembourg; Laboratoire National de Santé, L-3583 Dudelange, Luxembourg
| | - Feng Q He
- Department of Infection and Immunity, Luxembourg Institute of Health, L-4354 Esch-Sur-Alzette, Luxembourg; Institute of Medical Microbiology, University Hospital Essen, University Duisburg-Essen, Essen, Germany.
| |
Collapse
|
53
|
Özdemir Özdoğan G, Kaya H. Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data. Interdiscip Sci 2020; 12:302-310. [PMID: 32519123 DOI: 10.1007/s12539-020-00374-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 04/26/2020] [Accepted: 05/22/2020] [Indexed: 12/31/2022]
Abstract
Next-generation sequencing (NGS) is related to massively parallel or deep deoxyribonucleic acid (DNA) sequencing technology which has revolutionized genomic researches in recent years. Although the cost of generating NGS data was decreased compared to the one at the time of emerging this technology, its cost might still be somewhat a problem. Hence, new strategies as pool-seq and low-coverage NGS data have been developed to overcome the cost problem. Despite decreasing cost, it is important to elucidate whether they are efficient in NGS studies. We applied a bioinformatics pipeline on pool-seq and low-coverage retinoblastoma data retrieved from only tumor data. Retinoblastoma is an eye malignancy in childhood that is initiated by RB1 mutation or MYCN amplification and can lead to the loss of vision of eye(s), and even sometimes life. We applied our pipeline on both retinoblastoma disease data and two other particular data to testify the validity and also for comparison purposes in the aspect of performance. High-confidence variant calls from Genome in a Bottle Consortium were used for fulfilling these purposes. We observed that our pipeline successfully called higher number of variants than a standard pipeline for all these three different data. Besides, the recall and F-score values were quite better in our pipeline as being noteworthy. We further presented our results on disease data in the aspects of the variants, variant types and disease-related genes. This study provides a guideline for performing NGS data analysis pipeline on pool-seq and low-coverage sequencing data in conjunction. To get more conclusive outcomes of these two strategies, we recommend using cancer data having higher mutation rates and larger pools.
Collapse
Affiliation(s)
| | - Hilal Kaya
- Department of Computer Engineering, Ankara Yildirim Beyazit University, 06010, Ankara, Turkey.
| |
Collapse
|
54
|
Heller D, Vingron M. SVIM: structural variant identification using mapped long reads. Bioinformatics 2020; 35:2907-2915. [PMID: 30668829 PMCID: PMC6735718 DOI: 10.1093/bioinformatics/btz041] [Citation(s) in RCA: 152] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 01/04/2019] [Accepted: 01/22/2019] [Indexed: 02/07/2023] Open
Abstract
Motivation Structural variants are defined as genomic variants larger than 50 bp. They have been shown to affect more bases in any given genome than single-nucleotide polymorphisms or small insertions and deletions. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single-molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long-read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities. Results We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long-read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from Pacific Biosciences and Nanopore sequencing machines. Availability and implementation The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Heller
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
55
|
Dohál M, Porvazník I, Pršo K, Rasmussen EM, Solovič I, Mokrý J. Whole-genome sequencing and Mycobacterium tuberculosis: Challenges in sample preparation and sequencing data analysis. Tuberculosis (Edinb) 2020; 123:101946. [PMID: 32741530 DOI: 10.1016/j.tube.2020.101946] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Revised: 04/29/2020] [Accepted: 04/30/2020] [Indexed: 12/26/2022]
Abstract
The numbers of patients with tuberculosis (TB) caused by resistant strains are still alarming. Therefore, it is necessary to determine resistance more quickly and precisely, than it is with the currently used phenotypic and genotypic methods. In recent years, technological advances have been made and the whole-genome sequencing (WGS) method has been introduced as a part of routine diagnostics in clinical laboratories. Comparing a wide range of mycobacterial genomic variations with a reference genome leads to a consistent evaluation of molecular-epidemiology and resistance of Mycobacterium tuberculosis (M. tuberculosis) to a wide range of anti-TB drugs. The quality of the obtained sequencing data is closely related to the type of sample and the method used for DNA extraction and sequencing library preparation. Moreover, the correct interpretation of results is also influenced by a bioinformatic data processing. A large number of bioinformatics pipelines are currently available, the sensitivity of which varies due to the different sizes of databases containing relevant mutations. This review focuses on the individual steps included in the sequencing workflow and factors that may affect the interpretation of final results.
Collapse
Affiliation(s)
- Matúš Dohál
- Department of Pharmacology and Biomedical Center Martin, Jessenius Faculty of Medicine, Comenius University, Martin, Slovakia.
| | - Igor Porvazník
- National Institute of Tuberculosis, Lung Diseases and Thoracic Surgery, Vyšné Hágy, Slovakia; Faculty of Health, Catholic University, Ružomberok, Slovakia
| | - Kristián Pršo
- Department of Pharmacology and Biomedical Center Martin, Jessenius Faculty of Medicine, Comenius University, Martin, Slovakia
| | - Erik Michael Rasmussen
- International Reference Laboratory of Mycobacteriology, Statens Serum Institut, Copenhagen, Denmark
| | - Ivan Solovič
- National Institute of Tuberculosis, Lung Diseases and Thoracic Surgery, Vyšné Hágy, Slovakia
| | - Juraj Mokrý
- Department of Pharmacology and Biomedical Center Martin, Jessenius Faculty of Medicine, Comenius University, Martin, Slovakia
| |
Collapse
|
56
|
Abstract
Pheochromocytoma (PCC) is a rare, mostly benign tumour of the adrenal medulla. Hereditary PCC accounts for ~35% of cases and has been associated with germline mutations in several cancer susceptibility genes (e.g., KIF1B, SDHB, VHL, SDHD, RET). We performed whole-exome sequencing in a family with four PCC-affected patients in two consecutive generations and identified a potential novel candidate pathogenic variant in the REXO2 gene that affects splicing (c.531-1G>T (NM 015523.3)), which co-segregated with the phenotype in the family. REXO2 encodes for RNA exonuclease 2 protein and localizes to 11q23, a chromosomal region displaying allelic imbalance in PCC. REXO2 protein has been associated with DNA repair, replication and recombination processes and thus its inactivation may contribute to tumorigenesis. While the study suggests that this novel REXO2 gene variant underlies PCC in this family, additional functional studies are required in order to establish the putative role of the REXO2 gene in PCC predisposition.
Collapse
|
57
|
Zhang J, Zeng Y, Liu B, Deng X. MerP/MerT-mediated mechanism: A different approach to mercury resistance and bioaccumulation by marine bacteria. JOURNAL OF HAZARDOUS MATERIALS 2020; 388:122062. [PMID: 31955028 DOI: 10.1016/j.jhazmat.2020.122062] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Revised: 01/08/2020] [Accepted: 01/08/2020] [Indexed: 06/10/2023]
Abstract
Currently, mechanism underlying mercury resistance and bioaccumulation of marine bacteria remains little understood. A marine bacterium Pseudomonas pseudoalcaligenes S1 is resistant to 120 mg/L Hg2+ with bioaccumulation capacity of 133.33 mg/g. Accordingly, Hg2+ resistance and bioaccumulation mechanism of S1 was investigated at molecular and cellular level. Annotation of S1 transcriptome reveals 772 differentially expressed genes, including Hg2+-relevant genes merT, merP and merA. Both merT and merP gene have three complete copies in S1 genome, while merA gene has only one. In order to evaluate the function of these Hg2+-relevant genes, three recombinant strains were constructed to express MerA (named as A), MerT/MerP (TP) and MerT/MerP/MerA (TPA), respectively. The results show that Hg2+ resistance of strain TP, TPA, and A are improved with minimum inhibition concentration (MIC) being 60 mg/L, 40 mg/L, and 20 mg/L, respectively compared to 2 mg/L of host strain. Strain TP and TPA exhibit enhanced Hg2+ bioaccumulation capacity, while strain A does not differ from the control. Their equilibrium Hg2+ bioaccumulation capacities are 110.48 mg/g, 94.49 mg/g, 83.76 mg/g and 82.29 mg/g, respectively. Summarily, different from most microorganisms that exhibit Hg2+ resistance by MerA-mediated mechanism, marine bacterium S1 achieves Hg2+ resistance and bioaccumulation capability via MerT/MerP-mediated strategy.
Collapse
Affiliation(s)
- Jinlong Zhang
- Shenzhen Key Laboratory of Marine Bioresource and Eco-environmental Science, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen 518060, China
| | - Yiting Zeng
- Shenzhen Key Laboratory of Marine Bioresource and Eco-environmental Science, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen 518060, China
| | - Bing Liu
- School of Traffic and Environment, Shenzhen Institute of Information Technology, Shenzhen 518172, China
| | - Xu Deng
- Shenzhen Key Laboratory of Marine Bioresource and Eco-environmental Science, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen 518060, China.
| |
Collapse
|
58
|
Nyangiri OA, Noyes H, Mulindwa J, Ilboudo H, Kabore JW, Ahouty B, Koffi M, Asina OF, Mumba D, Ofon E, Simo G, Kimuda MP, Enyaru J, Alibu VP, Kamoto K, Chisi J, Simuunza M, Camara M, Sidibe I, MacLeod A, Bucheton B, Hall N, Hertz-Fowler C, Matovu E. Copy number variation in human genomes from three major ethno-linguistic groups in Africa. BMC Genomics 2020; 21:289. [PMID: 32272904 PMCID: PMC7147055 DOI: 10.1186/s12864-020-6669-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 03/12/2020] [Indexed: 01/02/2023] Open
Abstract
Background Copy number variation is an important class of genomic variation that has been reported in 75% of the human genome. However, it is underreported in African populations. Copy number variants (CNVs) could have important impacts on disease susceptibility and environmental adaptation. To describe CNVs and their possible impacts in Africans, we sequenced genomes of 232 individuals from three major African ethno-linguistic groups: (1) Niger Congo A from Guinea and Côte d’Ivoire, (2) Niger Congo B from Uganda and the Democratic Republic of Congo and (3) Nilo-Saharans from Uganda. We used GenomeSTRiP and cn.MOPS to identify copy number variant regions (CNVRs). Results We detected 7608 CNVRs, of which 2172 were only deletions, 2384 were only insertions and 3052 had both. We detected 224 previously un-described CNVRs. The majority of novel CNVRs were present at low frequency and were not shared between populations. We tested for evidence of selection associated with CNVs and also for population structure. Signatures of selection identified previously, using SNPs from the same populations, were overrepresented in CNVRs. When CNVs were tagged with SNP haplotypes to identify SNPs that could predict the presence of CNVs, we identified haplotypes tagging 3096 CNVRs, 372 CNVRs had SNPs with evidence of selection (iHS > 3) and 222 CNVRs had both. This was more than expected (p < 0.0001) and included loci where CNVs have previously been associated with HIV, Rhesus D and preeclampsia. When integrated with 1000 Genomes CNV data, we replicated their observation of population stratification by continent but no clustering by populations within Africa, despite inclusion of Nilo-Saharans and Niger-Congo populations within our dataset. Conclusions Novel CNVRs in the current study increase representation of African diversity in the database of genomic variants. Over-representation of CNVRs in SNP signatures of selection and an excess of SNPs that both tag CNVs and are subject to selection show that CNVs may be the actual targets of selection at some loci. However, unlike SNPs, CNVs alone do not resolve African ethno-linguistic groups. Tag haplotypes for CNVs identified may be useful in predicting African CNVs in future studies where only SNP data is available.
Collapse
Affiliation(s)
- Oscar A Nyangiri
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, P. O. Box 7062, Kampala, Uganda.,Epidemiology and Demography Department, Kenya Medical Research Institute (KEMRI)/Wellcome Trust Research Programme, P.O. Box 230, Kilifi, Kenya
| | - Harry Noyes
- Centre for Genomic Research, University of Liverpool, Liverpool, L69 7ZB, UK
| | - Julius Mulindwa
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Hamidou Ilboudo
- Institut de Recherche en Sciences de la Santé (IRSS) - Unité de Recherche Clinique de Nanoro (URCN), Nanoro, Burkina Faso
| | - Justin Windingoudi Kabore
- Centre International de Recherche-Développement sur l'Elevage en zones Subhumides (CIRDES), Unité des Maladies à Vecteurs et Biodiversité (UMaVeB), 01 BP 454, Bobo-Dioulasso, 01, Burkina Faso
| | - Bernardin Ahouty
- Felix Houphouet Boigny University (UFHB), Cocody, Abidjan, Côte d'Ivoire
| | - Mathurin Koffi
- Université Jean Lorougnon Guédé (UJLoG) de Daloa, Daloa, Côte d'Ivoire
| | - Olivier Fataki Asina
- Institut National de Recherche Biomedicale, Avenue de la Democratie, Kinshasa Gombe, P. O. Box 1197, Kinshasa, Democratic Republic of Congo
| | - Dieudonne Mumba
- Institut National de Recherche Biomedicale, Avenue de la Democratie, Kinshasa Gombe, P. O. Box 1197, Kinshasa, Democratic Republic of Congo
| | - Elvis Ofon
- Faculty of Science, University of Dschang, P. O. Box 67, Dschang, Cameroon
| | - Gustave Simo
- Faculty of Science, University of Dschang, P. O. Box 67, Dschang, Cameroon
| | - Magambo Phillip Kimuda
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - John Enyaru
- College of Natural Sciences, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Vincent Pius Alibu
- College of Natural Sciences, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Kelita Kamoto
- College of Medicine, Department of Basic Medical Sciences, University of Malawi, Private Bag 360, Chichiri, Blantyre, 3, Malawi
| | - John Chisi
- College of Medicine, Department of Basic Medical Sciences, University of Malawi, Private Bag 360, Chichiri, Blantyre, 3, Malawi
| | - Martin Simuunza
- Department of Disease Control, School of Veterinary Medicine, University of Zambia, P. O. Box 32379, Lusaka, Zambia
| | - Mamadou Camara
- Programme National de Lutte contre la Trypanosomose Humaine Africaine, BP 851, Conakry, Guinea
| | - Issa Sidibe
- Centre International de Recherche-Développement sur l'Elevage en zones Subhumides (CIRDES), Unité des Maladies à Vecteurs et Biodiversité (UMaVeB), 01 BP 454, Bobo-Dioulasso, 01, Burkina Faso
| | - Annette MacLeod
- Wellcome Centre for Molecular Parasitology, Institute of Biodiversity, Animal Health and Comparative Medicine, Garscube Estate, Glasgow, G61 1QH, UK
| | - Bruno Bucheton
- Programme National de Lutte contre la Trypanosomose Humaine Africaine, BP 851, Conakry, Guinea.,Institut de Recherche pour le Développement (IRD), IRD-CIRAD 177, TA A-17/G, Campus International de Baillarguet, F-34398, Montpellier, France
| | - Neil Hall
- Centre for Genomic Research, University of Liverpool, Liverpool, L69 7ZB, UK.,Present address: Earlham Institute Norwich Research Park Innovation Centre, Colney Ln, Norwich, NR4 7UZ, UK
| | | | - Enock Matovu
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, P. O. Box 7062, Kampala, Uganda.
| | | |
Collapse
|
59
|
Challenges and opportunities for strain verification by whole-genome sequencing. Sci Rep 2020; 10:5873. [PMID: 32245992 PMCID: PMC7125075 DOI: 10.1038/s41598-020-62364-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Accepted: 03/11/2020] [Indexed: 11/28/2022] Open
Abstract
Laboratory strains, cell lines, and other genetic materials change hands frequently in the life sciences. Despite evidence that such materials are subject to mix-ups, contamination, and accumulation of secondary mutations, verification of strains and samples is not an established part of many experimental workflows. With the plummeting cost of next generation technologies, it is conceivable that whole genome sequencing (WGS) could be applied to routine strain and sample verification in the future. To demonstrate the need for strain validation by WGS, we sequenced haploid yeast segregants derived from a popular commercial mutant collection and identified several unexpected mutations. We determined that available bioinformatics tools may be ill-suited for verification and highlight the importance of finishing reference genomes for commonly used laboratory strains.
Collapse
|
60
|
Schilbert HM, Rempel A, Pucker B. Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data. PLANTS (BASEL, SWITZERLAND) 2020; 9:E439. [PMID: 32252268 PMCID: PMC7238416 DOI: 10.3390/plants9040439] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 03/28/2020] [Accepted: 03/30/2020] [Indexed: 12/30/2022]
Abstract
High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.
Collapse
Affiliation(s)
- Hanna Marie Schilbert
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany
| | - Andreas Rempel
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany
| | - Boas Pucker
- Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, 33615 Bielefeld, Germany
- Molecular Genetics and Physiology of Plants, Faculty of Biology and Biotechnology, Ruhr-University Bochum, 44801 Bochum, Germany
| |
Collapse
|
61
|
Itoh T, Onuki R, Tsuda M, Oshima M, Endo M, Sakai H, Tanaka T, Ohsawa R, Tabei Y. Foreign DNA detection by high-throughput sequencing to regulate genome-edited agricultural products. Sci Rep 2020; 10:4914. [PMID: 32188926 PMCID: PMC7080720 DOI: 10.1038/s41598-020-61949-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2019] [Accepted: 03/05/2020] [Indexed: 12/11/2022] Open
Abstract
Although the advent of several new breeding techniques (NBTs) is revolutionizing agricultural production processes, technical information necessary for their regulation is yet to be provided. Here, we show that high-throughput DNA sequencing is effective for the detection of unintended remaining foreign DNA segments in genome-edited rice. A simple k-mer detection method is presented and validated through a series of computer simulations and real data analyses. The data show that a short foreign DNA segment of 20 nucleotides can be detected and the probability that the segment is overlooked is 10-3 or less if the average sequencing depth is 30 or more, while the number of false hits is less than 1 on average. This method was applied to real sequencing data, and the presence and absence of an external DNA segment were successfully proven. Additionally, our in-depth analyses also identified some weaknesses in current DNA sequencing technologies. Hence, for a rigorous safety assessment, the combination of k-mer detection and another method, such as Southern blot assay, is recommended. The results presented in this study will lay the foundation for the regulation of NBT products, where foreign DNA is utilized during their generation.
Collapse
Affiliation(s)
- Takeshi Itoh
- Bioinformatics Team, Advanced Analysis Center, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, 305-8602, Japan.
- National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan.
| | - Ritsuko Onuki
- Bioinformatics Team, Advanced Analysis Center, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, 305-8602, Japan
- National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan
- Research Institute, National Cancer Center Japan, Chuo-ku, Tokyo, 104-0045, Japan
| | - Mai Tsuda
- Tsukuba Plant Innovation Research Center, University of Tsukuba, Tsukuba, Ibaraki, 305-8572, Japan
| | - Masao Oshima
- National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan
- Tsukuba Plant Innovation Research Center, University of Tsukuba, Tsukuba, Ibaraki, 305-8572, Japan
| | - Masaki Endo
- National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan
- Institute of Agrobiological Sciences, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, 305-8634, Japan
| | - Hiroaki Sakai
- Bioinformatics Team, Advanced Analysis Center, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, 305-8602, Japan
- National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan
| | - Tsuyoshi Tanaka
- Bioinformatics Team, Advanced Analysis Center, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, 305-8602, Japan
- Institute of Crop Science, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, 305-8518, Japan
| | - Ryo Ohsawa
- Tsukuba Plant Innovation Research Center, University of Tsukuba, Tsukuba, Ibaraki, 305-8572, Japan
| | - Yutaka Tabei
- National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan
- Institute of Agrobiological Sciences, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, 305-8634, Japan
| |
Collapse
|
62
|
Laissue P, Vaiman D. Exploring the Molecular Aetiology of Preeclampsia by Massive Parallel Sequencing of DNA. Curr Hypertens Rep 2020; 22:31. [PMID: 32172383 DOI: 10.1007/s11906-020-01039-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
PURPOSE OF REVIEW This manuscript aims to review (for the first time) studies describing NGS sequencing of preeclampsia (PE) women's DNA. RECENT FINDINGS Describing markers for the early detection of PE is an essential task because, although associated molecular dysfunction begins early on during pregnancy, the disease's clinical signs usually appear late in pregnancy. Although several biochemical biomarkers have been proposed, their use in clinical environments is still limited, thereby encouraging research into PE's genetic origin. Hundreds of genes involved in numerous implantation- and placentation-related biological processes may be coherent candidates for PE aetiology. Next-generation sequencing (NGS) offers new technical possibilities for PE studying, as it enables large genomic regions to be analysed at affordable cost. This technique has facilitated the description of genes contributing to the molecular origin of a significant amount of monogenic and complex diseases. Regarding PE, NGS of DNA has been used in familial and isolated cases, thereby enabling new genes potentially related to the phenotype to be proposed. For a better understanding of NGS, technical aspects, applications and limitations are presented initially. Thereafter, NGS studies of DNA in familial and non-familial cases are described, including pitfalls and positive findings. The information given here should enable scientists and clinicians to analyse and design new studies permitting the identification of novel clinically useful molecular PE markers.
Collapse
Affiliation(s)
- Paul Laissue
- Biopas Laboratoires, Biopas Group, Bogotá, Colombia. .,Inserm U1016, CNRS UMR8104, Institut Cochin, équipe FGTB, 24, rue du faubourg Saint-Jacques, 75014, Paris, France. .,CIGGUR Genetics Group, School of Medicine and Health Sciences, El Rosario University, Bogotá, Colombia.
| | - Daniel Vaiman
- Inserm U1016, CNRS UMR8104, Institut Cochin, équipe FGTB, 24, rue du faubourg Saint-Jacques, 75014, Paris, France
| |
Collapse
|
63
|
CNV Radar: an improved method for somatic copy number alteration characterization in oncology. BMC Bioinformatics 2020; 21:98. [PMID: 32143562 PMCID: PMC7060549 DOI: 10.1186/s12859-020-3397-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Accepted: 02/07/2020] [Indexed: 12/15/2022] Open
Abstract
Background Cancer associated copy number variation (CNV) events provide important information for identifying patient subgroups and suggesting treatment strategies. Technical and logistical issues, however, make it challenging to accurately detect abnormal copy number events in a cost-effective manner in clinical studies. Results Here we present CNV Radar, a software tool that utilizes next-generation sequencing read depth information and variant allele frequency patterns, to infer the true copy number status of genes and genomic regions from whole exome sequencing data. Evaluation of CNV Radar in a public multiple myeloma dataset demonstrated that CNV Radar was able to detect a variety of CNVs associated with risk of progression, and we observed > 70% concordance with fluorescence in situ hybridization (FISH) results. Compared to other CNV callers, CNV Radar showed high sensitivity and specificity. Similar results were observed when comparing CNV Radar calls to single nucleotide polymorphism array results from acute myeloid leukemia and prostate cancer datasets available on TCGA. More importantly, CNV Radar demonstrated its utility in the clinical trial setting: in POLLUX and CASTOR, two phase 3 studies in patients with relapsed or refractory multiple myeloma, we observed a high concordance rate with FISH for del17p, a risk defining CNV event (88% in POLLUX and 90% in CASTOR), therefore allowing for efficacy assessments in clinically relevant disease subgroups. Our case studies also showed that CNV Radar is capable of detecting abnormalities such as copy-neutral loss of heterozygosity that elude other approaches. Conclusions We demonstrated that CNV Radar is more sensitive than other CNV detection methods, accurately detects clinically important cytogenetic events, and allows for further interrogation of novel disease biology. Overall, CNV Radar exhibited high concordance with standard methods such as FISH, and its success in the POLLUX and CASTOR clinical trials demonstrated its potential utility for informing clinical and therapeutic decisions.
Collapse
|
64
|
Dornbos P, Arkatkar AA, LaPres JJ. An Automated Method To Predict Mouse Gene and Protein Sequences Using Variant Data. G3 (BETHESDA, MD.) 2020; 10:925-932. [PMID: 31911484 PMCID: PMC7056971 DOI: 10.1534/g3.119.400983] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Accepted: 12/24/2019] [Indexed: 01/29/2023]
Abstract
With recent advances in sequencing technologies, the scientific community has begun to probe the potential genetic bases behind complex phenotypes in humans and model organisms. In many cases, the genomes of genetically distinct strains of model organisms, such as the mouse (Mus musculus), have not been fully sequenced. Here, we report on a tool designed to use single-nucleotide polymorphism (SNP) and insertion-deletion (indel) data to predict gene, mRNA, and protein sequences for up to 36 genetically distinct mouse strains. By automated querying of freely accessible databases through a graphical interface, the software requires no data and little computational experience. As a proof of concept, we predicted the gene and amino acid sequence of the aryl hydrocarbon receptor (Ahr) for all inbred mouse strains of which variant data were currently available through Mouse Genome Project. Predicted sequences were compared with fully sequenced genomes to show that the tool is effective in predicting gene and protein sequences.
Collapse
Affiliation(s)
- Peter Dornbos
- Department of Biochemistry and Molecular Biology and
- Institute for Integrative Toxicology, Michigan State University, East Lansing, Michigan
| | | | - John J LaPres
- Department of Biochemistry and Molecular Biology and
- Institute for Integrative Toxicology, Michigan State University, East Lansing, Michigan
| |
Collapse
|
65
|
Tattini L, Tellini N, Mozzachiodi S, D'Angiolo M, Loeillet S, Nicolas A, Liti G. Accurate Tracking of the Mutational Landscape of Diploid Hybrid Genomes. Mol Biol Evol 2020; 36:2861-2877. [PMID: 31397846 PMCID: PMC6878955 DOI: 10.1093/molbev/msz177] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Mutations, recombinations, and genome duplications may promote genetic diversity and trigger evolutionary processes. However, quantifying these events in diploid hybrid genomes is challenging. Here, we present an integrated experimental and computational workflow to accurately track the mutational landscape of yeast diploid hybrids (MuLoYDH) in terms of single-nucleotide variants, small insertions/deletions, copy-number variants, aneuploidies, and loss-of-heterozygosity. Pairs of haploid Saccharomyces parents were combined to generate ancestor hybrids with phased genomes and varying levels of heterozygosity. These diploids were evolved under different laboratory protocols, in particular mutation accumulation experiments. Variant simulations enabled the efficient integration of competitive and standard mapping of short reads, depending on local levels of heterozygosity. Experimental validations proved the high accuracy and resolution of our computational approach. Finally, applying MuLoYDH to four different diploids revealed striking genetic background effects. Homozygous Saccharomyces cerevisiae showed a ∼4-fold higher mutation rate compared with its closely related species S. paradoxus. Intraspecies hybrids unveiled that a substantial fraction of the genome (∼250 bp per generation) was shaped by loss-of-heterozygosity, a process strongly inhibited in interspecies hybrids by high levels of sequence divergence between homologous chromosomes. In contrast, interspecies hybrids exhibited higher single-nucleotide mutation rates compared with intraspecies hybrids. MuLoYDH provided an unprecedented quantitative insight into the evolutionary processes that mold diploid yeast genomes and can be generalized to other genetic systems.
Collapse
Affiliation(s)
- Lorenzo Tattini
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| | - Nicolò Tellini
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| | | | | | - Sophie Loeillet
- CNRS UMR3244, Institut Curie, PSL Research University, Paris, France
| | - Alain Nicolas
- CNRS UMR3244, Institut Curie, PSL Research University, Paris, France
| | - Gianni Liti
- CNRS UMR7284, INSERM, IRCAN, Université Côte d'Azur, Nice, France
| |
Collapse
|
66
|
Yokoyama TT, Kasahara M. Visualization tools for human structural variations identified by whole-genome sequencing. J Hum Genet 2020; 65:49-60. [PMID: 31666648 PMCID: PMC8075883 DOI: 10.1038/s10038-019-0687-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 09/27/2019] [Accepted: 10/02/2019] [Indexed: 01/02/2023]
Abstract
Visualizing structural variations (SVs) is a critical step for finding associations between SVs and human traits or diseases. Given that there are many sequencing platforms used for SV identification and given that how best to visualize SVs together with other data, such as read alignments and annotations, depends on research goals, there are dozens of SV visualization tools designed for different research goals and sequencing platforms. Here, we provide a comprehensive survey of over 30 SV visualization tools to help users choose which tools to use. This review targets users who wish to visualize a set of SVs identified from the massively parallel sequencing reads of an individual human genome. We first categorize the ways in which SV visualization tools display SVs into ten major categories, which we denote as view modules. View modules allow readers to understand the features of each SV visualization tool quickly. Next, we introduce the features of individual SV visualization tools from several aspects, including whether SV views are integrated with annotations, whether long-read alignment is displayed, whether underlying data structures are graph-based, the type of SVs shown, whether auditing is possible, whether bird's eye view is available, sequencing platforms, and the number of samples. We hope that this review will serve as a guide for readers on the currently available SV visualization tools and lead to the development of new SV visualization tools in the near future.
Collapse
Affiliation(s)
- Toshiyuki T Yokoyama
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Masahiro Kasahara
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
| |
Collapse
|
67
|
Calarco L, Barratt J, Ellis J. Detecting sequence variants in clinically important protozoan parasites. Int J Parasitol 2019; 50:1-18. [PMID: 31857072 DOI: 10.1016/j.ijpara.2019.10.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Revised: 09/29/2019] [Accepted: 10/01/2019] [Indexed: 02/06/2023]
Abstract
Second and third generation sequencing methods are crucial for population genetic studies, and variant detection is a popular approach for exploiting this sequence data. While mini- and microsatellites are historically useful markers for studying important Protozoa such as Toxoplasma and Plasmodium spp., detecting non-repetitive variants such as those found in genes can be fundamental to investigating a pathogen's biology. These variants, namely single nucleotide polymorphisms and insertions and deletions, can help elucidate the genetic basis of an organism's pathogenicity, identify selective pressures, and resolve phylogenetic relationships. They also have the added benefit of possessing a comparatively low mutation rate, which contributes to their stability. However, there is a plethora of variant analysis tools with nuanced pipelines and conflicting recommendations for best practise, which can be confounding. This lack of standardisation means that variant analysis requires careful parameter optimisation, an understanding of its limitations, and the availability of high quality data. This review explores the value of variant detection when applied to non-model organisms such as clinically important protozoan pathogens. The limitations of current methods are discussed, including special considerations that require the end-users' attention to ensure that the results generated are reproducible, and the biological conclusions drawn are valid.
Collapse
Affiliation(s)
- Larissa Calarco
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia.
| | - Joel Barratt
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - John Ellis
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| |
Collapse
|
68
|
Palacios J, de la Hoya M, Bellosillo B, de Juan I, Matías-Guiu X, Lázaro C, Palanca S, Osorio A, Rojo F, Rosa-Rosa JM, Cigudosa JC. Mutational Screening of BRCA1/2 Genes as a Predictive Factor for Therapeutic Response in Epithelial Ovarian Cancer: A Consensus Guide from the Spanish Society of Pathology (SEAP-IAP) and the Spanish Society of Human Genetics (AEGH). Virchows Arch 2019; 476:195-207. [PMID: 31797087 PMCID: PMC7028830 DOI: 10.1007/s00428-019-02709-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Revised: 10/11/2019] [Accepted: 10/25/2019] [Indexed: 12/21/2022]
Abstract
Germline/somatic BRCA-mutated ovarian carcinomas (OC) are associated to have better response with platinum-based chemotherapy and long-term prognosis than non-BRCA-associated OCs. In addition, these mutations are predictive factors to response to Poly(ADP-ribose) polymerase (PARP) inhibitors. Different positioning papers have addressed the clinical recommendations for BRCA testing in OC. This consensus guide represents a collection of technical recommendations to address the detection of BRCA1/2 mutations in the molecular diagnostic testing strategy for OC. Under the coordination of Spanish Society of Pathology (SEAP-IAP) and the Spanish Society of Human Genetics (AEGH), these recommendations have been developed by pathologists and geneticists taking into account previously published recommendations and their experience in the molecular characterization of these genes. Since the implementation of BRCA testing as a predictive factor can initiate the workflow by testing germline mutations in the blood or by testing both germline and somatic mutations in tumor tissue, distinctive features of both strategies are discussed. Additionally, the recommendations included in this paper provide some references, quality parameters, and genomic tools aimed to standardize and facilitate the clinical genomic diagnosis of OC.
Collapse
Affiliation(s)
- J Palacios
- Servicio de Anatomía Patológica, Hospital Universitario Ramón y Cajal, 28034, Madrid, Spain.
- Instituto Ramón y Cajal de Investigación Sanitaria, 28034, Madrid, Spain.
- Universidad de Alcalá, 28801, Alcalá de Henares, Spain.
- CIBER-ONC, Instituto de Salud Carlos III, 28029, Madrid, Spain.
| | - M de la Hoya
- CIBER-ONC, Instituto de Salud Carlos III, 28029, Madrid, Spain
- Molecular Oncology Laboratory, Hospital Clinico San Carlos, IdISSC (Instituto de Investigación Sanitaria del Hospital Clínico San Carlos), Madrid, Spain
| | - B Bellosillo
- CIBER-ONC, Instituto de Salud Carlos III, 28029, Madrid, Spain
- Laboratorio de Diagnóstico Molecular, Servicio de Patología, Hospital del Mar, 08003, Barcelona, Spain
| | - I de Juan
- Unidad de Biología Molecular, Servicio de Análisis Clínicos, Hospital Universitario y Politécnico La Fe, 46026, Valencia, Spain
| | - X Matías-Guiu
- CIBER-ONC, Instituto de Salud Carlos III, 28029, Madrid, Spain
- Servicio de Anatomía Patológica, Hospital Universitario de Bellvitge, 08908, L'Hospitalet, Spain
| | - C Lázaro
- CIBER-ONC, Instituto de Salud Carlos III, 28029, Madrid, Spain
- Unidad de Diagnóstico Molecular, Institut Català d'Oncologia, (ICO-IDIBELL-ONCOBELL), 08908, L'Hospitalet, Spain
| | - S Palanca
- Unidad de Biología Molecular, Servicio de Análisis Clínicos, Hospital Universitario y Politécnico La Fe, 46026, Valencia, Spain
| | - A Osorio
- Human Cancer Genetics Programme, Spanish National Cancer Centre (CNIO), 28029, Madrid, Spain
- CIBER-ER, Instituto de Salud Carlos III, 28029, Madrid, Spain
| | - F Rojo
- CIBER-ONC, Instituto de Salud Carlos III, 28029, Madrid, Spain
- Departamento de Patología, Fundación Jímenez-Díaz, 28040, Madrid, Spain
| | - J M Rosa-Rosa
- Instituto Ramón y Cajal de Investigación Sanitaria, 28034, Madrid, Spain
- CIBER-ONC, Instituto de Salud Carlos III, 28029, Madrid, Spain
| | - J C Cigudosa
- NIMGenetics, Parque Científico de Madrid, Campus Cantoblanco, 28049, Madrid, Spain
| |
Collapse
|
69
|
Strain-Specific Metabolic Requirements Revealed by a Defined Minimal Medium for Systems Analyses of Staphylococcus aureus. Appl Environ Microbiol 2019; 85:AEM.01773-19. [PMID: 31471305 DOI: 10.1128/aem.01773-19] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 08/26/2019] [Indexed: 01/08/2023] Open
Abstract
Staphylococcus aureus is a Gram-positive pathogenic bacterium that colonizes an estimated one-third of the human population and can cause a wide spectrum of disease, ranging from superficial skin infections to life-threatening sepsis. The adaptive mechanisms that contribute to the success of this pathogen remain obscure partially due to a lack of knowledge of its metabolic requirements. Systems biology approaches can be extremely useful in predicting and interpreting metabolic phenotypes; however, such approaches rely on a chemically defined minimal medium as a basis to investigate the requirements of the cell. In this study, a chemically defined minimal medium formulation, termed synthetic minimal medium (SMM), was investigated and validated to support growth of three S. aureus strains: LAC and TCH1516 (USA300 lineage), as well as D592 (USA100 lineage). The formulated SMM was used in an adaptive laboratory evolution experiment to probe the various mutational trajectories of all three strains leading to optimized growth capabilities. The evolved strains were phenotypically characterized for their growth rate and antimicrobial susceptibility. Strains were also resequenced to examine the genetic basis for observed changes in phenotype and to design follow-up metabolite supplementation assays. Our results reveal evolutionary trajectories that arose from strain-specific metabolic requirements. SMM and the evolved strains can also serve as important tools to study antibiotic resistance phenotypes of S. aureus IMPORTANCE As researchers try to understand and combat the development of antibiotic resistance in pathogens, there is a growing need to thoroughly understand the physiology and metabolism of the microbes. Staphylococcus aureus is a threatening pathogen with increased antibiotic resistance and well-studied virulence mechanisms. However, the adaptive mechanisms used by this pathogen to survive environmental stresses remain unclear, mostly due to the lack of information about its metabolic requirements. Defining the minimal metabolic requirements for S. aureus growth is a first step toward unraveling the mechanisms by which it adapts to metabolic stresses. Here, we present the development of a chemically defined minimal medium supporting growth of three S. aureus strains, and we reveal key genetic mutations contributing to improved growth in minimal medium.
Collapse
|
70
|
Martínez-Jaramillo C, Gutierrez-Hincapie S, Arango JCO, Vásquez-Duque GM, Erazo-Garnica RM, Franco JL, Trujillo-Vargas CM. Clinical, immunological and genetic characteristic of patients with clinical phenotype associated to LRBA-deficiency in Colombia. Colomb Med (Cali) 2019; 50:176-191. [PMID: 32284663 PMCID: PMC7141146 DOI: 10.25100/cm.v50i3.3969] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Background LPS-responsive beige -like anchor protein (LRBA) deficiency is a primary immunodeficiency disease caused by loss of LRBA protein expression, due to biallelic mutations in LRBA gene. LRBA deficiency patients exhibit a clinically heterogeneous syndrome. The main clinical complication of LRBA deficiency is immune dysregulation. Furthermore, hypogammaglobulinemia is found in more than half of patients with LRBA-deficiency. To date, no patients with this condition have been reported in Colombia. Objective To evaluate the expression of the LRBA protein in patients from Colombia with clinical phenotype associated to LRBA-deficiency. Methods In the present study the LRBA-expression in patients from Colombia with clinical phenotype associated to LRBA-deficiency was evaluated. After then, the clinical, the immunological characteristics and the possible genetic variants in LRBA or other genes associated with the immune system in patients that exhibit decrease protein expression was evaluated. Results In total, 112 patients with different clinical manifestations associated to the clinical LRBA phenotype were evaluated. The LRBA expression varies greatly between different healthy donors and patients. Despite the great variability in the LRBA expression, six patients with a decrease in LRBA protein expression were observed. However, no pathogenic or possible pathogenic biallelic variants in LRBA, or in genes related with the immune system were found. Conclusion LRBA expression varies greatly between different healthy donors and patients. Reduction LRBA-expression in 6 patients without homozygous mutations in LRBA or in associated genes with the immune system was observed. These results suggest the other genetic, epigenetic or environmental mechanisms, that might be regulated the LRBA-expression.
Collapse
Affiliation(s)
- Catalina Martínez-Jaramillo
- Universidad de Antioquia UdeA, Facultad de Medicina, Grupo de Inmunodeficiencias Primarias, Medellin, Colombia
| | | | | | | | | | - Jose Luis Franco
- Universidad de Antioquia UdeA, Facultad de Medicina, Grupo de Inmunodeficiencias Primarias, Medellin, Colombia
| | | |
Collapse
|
71
|
Rojano E, Seoane P, Ranea JAG, Perkins JR. Regulatory variants: from detection to predicting impact. Brief Bioinform 2019; 20:1639-1654. [PMID: 29893792 PMCID: PMC6917219 DOI: 10.1093/bib/bby039] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Revised: 04/18/2018] [Indexed: 02/01/2023] Open
Abstract
Variants within non-coding genomic regions can greatly affect disease. In recent years, increasing focus has been given to these variants, and how they can alter regulatory elements, such as enhancers, transcription factor binding sites and DNA methylation regions. Such variants can be considered regulatory variants. Concurrently, much effort has been put into establishing international consortia to undertake large projects aimed at discovering regulatory elements in different tissues, cell lines and organisms, and probing the effects of genetic variants on regulation by measuring gene expression. Here, we describe methods and techniques for discovering disease-associated non-coding variants using sequencing technologies. We then explain the computational procedures that can be used for annotating these variants using the information from the aforementioned projects, and prediction of their putative effects, including potential pathogenicity, based on rule-based and machine learning approaches. We provide the details of techniques to validate these predictions, by mapping chromatin-chromatin and chromatin-protein interactions, and introduce Clustered Regularly Interspaced Short Palindromic Repeats-Associated Protein 9 (CRISPR-Cas9) technology, which has already been used in this field and is likely to have a big impact on its future evolution. We also give examples of regulatory variants associated with multiple complex diseases. This review is aimed at bioinformaticians interested in the characterization of regulatory variants, molecular biologists and geneticists interested in understanding more about the nature and potential role of such variants from a functional point of views, and clinicians who may wish to learn about variants in non-coding genomic regions associated with a given disease and find out what to do next to uncover how they impact on the underlying mechanisms.
Collapse
Affiliation(s)
- Elena Rojano
- Department of Molecular Biology and Biochemistry, University of Malaga (UMA), 29010 Malaga, Spain
| | - Pedro Seoane
- Department of Molecular Biology and Biochemistry, University of Malaga (UMA), 29010 Malaga, Spain
| | - Juan A G Ranea
- CIBER de Enfermedades Raras, ISCIII, Madrid, Spain and Department of Molecular Biology and Biochemistry, University of Malaga (UMA), 29010 Malaga, Spain
| | - James R Perkins
- Research laboratory, IBIMA-Regional University Hospital of Malaga, UMA, Malaga 29009, Spain
| |
Collapse
|
72
|
Bhardwaj A, Bag SK. PLANET-SNP pipeline: PLants based ANnotation and Establishment of True SNP pipeline. Genomics 2019; 111:1066-1077. [PMID: 31533899 DOI: 10.1016/j.ygeno.2018.07.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Revised: 06/10/2018] [Accepted: 07/02/2018] [Indexed: 12/30/2022]
Abstract
Acute prediction of SNPs (Single Nucleotide Polymorphisms) from high throughput sequencing data is a challenging problem, having potential to explore possible variation within plants species. For the extraction of profitable information from bulk of data, machine learning (ML) could lead to development of accurate model based on the learning of prior information. We performed state of art, in-depth learning on six different plant species. Comparative evaluation of five different algorithms showed that Random Forest substantially outperformed in selection of potential SNPs, with markedly improved prediction accuracy via 10-fold cross validation technique and integrated in system known as PLANET-SNP. We present the accurate method to extract the potential SNPs with user specific customizable parameters. It will facilitate the identification of efficient and functional SNPs in most easy and intuitive way. PLANET-SNP pipeline is very flexible in terms of data input and output formats. PLANET-SNP Pipeline is available at http://www.ncgd.nbri.res.in/PLANET-SNP-Pipeline.aspx.
Collapse
Affiliation(s)
- Archana Bhardwaj
- Academy of Scientific and Innovative Research (AcSIR), CSIR-NBRI Campus, Lucknow, India; Computational Biology Lab, Council of Scientific and Industrial Research - National Botanical Research Institute (CSIR-NBRI), Rana Pratap Marg, Lucknow, Uttar Pradesh 226001, India
| | - Sumit K Bag
- Academy of Scientific and Innovative Research (AcSIR), CSIR-NBRI Campus, Lucknow, India; Computational Biology Lab, Council of Scientific and Industrial Research - National Botanical Research Institute (CSIR-NBRI), Rana Pratap Marg, Lucknow, Uttar Pradesh 226001, India.
| |
Collapse
|
73
|
Caspar SM, Dubacher N, Kopps AM, Meienberg J, Henggeler C, Matyas G. Clinical sequencing: From raw data to diagnosis with lifetime value. Clin Genet 2019; 93:508-519. [PMID: 29206278 DOI: 10.1111/cge.13190] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Revised: 11/28/2017] [Accepted: 11/30/2017] [Indexed: 12/22/2022]
Abstract
High-throughput sequencing (HTS) has revolutionized genetics by enabling the detection of sequence variants at hitherto unprecedented large scale. Despite these advances, however, there are still remaining challenges in the complete coverage of targeted regions (genes, exome or genome) as well as in HTS data analysis and interpretation. Moreover, it is easy to get overwhelmed by the plethora of available methods and tools for HTS. Here, we review the step-by-step process from the generation of sequence data to molecular diagnosis of Mendelian diseases. Highlighting advantages and limitations, this review addresses the current state of (1) HTS technologies, considering targeted, whole-exome, and whole-genome sequencing on short- and long-read platforms; (2) read alignment, variant calling and interpretation; as well as (3) regulatory issues related to genetic counseling, reimbursement, and data storage.
Collapse
Affiliation(s)
- S M Caspar
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - N Dubacher
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - A M Kopps
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - J Meienberg
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - C Henggeler
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland
| | - G Matyas
- Center for Cardiovascular Genetics and Gene Diagnostics, Foundation for People with Rare Diseases, Schlieren-Zurich, Switzerland.,Zurich Center for Integrative Human Physiology, University of Zurich, Zurich, Switzerland
| |
Collapse
|
74
|
Hu Z, Yu C, Furutsuki M, Andreoletti G, Ly M, Hoskins R, Adhikari AN, Brenner SE. VIPdb, a genetic Variant Impact Predictor Database. Hum Mutat 2019; 40:1202-1214. [PMID: 31283070 PMCID: PMC7288905 DOI: 10.1002/humu.23858] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Accepted: 06/27/2019] [Indexed: 12/30/2022]
Abstract
Genome sequencing identifies vast number of genetic variants. Predicting these variants' molecular and clinical effects is one of the preeminent challenges in human genetics. Accurate prediction of the impact of genetic variants improves our understanding of how genetic information is conveyed to molecular and cellular functions, and is an essential step towards precision medicine. Over one hundred tools/resources have been developed specifically for this purpose. We summarize these tools as well as their characteristics, in the genetic Variant Impact Predictor Database (VIPdb). This database will help researchers and clinicians explore appropriate tools, and inform the development of improved methods. VIPdb can be browsed and downloaded at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Changhua Yu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Department of Bioengineering, University of California, Berkeley, California 94720, USA
| | - Mabel Furutsuki
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Gaia Andreoletti
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Melissa Ly
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Division of Data Sciences, University of California, Berkeley, California 94720, USA
| | - Roger Hoskins
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Aashish N. Adhikari
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
75
|
Fasterius E, Al-Khalili Szigyarto C. seqCAT: a Bioconductor R-package for variant analysis of high throughput sequencing data. F1000Res 2019. [DOI: 10.12688/f1000research.16083.2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
High throughput sequencing technologies are flourishing in the biological sciences, enabling unprecedented insights into e.g. genetic variation, but require extensive bioinformatic expertise for the analysis. There is thus a need for simple yet effective software that can analyse both existing and novel data, providing interpretable biological results with little bioinformatic prowess. We present seqCAT, a Bioconductor toolkit for analysing genetic variation in high throughput sequencing data. It is a highly accessible, easy-to-use and well-documented R-package that enables a wide range of researchers to analyse their own and publicly available data, providing biologically relevant conclusions and publication-ready figures. SeqCAT can provide information regarding genetic similarities between an arbitrary number of samples, validate specific variants as well as define functionally similar variant groups for further downstream analyses. Its ease of use, installation, complete data-to-conclusions functionality and the inherent flexibility of the R programming language make seqCAT a powerful tool for variant analyses compared to already existing solutions. A publicly available dataset of liver cancer-derived organoids is analysed herein using the seqCAT package, corroborating the original authors' conclusions that the organoids are genetically stable. A previously known liver cancer-related mutation is additionally shown to be present in a sample though it was not listed in the original publication. Differences between DNA- and RNA-based variant calls in this dataset are also analysed revealing a high median concordance of 97.5%. SeqCAT is an open source software under a MIT licence available at https://bioconductor.org/packages/release/bioc/html/seqCAT.html.
Collapse
|
76
|
Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun 2019; 10:3240. [PMID: 31324872 PMCID: PMC6642177 DOI: 10.1038/s41467-019-11146-4] [Citation(s) in RCA: 137] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 06/26/2019] [Indexed: 01/12/2023] Open
Abstract
In recent years, many software packages for identifying structural variants (SVs) using whole-genome sequencing data have been released. When published, a new method is commonly compared with those already available, but this tends to be selective and incomplete. The lack of comprehensive benchmarking of methods presents challenges for users in selecting methods and for developers in understanding algorithm behaviours and limitations. Here we report the comprehensive evaluation of 10 SV callers, selected following a rigorous process and spanning the breadth of detection approaches, using high-quality reference cell lines, as well as simulations. Due to the nature of available truth sets, our focus is on general-purpose rather than somatic callers. We characterise the impact on performance of event size and type, sequencing characteristics, and genomic context, and analyse the efficacy of ensemble calling and calibration of variant quality scores. Finally, we provide recommendations for both users and methods developers. A number of computational methods have been developed for calling structural variants (SVs) using short read sequencing data. Here, the authors perform a comprehensive benchmarking analysis comparing 10 general-purpose callers and provide recommendations for both users and methods developers.
Collapse
|
77
|
Chen J, Li X, Zhong H, Meng Y, Du H. Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers. Sci Rep 2019; 9:9345. [PMID: 31249349 PMCID: PMC6597787 DOI: 10.1038/s41598-019-45835-3] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 06/12/2019] [Indexed: 12/17/2022] Open
Abstract
The development and innovation of next generation sequencing (NGS) and the subsequent analysis tools have gain popularity in scientific researches and clinical diagnostic applications. Hence, a systematic comparison of the sequencing platforms and variant calling pipelines could provide significant guidance to NGS-based scientific and clinical genomics. In this study, we compared the performance, concordance and operating efficiency of 27 combinations of sequencing platforms and variant calling pipelines, testing three variant calling pipelines—Genome Analysis Tool Kit HaplotypeCaller, Strelka2 and Samtools-Varscan2 for nine data sets for the NA12878 genome sequenced by different platforms including BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants calling performance of 12 combinations in WES datasets, all combinations displayed good performance in calling SNPs, with their F-scores entirely higher than 0.96, and their performance in calling INDELs varies from 0.75 to 0.91. And all 15 combinations in WGS datasets also manifested good performance, with F-scores in calling SNPs were entirely higher than 0.975 and their performance in calling INDELs varies from 0.71 to 0.93. All of these combinations manifested high concordance in variant identification, while the divergence of variants identification in WGS datasets were larger than that in WES datasets. We also down-sampled the original WES and WGS datasets at a series of gradient coverage across multiple platforms, then the variants calling period consumed by the three pipelines at each coverage were counted, respectively. For the GIAB datasets on both BGI and Illumina platforms, Strelka2 manifested its ultra-performance in detecting accuracy and processing efficiency compared with other two pipelines on each sequencing platform, which was recommended in the further promotion and application of next generation sequencing technology. The results of our researches will provide useful and comprehensive guidelines for personal or organizational researchers in reliable and consistent variants identification.
Collapse
Affiliation(s)
- Jiayun Chen
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China
| | - Xingsong Li
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China
| | - Hongbin Zhong
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China
| | - Yuhuan Meng
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China.
| | - Hongli Du
- School of Biology and Biological Engineering & Department of Biomedical Engineering, South China University of Technology, Guangzhou, China.
| |
Collapse
|
78
|
Bope CD, Chimusa ER, Nembaware V, Mazandu GK, de Vries J, Wonkam A. Dissecting in silico Mutation Prediction of Variants in African Genomes: Challenges and Perspectives. Front Genet 2019; 10:601. [PMID: 31293624 PMCID: PMC6603221 DOI: 10.3389/fgene.2019.00601] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Accepted: 06/05/2019] [Indexed: 12/20/2022] Open
Abstract
Genomic medicine is set to drastically improve clinical care globally due to high throughput technologies which enable speedy in silico detection and analysis of clinically relevant mutations. However, the variability in the in silico prediction methods and categorization of functionally relevant genetic variants can pose specific challenges in some populations. In silico mutation prediction tools could lead to high rates of false positive/negative results, particularly in African genomes that harbor the highest genetic diversity and that are disproportionately underrepresented in public databases and reference panels. These issues are particularly relevant with the recent increase in initiatives, such as the Human Heredity and Health (H3Africa), that are generating huge amounts of genomic sequence data in the absence of policies to guide genomic researchers to return results of variants in so-called actionable genes to research participants. This report (i) provides an inventory of publicly available Whole Exome/Genome data from Africa which could help improve reference panels and explore the frequency of pathogenic variants in actionable genes and related challenges, (ii) reviews available in silico prediction mutation tools and the criteria for categorization of pathogenicity of novel variants, and (iii) proposes recommendations for analyzing pathogenic variants in African genomes for their use in research and clinical practice. In conclusion, this work proposes criteria to define mutation pathogenicity and actionability in human genetic research and clinical practice in Africa and recommends setting up an African expert panel to oversee the proposed criteria.
Collapse
Affiliation(s)
- Christian Domilongo Bope
- Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
- Departments of Mathematics and Computer Sciences, Faculty of Sciences, University of Kinshasa, Kinshasa, Democratic Republic of Congo
| | - Emile R. Chimusa
- Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Victoria Nembaware
- Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Gaston K. Mazandu
- Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Jantina de Vries
- Department of Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| | - Ambroise Wonkam
- Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
- Department of Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
- Institute of Infectious Diseases and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
| |
Collapse
|
79
|
Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinformatics 2019; 20:342. [PMID: 31208315 PMCID: PMC6580603 DOI: 10.1186/s12859-019-2928-9] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Accepted: 05/31/2019] [Indexed: 12/30/2022] Open
Abstract
Background Whole exome sequencing (WES) is a cost-effective method that identifies clinical variants but it demands accurate variant caller tools. Currently available tools have variable accuracy in predicting specific clinical variants. But it may be possible to find the best combination of aligner-variant caller tools for detecting accurate single nucleotide variants (SNVs) and small insertion and deletion (InDels) separately. Moreover, many important aspects of InDel detection are overlooked while comparing the performance of tools, particularly its base pair length. Results We assessed the performance of variant calling pipelines using the combinations of four variant callers and five aligners on human NA12878 and simulated exome data. We used high confidence variant calls from Genome in a Bottle (GiaB) consortium for validation, and GRCh37 and GRCh38 as the human reference genome. Based on the performance metrics, both BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for InDels. Furthermore, we obtained similar results on human NA24385 and NA24631 exome data from GiaB. Conclusion In this study, DeepVariant with BWA and Novoalign performed best for detecting accurate SNVs and InDels. The accuracy of variant calling was improved by merging the top performing pipelines. The results of our study provide useful recommendations for analysis of WES data in clinical genomics. Electronic supplementary material The online version of this article (10.1186/s12859-019-2928-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Manojkumar Kumaran
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.,School of Chemical and Biotechnology, SASTRA (Deemed to be University), Thanjavur, Tamil Nadu, 613401, India
| | - Umadevi Subramanian
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India
| | - Bharanidharan Devarajan
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.
| |
Collapse
|
80
|
Raimondi D, Tanyalcin I, Ferté J, Gazzo A, Orlando G, Lenaerts T, Rooman M, Vranken W. DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic Acids Res 2019; 45:W201-W206. [PMID: 28498993 PMCID: PMC5570203 DOI: 10.1093/nar/gkx390] [Citation(s) in RCA: 90] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2017] [Accepted: 04/26/2017] [Indexed: 12/22/2022] Open
Abstract
High-throughput sequencing methods are generating enormous amounts of genomic data, giving unprecedented insights into human genetic variation and its relation to disease. An individual human genome contains millions of Single Nucleotide Variants: to discriminate the deleterious from the benign ones, a variety of methods have been developed that predict whether a protein-coding variant likely affects the carrier individual's health. We present such a method, DEOGEN2, which incorporates heterogeneous information about the molecular effects of the variants, the domains involved, the relevance of the gene and the interactions in which it participates. This extensive contextual information is non-linearly mapped into one single deleteriousness score for each variant. Since for the non-expert user it is sometimes still difficult to assess what this score means, how it relates to the encoded protein, and where it originates from, we developed an interactive online framework (http://deogen2.mutaframe.com/) to better present the DEOGEN2 deleteriousness predictions of all possible variants in all human proteins. The prediction is visualized so both expert and non-expert users can gain insights into the meaning, protein context and origins of each prediction.
Collapse
Affiliation(s)
- Daniele Raimondi
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, 1050 Brussels, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
| | - Ibrahim Tanyalcin
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
| | - Julien Ferté
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,3BIO-BioInfo Group, Université Libre De Bruxelles, AV Fr. Roosevelt 50, CP 165/61, Brussels 1050, Belgium
| | - Andrea Gazzo
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, 1050 Brussels, Belgium
| | - Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, 1050 Brussels, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, 1050 Brussels, Belgium.,Artificial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium
| | - Marianne Rooman
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,3BIO-BioInfo Group, Université Libre De Bruxelles, AV Fr. Roosevelt 50, CP 165/61, Brussels 1050, Belgium
| | - Wim Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB/VUB, Triomflaan, BC building, 6th floor, CP 263, 1050 Brussels, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium.,Artificial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium
| |
Collapse
|
81
|
Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol 2019; 20:117. [PMID: 31159850 PMCID: PMC6547561 DOI: 10.1186/s13059-019-1720-5] [Citation(s) in RCA: 229] [Impact Index Per Article: 45.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Accepted: 05/20/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Structural variations (SVs) or copy number variations (CNVs) greatly impact the functions of the genes encoded in the genome and are responsible for diverse human diseases. Although a number of existing SV detection algorithms can detect many types of SVs using whole genome sequencing (WGS) data, no single algorithm can call every type of SVs with high precision and high recall. RESULTS We comprehensively evaluate the performance of 69 existing SV detection algorithms using multiple simulated and real WGS datasets. The results highlight a subset of algorithms that accurately call SVs depending on specific types and size ranges of the SVs and that accurately determine breakpoints, sizes, and genotypes of the SVs. We enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories. To improve the accuracy of SV calling, we systematically evaluate the accuracy of overlapping calls between possible combinations of algorithms for every type and size range of SVs. The results demonstrate that both the precision and recall for overlapping calls vary depending on the combinations of specific algorithms rather than the combinations of methods used in the algorithms. CONCLUSION These results suggest that careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs. The selection of specific pairs of algorithms for overlapping calls promises to effectively improve the SV detection accuracy.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Yukihide Momozawa
- Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Xiaoxi Liu
- Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Chikashi Terao
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Michiaki Kubo
- RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Yoichiro Kamatani
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| |
Collapse
|
82
|
Mangul S, Mosqueiro T, Abdill RJ, Duong D, Mitchell K, Sarwal V, Hill B, Brito J, Littman RJ, Statz B, Lam AKM, Dayama G, Grieneisen L, Martin LS, Flint J, Eskin E, Blekhman R. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol 2019; 17:e3000333. [PMID: 31220077 PMCID: PMC6605654 DOI: 10.1371/journal.pbio.3000333] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/02/2019] [Indexed: 01/07/2023] Open
Abstract
Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed "easy to install," and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Thiago Mosqueiro
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Richard J. Abdill
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Dat Duong
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Keith Mitchell
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Varuni Sarwal
- Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India
| | - Brian Hill
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Jaqueline Brito
- Institute of Mathematics and Computer Science, University of São Paulo, São Paulo, Brazil
| | - Russell Jared Littman
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Benjamin Statz
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Gargi Dayama
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Laura Grieneisen
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Lana S. Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, Los Angeles, California, United States of America
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Ran Blekhman
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Minnesota, United States of America
| |
Collapse
|
83
|
Thormann A, Halachev M, McLaren W, Moore DJ, Svinti V, Campbell A, Kerr SM, Tischkowitz M, Hunt SE, Dunlop MG, Hurles ME, Wright CF, Firth HV, Cunningham F, FitzPatrick DR. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun 2019; 10:2373. [PMID: 31147538 PMCID: PMC6542828 DOI: 10.1038/s41467-019-10016-3] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 04/15/2019] [Indexed: 12/31/2022] Open
Abstract
We aimed to develop an efficient, flexible and scalable approach to diagnostic genome-wide sequence analysis of genetically heterogeneous clinical presentations. Here we present G2P ( www.ebi.ac.uk/gene2phenotype ) as an online system to establish, curate and distribute datasets for diagnostic variant filtering via association of allelic requirement and mutational consequence at a defined locus with phenotypic terms, confidence level and evidence links. An extension to Ensembl Variant Effect Predictor (VEP), VEP-G2P was used to filter both disease-associated and control whole exome sequence (WES) with Developmental Disorders G2P (G2PDD; 2044 entries). VEP-G2PDD shows a sensitivity/precision of 97.3%/33% for de novo and 81.6%/22.7% for inherited pathogenic genotypes respectively. Many of the missing genotypes are likely false-positive pathogenic assignments. The expected number and discriminative features of background genotypes are defined using control WES. Using only human genetic data VEP-G2P performs well compared to other freely-available diagnostic systems and future phenotypic matching capabilities should further enhance performance.
Collapse
Affiliation(s)
- Anja Thormann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Mihail Halachev
- MRC Institute of Genetics and Molecular Medicine at the University of Edinburgh, Edinburgh, EH4 2XU, UK
- South East Scotland Regional Genetics Services, Western General Hospital, Edinburgh, EH4 2XU, UK
| | - William McLaren
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - David J Moore
- South East Scotland Regional Genetics Services, Western General Hospital, Edinburgh, EH4 2XU, UK
| | - Victoria Svinti
- MRC Institute of Genetics and Molecular Medicine at the University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Archie Campbell
- Centre for Genomic and Experimental Medicine, Institute of Genetics & Molecular Medicine, Western General Hospital, University of Edinburgh, Edinburgh, EH4 2XU, UK
- Usher Institute for Population Health Sciences and Informatics, The University of Edinburgh, Nine Edinburgh BioQuarter, 9 Little France Road, Edinburgh, EH16 4UX, UK
| | - Shona M Kerr
- Centre for Genomic and Experimental Medicine, Institute of Genetics & Molecular Medicine, Western General Hospital, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Marc Tischkowitz
- Clinical Genetic Department, Addenbrooke's Hospital Cambridge University Hospitals, Cambridge, CB2 0QQ, UK
| | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Malcolm G Dunlop
- MRC Institute of Genetics and Molecular Medicine at the University of Edinburgh, Edinburgh, EH4 2XU, UK
- Edinburgh Cancer Research Centre, Institute of Genetics & Molecular Medicine, Western General Hospital, University of Edinburgh, Edinburgh, EH4 2XU, UK
| | - Matthew E Hurles
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Caroline F Wright
- University of Exeter Medical School, RILD Level 4, Royal Devon & Exeter Hospital, Barrack Road, Exeter, UK
| | - Helen V Firth
- Clinical Genetic Department, Addenbrooke's Hospital Cambridge University Hospitals, Cambridge, CB2 0QQ, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - David R FitzPatrick
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine at the University of Edinburgh, Edinburgh, EH4 2XU, UK.
| |
Collapse
|
84
|
Singer J, Irmisch A, Ruscheweyh HJ, Singer F, Toussaint NC, Levesque MP, Stekhoven DJ, Beerenwinkel N. Bioinformatics for precision oncology. Brief Bioinform 2019; 20:778-788. [PMID: 29272324 PMCID: PMC6585151 DOI: 10.1093/bib/bbx143] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2017] [Revised: 09/29/2017] [Indexed: 12/13/2022] Open
Abstract
Molecular profiling of tumor biopsies plays an increasingly important role not only in cancer research, but also in the clinical management of cancer patients. Multi-omics approaches hold the promise of improving diagnostics, prognostics and personalized treatment. To deliver on this promise of precision oncology, appropriate bioinformatics methods for managing, integrating and analyzing large and complex data are necessary. Here, we discuss the specific requirements of bioinformatics methods and software that arise in the setting of clinical oncology, owing to a stricter regulatory environment and the need for rapid, highly reproducible and robust procedures. We describe the workflow of a molecular tumor board and the specific bioinformatics support that it requires, from the primary analysis of raw molecular profiling data to the automatic generation of a clinical report and its delivery to decision-making clinical oncologists. Such workflows have to various degrees been implemented in many clinical trials, as well as in molecular tumor boards at specialized cancer centers and university hospitals worldwide. We review these and more recent efforts to include other high-dimensional multi-omics patient profiles into the tumor board, as well as the state of clinical decision support software to translate molecular findings into treatment recommendations.
Collapse
Affiliation(s)
- Jochen Singer
- Department of Biosystems Science and Engineering of ETH Zurich in Basel, Switzerland
| | - Anja Irmisch
- Department of Dermatology at the University of Zurich Hospital in Zurich, Switzerland
| | | | | | | | | | | | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering of ETH Zurich in Basel, Switzerland
| |
Collapse
|
85
|
Alzu'bi AA, Zhou L, Watzlaf VJM. Genetic Variations and Precision Medicine. PERSPECTIVES IN HEALTH INFORMATION MANAGEMENT 2019; 16:1a. [PMID: 31019429 PMCID: PMC6462879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The time and costs associated with the sequencing of a human genome have decreased significantly in recent years. Many people have chosen to have their genomes sequenced to receive genomics-based personalized healthcare services. To reach the goal of genomics-based precision medicine, health information management (HIM) professionals need to manage and analyze patients' genomic data. Two important pieces of information from the genome sequence are the risk of genetic diseases and the specific medication or pharmacogenomic results for the individual patient, both of which are linked to a patient's genetic variations. In this review article, we introduce genetic variations, including their data types, relevant databases, and some currently available analysis methods and systems. HIM professionals can choose to use these databases, methods, and systems in the management and analysis of patients' genomic data.
Collapse
Affiliation(s)
- Amal Adel Alzu'bi
- The Department of Computer Information Systems at Jordan University of Science and Technology in Irbid, Jordan
| | - Leming Zhou
- The Department of Health Information Management at the University of Pittsburgh in Pittsburgh, PA
| | - Valerie J M Watzlaf
- The Department of Health Information Management at the University of Pittsburgh in Pittsburgh, PA
| |
Collapse
|
86
|
Mangul S, Martin LS, Hill BL, Lam AKM, Distler MG, Zelikovsky A, Eskin E, Flint J. Systematic benchmarking of omics computational tools. Nat Commun 2019; 10:1393. [PMID: 30918265 PMCID: PMC6437167 DOI: 10.1038/s41467-019-09406-4] [Citation(s) in RCA: 82] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/06/2019] [Indexed: 01/11/2023] Open
Abstract
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Benchmarking studies are important for comprehensively understanding and evaluating different computational omics methods. Here, the authors review practices from 25 recent studies and propose principles to improve the quality of benchmarking studies.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA. .,Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA.
| | - Lana S Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Margaret G Distler
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30303, USA.,The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA.,Department of Human Genetics, University of California Los Angeles, 695 Charles E. Young, Los Angeles, CA, USA
| | - Jonathan Flint
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| |
Collapse
|
87
|
Hwang KB, Lee IH, Li H, Won DG, Hernandez-Ferrer C, Negron JA, Kong SW. Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings. Sci Rep 2019; 9:3219. [PMID: 30824715 PMCID: PMC6397176 DOI: 10.1038/s41598-019-39108-2] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Accepted: 01/16/2019] [Indexed: 12/30/2022] Open
Abstract
Comprehensive and accurate detection of variants from whole-genome sequencing (WGS) is a strong prerequisite for translational genomic medicine; however, low concordance between analytic pipelines is an outstanding challenge. We processed a European and an African WGS samples with 70 analytic pipelines comprising the combination of 7 short-read aligners and 10 variant calling algorithms (VCAs), and observed remarkable differences in the number of variants called by different pipelines (max/min ratio: 1.3~3.4). The similarity between variant call sets was more closely determined by VCAs rather than by short-read aligners. Remarkably, reported minor allele frequency had a substantial effect on concordance between pipelines (concordance rate ratio: 0.11~0.92; Wald tests, P < 0.001), entailing more discordant results for rare and novel variants. We compared the performance of analytic pipelines and pipeline ensembles using gold-standard variant call sets and the catalog of variants from the 1000 Genomes Project. Notably, a single pipeline using BWA-MEM and GATK-HaplotypeCaller performed comparable to the pipeline ensembles for ‘callable’ regions (~97%) of the human reference genome. While a single pipeline is capable of analyzing common variants in most genomic regions, our findings demonstrated the limitations and challenges in analyzing rare or novel variants, especially for non-European genomes.
Collapse
Affiliation(s)
- Kyu-Baek Hwang
- School of Computer Science and Engineering, Soongsil University, Seoul, 06978, Korea
| | - In-Hee Lee
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, 02115, USA
| | - Honglan Li
- School of Computer Science and Engineering, Soongsil University, Seoul, 06978, Korea
| | - Dhong-Geon Won
- School of Computer Science and Engineering, Soongsil University, Seoul, 06978, Korea
| | - Carles Hernandez-Ferrer
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, 02115, USA
| | - Jose Alberto Negron
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, 02115, USA
| | - Sek Won Kong
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, 02115, USA. .,Department of Pediatrics, Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
88
|
Salmela L, Tomescu AI. Safely Filling Gaps with Partial Solutions Common to All Solutions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:617-626. [PMID: 29994355 DOI: 10.1109/tcbb.2017.2785831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Gap filling has emerged as a natural sub-problem of many de novo genome assembly projects. The gap filling problem generally asks for an $s$s-$t$t path in an assembly graph whose length matches the gap length estimate. Several methods have addressed it, but only few have focused on strategies for dealing with multiple gap filling solutions and for guaranteeing reliable results. Such strategies include reporting only unique solutions, or exhaustively enumerating all filling solutions and heuristically creating their consensus. Our main contribution is a new method for reliable gap filling: filling gaps with those sub-paths common to all gap filling solutions. We call these partial solutions safe, following the framework of (Tomescu and Medvedev, RECOMB 2016). We give an efficient safe algorithm running in $O(dm)$O(dm) time and space, where $d$d is the gap length estimate and $m$m is the number of edges of the assembly graph. To show the benefits of this method, we implemented this algorithm for the problem of filling gaps in scaffolds. Our experimental results on bacterial and on conservative human assemblies show that, on average, our method can retrieve over 73 percent more safe and correct bases as compared to previous methods, with a similar precision.
Collapse
|
89
|
SM-RCNV: a statistical method to detect recurrent copy number variations in sequenced samples. Genes Genomics 2019; 41:529-536. [PMID: 30779024 DOI: 10.1007/s13258-019-00788-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 01/21/2019] [Indexed: 12/13/2022]
Abstract
BACKGROUND Copy number variation (CNV) is an important form of genomic structural variation and is linked to dozens of human diseases. Using next-generation sequencing (NGS) data and developing computational methods to characterize such structural variants is significant for understanding the mechanisms of diseases. OBJECTIVE The objective of this study is to develop a new statistical method of detection recurrent CNVs across multiple samples from genomic sequences. METHODS A statistical method is carried out to detect recurrent CNVs, referred to as SM-RCNV. This method uses a statistic associated with each location by combining the frequency of variation at one location across whole samples and the correlation among consecutive locations. The weights of the frequency and correlation are trained using real datasets with known CNVs. P-value is assessed for each location on the genome by permutation testing. RESULTS Compared with six peer methods, SM-RCNV outperforms the peer methods under receiver operating characteristic curves. SM-RCNV successfully identifies many consistent recurrent CNVs, most of which are known to be of biological significance and associated with diseased genes. The validation rate of SM-RCNV in the CEU call set and YRI call set with Database of Genomic Variants are 258/328 (79%) and (157/309) 51%, respectively. CONCLUSION SM-RCNV is a well-grounded statistical framework for detecting recurrent CNVs from multiple genomic sequences, providing valuable information to study genomes in human diseases. The source code is freely available at https://sourceforge.net/projects/sm-rcnv/ .
Collapse
|
90
|
VanRaden PM, Bickhart DM, O'Connell JR. Calling known variants and identifying new variants while rapidly aligning sequence data. J Dairy Sci 2019; 102:3216-3229. [PMID: 30772032 DOI: 10.3168/jds.2018-15172] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Accepted: 12/10/2018] [Indexed: 12/30/2022]
Abstract
Whole-genome sequencing studies can identify causative mutations for subsequent use in genomic evaluations. Speed and accuracy of sequence alignment can be improved by accounting for known variant locations during alignment instead of calling the variants after alignment as in previous programs. The new programs Findmap and Findvar were compared with alignment using Burrows-Wheeler alignment (BWA) or SNAP and variant identification using Genome Analysis ToolKit (GATK) or SAMtools. Findmap stores the reference map and any known variant locations while aligning reads and counting reference and alternate alleles for each DNA source. Findmap also outputs potential new single nucleotide variant, insertion, and deletion alleles. Findvar separates likely true variants from read errors and outputs genotype probabilities. Strategies were tested using cattle, human, and a completely random reference map and simulated or actual data. Most tests simulated 10 bulls, each with 10× simulated sequence reads containing 39 million variants from the 1000 Bull Genomes Project. With 10 processors, clock times for processing 100× data were 105 h for BWA, 25 h for GATK, and 11 h for SAMtools but only about 4 h for SNAP, 3 h for Findmap, and 1 h for Findvar. Alignment programs required about the same total memory; BWA used 46 GB (4.6 GB/processor), whereas >10 processors can share the same memory in SNAP and Findmap, which used 40 and 46 GB, respectively. Findmap correctly mapped 92.9% of reads (compared with 92.6% from SNAP and 90.5% from BWA) and had high accuracy of calling alleles for known variants. For new variants, Findvar found 99.8% of single nucleotide variants, 79% of insertions, and 67% of deletions; GATK found 99.4, 95, and 90%, respectively; and SAMtools found 99.8, 12, and 16%, respectively. False positives (as percentages of true variants) were 11% of single nucleotide variants, 0.4% of insertions, and 0.3% of deletions from Findvar; 12, 8.4, and 2.9%, respectively, from GATK; and 37, 1.3, and 0.4%, respectively, from SAMtools. Advantages of Findmap and Findvar are fast processing, precise alignment, more useful data summaries, more compact output, and fewer steps. Calling known variants during alignment allows more efficient and accurate sequence-based genotyping.
Collapse
Affiliation(s)
- P M VanRaden
- USDA, Agricultural Research Service, Animal Genomics and Improvement Laboratory, Beltsville, MD 20705-2350.
| | - D M Bickhart
- USDA, Agricultural Research Service, Animal Genomics and Improvement Laboratory, Beltsville, MD 20705-2350
| | - J R O'Connell
- University of Maryland School of Medicine, Baltimore 21201
| |
Collapse
|
91
|
Tang M, Hasan MS, Zhu H, Zhang L, Wu X. vi-HMM: a novel HMM-based method for sequence variant identification in short-read data. Hum Genomics 2019; 13:9. [PMID: 30795817 PMCID: PMC6387560 DOI: 10.1186/s40246-019-0194-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Accepted: 01/29/2019] [Indexed: 12/30/2022] Open
Abstract
Background Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in next-generation sequencing (NGS) applications. Existing methods for calling these variants often make simplified assumptions of positional independence and fail to leverage the dependence between genotypes at nearby loci that is caused by linkage disequilibrium (LD). Results and conclusion We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short-read data. This method allows transitions between hidden states (defined as “SNP,” “Ins,” “Del,” and “Match”) of adjacent genomic bases and determines an optimal hidden state path by using the Viterbi algorithm. The inferred hidden state path provides a direct solution to the identification of SNPs and INDELs. Simulation studies show that, under various sequencing depths, vi-HMM outperforms commonly used variant calling methods in terms of sensitivity and F1 score. When applied to the real data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs. Electronic supplementary material The online version of this article (10.1186/s40246-019-0194-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Man Tang
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA
| | - Mohammad Shabbir Hasan
- Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA
| | - Hongxiao Zhu
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA
| | - Xiaowei Wu
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA.
| |
Collapse
|
92
|
Zhou B, Ho SS, Greer SU, Zhu X, Bell JM, Arthur JG, Spies N, Zhang X, Byeon S, Pattni R, Ben-Efraim N, Haney MS, Haraksingh RR, Song G, Ji HP, Perrin D, Wong WH, Abyzov A, Urban AE. Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562. Genome Res 2019; 29:472-484. [PMID: 30737237 PMCID: PMC6396411 DOI: 10.1101/gr.234948.118] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Accepted: 12/28/2018] [Indexed: 11/24/2022]
Abstract
K562 is widely used in biomedical research. It is one of three tier-one cell lines of ENCODE and also most commonly used for large-scale CRISPR/Cas9 screens. Although its functional genomic and epigenomic characteristics have been extensively studied, its genome sequence and genomic structural features have never been comprehensively analyzed. Such information is essential for the correct interpretation and understanding of the vast troves of existing functional genomics and epigenomics data for K562. We performed and integrated deep-coverage whole-genome (short-insert), mate-pair, and linked-read sequencing as well as karyotyping and array CGH analysis to identify a wide spectrum of genome characteristics in K562: copy numbers (CN) of aneuploid chromosome segments at high-resolution, SNVs and indels (both corrected for CN in aneuploid regions), loss of heterozygosity, megabase-scale phased haplotypes often spanning entire chromosome arms, structural variants (SVs), including small and large-scale complex SVs and nonreference retrotransposon insertions. Many SVs were phased, assembled, and experimentally validated. We identified multiple allele-specific deletions and duplications within the tumor suppressor gene FHIT. Taking aneuploidy into account, we reanalyzed K562 RNA-seq and whole-genome bisulfite sequencing data for allele-specific expression and allele-specific DNA methylation. We also show examples of how deeper insights into regulatory complexity are gained by integrating genomic variant information and structural context with functional genomics and epigenomics data. Furthermore, using K562 haplotype information, we produced an allele-specific CRISPR targeting map. This comprehensive whole-genome analysis serves as a resource for future studies that utilize K562 as well as a framework for the analysis of other cancer genomes.
Collapse
Affiliation(s)
- Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Steve S Ho
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Stephanie U Greer
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Xiaowei Zhu
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - John M Bell
- Stanford Genome Technology Center, Stanford University, Palo Alto, California 94304, USA
| | - Joseph G Arthur
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - Noah Spies
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA.,Genome-Scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, USA
| | - Xianglong Zhang
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Seunggyu Byeon
- School of Computer Science and Engineering, College of Engineering, Pusan National University, Busan 46241, South Korea
| | - Reenal Pattni
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Noa Ben-Efraim
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Michael S Haney
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Rajini R Haraksingh
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Giltae Song
- School of Computer Science and Engineering, College of Engineering, Pusan National University, Busan 46241, South Korea
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, California 94305, USA.,Stanford Genome Technology Center, Stanford University, Palo Alto, California 94304, USA
| | - Dimitri Perrin
- Science and Engineering Faculty, Queensland University of Technology, Brisbane, QLD 4001, Australia
| | - Wing H Wong
- Department of Statistics, Stanford University, Stanford, California 94305, USA.,Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Alexej Abyzov
- Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Rochester, Minnesota 55905, USA
| | - Alexander E Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA.,Tashia and John Morgridge Faculty Scholar, Stanford Child Health Research Institute, Stanford, California 94305, USA
| |
Collapse
|
93
|
González-Gomariz J, Guruceaga E, López-Sánchez M, Segura V. Proteogenomics in the context of the Human Proteome Project (HPP). Expert Rev Proteomics 2019; 16:267-275. [PMID: 30654666 DOI: 10.1080/14789450.2019.1571916] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
INTRODUCTION The technological and scientific progress performed in the Human Proteome Project (HPP) has provided to the scientific community a new set of experimental and bioinformatic methods in the challenging field of shotgun and SRM/MRM-based Proteomics. The requirements for a protein to be considered experimentally validated are now well-established, and the information about the human proteome is available in the neXtProt database, while targeted proteomic assays are stored in SRMAtlas. However, the study of the missing proteins continues being an outstanding issue. Areas covered: This review is focused on the implementation of proteogenomic methods designed to improve the detection and validation of the missing proteins. The evolution of the methodological strategies based on the combination of different omic technologies and the use of huge publicly available datasets is shown taking the Chromosome 16 Consortium as reference. Expert commentary: Proteogenomics and other strategies of data analysis implemented within the C-HPP initiative could be used as guidance to complete in a near future the catalog of the human proteins. Besides, in the next years, we will probably witness their use in the B/D-HPP initiative to go a step forward on the implications of the proteins in the human biology and disease.
Collapse
Affiliation(s)
- José González-Gomariz
- a Bioinformatics Platform, Center for Applied Medical Research , University of Navarra , Pamplona , Spain.,b IdiSNA , Navarra Institute for Health Research , Pamplona , Spain
| | - Elizabeth Guruceaga
- a Bioinformatics Platform, Center for Applied Medical Research , University of Navarra , Pamplona , Spain.,b IdiSNA , Navarra Institute for Health Research , Pamplona , Spain
| | - Macarena López-Sánchez
- a Bioinformatics Platform, Center for Applied Medical Research , University of Navarra , Pamplona , Spain
| | - Victor Segura
- a Bioinformatics Platform, Center for Applied Medical Research , University of Navarra , Pamplona , Spain.,b IdiSNA , Navarra Institute for Health Research , Pamplona , Spain
| |
Collapse
|
94
|
Cox KH, Oliveira LMB, Plummer L, Corbin B, Gardella T, Balasubramanian R, Crowley WF. Modeling mutant/wild-type interactions to ascertain pathogenicity of PROKR2 missense variants in patients with isolated GnRH deficiency. Hum Mol Genet 2019; 27:338-350. [PMID: 29161432 DOI: 10.1093/hmg/ddx404] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Accepted: 11/10/2017] [Indexed: 12/30/2022] Open
Abstract
A major challenge in human genetics is the validation of pathogenicity of heterozygous missense variants. This problem is well-illustrated by PROKR2 variants associated with Isolated GnRH Deficiency (IGD). Homozygous, loss of function variants in PROKR2 was initially implicated in autosomal recessive IGD; however, most IGD-associated PROKR2 variants are heterozygous. Moreover, while IGD patient cohorts are enriched for PROKR2 missense variants similar rare variants are also found in normal individuals. To elucidate the pathogenic mechanisms distinguishing IGD-associated PROKR2 variants from rare variants in controls, we assessed 59 variants using three approaches: (i) in silico prediction, (ii) traditional in vitro functional assays across three signaling pathways with mutant-alone transfections, and (iii) modified in vitro assays with mutant and wild-type expression constructs co-transfected to model in vivo heterozygosity. We found that neither in silico analyses nor traditional in vitro assessments of mutants transfected alone could distinguish IGD variants from control variants. However, in vitro co-transfections revealed that 15/34 IGD variants caused loss-of-function (LoF), including 3 novel dominant-negatives, while only 4/25 control variants caused LoF. Surprisingly, 19 IGD-associated variants were benign or exhibited LoF that could be rescued by WT co-transfection. Overall, variants that were LoF in ≥ 2 signaling assays under co-transfection conditions were more likely to be disease-associated than benign or 'rescuable' variants. Our findings suggest that in vitro modeling of WT/Mutant interactions increases the resolution for identifying causal variants, uncovers novel dominant negative mutations, and provides new insights into the pathogenic mechanisms underlying heterozygous PROKR2 variants.
Collapse
Affiliation(s)
- Kimberly H Cox
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Luciana M B Oliveira
- Department of Bioregulation, Institute of Health Sciences, Federal University of Bahia, Salvador, Brazil
| | - Lacey Plummer
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Braden Corbin
- Endocrine Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Thomas Gardella
- Endocrine Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Ravikumar Balasubramanian
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - William F Crowley
- Harvard Reproductive Sciences Center and The Reproductive Endocrine Unit of the Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| |
Collapse
|
95
|
A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9010150] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The plethora of open source clinical software offers great reuse opportunities for developers to build clinical tools at lower cost and at a faster pace. However, the lack of research on open source clinical software poses a challenge for software reuse in clinical software development. This paper aims to help clinical developers better understand open source clinical software by conducting a thorough investigation of open source clinical software hosted on GitHub. We first developed a data pipeline that automatically collected and preprocessed GitHub data. Then, a deep analysis with several methods, such as statistical analysis, hypothesis testing, and topic modeling, was conducted to reveal the overall status and various characteristics of open source clinical software. There were 14,971 clinical-related GitHub repositories created during the last 10 years, with an average annual growth rate of 55%. Among them, 12,919 are open source clinical software. Our analysis unveiled a number of interesting findings: Popular open source clinical software in terms of the number of stars, most productive countries that contribute to the community, important factors that make an open source clinical software popular, and 10 main groups of open source clinical software. The results can assist both researchers and practitioners, especially newcomers, in understanding open source clinical software.
Collapse
|
96
|
Plekhanova E, Nuzhdin SV, Utkin LV, Samsonova MG. Prediction of deleterious mutations in coding regions of mammals with transfer learning. Evol Appl 2019; 12:18-28. [PMID: 30622632 PMCID: PMC6304693 DOI: 10.1111/eva.12607] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Accepted: 01/16/2018] [Indexed: 12/31/2022] Open
Abstract
The genomes of mammals contain thousands of deleterious mutations. It is important to be able to recognize them with high precision. In conservation biology, the small size of fragmented populations results in accumulation of damaging variants. Preserving animals with less damaged genomes could optimize conservation efforts. In breeding of farm animals, trade-offs between farm performance versus general fitness might be better avoided if deleterious mutations are well classified. In humans, the problem of such a precise classification has been successfully solved, in large part due to large databases of disease-causing mutations. However, this kind of information is very limited for other mammals. Here, we propose to better use information available on human mutations to enable classification of damaging mutations in other mammalian species. Specifically, we apply transfer learning-machine learning methods-improving small dataset for solving a focal problem (recognizing damaging mutations in our companion and farm animals) due to the use of much large datasets available for solving a related problem (recognizing damaging mutations in humans). We validate our tools using mouse and dog annotated datasets and obtain significantly better results in companion to the SIFT classifier. Then, we apply them to predict deleterious mutations in cattle genomewide dataset.
Collapse
Affiliation(s)
- Elena Plekhanova
- Peter the Great St. Petersburg Polytechnic UniversitySt. PetersburgRussia
| | - Sergey V. Nuzhdin
- Peter the Great St. Petersburg Polytechnic UniversitySt. PetersburgRussia
- Program Molecular and Computation BiologyDornsife College of Letters, Arts, and SciencesUniversity of Southern CaliforniaLos AngelesCAUSA
| | - Lev V. Utkin
- Peter the Great St. Petersburg Polytechnic UniversitySt. PetersburgRussia
| | - Maria G. Samsonova
- Peter the Great St. Petersburg Polytechnic UniversitySt. PetersburgRussia
| |
Collapse
|
97
|
Shashi V, Schoch K, Spillmann R, Cope H, Tan QKG, Walley N, Pena L, McConkie-Rosell A, Jiang YH, Stong N, Need AC, Goldstein DB. A comprehensive iterative approach is highly effective in diagnosing individuals who are exome negative. Genet Med 2019; 21:161-172. [PMID: 29907797 PMCID: PMC6295275 DOI: 10.1038/s41436-018-0044-2] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Accepted: 04/09/2018] [Indexed: 01/01/2023] Open
Abstract
PURPOSE Sixty to seventy-five percent of individuals with rare and undiagnosed phenotypes remain undiagnosed after exome sequencing (ES). With standard ES reanalysis resolving 10-15% of the ES negatives, further approaches are necessary to maximize diagnoses in these individuals. METHODS In 38 ES negative patients an individualized genomic-phenotypic approach was employed utilizing (1) phenotyping; (2) reanalyses of FASTQ files, with innovative bioinformatics; (3) targeted molecular testing; (4) genome sequencing (GS); and (5) conferring of clinical diagnoses when pathognomonic clinical findings occurred. RESULTS Certain and highly likely diagnoses were made in 18/38 (47%) individuals, including identifying two new developmental disorders. The majority of diagnoses (>70%) were due to our bioinformatics, phenotyping, and targeted testing identifying variants that were undetected or not prioritized on prior ES. GS diagnosed 3/18 individuals with structural variants not amenable to ES. Additionally, tentative diagnoses were made in 3 (8%), and in 5 individuals (13%) candidate genes were identified. Overall, diagnoses/potential leads were identified in 26/38 (68%). CONCLUSIONS Our comprehensive approach to ES negatives maximizes the ES and clinical data for both diagnoses and candidate gene identification, without GS in the majority. This iterative approach is cost-effective and is pertinent to the current conundrum of ES negatives.
Collapse
Affiliation(s)
- Vandana Shashi
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA.
| | - Kelly Schoch
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Rebecca Spillmann
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Heidi Cope
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Queenie K-G Tan
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Nicole Walley
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Loren Pena
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Allyn McConkie-Rosell
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Yong-Hui Jiang
- Department of Pediatrics, Division of Medical Genetics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Nicholas Stong
- Institute for Genomic Medicine, Columbia University, New York, New York, USA
| | - Anna C Need
- Division of Brain Sciences, Department of Medicine, Imperial College London, London, UK
| | - David B Goldstein
- Institute for Genomic Medicine, Columbia University, New York, New York, USA
| |
Collapse
|
98
|
Zhou B, Arthur JG, Ho SS, Pattni R, Huang Y, Wong WH, Urban AE. Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci Data 2018; 5:180261. [PMID: 30561434 PMCID: PMC6298255 DOI: 10.1038/sdata.2018.261] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Accepted: 10/04/2018] [Indexed: 12/30/2022] Open
Abstract
We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human genome sequence variation for the dominant Illumina sequencing technology can test and hone their algorithms by making variant calls from these Venter/HuRef datasets and then immediately confirm the detected variants in the Sanger assembly, freeing them of the need for further experimental validation. This process also applies to implementing and benchmarking existing genome analysis pipelines. We prepared and sequenced 200 bp and 350 bp short-insert whole-genome sequencing libraries (sequenced to 100x and 40x genomic coverages respectively) as well as 2 kb, 5 kb, and 12 kb mate-pair libraries (49x, 122x, and 145x physical coverages respectively). Lastly, we produced a linked-read library (128x physical coverage) from which we also performed haplotype phasing.
Collapse
Affiliation(s)
- Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Joseph G. Arthur
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Steve S. Ho
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Reenal Pattni
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Yiling Huang
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Wing H. Wong
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Alexander E. Urban
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
- Tashia and John Morgridge Faculty Scholar, Stanford Child Health Research Institute, Palo Alto, California 94305, USA
| |
Collapse
|
99
|
Muyas F, Bosio M, Puig A, Susak H, Domènech L, Escaramis G, Zapata L, Demidov G, Estivill X, Rabionet R, Ossowski S. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum Mutat 2018; 40:115-126. [PMID: 30353964 PMCID: PMC6587442 DOI: 10.1002/humu.23674] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 09/17/2018] [Accepted: 10/20/2018] [Indexed: 12/13/2022]
Abstract
In recent years, next‐generation sequencing (NGS) has become a cornerstone of clinical genetics and diagnostics. Many clinical applications require high precision, especially if rare events such as somatic mutations in cancer or genetic variants causing rare diseases need to be identified. Although random sequencing errors can be modeled statistically and deep sequencing minimizes their impact, systematic errors remain a problem even at high depth of coverage. Understanding their source is crucial to increase precision of clinical NGS applications. In this work, we studied the relation between recurrent biases in allele balance (AB), systematic errors, and false positive variant calls across a large cohort of human samples analyzed by whole exome sequencing (WES). We have modeled the AB distribution for biallelic genotypes in 987 WES samples in order to identify positions recurrently deviating significantly from the expectation, a phenomenon we termed allele balance bias (ABB). Furthermore, we have developed a genotype callability score based on ABB for all positions of the human exome, which detects false positive variant calls that passed state‐of‐the‐art filters. Finally, we demonstrate the use of ABB for detection of false associations proposed by rare variant association studies. Availability: https://github.com/Francesc-Muyas/ABB.
Collapse
Affiliation(s)
- Francesc Muyas
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| | - Mattia Bosio
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Anna Puig
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Hana Susak
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Laura Domènech
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
| | - Georgia Escaramis
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
| | - Luis Zapata
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - German Demidov
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| | - Xavier Estivill
- Sidra Medicine, Doha, Qatar.,Women's Health Dexeus, Barcelona, Spain
| | - Raquel Rabionet
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER in Epidemiology and Public Health (CIBERESP), Barcelona, Spain.,Institut de Recerca Sant Joan de Déu; Institut de Biomedicina de la Universitat de Barcelona (IBUB), ; & Department of Genetics, Microbiology and Statistics, University of Barcelona, Barcelona, Spain
| | - Stephan Ossowski
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen, Germany
| |
Collapse
|
100
|
Raimondi D, Orlando G, Tabaro F, Lenaerts T, Rooman M, Moreau Y, Vranken WF. Large-scale in-silico statistical mutagenesis analysis sheds light on the deleteriousness landscape of the human proteome. Sci Rep 2018; 8:16980. [PMID: 30451933 PMCID: PMC6242909 DOI: 10.1038/s41598-018-34959-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Accepted: 10/26/2018] [Indexed: 12/18/2022] Open
Abstract
Next generation sequencing technologies are providing increasing amounts of sequencing data, paving the way for improvements in clinical genetics and precision medicine. The interpretation of the observed genomic variants in the light of their phenotypic effects is thus emerging as a crucial task to solve in order to advance our understanding of how exomic variants affect proteins and how the proteins' functional changes affect human health. Since the experimental evaluation of the effects of every observed variant is unfeasible, Bioinformatics methods are being developed to address this challenge in-silico, by predicting the impact of millions of variants, thus providing insight into the deleteriousness landscape of entire proteomes. Here we show the feasibility of this approach by using the recently developed DEOGEN2 variant-effect predictor to perform the largest in-silico mutagenesis scan to date. We computed the deleteriousness score of 170 million variants over 15000 human proteins and we analysed the results, investigating how the predicted deleteriousness landscape of the proteins relates to known functionally and structurally relevant protein regions and biophysical properties. Moreover, we qualitatively validated our results by comparing them with two mutagenesis studies targeting two specific proteins, showing the consistency of DEOGEN2 predictions with respect to experimental data.
Collapse
Affiliation(s)
- Daniele Raimondi
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, 1050, Brussels, Belgium
- ESAT-STADIUS, KU Leuven, Kasteelpark Arenberg 10, 3001, Leuven, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium
| | - Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, 1050, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium
| | - Francesco Tabaro
- Institute of Biosciences and Medical Technology, Arvo Ylpőn katu 34, 33520, Tampere, Finland
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, 1050, Brussels, Belgium
- Machine Learning Group, ULB, La Plaine Campus, 1050, Brussels, Belgium
| | - Marianne Rooman
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, 1050, Brussels, Belgium
- Department of BioModeling, BioInformatics & BioProcesses, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, Kasteelpark Arenberg 10, 3001, Leuven, Belgium
- Imec, 3001, Leuven, Belgium
| | - Wim F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, 1050, Brussels, Belgium.
- Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium.
| |
Collapse
|