1
|
Li X, Cooper NGF, O'Toole TE, Rouchka EC. Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies. BMC Genomics 2020; 21:75. [PMID: 31992223 PMCID: PMC6986029 DOI: 10.1186/s12864-020-6502-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 01/16/2020] [Indexed: 12/20/2022] Open
Abstract
Background High-throughput RNA sequencing (RNA-seq) has evolved as an important analytical tool in molecular biology. Although the utility and importance of this technique have grown, uncertainties regarding the proper analysis of RNA-seq data remain. Of primary concern, there is no consensus regarding which normalization and statistical methods are the most appropriate for analyzing this data. The lack of standardized analytical methods leads to uncertainties in data interpretation and study reproducibility, especially with studies reporting high false discovery rates. In this study, we compared a recently developed normalization method, UQ-pgQ2, with three of the most frequently used alternatives including RLE (relative log estimate), TMM (Trimmed-mean M values) and UQ (upper quartile normalization) in the analysis of RNA-seq data. We evaluated the performance of these methods for gene-level differential expression analysis by considering the factors, including: 1) normalization combined with the choice of a Wald test from DESeq2 and an exact test/QL (Quasi-likelihood) F-Test from edgeR; 2) sample sizes in two balanced two-group comparisons; and 3) sequencing read depths. Results Using the MAQC RNA-seq datasets with small sample replicates, we found that UQ-pgQ2 normalization combined with an exact test can achieve better performance in term of power and specificity in differential gene expression analysis. However, using an intra-group analysis of false positives from real and simulated data, we found that a Wald test performs better than an exact test when the number of sample replicates is large and that a QL F-test performs the best given sample sizes of 5, 10 and 15 for any normalization. The RLE, TMM and UQ methods performed similarly given a desired sample size. Conclusion We found the UQ-pgQ2 method combined with an exact test/QL F-test is the best choice in order to control false positives when the sample size is small. When the sample size is large, UQ-pgQ2 with a QL F-test is a better choice for the type I error control in an intra-group analysis. We observed read depths have a minimal impact for differential gene expression analysis based on the simulated data.
Collapse
Affiliation(s)
- Xiaohong Li
- Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, USA.
| | - Nigel G F Cooper
- Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, USA
| | | | - Eric C Rouchka
- Department of Computer Science and Engineering, University of Louisville, Louisville, KY, USA
| |
Collapse
|
2
|
How does normalization impact RNA-seq disease diagnosis? J Biomed Inform 2018; 85:80-92. [PMID: 30041017 DOI: 10.1016/j.jbi.2018.07.016] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Revised: 07/07/2018] [Accepted: 07/14/2018] [Indexed: 12/18/2022]
Abstract
With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis. In this study, we investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. We propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, we present an original visualization analysis to compare the performance of normalized data versus raw data. We have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Our visualization analysis also shows that some normalization methods even bring 'outliers', which unavoidably decreases sample detectability in diagnosis. More importantly, our data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, we found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data. Our results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.
Collapse
|
3
|
Löhr JM, Kordes M, Rutkowski W, Heuchel R, Gustafsson-Liljefors M, Russom A, Nilsson M. Overcoming diagnostic issues in precision treatment of pancreatic cancer. EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 2018. [DOI: 10.1080/23808993.2018.1476061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Affiliation(s)
- J.-Matthias Löhr
- Department of Cancer Medicine, Division for Upper GI, Karolinska University Hospital, Stockholm, Sweden
- CLINTEC, Karolinska Institutet, Science for Life Laboratory, Stockholm, Sweden
| | - Maximilian Kordes
- Department of Cancer Medicine, Division for Upper GI, Karolinska University Hospital, Stockholm, Sweden
- CLINTEC, Karolinska Institutet, Science for Life Laboratory, Stockholm, Sweden
| | - Wiktor Rutkowski
- CLINTEC, Karolinska Institutet, Science for Life Laboratory, Stockholm, Sweden
| | - Rainer Heuchel
- CLINTEC, Karolinska Institutet, Science for Life Laboratory, Stockholm, Sweden
| | | | | | | |
Collapse
|
4
|
Malgerud L, Lindberg J, Wirta V, Gustafsson-Liljefors M, Karimi M, Moro CF, Stecker K, Picker A, Huelsewig C, Stein M, Bohnert R, Del Chiaro M, Haas SL, Heuchel RL, Permert J, Maeurer MJ, Brock S, Verbeke CS, Engstrand L, Jackson DB, Grönberg H, Löhr JM. Bioinformatory-assisted analysis of next-generation sequencing data for precision medicine in pancreatic cancer. Mol Oncol 2017; 11:1413-1429. [PMID: 28675654 PMCID: PMC5623817 DOI: 10.1002/1878-0261.12108] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Revised: 05/30/2017] [Accepted: 06/10/2017] [Indexed: 12/20/2022] Open
Abstract
Pancreatic ductal adenocarcinoma (PDAC) is a tumor with an extremely poor prognosis, predominantly as a result of chemotherapy resistance and numerous somatic mutations. Consequently, PDAC is a prime candidate for the use of sequencing to identify causative mutations, facilitating subsequent administration of targeted therapy. In a feasibility study, we retrospectively assessed the therapeutic recommendations of a novel, evidence-based software that analyzes next-generation sequencing (NGS) data using a large panel of pharmacogenomic biomarkers for efficacy and toxicity. Tissue from 14 patients with PDAC was sequenced using NGS with a 620 gene panel. FASTQ files were fed into treatmentmap. The results were compared with chemotherapy in the patients, including all side effects. No changes in therapy were made. Known driver mutations for PDAC were confirmed (e.g. KRAS, TP53). Software analysis revealed positive biomarkers for predicted effective and ineffective treatments in all patients. At least one biomarker associated with increased toxicity could be detected in all patients. Patients had been receiving one of the currently approved chemotherapy agents. In two patients, toxicity could have been correctly predicted by the software analysis. The results suggest that NGS, in combination with an evidence-based software, could be conducted within a 2-week period, thus being feasible for clinical routine. Therapy recommendations were principally off-label use. Based on the predominant KRAS mutations, other drugs were predicted to be ineffective. The pharmacogenomic biomarkers indicative of increased toxicity could be retrospectively linked to reported negative side effects in the respective patients. Finally, the occurrence of somatic and germline mutations in cancer syndrome-associated genes is noteworthy, despite a high frequency of these particular variants in the background population. These results suggest software-analysis of NGS data provides evidence-based information on effective, ineffective and toxic drugs, potentially forming the basis for precision cancer medicine in PDAC.
Collapse
Affiliation(s)
- Linnéa Malgerud
- Center for Digestive Diseases, Karolinska University Hospital, Stockholm, Sweden.,Department of Clinical Sciences, Intervention and Technology (CLINTEC), Karolinska Institutet, Stockholm, Sweden
| | - Johan Lindberg
- Department of Medical Epidemiology & Biostatistics (MEB), Karolinska Institutet, Stockholm, Sweden
| | - Valtteri Wirta
- Science for Life Laboratory, Department of Microbiology, Tumor and Cell Biology (MTC), Karolinska Institutet, Stockholm, Sweden
| | | | - Masoud Karimi
- Department of Oncology at Radiumhemmet, Karolinska University Hospital, Stockholm, Sweden
| | | | | | | | | | | | | | - Marco Del Chiaro
- Center for Digestive Diseases, Karolinska University Hospital, Stockholm, Sweden.,Department of Clinical Sciences, Intervention and Technology (CLINTEC), Karolinska Institutet, Stockholm, Sweden
| | - Stephan L Haas
- Center for Digestive Diseases, Karolinska University Hospital, Stockholm, Sweden
| | - Rainer L Heuchel
- Department of Clinical Sciences, Intervention and Technology (CLINTEC), Karolinska Institutet, Stockholm, Sweden
| | - Johan Permert
- Innovation Office, Karolinska University Hospital, Stockholm, Sweden
| | - Markus J Maeurer
- Department of Laboratory Medicine (LABMED), Karolinska Institutet, Stockholm, Sweden
| | | | - Caroline S Verbeke
- Department of Pathology, Karolinska University Hospital, Stockholm, Sweden
| | - Lars Engstrand
- Science for Life Laboratory, Department of Microbiology, Tumor and Cell Biology (MTC), Karolinska Institutet, Stockholm, Sweden
| | | | - Henrik Grönberg
- Department of Medical Epidemiology & Biostatistics (MEB), Karolinska Institutet, Stockholm, Sweden
| | - Johannes Matthias Löhr
- Center for Digestive Diseases, Karolinska University Hospital, Stockholm, Sweden.,Department of Clinical Sciences, Intervention and Technology (CLINTEC), Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
5
|
Abstract
The big omics data are challenging translational bioinformatics in an unprecedented way for its complexities and volumes. How to employ big omics data to achieve a rivalling-clinical, reproducible disease diagnosis from a systems approach is an urgent problem to be solved in translational bioinformatics and machine learning. In this study, the authors propose a novel transcriptome marker diagnosis to tackle this problem using big RNA-seq data by viewing whole transcriptome as a profile marker systematically. The systems diagnosis not only avoids the reproducibility issue of the existing gene-/network-marker-based diagnostic methods, but also achieves rivalling-clinical diagnostic results by extracting true signals from big RNA-seq data. Their method demonstrates a better fit for personalised diagnostics by attaining exceptional diagnostic performance via using systems information than its competitive methods and prepares itself as a good candidate for clinical usage. To the best of their knowledge, it is the first study on this topic and will inspire the more investigations in big omics data diagnostics.
Collapse
Affiliation(s)
- Henry Han
- Division of Computer Science, Mathematics and Science, St. John's University, Queens, NY 11349, USA.
| | - Ying Liu
- Division of Computer Science, Mathematics and Science, St. John's University, Queens, NY 11349, USA
| |
Collapse
|
6
|
Lambertson KF, Damiani SA, Might M, Shelton R, Terry SF. Participant-driven matchmaking in the genomic era. Hum Mutat 2015; 36:965-73. [PMID: 26252162 DOI: 10.1002/humu.22852] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2015] [Accepted: 07/15/2015] [Indexed: 01/16/2023]
Abstract
Whole-genome and whole-exome sequencing are increasingly useful diagnostic tools for novel monogenic conditions. In order to confirm diagnoses made using these technologies, genomic matchmaking-the matching of cases with similar phenotypic and/or genotypic profiles, to narrow the number of candidate genes or ascertain a condition's etiology with greater certainty-is essential. Yet, due to current limitations on the size of matchmaking networks and data sets available to support them, identifying a match can be difficult. We argue that matchmaking efforts led by affected individuals and their families-participant-led efforts-offer a twofold solution to this need, in that participants both have the capacity to access larger networks and to provide more detailed sets of phenotypic and genotypic data. These features of participant-led efforts have the potential to increase the value of matchmaking networks, both in terms of number of matches and in terms of the overall energy of the network. We provide two examples of participant-led matchmaking, and propose a model for scaling these efforts.
Collapse
Affiliation(s)
| | - Stephen A Damiani
- Mission Massimo Foundation, Inc., Elsternwick, Victoria, Australia.,Mission Massimo Foundation, Inc., Westlake Village, California
| | - Matthew Might
- NGLY1.org, Salt Lake City, Utah.,University of Utah, Salt Lake City, Utah, United States
| | | | - Sharon F Terry
- Genetic Alliance, Washington, District of Columbia.,PXE International, Inc, Washington, District of Columbia
| |
Collapse
|
7
|
Sayson B, Popurs MAM, Lafek M, Berkow R, Stockler-Ipsiroglu S, van Karnebeek CDM. Retrospective analysis supports algorithm as efficient diagnostic approach to treatable intellectual developmental disabilities. Mol Genet Metab 2015; 115:1-9. [PMID: 25801009 DOI: 10.1016/j.ymgme.2015.03.001] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Revised: 03/02/2015] [Accepted: 03/02/2015] [Indexed: 10/23/2022]
Abstract
BACKGROUND Intellectual developmental disorders (IDD(1)), characterized by a significant impairment in cognitive function and behavior, affect 2.5% of the population and are associated with considerable morbidity and healthcare costs. Inborn errors of metabolism (IEM) currently constitute the largest group of genetic defects presenting with IDD, which are amenable to causal therapy. Recently, we created an evidence-based 2-tiered diagnostic protocol (TIDE protocol); the first tier is a 'screening step' applied in all patients, comprising routinely performed, wide available metabolic tests in blood and urine, while second-tier tests are more specific and based on the patient's phenotype. The protocol is supported by an app (www.treatable-ID.org). OBJECTIVE To retrospectively examine the cost- and time-effectiveness of the TIDE protocol in patients identified with a treatable IEM at the British Columbia Children's Hospital. METHODS We searched the database for all IDD patients diagnosed with a treatable IEM, during the period 2000-2009 in our academic institution. Data regarding the patient's clinical phenotype, IEM, diagnostic tests and interval were collected. Total costs and time intervals associated with all testing and physician consultations actually performed were calculated and compared to the model of the TIDE protocol. RESULTS Thirty-one patients (16 males) were diagnosed with treatable IDD during the period 2000-2009. For those identifiable via the 1st tier (n=20), the average cost savings would have been $311.17 CAD, and for those diagnosed via a second-tier test (n=11) $340.14 CAD. Significant diagnostic delay (mean 9 months; range 1-29 months) could have been avoided in 9 patients with first-tier diagnoses, had the TIDE protocol been used. For those with second-tier treatable IDD, diagnoses could have been more rapidly achieved with the use of the Treatable IDD app allowing for specific searches based on signs and symptoms. CONCLUSION The TIDE protocol for treatable forms of IDD appears effective reducing diagnostic delay and unnecessary costs. Larger prospective studies, currently underway, are needed to prove that standard screening for treatable conditions in patients with IDD is time- and cost-effective, and most importantly will preserve brain function by timely diagnosis enabling initiation of causal therapy.
Collapse
Affiliation(s)
- Bryan Sayson
- Division of Pediatric Neurology, BC Children's Hospital, Vancouver, Canada; Department of Pediatrics, BC Children's Hospital, Vancouver, Canada; Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada; University of British Columbia, Vancouver, Canada
| | - Marioara Angela Moisa Popurs
- Department of Pediatrics, BC Children's Hospital, Vancouver, Canada; Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada; Division of Biochemical Diseases, BC Children's Hospital, Vancouver, Canada
| | - Mirafe Lafek
- Department of Pediatrics, BC Children's Hospital, Vancouver, Canada; Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada; Division of Biochemical Diseases, BC Children's Hospital, Vancouver, Canada
| | - Ruth Berkow
- Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada
| | - Sylvia Stockler-Ipsiroglu
- Department of Pediatrics, BC Children's Hospital, Vancouver, Canada; Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada; Division of Biochemical Diseases, BC Children's Hospital, Vancouver, Canada; Child and Family Research Institute, Vancouver, Canada; University of British Columbia, Vancouver, Canada
| | - Clara D M van Karnebeek
- Department of Pediatrics, BC Children's Hospital, Vancouver, Canada; Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada; Division of Biochemical Diseases, BC Children's Hospital, Vancouver, Canada; Child and Family Research Institute, Vancouver, Canada; University of British Columbia, Vancouver, Canada; Centre for Molecular Medicine and Therapeutics, Vancouver, Canada.
| |
Collapse
|
8
|
de Koning TJ, Jongbloed JDH, Sikkema-Raddatz B, Sinke RJ. Targeted next-generation sequencing panels for monogenetic disorders in clinical diagnostics: the opportunities and challenges. Expert Rev Mol Diagn 2014; 15:61-70. [PMID: 25367078 DOI: 10.1586/14737159.2015.976555] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Next-generation sequencing (NGS) will soon be used for clinically heterogeneous, inherited disorders and the increasing number of disease-causing genes reported. Diagnostic laboratories therefore need to decide which NGS methods they are going to invest in and how to implement them. We discuss here the challenges and opportunities of using targeted resequencing (TRS) panels for diagnosing monogenetic disorders. Of the different NGS approaches available, TRS panels offer the opportunity to sequence and analyze a limited set of predetermined target genes. At present, TRS panels offer better base-pair coverage, running times, costs and dataset handling than other NGS applications such as whole genome sequencing and whole exome sequencing. However, working with TRS panels also poses new challenges in variant interpretation, data handling and bioinformatic analyses. To optimize the analyses, TRS panel testing should be performed by bioinformaticians, clinicians and laboratory staff in close collaboration.
Collapse
Affiliation(s)
- Tom J de Koning
- University of Groningen, University Medical Center Groningen, Department of Genetics, CB 50, PO Box 30.001, 9700 RB Groningen, The Netherlands
| | | | | | | |
Collapse
|