151
|
Parker HS, Corrada Bravo H, Leek JT. Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ 2014; 2:e561. [PMID: 25332844 PMCID: PMC4179553 DOI: 10.7717/peerj.561] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2014] [Accepted: 08/15/2014] [Indexed: 01/06/2023] Open
Abstract
Batch effects are responsible for the failure of promising genomic prognostic signatures, major ambiguities in published genomic results, and retractions of widely-publicized findings. Batch effect corrections have been developed to remove these artifacts, but they are designed to be used in population studies. But genomic technologies are beginning to be used in clinical applications where samples are analyzed one at a time for diagnostic, prognostic, and predictive applications. There are currently no batch correction methods that have been developed specifically for prediction. In this paper, we propose an new method called frozen surrogate variable analysis (fSVA) that borrows strength from a training set for individual sample batch correction. We show that fSVA improves prediction accuracy in simulations and in public genomic studies. fSVA is available as part of the sva Bioconductor package.
Collapse
Affiliation(s)
- Hilary S. Parker
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Héctor Corrada Bravo
- Center for Bioinformatics and Computational Biology, Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Jeffrey T. Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| |
Collapse
|
152
|
Richard AC, Lyons PA, Peters JE, Biasci D, Flint SM, Lee JC, McKinney EF, Siegel RM, Smith KGC. Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation. BMC Genomics 2014; 15:649. [PMID: 25091430 PMCID: PMC4143561 DOI: 10.1186/1471-2164-15-649] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2014] [Accepted: 07/17/2014] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. RESULTS Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this "gold-standard" comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. CONCLUSIONS Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.
Collapse
Affiliation(s)
- Arianne C Richard
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
- />Immunoregulation Section, Autoimmunity Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, MD USA
| | - Paul A Lyons
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
| | - James E Peters
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
| | - Daniele Biasci
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
| | - Shaun M Flint
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
| | - James C Lee
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
| | - Eoin F McKinney
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
| | - Richard M Siegel
- />Immunoregulation Section, Autoimmunity Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, MD USA
| | - Kenneth GC Smith
- />Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK
| |
Collapse
|
153
|
Lee JA, Dobbin KK, Ahn J. Covariance adjustment for batch effect in gene expression data. Stat Med 2014; 33:2681-95. [PMID: 24687561 PMCID: PMC4065794 DOI: 10.1002/sim.6157] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2013] [Revised: 11/26/2013] [Accepted: 02/28/2014] [Indexed: 02/01/2023]
Abstract
Batch bias has been found in many microarray gene expression studies that involve multiple batches of samples. A serious batch effect can alter not only the distribution of individual genes but also the inter-gene relationships. Even though some efforts have been made to remove such bias, there has been relatively less development on a multivariate approach, mainly because of the analytical difficulty due to the high-dimensional nature of gene expression data. We propose a multivariate batch adjustment method that effectively eliminates inter-gene batch effects. The proposed method utilizes high-dimensional sparse covariance estimation based on a factor model and a hard thresholding. Another important aspect of the proposed method is that if it is known that one of the batches is produced in a superior condition, the other batches can be adjusted so that they resemble the target batch. We study high-dimensional asymptotic properties of the proposed estimator and compare the performance of the proposed method with some popular existing methods with simulated data and gene expression data sets.
Collapse
Affiliation(s)
- Jung Ae Lee
- Division of Public Health Sciences, Washington University in St. Louis, St. Louis, MO 63110, U.S.A
| | | | | |
Collapse
|
154
|
Larsen MJ, Thomassen M, Tan Q, Sørensen KP, Kruse TA. Microarray-based RNA profiling of breast cancer: batch effect removal improves cross-platform consistency. BIOMED RESEARCH INTERNATIONAL 2014; 2014:651751. [PMID: 25101291 PMCID: PMC4101981 DOI: 10.1155/2014/651751] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2014] [Revised: 04/17/2014] [Accepted: 06/09/2014] [Indexed: 12/13/2022]
Abstract
Microarray is a powerful technique used extensively for gene expression analysis. Different technologies are available, but lack of standardization makes it challenging to compare and integrate data. Furthermore, batch-related biases within datasets are common but often not tackled. We have analyzed the same 234 breast cancers on two different microarray platforms. One dataset contained known batch-effects associated with the fabrication procedure used. The aim was to assess the significance of correcting for systematic batch-effects when integrating data from different platforms. We here demonstrate the importance of detecting batch-effects and how tools, such as ComBat, can be used to successfully overcome such systematic variations in order to unmask essential biological signals. Batch adjustment was found to be particularly valuable in the detection of more delicate differences in gene expression. Furthermore, our results show that prober adjustment is essential for integration of gene expression data obtained from multiple sources. We show that high-variance genes are highly reproducibly expressed across platforms making them particularly well suited as biomarkers and for building gene signatures, exemplified by prediction of estrogen-receptor status and molecular subtypes. In conclusion, the study emphasizes the importance of utilizing proper batch adjustment methods when integrating data across different batches and platforms.
Collapse
Affiliation(s)
- Martin J. Larsen
- Department of Clinical Genetics, Odense University Hospital, Sdr. Boulevard 29, 5000 Odense C, Denmark
- Human Genetics, Institute of Clinical Research, University of Southern Denmark, Winsløwvej 19, 5000 Odense C, Denmark
| | - Mads Thomassen
- Department of Clinical Genetics, Odense University Hospital, Sdr. Boulevard 29, 5000 Odense C, Denmark
- Human Genetics, Institute of Clinical Research, University of Southern Denmark, Winsløwvej 19, 5000 Odense C, Denmark
| | - Qihua Tan
- Human Genetics, Institute of Clinical Research, University of Southern Denmark, Winsløwvej 19, 5000 Odense C, Denmark
- Epidemiology, Biostatistics and Biodemography, Institute of Public Health, University of Southern Denmark, J.B. Winsløws Vej 9B, 5000 Odense C, Denmark
| | - Kristina P. Sørensen
- Department of Clinical Genetics, Odense University Hospital, Sdr. Boulevard 29, 5000 Odense C, Denmark
- Human Genetics, Institute of Clinical Research, University of Southern Denmark, Winsløwvej 19, 5000 Odense C, Denmark
| | - Torben A. Kruse
- Department of Clinical Genetics, Odense University Hospital, Sdr. Boulevard 29, 5000 Odense C, Denmark
- Human Genetics, Institute of Clinical Research, University of Southern Denmark, Winsløwvej 19, 5000 Odense C, Denmark
| |
Collapse
|
155
|
Soneson C, Gerster S, Delorenzi M. Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation. PLoS One 2014; 9:e100335. [PMID: 24967636 PMCID: PMC4072626 DOI: 10.1371/journal.pone.0100335] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2014] [Accepted: 05/26/2014] [Indexed: 01/05/2023] Open
Abstract
Background With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences (“batch effects”) as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. Focus The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. Data We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., ‘control’) or group 2 (e.g., ‘treated’). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. Methods We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.
Collapse
Affiliation(s)
| | - Sarah Gerster
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Mauro Delorenzi
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland; Ludwig Center for Cancer Research, University of Lausanne, Lausanne, Switzerland; Oncology Department, University of Lausanne, Lausanne, Switzerland
| |
Collapse
|
156
|
Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, Chung CH, Fertig EJ. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. ACTA ACUST UNITED AC 2014; 30:2757-63. [PMID: 24907368 DOI: 10.1093/bioinformatics/btu375] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. RESULTS Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set. AVAILABILITY AND IMPLEMENTATION All analyses were performed using R version 2.15.0. The code and data used to generate the results of this manuscript is available from https://sourceforge.net/projects/psva.
Collapse
Affiliation(s)
- Hilary S Parker
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| | - Alexander V Favorov
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| | - Michael Considine
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| | - Xiaoxin Xia
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| | - Sameer Chavan
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| | - Christine H Chung
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| | - Elana J Fertig
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 21205, USA, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119333, Russia, Research Institute for Genetics and Selection of Industrial Microorganisms "GosNIIGenetika", Moscow 117545, Russia, Department of Statistics and Biostatistics, Rutgers University, NJ 08854, USA and Division of Allergy & Clinical Immunology, Department of Medicine, Johns Hopkins University, Baltimore, MD 21224, USA
| |
Collapse
|
157
|
Sordillo J, Raby BA. Gene expression profiling in asthma. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2014; 795:157-81. [PMID: 24162908 DOI: 10.1007/978-1-4614-8603-9_10] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Affiliation(s)
- Joanne Sordillo
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA,
| | | |
Collapse
|
158
|
Cangelosi D, Muselli M, Parodi S, Blengio F, Becherini P, Versteeg R, Conte M, Varesio L. Use of Attribute Driven Incremental Discretization and Logic Learning Machine to build a prognostic classifier for neuroblastoma patients. BMC Bioinformatics 2014; 15 Suppl 5:S4. [PMID: 25078098 PMCID: PMC4095004 DOI: 10.1186/1471-2105-15-s5-s4] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Cancer patient's outcome is written, in part, in the gene expression profile of the tumor. We previously identified a 62-probe sets signature (NB-hypo) to identify tissue hypoxia in neuroblastoma tumors and showed that NB-hypo stratified neuroblastoma patients in good and poor outcome 1. It was important to develop a prognostic classifier to cluster patients into risk groups benefiting of defined therapeutic approaches. Novel classification and data discretization approaches can be instrumental for the generation of accurate predictors and robust tools for clinical decision support. We explored the application to gene expression data of Rulex, a novel software suite including the Attribute Driven Incremental Discretization technique for transforming continuous variables into simplified discrete ones and the Logic Learning Machine model for intelligible rule generation. RESULTS We applied Rulex components to the problem of predicting the outcome of neuroblastoma patients on the bases of 62 probe sets NB-hypo gene expression signature. The resulting classifier consisted in 9 rules utilizing mainly two conditions of the relative expression of 11 probe sets. These rules were very effective predictors, as shown in an independent validation set, demonstrating the validity of the LLM algorithm applied to microarray data and patients' classification. The LLM performed as efficiently as Prediction Analysis of Microarray and Support Vector Machine, and outperformed other learning algorithms such as C4.5. Rulex carried out a feature selection by selecting a new signature (NB-hypo-II) of 11 probe sets that turned out to be the most relevant in predicting outcome among the 62 of the NB-hypo signature. Rules are easily interpretable as they involve only few conditions. CONCLUSIONS Our findings provided evidence that the application of Rulex to the expression values of NB-hypo signature created a set of accurate, high quality, consistent and interpretable rules for the prediction of neuroblastoma patients' outcome. We identified the Rulex weighted classification as a flexible tool that can support clinical decisions. For these reasons, we consider Rulex to be a useful tool for cancer classification from microarray gene expression data.
Collapse
|
159
|
Kothari S, Phan JH, Stokes TH, Osunkoya AO, Young AN, Wang MD. Removing batch effects from histopathological images for enhanced cancer diagnosis. IEEE J Biomed Health Inform 2014; 18:765-72. [PMID: 24808220 PMCID: PMC5003052 DOI: 10.1109/jbhi.2013.2276766] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Researchers have developed computer-aided decision support systems for translational medicine that aim to objectively and efficiently diagnose cancer using histopathological images. However, the performance of such systems is confounded by nonbiological experimental variations or "batch effects" that can commonly occur in histopathological data, especially when images are acquired using different imaging devices and patient samples. This is even more problematic in large-scale studies in which cross-laboratory sharing of large volumes of data is necessary. Batch effects can change quantitative morphological image features and decrease the prediction performance. Using four batches of renal tumor images, we compare one image-level and five feature-level batch effect removal methods. Principal component variation analysis shows that batch is a large source of variance in image features. Results show that feature-level normalization methods reduce batch-contributed variance to almost zero. Moreover, feature-level normalization, especially ComBatN, improves cross-batch and combined-batch prediction performance. Compared to no normalization, ComBatN improves performance in 83% and 90% of cross-batch and combined-batch prediction models, respectively.
Collapse
Affiliation(s)
- Sonal Kothari
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA
| | - John H. Phan
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA and also with Emory University, Atlanta, GA 30332, USA
| | - Todd H. Stokes
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA and also with Emory University, Atlanta, GA 30332, USA
| | - Adeboye O. Osunkoya
- Department of Pathology, Emory University School of Medicine, Atlanta, GA 30322 USA
| | - Andrew N. Young
- Pathology and Laboratory Medicine, Emory University and Grady Health System, Atlanta, GA 30322 USA
| | - May D. Wang
- Department of Biomedical Engineering, School of Electrical and Computer Engineering, Winship Cancer Institute, Parker H. Petit Institute of Bioengineering and Biosciences, Institute of People and Technology, Georgia Institute of Technology Atlanta, GA 30322 USA and also with Emory University, Atlanta, GA 30332 USA
| |
Collapse
|
160
|
Lê Cao KA, Rohart F, McHugh L, Korn O, Wells CA. YuGene: A simple approach to scale gene expression data derived from different platforms for integrated analyses. Genomics 2014; 103:239-51. [DOI: 10.1016/j.ygeno.2014.03.001] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2014] [Revised: 03/14/2014] [Accepted: 03/16/2014] [Indexed: 01/09/2023]
|
161
|
Chen JJ, Lin WJ, Chen HC. Pharmacogenomic biomarkers for personalized medicine. Pharmacogenomics 2014; 14:969-80. [PMID: 23746190 DOI: 10.2217/pgs.13.75] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Pharmacogenomics examines how the benefits and adverse effects of a drug vary among patients in a target population by analyzing genomic profiles of individual patients. Personalized medicine prescribes specific therapeutics that best suit an individual patient. Much current research focuses on developing genomic biomarkers to identify patients, to identify which patients would benefit from a treatment, have an adverse response, or no response at all, prior to treatment according to relevant differences in risk factors, disease types and/or responses to therapy. This review describes the use of the two personalized medicine biomarkers, prognostic and predictive, to classify patients into subgroups for treatment recommendation.
Collapse
Affiliation(s)
- James J Chen
- Division of Bioinformatics & Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Road, HFT-20, Jefferson, AR 72079, USA.
| | | | | |
Collapse
|
162
|
Ford NA, Devlin KL, Lashinger LM, Hursting SD. Deconvoluting the obesity and breast cancer link: secretome, soil and seed interactions. J Mammary Gland Biol Neoplasia 2013; 18:267-75. [PMID: 24091864 PMCID: PMC3874287 DOI: 10.1007/s10911-013-9301-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/05/2013] [Accepted: 09/24/2013] [Indexed: 12/20/2022] Open
Abstract
Obesity is associated with increased risk of breast cancer in postmenopausal women and is linked with poor prognosis in pre- and postmenopausal breast cancer patients. The mechanisms underlying the obesity-breast cancer connection are becoming increasingly clear and provide multiple opportunities for primary to tertiary prevention. Several obesity-related host factors can influence breast tumor initiation, progression and/or response to therapy, and these have been implicated as key contributors to the complex effects of obesity on cancer incidence and outcomes. These host factors include components of the secretome, including insulin, insulin-like growth factor-1, leptin, adiponectin, steroid hormones, cytokines, vascular regulators, and inflammation-related molecules, as well as the cellular and structural components of the tumor microenvironment. These secreted and structural host factors are extrinsic to, and interact with, the intrinsic molecular characteristics of breast cancer cells (including breast cancer stem cells), and each will be considered in the context of energy balance and as potential targets for cancer prevention.
Collapse
Affiliation(s)
- Nikki A. Ford
- Department of Nutritional Sciences, University of Texas at Austin, Austin, Texas 78722, USA
| | - Kaylyn L. Devlin
- Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, Texas 78722, USA
| | - Laura M. Lashinger
- Department of Nutritional Sciences, University of Texas at Austin, Austin, Texas 78722, USA
| | - Stephen D. Hursting
- Department of Nutritional Sciences, University of Texas at Austin, Austin, Texas 78722, USA
- Department of Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Science Park, Smithville, TX 78957, USA
| |
Collapse
|
163
|
Slee RB, Grimes BR, Bansal R, Gore J, Blackburn C, Brown L, Gasaway R, Jeong J, Victorino J, March KL, Colombo R, Herbert BS, Korc M. Selective inhibition of pancreatic ductal adenocarcinoma cell growth by the mitotic MPS1 kinase inhibitor NMS-P715. Mol Cancer Ther 2013; 13:307-315. [PMID: 24282275 DOI: 10.1158/1535-7163.mct-13-0324] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Most solid tumors, including pancreatic ductal adenocarcinoma (PDAC), exhibit structural and numerical chromosome instability (CIN). Although often implicated as a driver of tumor progression and drug resistance, CIN also reduces cell fitness and poses a vulnerability that can be exploited therapeutically. The spindle assembly checkpoint (SAC) ensures correct chromosome-microtubule attachment, thereby minimizing chromosome segregation errors. Many tumors exhibit upregulation of SAC components such as MPS1, which may help contain CIN within survivable limits. Prior studies showed that MPS1 inhibition with the small molecule NMS-P715 limits tumor growth in xenograft models. In cancer cell lines, NMS-P715 causes cell death associated with impaired SAC function and increased chromosome missegregation. Although normal cells appeared more resistant, effects on stem cells, which are the dose-limiting toxicity of most chemotherapeutics, were not examined. Elevated expression of 70 genes (CIN70), including MPS1, provides a surrogate measure of CIN and predicts poor patient survival in multiple tumor types. Our new findings show that the degree of CIN70 upregulation varies considerably among PDAC tumors, with higher CIN70 gene expression predictive of poor outcome. We identified a 25 gene subset (PDAC CIN25) whose overexpression was most strongly correlated with poor survival and included MPS1. In vitro, growth of human and murine PDAC cells is inhibited by NMS-P715 treatment, whereas adipose-derived human mesenchymal stem cells are relatively resistant and maintain chromosome stability upon exposure to NMS-P715. These studies suggest that NMS-P715 could have a favorable therapeutic index and warrant further investigation of MPS1 inhibition as a new PDAC treatment strategy.
Collapse
Affiliation(s)
- Roger B Slee
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy.,IU Melvin and Bren Simon Cancer Center (IUSCC), Nerviano Medical Sciences, Nerviano, Italy
| | - Brenda R Grimes
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy.,IU Melvin and Bren Simon Cancer Center (IUSCC), Nerviano Medical Sciences, Nerviano, Italy.,IUSCC Center for Pancreatic Cancer Research, Nerviano Medical Sciences, Nerviano, Italy
| | - Ruchi Bansal
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy
| | - Jesse Gore
- IUSM Department of Medicine, Nerviano Medical Sciences, Nerviano, Italy
| | - Corinne Blackburn
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy
| | - Lyndsey Brown
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy
| | - Rachel Gasaway
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy
| | - Jaesik Jeong
- IUSM Department of Biostatistics, Nerviano Medical Sciences, Nerviano, Italy
| | - Jose Victorino
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy.,California State University Dominguez Hills, Nerviano Medical Sciences, Nerviano, Italy
| | - Keith L March
- IUSM Department of Medicine, Nerviano Medical Sciences, Nerviano, Italy.,IUSM Department of Biochemistry and Molecular Biology, Nerviano Medical Sciences, Nerviano, Italy.,Krannert Institute of Cardiology, Nerviano Medical Sciences, Nerviano, Italy.,Indiana Center for Vascular Biology, Nerviano Medical Sciences, Nerviano, Italy.,R.L. Roudebush Veterans Affairs Medical Center, Nerviano Medical Sciences, Nerviano, Italy
| | - Riccardo Colombo
- Indianapolis, Indiana. Nerviano Medical Sciences, Nerviano, Italy
| | - Brittney-Shea Herbert
- Indiana University School of Medicine (IUSM) Department of Medical and Molecular Genetics,Indiana. Nerviano Medical Sciences, Nerviano, Italy.,IU Melvin and Bren Simon Cancer Center (IUSCC), Nerviano Medical Sciences, Nerviano, Italy
| | - Murray Korc
- IU Melvin and Bren Simon Cancer Center (IUSCC), Nerviano Medical Sciences, Nerviano, Italy.,IUSCC Center for Pancreatic Cancer Research, Nerviano Medical Sciences, Nerviano, Italy.,IUSM Department of Medicine, Nerviano Medical Sciences, Nerviano, Italy.,IUSM Department of Biochemistry and Molecular Biology, Nerviano Medical Sciences, Nerviano, Italy
| |
Collapse
|
164
|
Reese SE, Archer KJ, Therneau TM, Atkinson EJ, Vachon CM, de Andrade M, Kocher JPA, Eckel-Passow JE. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics 2013; 29:2877-83. [PMID: 23958724 PMCID: PMC3810845 DOI: 10.1093/bioinformatics/btt480] [Citation(s) in RCA: 92] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2012] [Revised: 07/03/2013] [Accepted: 08/14/2013] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. RESULTS We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. CONCLUSION We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well. AVAILABILITY AND IMPLEMENTATION The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article. CONTACT reesese@vcu.edu
Collapse
Affiliation(s)
- Sarah E Reese
- Department of Biostatistics, Biostatistics Shared Resource Core, VCU Massey Cancer Center, Virginia Commonwealth University, Richmond, VA 23284, USA, Division of Biomedical Statistics and Informatics and Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA
| | | | | | | | | | | | | | | |
Collapse
|
165
|
Halloran PF, Pereira AB, Chang J, Matas A, Picton M, De Freitas D, Bromberg J, Serón D, Sellarés J, Einecke G, Reeve J. Microarray diagnosis of antibody-mediated rejection in kidney transplant biopsies: an international prospective study (INTERCOM). Am J Transplant 2013; 13:2865-74. [PMID: 24119109 DOI: 10.1111/ajt.12465] [Citation(s) in RCA: 134] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2013] [Revised: 07/22/2013] [Accepted: 08/01/2013] [Indexed: 01/25/2023]
Abstract
In a reference set of 403 kidney transplant biopsies, we recently developed a microarray-based test that diagnoses antibody-mediated rejection (ABMR) by assigning an ABMR score. To validate the ABMR score and assess its potential impact on practice, we performed the present prospective INTERCOM study (clinicaltrials.gov NCT01299168) in 300 new biopsies (264 patients) from six centers: Baltimore, Barcelona, Edmonton, Hannover, Manchester and Minneapolis. We assigned ABMR scores using the classifier created in the reference set and compared it to conventional assessment as documented in the pathology reports. INTERCOM documented uncertainty in conventional assessment: In 41% of biopsies where ABMR features were noted, the recorded diagnoses did not mention ABMR. The ABMR score correlated with ABMR histologic lesions and donor-specific antibodies, but not with T cell-mediated rejection lesions. The agreement between ABMR scores and conventional assessment was identical to that in the reference set (accuracy 85%). The ABMR score was more strongly associated with failure than conventional assessment, and when the ABMR score and conventional assessment disagreed, only the ABMR score was associated with early progression to failure. INTERCOM confirms the need to reduce uncertainty in the diagnosis of ABMR, and demonstrates the potential of the ABMR score to impact practice.
Collapse
Affiliation(s)
- P F Halloran
- Alberta Transplant Applied Genomics Center, University of Alberta, Edmonton, AB, Canada; Department of Medicine, Division of Nephrology and Transplant Immunology, University of Alberta, Edmonton, AB, Canada
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
166
|
Wang X, Markowetz F, De Sousa E Melo F, Medema JP, Vermeulen L. Dissecting cancer heterogeneity--an unsupervised classification approach. Int J Biochem Cell Biol 2013; 45:2574-9. [PMID: 24004832 DOI: 10.1016/j.biocel.2013.08.014] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2013] [Revised: 08/20/2013] [Accepted: 08/22/2013] [Indexed: 02/04/2023]
Abstract
Gene-expression-based classification studies have changed the way cancer is traditionally perceived. It is becoming increasingly clear that many cancer types are in fact not single diseases but rather consist of multiple molecular distinct subtypes. In this review, we discuss unsupervised classification studies of common malignancies during the recent years. We found that the bioinformatic workflow of many of these studies follows a common main stream, although different statistical tools may be preferred from case to case. Here we summarize the employed methods, with a special focus on consensus clustering and classification. For each critical step of the bioinformatic analysis, we explain the biological relevance and implications of the technical principles. We think that a better understanding of these ever more frequently used methods to study cancer heterogeneity by the biomedical community is relevant as these type of studies will have an important impact on patient stratification and cancer subtype-specific drug development in the future.
Collapse
Affiliation(s)
- Xin Wang
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK.
| | | | | | | | | |
Collapse
|
167
|
|
168
|
Rao SSS, Shepherd LA, Bruno AE, Liu S, Miecznikowski JC. Comparing Imputation Procedures for Affymetrix Gene Expression Datasets Using MAQC Datasets. Adv Bioinformatics 2013; 2013:790567. [PMID: 24223587 PMCID: PMC3809938 DOI: 10.1155/2013/790567] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Accepted: 08/28/2013] [Indexed: 01/13/2023] Open
Abstract
Introduction. The microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of the precision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we are aware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the MAQC Affymetrix datasets to evaluate several imputation procedures in Affymetrix microarrays. Results. We evaluated several cutting edge imputation procedures and compared them using different error measures. We randomly deleted 5% and 10% of the data and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. The results for both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with k = 4 is most accurate under the error measures considered. The k-nearest neighbor method with k = 1 has the highest error rate among imputation methods and error measures. Conclusions. We conclude for imputing missing values in Affymetrix microarray datasets, using the MAS 5.0 preprocessing scheme, the local least squares method with k = 4 has the best overall performance and k-nearest neighbor method with k = 1 has the worst overall performance. These results hold true for both 5% and 10% missing values.
Collapse
Affiliation(s)
| | - Lori A. Shepherd
- Department of Biostatistics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| | - Andrew E. Bruno
- Center for Computational Research, University at Buffalo, NYS Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
| | - Song Liu
- Department of Biostatistics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| | - Jeffrey C. Miecznikowski
- Department of Biostatistics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
- Department of Biostatistics, SUNY University at Buffalo, Buffalo, NY 14214, USA
| |
Collapse
|
169
|
Genetic and nongenetic variation revealed for the principal components of human gene expression. Genetics 2013; 195:1117-28. [PMID: 24026092 DOI: 10.1534/genetics.113.153221] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Principal components analysis has been employed in gene expression studies to correct for population substructure and batch and environmental effects. This method typically involves the removal of variation contained in as many as 50 principal components (PCs), which can constitute a large proportion of total variation present in the data. Each PC, however, can detect many sources of variation, including gene expression networks and genetic variation influencing transcript levels. We demonstrate that PCs generated from gene expression data can simultaneously contain both genetic and nongenetic factors. From heritability estimates we show that all PCs contain a considerable portion of genetic variation while nongenetic artifacts such as batch effects were associated to varying degrees with the first 60 PCs. These PCs demonstrate an enrichment of biological pathways, including core immune function and metabolic pathways. The use of PC correction in two independent data sets resulted in a reduction in the number of cis- and trans-expression QTL detected. Comparisons of PC and linear model correction revealed that PC correction was not as efficient at removing known batch effects and had a higher penalty on genetic variation. Therefore, this study highlights the danger of eliminating biologically relevant data when employing PC correction in gene expression data.
Collapse
|
170
|
Halloran PF, Pereira AB, Chang J, Matas A, Picton M, De Freitas D, Bromberg J, Serón D, Sellarés J, Einecke G, Reeve J. Potential impact of microarray diagnosis of T cell-mediated rejection in kidney transplants: The INTERCOM study. Am J Transplant 2013; 13:2352-63. [PMID: 23915426 DOI: 10.1111/ajt.12387] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2013] [Revised: 05/30/2013] [Accepted: 06/14/2013] [Indexed: 01/25/2023]
Abstract
We previously developed a microarray-based test for T cell-mediated rejection (TCMR) in a reference set of 403 biopsies. To determine the potential impact of this test in clinical practice, we undertook INTERCOM, a prospective international study of 300 indication biopsies from 264 patients (ClinicalTrials.gov NCT01299168). Biopsies from six centers-Baltimore, Barcelona, Edmonton, Hannover, Manchester and Minneapolis-were analyzed by microarrays, assigning TCMR scores by an algorithm developed in the reference set and comparing TCMR scores to local histology assessment. The TCMR score correlated with histologic TCMR lesions-tubulitis and interstitial infiltration. The accuracy for primary histologic diagnoses (0.87) was similar to the reference set (0.89). The TCMR scores reclassified 77/300 biopsies (26%): 16 histologic TCMR were molecularly non-TCMR; 15 histologic non-TCMR were molecularly TCMR, including 6 with polyoma virus nephropathy; and all 46 "borderline" biopsies were reclassified as TCMR (8) or non-TCMR (38). Like the reference set, discrepancies were primarily in situations where histology has known limitations, for example, in biopsies with scarring and inflammation/tubulitis potentially from other diseases. Neither the TCMR score nor histologic TCMR was associated with graft loss. Thus the molecular TCMR score has potential to add new insight, particularly in situations where histology is ambiguous or potentially misleading.
Collapse
Affiliation(s)
- P F Halloran
- Alberta Transplant Applied Genomics Centre, University of Alberta, Edmonton, AB, Canada; Department of Medicine, Division of Nephrology and Transplant Immunology, University of Alberta, Edmonton, AB, Canada
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
171
|
Tarca AL, Lauria M, Unger M, Bilal E, Boue S, Kumar Dey K, Hoeng J, Koeppl H, Martin F, Meyer P, Nandy P, Norel R, Peitsch M, Rice JJ, Romero R, Stolovitzky G, Talikka M, Xiang Y, Zechner C. Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge. ACTA ACUST UNITED AC 2013; 29:2892-9. [PMID: 23966112 DOI: 10.1093/bioinformatics/btt492] [Citation(s) in RCA: 101] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION After more than a decade since microarrays were used to predict phenotype of biological samples, real-life applications for disease screening and identification of patients who would best benefit from treatment are still emerging. The interest of the scientific community in identifying best approaches to develop such prediction models was reaffirmed in a competition style international collaboration called IMPROVER Diagnostic Signature Challenge whose results we describe herein. RESULTS Fifty-four teams used public data to develop prediction models in four disease areas including multiple sclerosis, lung cancer, psoriasis and chronic obstructive pulmonary disease, and made predictions on blinded new data that we generated. Teams were scored using three metrics that captured various aspects of the quality of predictions, and best performers were awarded. This article presents the challenge results and introduces to the community the approaches of the best overall three performers, as well as an R package that implements the approach of the best overall team. The analyses of model performance data submitted in the challenge as well as additional simulations that we have performed revealed that (i) the quality of predictions depends more on the disease endpoint than on the particular approaches used in the challenge; (ii) the most important modeling factor (e.g. data preprocessing, feature selection and classifier type) is problem dependent; and (iii) for optimal results datasets and methods have to be carefully matched. Biomedical factors such as the disease severity and confidence in diagnostic were found to be associated with the misclassification rates across the different teams. AVAILABILITY The lung cancer dataset is available from Gene Expression Omnibus (accession, GSE43580). The maPredictDSC R package implementing the approach of the best overall team is available at www.bioconductor.org or http://bioinformaticsprb.med.wayne.edu/.
Collapse
Affiliation(s)
- Adi L Tarca
- Department of Computer Science, Wayne State University, Perinatology Research Branch, NICHD/NIH, Detroit, MI 48201, USA, The Microsoft Research - University of Trento Centre for Computational and Systems Biology, Rovereto 38068, Italy, ETH Zurich, Zurich 8092, Switzerland, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA and Philip Morris International, Research & Development, Neuchâtel CH-2000, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
172
|
Kothari S, Phan JH, Stokes TH, Wang MD. Pathology imaging informatics for quantitative analysis of whole-slide images. J Am Med Inform Assoc 2013; 20:1099-108. [PMID: 23959844 PMCID: PMC3822114 DOI: 10.1136/amiajnl-2012-001540] [Citation(s) in RCA: 150] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Objectives With the objective of bringing clinical decision support systems to reality, this article reviews histopathological whole-slide imaging informatics methods, associated challenges, and future research opportunities. Target audience This review targets pathologists and informaticians who have a limited understanding of the key aspects of whole-slide image (WSI) analysis and/or a limited knowledge of state-of-the-art technologies and analysis methods. Scope First, we discuss the importance of imaging informatics in pathology and highlight the challenges posed by histopathological WSI. Next, we provide a thorough review of current methods for: quality control of histopathological images; feature extraction that captures image properties at the pixel, object, and semantic levels; predictive modeling that utilizes image features for diagnostic or prognostic applications; and data and information visualization that explores WSI for de novo discovery. In addition, we highlight future research directions and discuss the impact of large public repositories of histopathological data, such as the Cancer Genome Atlas, on the field of pathology informatics. Following the review, we present a case study to illustrate a clinical decision support system that begins with quality control and ends with predictive modeling for several cancer endpoints. Currently, state-of-the-art software tools only provide limited image processing capabilities instead of complete data analysis for clinical decision-making. We aim to inspire researchers to conduct more research in pathology imaging informatics so that clinical decision support can become a reality.
Collapse
Affiliation(s)
- Sonal Kothari
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | | | | | | |
Collapse
|
173
|
Gregori J, Villarreal L, Sánchez A, Baselga J, Villanueva J. An effect size filter improves the reproducibility in spectral counting-based comparative proteomics. J Proteomics 2013; 95:55-65. [PMID: 23770383 DOI: 10.1016/j.jprot.2013.05.030] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2012] [Revised: 05/06/2013] [Accepted: 05/22/2013] [Indexed: 11/17/2022]
Abstract
UNLABELLED The microarray community has shown that the low reproducibility observed in gene expression-based biomarker discovery studies is partially due to relying solely on p-values to get the lists of differentially expressed genes. Their conclusions recommended complementing the p-value cutoff with the use of effect-size criteria. The aim of this work was to evaluate the influence of such an effect-size filter on spectral counting-based comparative proteomic analysis. The results proved that the filter increased the number of true positives and decreased the number of false positives and the false discovery rate of the dataset. These results were confirmed by simulation experiments where the effect size filter was used to evaluate systematically variable fractions of differentially expressed proteins. Our results suggest that relaxing the p-value cut-off followed by a post-test filter based on effect size and signal level thresholds can increase the reproducibility of statistical results obtained in comparative proteomic analysis. Based on our work, we recommend using a filter consisting of a minimum absolute log2 fold change of 0.8 and a minimum signal of 2-4 SpC on the most abundant condition for the general practice of comparative proteomics. The implementation of feature filtering approaches could improve proteomic biomarker discovery initiatives by increasing the reproducibility of the results obtained among independent laboratories and MS platforms. BIOLOGICAL SIGNIFICANCE Quality control analysis of microarray-based gene expression studies pointed out that the low reproducibility observed in the lists of differentially expressed genes could be partially attributed to the fact that these lists are generated relying solely on p-values. Our study has established that the implementation of an effect size post-test filter improves the statistical results of spectral count-based quantitative proteomics. The results proved that the filter increased the number of true positives whereas decreased the false positives and the false discovery rate of the datasets. The results presented here prove that a post-test filter applying a reasonable effect size and signal level thresholds helps to increase the reproducibility of statistical results in comparative proteomic analysis. Furthermore, the implementation of feature filtering approaches could improve proteomic biomarker discovery initiatives by increasing the reproducibility of results obtained among independent laboratories and MS platforms. This article is part of a Special Issue entitled: Standardization and Quality Control in Proteomics.
Collapse
Affiliation(s)
- Josep Gregori
- Vall d'Hebron Institute of Oncology (VHIO), Universitat Autònoma de Barcelona (UAB), Barcelona, Spain; Statistics Department, University of Barcelona (UB), Barcelona, Spain
| | | | | | | | | |
Collapse
|
174
|
Liu HC, Peng PC, Hsieh TC, Yeh TC, Lin CJ, Chen CY, Hou JY, Shih LY, Liang DC. Comparison of feature selection methods for cross-laboratory microarray analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:593-604. [PMID: 24091394 DOI: 10.1109/tcbb.2013.70] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
The amount of gene expression data of microarray has grown exponentially. To apply them for extensive studies, integrated analysis of cross-laboratory (cross-lab) data becomes a trend, and thus, choosing an appropriate feature selection method is an essential issue. This paper focuses on feature selection for Affymetrix (Affy) microarray studies across different labs. We investigate four feature selection methods: $(t)$-test, significance analysis of microarrays (SAM), rank products (RP), and random forest (RF). The four methods are applied to acute lymphoblastic leukemia, acute myeloid leukemia, breast cancer, and lung cancer Affy data which consist of three cross-lab data sets each. We utilize a rank-based normalization method to reduce the bias from cross-lab data sets. Training on one data set or two combined data sets to test the remaining data set(s) are both considered. Balanced accuracy is used for prediction evaluation. This study provides comprehensive comparisons of the four feature selection methods in cross-lab microarray analysis. Results show that SAM has the best classification performance. RF also gets high classification accuracy, but it is not as stable as SAM. The most naive method is $(t)$-test, but its performance is the worst among the four methods. In this study, we further discuss the influence from the number of training samples, the number of selected genes, and the issue of unbalanced data sets.
Collapse
Affiliation(s)
- Hsi-Che Liu
- Mackay Medical College and Division of Pediatric Hematology-Oncology, Mackay Memorial Hospital, New Taipei
| | | | | | | | | | | | | | | | | |
Collapse
|
175
|
Calciano M, Lemarié JC, Blondiaux E, Einstein R, Fehlbaum-Beurdeley P. A predictive microarray-based biomarker for early detection of Alzheimer’s disease intended for clinical diagnostic application. Biomarkers 2013; 18:264-72. [DOI: 10.3109/1354750x.2013.773083] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
176
|
Tsuyuzaki K, Tominaga D, Kwon Y, Miyazaki S. Two-way AIC: detection of differentially expressed genes from large scale microarray meta-dataset. BMC Genomics 2013; 14 Suppl 2:S9. [PMID: 23445621 PMCID: PMC3582450 DOI: 10.1186/1471-2164-14-s2-s9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background Detection of significant differentially expressed genes (DEGs) from DNA microarray datasets is a common routine task conducted in biomedical research. For the detection of DEGs, numerous methods are proposed. By such conventional methods, generally, DEGs are detected from one dataset consisting of group of control and treatment. However, some DEGs are easily to be detected in any experimental condition. For the detection of much experiment condition specific DEGs, each measurement value of gene expression levels should be compared in two dimensional ways, or both with other genes and other datasets simultaneously. For this purpose, we retrieve the gene expression data from public database as possible and construct "meta-dataset" which summarize expression change of all genes in various experimental condition. Herein, we propose "two-way AIC" (Akaike Information Criteria), method for simultaneous detection of significance genes and experiments on meta-dataset. Results As a case study of the Pseudomonas aeruginosa, we evaluate whether two-way AIC method can detect test data which is the experiment condition specific DEGs. Operon genes are used as test data. Compared with other commonly used statistical methods (t-rank/F-test, RankProducts and SAM), two-way AIC shows the highest specificity of detection of operon genes. Conclusions The two-way AIC performs high specificity for operon gene detection on the microarray meta-dataset. This method can also be applied to estimation of mutual gene interactions.
Collapse
Affiliation(s)
- Koki Tsuyuzaki
- Department of Medical and Life Science, Faculty of Pharmaceutical Science, Tokyo University of Science, 2641 Yamazaki, Noda, 278-8510, Japan.
| | | | | | | |
Collapse
|
177
|
Heider A, Alt R. virtualArray: a R/bioconductor package to merge raw data from different microarray platforms. BMC Bioinformatics 2013; 14:75. [PMID: 23452776 PMCID: PMC3599117 DOI: 10.1186/1471-2105-14-75] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2012] [Accepted: 02/22/2013] [Indexed: 11/10/2022] Open
Abstract
Background Microarrays have become a routine tool to address diverse biological questions. Therefore, different types and generations of microarrays have been produced by several manufacturers over time. Likewise, the diversity of raw data deposited in public databases such as NCBI GEO or EBI ArrayExpress has grown enormously. This has resulted in databases currently containing several hundred thousand microarray samples clustered by different species, manufacturers and chip generations. While one of the original goals of these databases was to make the data available to other researchers for independent analysis and, where appropriate, integration with their own data, current software implementations could not provide that feature. Only those data sets generated on the same chip platform can be readily combined and even here there are batch effects to be taken care of. A straightforward approach to deal with multiple chip types and batch effects has been missing. The software presented here was designed to solve both of these problems in a convenient and user friendly way. Results The virtualArray software package can combine raw data sets using almost any chip types based on current annotations from NCBI GEO or Bioconductor. After establishing congruent annotations for the raw data, virtualArray can then directly employ one of seven implemented methods to adjust for batch effects in the data resulting from differences between the chip types used. Both steps can be tuned to the preferences of the user. When the run is finished, the whole dataset is presented as a conventional Bioconductor “ExpressionSet” object, which can be used as input to other Bioconductor packages. Conclusions Using this software package, researchers can easily integrate their own microarray data with data from public repositories or other sources that are based on different microarray chip types. Using the default approach a robust and up-to-date batch effect correction technique is applied to the data.
Collapse
Affiliation(s)
- Andreas Heider
- Translational Centre for Regenerative Medicine Leipzig, University of Leipzig, Semmelweisstr. 14, Leipzig 04103, Germany.
| | | |
Collapse
|
178
|
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Solís DYW, Molter C, Duque R, Bersini H, Nowé A. GENESHIFT: a nonparametric approach for integrating microarray gene expression data based on the inner product as a distance measure between the distributions of genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:383-392. [PMID: 23929862 DOI: 10.1109/tcbb.2013.12] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The potential of microarray gene expression (MAGE) data is only partially explored due to the limited number of samples in individual studies. This limitation can be surmounted by merging or integrating data sets originating from independent MAGE experiments, which are designed to study the same biological problem. However, this process is hindered by batch effects that are study-dependent and result in random data distortion; therefore numerical transformations are needed to render the integration of different data sets accurate and meaningful. Our contribution in this paper is two-fold. First we propose GENESHIFT, a new nonparametric batch effect removal method based on two key elements from statistics: empirical density estimation and the inner product as a distance measure between two probability density functions; second we introduce a new validation index of batch effect removal methods based on the observation that samples from two independent studies drawn from a same population should exhibit similar probability density functions. We evaluated and compared the GENESHIFT method with four other state-of-the-art methods for batch effect removal: Batch-mean centering, empirical Bayes or COMBAT, distance-weighted discrimination, and cross-platform normalization. Several validation indices providing complementary information about the efficiency of batch effect removal methods have been employed in our validation framework. The results show that none of the methods clearly outperforms the others. More than that, most of the methods used for comparison perform very well with respect to some validation indices while performing very poor with respect to others. GENESHIFT exhibits robust performances and its average rank is the highest among the average ranks of all methods used for comparison.
Collapse
|
179
|
Giordan M. A Two-Stage Procedure for the Removal of Batch Effects in Microarray Studies. STATISTICS IN BIOSCIENCES 2013. [DOI: 10.1007/s12561-013-9081-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
180
|
Abstract
One important application of microarray in clinical settings is for constructing a diagnosis or prognosis model. Batch effects are a well-known obstacle in this type of applications. Recently, a prominent study was published on how batch effects removal techniques could potentially improve microarray prediction performance. However, the results were not very encouraging, as prediction performance did not always improve. In fact, in up to 20% of the cases, prediction accuracy was reduced. Furthermore, it was stated in the paper that the techniques studied require sufficiently large sample sizes in both batches (train and test) to be effective, which is not a realistic situation especially in clinical settings. In this paper, we propose a different approach, which is able to overcome limitations faced by conventional methods. Our approach uses ranking value of microarray data and a bagging ensemble classifier with sequential hypothesis testing to dynamically determine the number of classifiers required in the ensemble. Using similar datasets to those in the original study, we showed that in only one case (<2%) is our performance reduced (by more than -0.05 AUC) and, in >60% of cases, it is improved (by more than 0.05 AUC). In addition, our approach works even on much smaller training data sets and is independent of the sample size of the test data, making it feasible to be applied on clinical studies.
Collapse
Affiliation(s)
- Chuan Hock Koh
- NUS Graduate School for Integrative Sciences and Engineering, Singapore.
| | | |
Collapse
|
181
|
Abstract
Gene expression patterns change dramatically in aging and age-related events. The DNA microarray is now recognized as a useful device in molecular biology and widely used to identify the molecular mechanisms of aging and the biological effects of drugs for therapeutic purpose in age-related diseases. Recently, numerous technological advantages have led to the evolution of DNA microarrays and microarray-based techniques, revealing the genomic modification and all transcriptional activity. Here, we show the step-by-step methods currently used in our lab to handling the oligonucleotide microarray and miRNA microarray. Moreover, we introduce the protocols of ribonucleoprotein [RNP] immunoprecipitation followed by microarray analysis (RIP-chip) which reveal the target mRNA of age-related RNA-binding proteins.
Collapse
|
182
|
Wang SY, Kuo CH, Tseng YJ. Batch Normalizer: A Fast Total Abundance Regression Calibration Method to Simultaneously Adjust Batch and Injection Order Effects in Liquid Chromatography/Time-of-Flight Mass Spectrometry-Based Metabolomics Data and Comparison with Current Calibration Methods. Anal Chem 2012; 85:1037-46. [DOI: 10.1021/ac302877x] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Affiliation(s)
- San-Yuan Wang
- Department
of Computer Science and Information Engineering, ‡The Metabolomics Core Laboratory,
Center of Genomic Medicine, §School of Pharmacy, College of Medicine, ∥Department of Pharmacy,
National Taiwan University Hospital, ⊥Graduate Institute of Biomedical
Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Ching-Hua Kuo
- Department
of Computer Science and Information Engineering, ‡The Metabolomics Core Laboratory,
Center of Genomic Medicine, §School of Pharmacy, College of Medicine, ∥Department of Pharmacy,
National Taiwan University Hospital, ⊥Graduate Institute of Biomedical
Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Yufeng J. Tseng
- Department
of Computer Science and Information Engineering, ‡The Metabolomics Core Laboratory,
Center of Genomic Medicine, §School of Pharmacy, College of Medicine, ∥Department of Pharmacy,
National Taiwan University Hospital, ⊥Graduate Institute of Biomedical
Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
183
|
Xu L, Cheng C, George EO, Homayouni R. Literature aided determination of data quality and statistical significance threshold for gene expression studies. BMC Genomics 2012; 13 Suppl 8:S23. [PMID: 23282414 PMCID: PMC3535704 DOI: 10.1186/1471-2164-13-s8-s23] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Gene expression data are noisy due to technical and biological variability. Consequently, analysis of gene expression data is complex. Different statistical methods produce distinct sets of genes. In addition, selection of expression p-value (EPv) threshold is somewhat arbitrary. In this study, we aimed to develop novel literature based approaches to integrate functional information in analysis of gene expression data. Methods Functional relationships between genes were derived by Latent Semantic Indexing (LSI) of Medline abstracts and used to calculate the function cohesion of gene sets. In this study, literature cohesion was applied in two ways. First, Literature-Based Functional Significance (LBFS) method was developed to calculate a p-value for the cohesion of differentially expressed genes (DEGs) in order to objectively evaluate the overall biological significance of the gene expression experiments. Second, Literature Aided Statistical Significance Threshold (LASST) was developed to determine the appropriate expression p-value threshold for a given experiment. Results We tested our methods on three different publicly available datasets. LBFS analysis demonstrated that only two experiments were significantly cohesive. For each experiment, we also compared the LBFS values of DEGs generated by four different statistical methods. We found that some statistical tests produced more functionally cohesive gene sets than others. However, no statistical test was consistently better for all experiments. This reemphasizes that a statistical test must be carefully selected for each expression study. Moreover, LASST analysis demonstrated that the expression p-value thresholds for some experiments were considerably lower (p < 0.02 and 0.01), suggesting that the arbitrary p-values and false discovery rate thresholds that are commonly used in expression studies may not be biologically sound. Conclusions We have developed robust and objective literature-based methods to evaluate the biological support for gene expression experiments and to determine the appropriate statistical significance threshold. These methods will assist investigators to more efficiently extract biologically meaningful insights from high throughput gene expression experiments.
Collapse
Affiliation(s)
- Lijing Xu
- Bioinformatics Program, Memphis, TN 38152, USA
| | | | | | | |
Collapse
|
184
|
Quo CF, Kaddi C, Phan JH, Zollanvari A, Xu M, Wang MD, Alterovitz G. Reverse engineering biomolecular systems using -omic data: challenges, progress and opportunities. Brief Bioinform 2012; 13:430-45. [PMID: 22833495 DOI: 10.1093/bib/bbs026] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Recent advances in high-throughput biotechnologies have led to the rapid growing research interest in reverse engineering of biomolecular systems (REBMS). 'Data-driven' approaches, i.e. data mining, can be used to extract patterns from large volumes of biochemical data at molecular-level resolution while 'design-driven' approaches, i.e. systems modeling, can be used to simulate emergent system properties. Consequently, both data- and design-driven approaches applied to -omic data may lead to novel insights in reverse engineering biological systems that could not be expected before using low-throughput platforms. However, there exist several challenges in this fast growing field of reverse engineering biomolecular systems: (i) to integrate heterogeneous biochemical data for data mining, (ii) to combine top-down and bottom-up approaches for systems modeling and (iii) to validate system models experimentally. In addition to reviewing progress made by the community and opportunities encountered in addressing these challenges, we explore the emerging field of synthetic biology, which is an exciting approach to validate and analyze theoretical system models directly through experimental synthesis, i.e. analysis-by-synthesis. The ultimate goal is to address the present and future challenges in reverse engineering biomolecular systems (REBMS) using integrated workflow of data mining, systems modeling and synthetic biology.
Collapse
Affiliation(s)
- Chang F Quo
- Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | | | | | | | | | | |
Collapse
|
185
|
G-cimp status prediction of glioblastoma samples using mRNA expression data. PLoS One 2012; 7:e47839. [PMID: 23139755 PMCID: PMC3490960 DOI: 10.1371/journal.pone.0047839] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2012] [Accepted: 09/21/2012] [Indexed: 11/19/2022] Open
Abstract
Glioblastoma Multiforme (GBM) is a tumor with high mortality and no known cure. The dramatic molecular and clinical heterogeneity seen in this tumor has led to attempts to define genetically similar subgroups of GBM with the hope of developing tumor specific therapies targeted to the unique biology within each of these subgroups. Recently, a subset of relatively favorable prognosis GBMs has been identified. These glioma CpG island methylator phenotype, or G-CIMP tumors, have distinct genomic copy number aberrations, DNA methylation patterns, and (mRNA) expression profiles compared to other GBMs. While the standard method for identifying G-CIMP tumors is based on genome-wide DNA methylation data, such data is often not available compared to the more widely available gene expression data. In this study, we have developed and evaluated a method to predict the G-CIMP status of GBM samples based solely on gene expression data.
Collapse
|
186
|
Parker HS, Leek JT. The practical effect of batch on genomic prediction. Stat Appl Genet Mol Biol 2012; 11:Article 10. [PMID: 22611599 DOI: 10.1515/1544-6115.1766] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we collected publicly available gene expression measurements with known outcomes, and estimated batches using date. Using these data we show (1) the impact of batch effects on prediction depends on the correlation between outcome and batch in the training data, and (2) removing expression measurements most affected by batch before building predictors may improve the accuracy of those predictors. These results suggest that (1) training sets should be designed to minimize correlation between batches and outcome, and (2) methods for identifying batch-affected probes should be developed to improve prediction results for studies with high correlation between batches and outcome.
Collapse
|
187
|
Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, Weiss-Solís DY, Duque R, Bersini H, Nowé A. Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform 2012; 14:469-90. [PMID: 22851511 DOI: 10.1093/bib/bbs037] [Citation(s) in RCA: 216] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Genomic data integration is a key goal to be achieved towards large-scale genomic data analysis. This process is very challenging due to the diverse sources of information resulting from genomics experiments. In this work, we review methods designed to combine genomic data recorded from microarray gene expression (MAGE) experiments. It has been acknowledged that the main source of variation between different MAGE datasets is due to the so-called 'batch effects'. The methods reviewed here perform data integration by removing (or more precisely attempting to remove) the unwanted variation associated with batch effects. They are presented in a unified framework together with a wide range of evaluation tools, which are mandatory in assessing the efficiency and the quality of the data integration process. We provide a systematic description of the MAGE data integration methodology together with some basic recommendation to help the users in choosing the appropriate tools to integrate MAGE data for large-scale analysis; and also how to evaluate them from different perspectives in order to quantify their efficiency. All genomic data used in this study for illustration purposes were retrieved from InSilicoDB http://insilico.ulb.ac.be.
Collapse
Affiliation(s)
- Cosmin Lazar
- Como, Vrije Universiteit Brussel, Pleinlaanz, 1050 Brussels, Belgium.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
188
|
Identification of a radiosensitivity signature using integrative metaanalysis of published microarray data for NCI-60 cancer cells. BMC Genomics 2012; 13:348. [PMID: 22846430 PMCID: PMC3472294 DOI: 10.1186/1471-2164-13-348] [Citation(s) in RCA: 114] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2011] [Accepted: 07/18/2012] [Indexed: 11/21/2022] Open
Abstract
Background In the postgenome era, a prediction of response to treatment could lead to better dose selection for patients in radiotherapy. To identify a radiosensitive gene signature and elucidate related signaling pathways, four different microarray experiments were reanalyzed before radiotherapy. Results Radiosensitivity profiling data using clonogenic assay and gene expression profiling data from four published microarray platforms applied to NCI-60 cancer cell panel were used. The survival fraction at 2 Gy (SF2, range from 0 to 1) was calculated as a measure of radiosensitivity and a linear regression model was applied to identify genes or a gene set with a correlation between expression and radiosensitivity (SF2). Radiosensitivity signature genes were identified using significant analysis of microarrays (SAM) and gene set analysis was performed using a global test using linear regression model. Using the radiation-related signaling pathway and identified genes, a genetic network was generated. According to SAM, 31 genes were identified as common to all the microarray platforms and therefore a common radiosensitivity signature. In gene set analysis, functions in the cell cycle, DNA replication, and cell junction, including adherence and gap junctions were related to radiosensitivity. The integrin, VEGF, MAPK, p53, JAK-STAT and Wnt signaling pathways were overrepresented in radiosensitivity. Significant genes including ACTN1, CCND1, HCLS1, ITGB5, PFN2, PTPRC, RAB13, and WAS, which are adhesion-related molecules that were identified by both SAM and gene set analysis, and showed interaction in the genetic network with the integrin signaling pathway. Conclusions Integration of four different microarray experiments and gene selection using gene set analysis discovered possible target genes and pathways relevant to radiosensitivity. Our results suggested that the identified genes are candidates for radiosensitivity biomarkers and that integrin signaling via adhesion molecules could be a target for radiosensitization.
Collapse
|
189
|
Troendle JF, Yu KF, Westfall PH, Pennello G, Schisterman EF. Comparing the Expected Misclassification Cost for Two Classifiers Based on Estimates From the Same Sample. Stat Biopharm Res 2012. [DOI: 10.1080/19466315.2012.695263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
190
|
Kupfer P, Guthke R, Pohlers D, Huber R, Koczan D, Kinne RW. Batch correction of microarray data substantially improves the identification of genes differentially expressed in rheumatoid arthritis and osteoarthritis. BMC Med Genomics 2012; 5:23. [PMID: 22682473 PMCID: PMC3528008 DOI: 10.1186/1755-8794-5-23] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2011] [Accepted: 05/21/2012] [Indexed: 11/10/2022] Open
Abstract
Background Batch effects due to sample preparation or array variation (type, charge, and/or platform) may influence the results of microarray experiments and thus mask and/or confound true biological differences. Of the published approaches for batch correction, the algorithm “Combating Batch Effects When Combining Batches of Gene Expression Microarray Data” (ComBat) appears to be most suitable for small sample sizes and multiple batches. Methods Synovial fibroblasts (SFB; purity > 98%) were obtained from rheumatoid arthritis (RA) and osteoarthritis (OA) patients (n = 6 each) and stimulated with TNF-α or TGF-β1 for 0, 1, 2, 4, or 12 hours. Gene expression was analyzed using Affymetrix Human Genome U133 Plus 2.0 chips, an alternative chip definition file, and normalization by Robust Multi-Array Analysis (RMA). Data were batch-corrected for different acquiry dates using ComBat and the efficacy of the correction was validated using hierarchical clustering. Results In contrast to the hierarchical clustering dendrogram before batch correction, in which RA and OA patients clustered randomly, batch correction led to a clear separation of RA and OA. Strikingly, this applied not only to the 0 hour time point (i.e., before stimulation with TNF-α/TGF-β1), but also to all time points following stimulation except for the late 12 hour time point. Batch-corrected data then allowed the identification of differentially expressed genes discriminating between RA and OA. Batch correction only marginally modified the original data, as demonstrated by preservation of the main Gene Ontology (GO) categories of interest, and by minimally changed mean expression levels (maximal change 4.087%) or variances for all genes of interest. Eight genes from the GO category “extracellular matrix structural constituent” (5 different collagens, biglycan, and tubulointerstitial nephritis antigen-like 1) were differentially expressed between RA and OA (RA > OA), both constitutively at time point 0, and at all time points following stimulation with either TNF-α or TGF-β1. Conclusion Batch correction appears to be an extremely valuable tool to eliminate non-biological batch effects, and allows the identification of genes discriminating between different joint diseases. RA-SFB show an upregulated expression of extracellular matrix components, both constitutively following isolation from the synovial membrane and upon stimulation with disease-relevant cytokines or growth factors, suggesting an “imprinted” alteration of their phenotype.
Collapse
Affiliation(s)
- Peter Kupfer
- Experimental Rheumatology Unit, Department of Orthopedics, University Hospital Jena, Friedrich Schiller University, Jena, Germany
| | | | | | | | | | | |
Collapse
|
191
|
Verderio P. Assessing the Clinical Relevance of Oncogenic Pathways in Neoadjuvant Breast Cancer. J Clin Oncol 2012; 30:1912-5. [DOI: 10.1200/jco.2012.41.7386] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Affiliation(s)
- Paolo Verderio
- Fondazione Istituto di Ricovero e Cura a Carattere Scientifico, Istituto Nazionale dei Tumori, Milan, Italy
| |
Collapse
|
192
|
Gregori J, Villarreal L, Méndez O, Sánchez A, Baselga J, Villanueva J. Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics. J Proteomics 2012; 75:3938-51. [PMID: 22588121 DOI: 10.1016/j.jprot.2012.05.005] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2012] [Revised: 04/27/2012] [Accepted: 05/02/2012] [Indexed: 02/04/2023]
Abstract
Shotgun proteomics has become the standard proteomics technique for the large-scale measurement of protein abundances in biological samples. Despite quantitative proteomics has been usually performed using label-based approaches, label-free quantitation offers advantages related to the avoidance of labeling steps, no limitation in the number of samples to be compared, and the gain in protein detection sensitivity. However, since samples are analyzed separately, experimental design becomes critical. The exploration of spectral counting quantitation based on LC-MS presented here gathers experimental evidence of the influence of batch effects on comparative proteomics. The batch effects shown with spiking experiments clearly interfere with the biological signal. In order to minimize the interferences from batch effects, a statistical correction is proposed and implemented. Our results show that batch effects can be attenuated statistically when proper experimental design is used. Furthermore, the batch effect correction implemented leads to a substantial increase in the sensitivity of statistical tests. Finally, the applicability of our batch effects correction is shown on two different biomarker discovery projects involving cancer secretomes. We think that our findings will allow designing and executing better comparative proteomics projects and will help to avoid reaching false conclusions in the field of proteomics biomarker discovery.
Collapse
Affiliation(s)
- Josep Gregori
- Vall d'Hebron Institut of Oncology, Barcelona, Spain
| | | | | | | | | | | |
Collapse
|
193
|
De Serres SA, Mfarrej BG, Grafals M, Riella LV, Magee CN, Yeung MY, Dyer C, Ahmad U, Chandraker A, Najafian N. Derivation and validation of a cytokine-based assay to screen for acute rejection in renal transplant recipients. Clin J Am Soc Nephrol 2012; 7:1018-25. [PMID: 22498498 DOI: 10.2215/cjn.11051011] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
BACKGROUND AND OBJECTIVES Acute rejection remains a problem in renal transplantation. This study sought to determine the utility of a noninvasive cytokine assay in screening of acute rejection. DESIGN, SETTING, PARTICIPANTS, & MEASUREMENTS In this observational cross-sectional study, 64 patients from two centers were recruited upon admission for allograft biopsy to investigate acute graft dysfunction. Blood was collected before biopsy and assayed for a panel of 21 cytokines secreted by PBMCs. Patients were classified as acute rejectors or nonrejectors according to a classification rule derived from an initial set of 32 patients (training cohort) and subsequently validated in the remaining patients (validation cohort). RESULTS Although six cytokines (IL-1β, IL-6, TNF-α, IL-4, GM-CSF, and monocyte chemoattractant protein-1) distinguished acute rejectors in the training cohort, logistic regression modeling identified a single cytokine, IL-6, as the best predictor. In the validation cohort, IL-6 was consistently the most accurate cytokine (area under the receiver-operating characteristic curve, 0.85; P=0.006), whereas the application of a prespecified cutoff level, as determined from the training cohort, resulted in a sensitivity and specificity of 92% and 63%, respectively. Secondary analyses revealed a strong association between IL-6 levels and acute rejection after multivariate adjustment for clinical characteristics (P<0.001). CONCLUSIONS In this pilot study, the measurement of a single cytokine can exclude acute rejection with a sensitivity of 92% in renal transplant recipients presenting with acute graft dysfunction. Prospective studies are needed to determine the utility of this simple assay, particularly for low-risk or remote patients.
Collapse
Affiliation(s)
- Sacha A De Serres
- Schuster Family Transplantation Research Center, Renal Division, Brigham and Women's Hospital & Children's Hospital Boston, Harvard Medical School, Boston, Massachusetts, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
194
|
Sun Z, Chai HS, Wu Y, White WM, Donkena KV, Klein CJ, Garovic VD, Therneau TM, Kocher JPA. Batch effect correction for genome-wide methylation data with Illumina Infinium platform. BMC Med Genomics 2011; 4:84. [PMID: 22171553 PMCID: PMC3265417 DOI: 10.1186/1755-8794-4-84] [Citation(s) in RCA: 86] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2011] [Accepted: 12/16/2011] [Indexed: 01/12/2023] Open
Abstract
Background Genome-wide methylation profiling has led to more comprehensive insights into gene regulation mechanisms and potential therapeutic targets. Illumina Human Methylation BeadChip is one of the most commonly used genome-wide methylation platforms. Similar to other microarray experiments, methylation data is susceptible to various technical artifacts, particularly batch effects. To date, little attention has been given to issues related to normalization and batch effect correction for this kind of data. Methods We evaluated three common normalization approaches and investigated their performance in batch effect removal using three datasets with different degrees of batch effects generated from HumanMethylation27 platform: quantile normalization at average β value (QNβ); two step quantile normalization at probe signals implemented in "lumi" package of R (lumi); and quantile normalization of A and B signal separately (ABnorm). Subsequent Empirical Bayes (EB) batch adjustment was also evaluated. Results Each normalization could remove a portion of batch effects and their effectiveness differed depending on the severity of batch effects in a dataset. For the dataset with minor batch effects (Dataset 1), normalization alone appeared adequate and "lumi" showed the best performance. However, all methods left substantial batch effects intact in the datasets with obvious batch effects and further correction was necessary. Without any correction, 50 and 66 percent of CpGs were associated with batch effects in Dataset 2 and 3, respectively. After QNβ, lumi or ABnorm, the number of CpGs associated with batch effects were reduced to 24, 32, and 26 percent for Dataset 2; and 37, 46, and 35 percent for Dataset 3, respectively. Additional EB correction effectively removed such remaining non-biological effects. More importantly, the two-step procedure almost tripled the numbers of CpGs associated with the outcome of interest for the two datasets. Conclusion Genome-wide methylation data from Infinium Methylation BeadChip can be susceptible to batch effects with profound impacts on downstream analyses and conclusions. Normalization can reduce part but not all batch effects. EB correction along with normalization is recommended for effective batch effect removal.
Collapse
Affiliation(s)
- Zhifu Sun
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First Street, Rochester, MN 55905, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
195
|
Nueda MJ, Ferrer A, Conesa A. ARSyN: a method for the identification and removal of systematic noise in multifactorial time course microarray experiments. Biostatistics 2011; 13:553-66. [PMID: 22085896 DOI: 10.1093/biostatistics/kxr042] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Transcriptomic profiling experiments that aim to the identification of responsive genes in specific biological conditions are commonly set up under defined experimental designs that try to assess the effects of factors and their interactions on gene expression. Data from these controlled experiments, however, may also contain sources of unwanted noise that can distort the signal under study, affect the residuals of applied statistical models, and hamper data analysis. Commonly, normalization methods are applied to transcriptomics data to remove technical artifacts, but these are normally based on general assumptions of transcript distribution and greatly ignore both the characteristics of the experiment under consideration and the coordinative nature of gene expression. In this paper, we propose a novel methodology, ARSyN, for the preprocessing of microarray data that takes into account these 2 last aspects. By combining analysis of variance (ANOVA) modeling of gene expression values and multivariate analysis of estimated effects, the method identifies the nonstructured part of the signal associated to the experimental factors (the noise within the signal) and the structured variation of the ANOVA errors (the signal of the noise). By removing these noise fractions from the original data, we create a filtered data set that is rich in the information of interest and includes only the random noise required for inferential analysis. In this work, we focus on multifactorial time course microarray (MTCM) experiments with 2 factors: one quantitative such as time or dosage and the other qualitative, as tissue, strain, or treatment. However, the method can be used in other situations such as experiments with only one factor or more complex designs with more than 2 factors. The filtered data obtained after applying ARSyN can be further analyzed with the appropriate statistical technique to obtain the biological information required. To evaluate the performance of the filtering strategy, we have applied different statistical approaches for MTCM analysis to several real and simulated data sets, studying also the efficiency of these techniques. By comparing the results obtained with the original and ARSyN filtered data and also with other filtering techniques, we can conclude that the proposed method increases the statistical power to detect biological signals, especially in cases where there are high levels of structural noise. Software for ARSyN is freely available at http://www.ua.es/personal/mj.nueda.
Collapse
Affiliation(s)
- Maria J Nueda
- Departamento de Estadística e Investigación Operativa, Universidad de Alicante, Apartado 03080, Alicante, Spain.
| | | | | |
Collapse
|
196
|
Barbau-Piednoir E, Lievens A, Vandermassen E, Mbongolo-Mbella EG, Leunda-Casi A, Roosens N, Sneyers M, Van den Bulcke M. Four new SYBR®Green qPCR screening methods for the detection of Roundup Ready®, LibertyLink®, and CryIAb traits in genetically modified products. Eur Food Res Technol 2011. [DOI: 10.1007/s00217-011-1605-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
197
|
Williams R, Schuldt B, Müller FJ. A guide to stem cell identification: progress and challenges in system-wide predictive testing with complex biomarkers. Bioessays 2011; 33:880-90. [PMID: 21901750 DOI: 10.1002/bies.201100073] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
We have developed a first generation tool for the unbiased identification and characterization of human pluripotent stem cells, termed PluriTest. This assay utilizes all the information contained on a microarray and abandons the conventional stem cell marker concept. Stem cells are defined by the ability to replenish themselves and to differentiate into more mature cell types. As differentiation potential is a property that cannot be directly proven in the stem cell state, biologists have to rely on correlative measurements in stem cells associated with differentiation potential. Unfortunately, most, if not all, of those markers are only valid within narrow limits of specific experimental systems. Microarray technologies and recently next-generation sequencing have revolutionized how cellular phenotypes can be characterized on a systems-wide level. Here we discuss the challenges PluriTest and similar global assays need to address to fulfill their enormous potential for industrial, diagnostic and therapeutic applications.
Collapse
Affiliation(s)
- Roy Williams
- Bioinformatics Shared Resource, Sanford Burnham Medical Research Institute, La Jolla, CA, USA
| | | | | |
Collapse
|
198
|
Mendrick DL. Transcriptional profiling to identify biomarkers of disease and drug response. Pharmacogenomics 2011; 12:235-49. [PMID: 21332316 DOI: 10.2217/pgs.10.184] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
The discovery, biological qualification and analytical validation of genomic biomarkers requires extensive collaborations between individuals with expertise in biology, statistics, bioinformatics, chemistry, clinical medicine, regulatory science and so on. For clinical utility, blood-borne biomarkers (e.g., mRNA and miRNA) of organ damage, drug toxicity and/or response would be preferred to those that are tissue based. Currently used biomarkers such as serum creatinine (indicating renal dysfunction) denote organ damage whether caused by disease, physical injury or drugs. Therefore, it is anticipated that studies of disease will discover biomarkers that can also be used to identify drug-induced injury and vice versa. This article describes transcriptomic blood-borne biomarkers that have been reported to be connected with disease and drug toxicity. Much more qualification and validation needs to be carried out before many of these biomarkers can prove useful. Discussed here are some of the lessons learned and roadblocks to success.
Collapse
Affiliation(s)
- Donna L Mendrick
- Division of Systems Biology, HFT-230, National Center for Toxicological Research, US FDA, 3900 NCTR Rd, Jefferson, AR 72079-4502, USA.
| |
Collapse
|
199
|
Malone JH, Oliver B. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 2011; 9:34. [PMID: 21627854 PMCID: PMC3104486 DOI: 10.1186/1741-7007-9-34] [Citation(s) in RCA: 347] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2010] [Accepted: 05/31/2011] [Indexed: 12/11/2022] Open
Abstract
Microarrays first made the analysis of the transcriptome possible, and have produced much important information. Today, however, researchers are increasingly turning to direct high-throughput sequencing -- RNA-Seq -- which has considerable advantages for examining transcriptome fine structure -- for example in the detection of allele-specific expression and splice junctions. In this article, we discuss the relative merits of the two techniques, the inherent biases in each, and whether all of the vast body of array work needs to be revisited using the newer technology. We conclude that microarrays remain useful and accurate tools for measuring expression levels, and RNA-Seq complements and extends microarray measurements.
Collapse
Affiliation(s)
- John H Malone
- Laboratory of Cellular and Developmental Biology, National Institute of Digestive, Diabetes, and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Brian Oliver
- Laboratory of Cellular and Developmental Biology, National Institute of Digestive, Diabetes, and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
200
|
Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, Liu C. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 2011; 6:e17238. [PMID: 21386892 PMCID: PMC3046121 DOI: 10.1371/journal.pone.0017238] [Citation(s) in RCA: 341] [Impact Index Per Article: 24.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2010] [Accepted: 01/24/2011] [Indexed: 01/07/2023] Open
Abstract
The expression microarray is a frequently used approach to study gene expression on a genome-wide scale. However, the data produced by the thousands of microarray studies published annually are confounded by "batch effects," the systematic error introduced when samples are processed in multiple batches. Although batch effects can be reduced by careful experimental design, they cannot be eliminated unless the whole study is done in a single batch. A number of programs are now available to adjust microarray data for batch effects prior to analysis. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. We also showed that it is essential to standardize expression data at the probe level when testing for correlation of expression profiles, due to a sizeable probe effect in microarray data that can inflate the correlation among replicates and unrelated samples.
Collapse
Affiliation(s)
- Chao Chen
- National Ministry of Education Key Laboratory of Contemporary Anthropology, Fudan University, Shanghai, People's Republic of China
- Department of Psychiatry, University of Chicago, Chicago, Illinois, United States of America
| | - Kay Grennan
- Department of Psychiatry, University of Chicago, Chicago, Illinois, United States of America
| | - Judith Badner
- Department of Psychiatry, University of Chicago, Chicago, Illinois, United States of America
| | - Dandan Zhang
- Department of Pathology, Zhejiang University, Hangzhou, People's Republic of China
| | - Elliot Gershon
- Department of Psychiatry, University of Chicago, Chicago, Illinois, United States of America
| | - Li Jin
- National Ministry of Education Key Laboratory of Contemporary Anthropology, Fudan University, Shanghai, People's Republic of China
| | - Chunyu Liu
- Department of Psychiatry, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|