1
|
Hui HWH, Kong W, Goh WWB. Thinking points for effective batch correction on biomedical data. Brief Bioinform 2024; 25:bbae515. [PMID: 39397427 PMCID: PMC11471903 DOI: 10.1093/bib/bbae515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 09/11/2024] [Accepted: 10/01/2024] [Indexed: 10/15/2024] Open
Abstract
Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive considerations. This paper underscores the necessity of a flexible and holistic approach for selecting batch effect correction algorithms (BECAs), advocating for proper BECA evaluations and consideration of artificial intelligence-based strategies. We also discuss key challenges in batch effect correction, including the importance of uncovering hidden batch factors and understanding the impact of design imbalance, missing values, and aggressive correction. Our aim is to provide researchers with a robust framework for effective batch effects management and enhancing the reliability of high-dimensional data analyses.
Collapse
Affiliation(s)
- Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
| | - Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Dr, Singapore 636921, Singapore
- Center of AI in Medicine, Nanyang Technological University, 59 Nanyang Dr, Singapore 636921, Singapore
- Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London, Burlington Danes, The Hammersmith Hospital, Du Cane Road, London W12 0NN, United Kingdom
| |
Collapse
|
2
|
Yu Y, Zhang N, Mai Y, Ren L, Chen Q, Cao Z, Chen Q, Liu Y, Hou W, Yang J, Hong H, Xu J, Tong W, Dong L, Shi L, Fang X, Zheng Y. Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method. Genome Biol 2023; 24:201. [PMID: 37674217 PMCID: PMC10483871 DOI: 10.1186/s13059-023-03047-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 05/18/2023] [Indexed: 09/08/2023] Open
Abstract
BACKGROUND Batch effects are notoriously common technical variations in multiomics data and may result in misleading outcomes if uncorrected or over-corrected. A plethora of batch-effect correction algorithms are proposed to facilitate data integration. However, their respective advantages and limitations are not adequately assessed in terms of omics types, the performance metrics, and the application scenarios. RESULTS As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assess the performance of seven batch effect correction algorithms based on different performance metrics of clinical relevance, i.e., the accuracy of identifying differentially expressed features, the robustness of predictive models, and the ability of accurately clustering cross-batch samples into their own donors. The ratio-based method, i.e., by scaling absolute feature values of study samples relative to those of concurrently profiled reference material(s), is found to be much more effective and broadly applicable than others, especially when batch effects are completely confounded with biological factors of study interests. We further provide practical guidelines for implementing the ratio based approach in increasingly large-scale multiomics studies. CONCLUSIONS Multiomics measurements are prone to batch effects, which can be effectively corrected using ratio-based scaling of the multiomics data. Our study lays the foundation for eliminating batch effects at a ratio scale.
Collapse
Affiliation(s)
- Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yuanbang Mai
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qiaochu Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Zehui Cao
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qingwang Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Wanwan Hou
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | | | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes, Shanghai, China.
| | - Xiang Fang
- National Institute of Metrology, Beijing, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
| |
Collapse
|
3
|
Yosef A, Shnaider E, Schneider M, Gurevich M. Heuristic normalization procedure for batch effect correction. Soft comput 2023. [DOI: 10.1007/s00500-023-08049-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
|
4
|
Yosef A, Shnaider E, Schneider M, Gurevich M. Normalization of Large-Scale Transcriptome Data Using Heuristic Methods. Bioinform Biol Insights 2023; 17:11779322231160397. [PMID: 37020503 PMCID: PMC10068970 DOI: 10.1177/11779322231160397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 02/09/2023] [Indexed: 04/03/2023] Open
Abstract
In this study, we introduce an artificial intelligent method for addressing the batch effect of a transcriptome data. The method has several clear advantages in comparison with the alternative methods presently in use. Batch effect refers to the discrepancy in gene expression data series, measured under different conditions. While the data from the same batch (measurements performed under the same conditions) are compatible, combining various batches into 1 data set is problematic because of incompatible measurements. Therefore, it is necessary to perform correction of the combined data (normalization), before performing biological analysis. There are numerous methods attempting to correct data set for batch effect. These methods rely on various assumptions regarding the distribution of the measurements. Forcing the data elements into pre-supposed distribution can severely distort biological signals, thus leading to incorrect results and conclusions. As the discrepancy between the assumptions regarding the data distribution and the actual distribution is wider, the biases introduced by such “correction methods” are greater. We introduce a heuristic method to reduce batch effect. The method does not rely on any assumptions regarding the distribution and the behavior of data elements. Hence, it does not introduce any new biases in the process of correcting the batch effect. It strictly maintains the integrity of measurements within the original batches.
Collapse
|
5
|
Iravani S, Conrad TOF. An Interpretable Deep Learning Approach for Biomarker Detection in LC-MS Proteomics Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:151-161. [PMID: 35007196 DOI: 10.1109/tcbb.2022.3141656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Analyzing mass spectrometry-based proteomics data with deep learning (DL) approaches poses several challenges due to the high dimensionality, low sample size, and high level of noise. Additionally, DL-based workflows are often hindered to be integrated into medical settings due to the lack of interpretable explanation. We present DLearnMS, a DL biomarker detection framework, to address these challenges on proteomics instances of liquid chromatography-mass spectrometry (LC-MS) - a well-established tool for quantifying complex protein mixtures. Our DLearnMS framework learns the clinical state of LC-MS data instances using convolutional neural networks. Based on the trained neural networks, we show how biomarkers can be identified using layer-wise relevance propagation. This enables detecting discriminating regions of the data and the design of more robust networks. One of the main advantages over other established methods is that no explicit preprocessing step is needed in our DLearnMS framework. Our evaluation shows that DLearnMS outperforms conventional LC-MS biomarker detection approaches in identifying fewer false positive peaks while maintaining a comparable amount of true positives peaks. Code availability: The code is available from the following GIT repository: https://github.com/SaharIravani/DlearnMS.
Collapse
|
6
|
Wang W, Yuan H, Han J, Liu W. PCLassoLog: A protein complex-based, group Lasso-logistic model for cancer classification and risk protein complex discovery. Comput Struct Biotechnol J 2022; 21:365-377. [PMID: 36582441 PMCID: PMC9791601 DOI: 10.1016/j.csbj.2022.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2022] [Revised: 12/02/2022] [Accepted: 12/03/2022] [Indexed: 12/12/2022] Open
Abstract
Risk gene identification has attracted much attention in the past two decades. Since most genes need to be translated into proteins and cooperate with other proteins to form protein complexes to carry out cellular functions, which significantly extends the functional diversity of individual proteins, revealing the molecular mechanism of cancer from a comprehensive perspective needs to shift from identifying individual risk genes toward identifying risk protein complexes. Here, we embed protein complexes into the regularized learning framework and propose a protein complex-based, group Lasso-logistic model (PCLassoLog) to discover risk protein complexes. Experiments on deep proteomic data of two cancer types show that PCLassoLog yields superior predictive performance on independent datasets. More importantly, PCLassoLog identifies risk protein complexes that not only contain individual risk proteins but also incorporate close partners that synergize with them. Furthermore, selection probabilities are calculated and two other protein complex-based models are proposed to complement PCLassoLog in identifying reliable risk protein complexes. Based on PCLassoLog, a pan-cancer analysis is performed to identify risk protein complexes in 12 cancer types. Finally, PCLassoLog is used to discover risk protein complexes associated with gene mutation. We implement all protein complex-based models as an R package PCLassoReg, which may serve as an effective tool to discover risk protein complexes in various contexts.
Collapse
Affiliation(s)
- Wei Wang
- College of Science, Heilongjiang Institute of Technology, Harbin 150050, China
| | - Haiyan Yuan
- College of Science, Heilongjiang Institute of Technology, Harbin 150050, China
| | - Junwei Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China,Corresponding authors.
| | - Wei Liu
- College of Science, Heilongjiang Institute of Technology, Harbin 150050, China,Corresponding authors.
| |
Collapse
|
7
|
Phua SX, Lim KP, Goh WWB. Perspectives for better batch effect correction in mass-spectrometry-based proteomics. Comput Struct Biotechnol J 2022; 20:4369-4375. [PMID: 36051874 PMCID: PMC9411064 DOI: 10.1016/j.csbj.2022.08.022] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 08/09/2022] [Accepted: 08/09/2022] [Indexed: 11/08/2022] Open
Abstract
Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch correction on is often unclear. Here, we explore several relevant issues pertinent to batch effect correct considerations. The first involves applications of batch effect correction requiring prior knowledge on batch factors and exploring data to uncover new/unknown batch factors. The second considers recent literature that suggests there is no single best batch effect correction algorithm---i.e., instead of a best approach, one may instead ask, what is a suitable approach. The third section considers issues of batch effect detection. And finally, we look at potential developments for proteomic-specific batch effect correction methods and how to do better functional evaluations on batch corrected data.
Collapse
Affiliation(s)
- Ser-Xian Phua
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Kai-Peng Lim
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Wilson Wen-Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore
| |
Collapse
|
8
|
Narayana JK, Mac Aogáin M, Goh WWB, Xia K, Tsaneva-Atanasova K, Chotirmall SH. Mathematical-based microbiome analytics for clinical translation. Comput Struct Biotechnol J 2021; 19:6272-6281. [PMID: 34900137 PMCID: PMC8637001 DOI: 10.1016/j.csbj.2021.11.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/20/2022] Open
Abstract
Traditionally, human microbiology has been strongly built on the laboratory focused culture of microbes isolated from human specimens in patients with acute or chronic infection. These approaches primarily view human disease through the lens of a single species and its relevant clinical setting however such approaches fail to account for the surrounding environment and wide microbial diversity that exists in vivo. Given the emergence of next generation sequencing technologies and advancing bioinformatic pipelines, researchers now have unprecedented capabilities to characterise the human microbiome in terms of its taxonomy, function, antibiotic resistance and even bacteriophages. Despite this, an analysis of microbial communities has largely been restricted to ordination, ecological measures, and discriminant taxa analysis. This is predominantly due to a lack of suitable computational tools to facilitate microbiome analytics. In this review, we first evaluate the key concerns related to the inherent structure of microbiome datasets which include its compositionality and batch effects. We describe the available and emerging analytical techniques including integrative analysis, machine learning, microbial association networks, topological data analysis (TDA) and mathematical modelling. We also present how these methods may translate to clinical settings including tools for implementation. Mathematical based analytics for microbiome analysis represents a promising avenue for clinical translation across a range of acute and chronic disease states.
Collapse
Affiliation(s)
- Jayanth Kumar Narayana
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
| | - Micheál Mac Aogáin
- Biochemical Genetics Laboratory, Department of Biochemistry, St. James’s Hospital, Dublin, Ireland
- Clinical Biochemistry Unit, School of Medicine, Trinity College Dublin, Dublin, Ireland
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
| | - Krasimira Tsaneva-Atanasova
- Department of Mathematics & Living Systems Institute, College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter EX4 4QF, UK
| | - Sanjay H. Chotirmall
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Department of Respiratory and Critical Care Medicine, Tan Tock Seng Hospital, Singapore
| |
Collapse
|
9
|
Park SY, Egan S, Cura AJ, Aron KL, Xu X, Zheng M, Borys M, Ghose S, Li Z, Lee K. Untargeted proteomics reveals upregulation of stress response pathways during CHO-based monoclonal antibody manufacturing process leading to disulfide bond reduction. MAbs 2021; 13:1963094. [PMID: 34424810 PMCID: PMC8386704 DOI: 10.1080/19420862.2021.1963094] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Monoclonal antibody (mAb) interchain disulfide bond reduction can cause a loss of function and negatively impact the therapeutic’s efficacy and safety. Disulfide bond reduction has been observed at various stages during the manufacturing process, including processing of the harvested material. The factors and mechanisms driving this phenomenon are not fully understood. In this study, we examined the host cell proteome as a potential factor affecting the susceptibility of a mAb to disulfide bond reduction in the harvested cell culture fluid (HCCF). We used untargeted liquid-chromatography-mass spectrometry-based proteomics experiments in conjunction with a semi-automated protein identification workflow to systematically compare Chinese hamster ovary (CHO) cell protein abundances between bioreactor conditions that result in reduction-susceptible and reduction-free HCCF. Although the growth profiles and antibody titers of these two bioreactor conditions were indistinguishable, we observed broad differences in host cell protein (HCP) expression. We found significant differences in the abundance of glycolytic enzymes, key protein reductases, and antioxidant defense enzymes. Multivariate analysis of the proteomics data determined that upregulation of stress-inducible endoplasmic reticulum (ER) and other chaperone proteins is a discriminatory characteristic of reduction-susceptible HCP profiles. Overall, these results suggest that stress response pathways activated during bioreactor culture increase the reduction-susceptibility of HCCF. Consequently, these pathways could be valuable targets for optimizing culture conditions to improve protein quality.
Collapse
Affiliation(s)
- Seo-Young Park
- Department of Chemical and Biological Engineering, Tufts University, Medford, MA, USA.,School of Chemical Engineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Susan Egan
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Anthony J Cura
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Kathryn L Aron
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Xuankuo Xu
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Mengyuan Zheng
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Michael Borys
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Sanchayita Ghose
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Zhengjian Li
- Biologics Development, Global Product Development and Supply, Bristol-Myers Squibb, Devens, USA
| | - Kyongbum Lee
- Department of Chemical and Biological Engineering, Tufts University, Medford, MA, USA
| |
Collapse
|
10
|
Wang W, Liu W. PCLasso: a protein complex-based, group lasso-Cox model for accurate prognosis and risk protein complex discovery. Brief Bioinform 2021; 22:6291946. [PMID: 34086850 DOI: 10.1093/bib/bbab212] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 05/08/2021] [Accepted: 05/15/2021] [Indexed: 12/12/2022] Open
Abstract
For high-dimensional expression data, most prognostic models perform feature selection based on individual genes, which usually lead to unstable prognosis, and the identified risk genes are inherently insufficient in revealing complex molecular mechanisms. Since most genes carry out cellular functions by forming protein complexes-basic representatives of functional modules, identifying risk protein complexes may greatly improve our understanding of disease biology. Coupled with the fact that protein complexes have been shown to have innate resistance to batch effects and are effective predictors of disease phenotypes, constructing prognostic models and selecting features with protein complexes as the basic unit should improve the robustness and biological interpretability of the model. Here, we propose a protein complex-based, group lasso-Cox model (PCLasso) to predict patient prognosis and identify risk protein complexes. Experiments on three cancer types have proved that PCLasso has better prognostic performance than prognostic models based on individual genes. The resulting risk protein complexes not only contain individual risk genes but also incorporate close partners that synergize with them, which may promote the revealing of molecular mechanisms related to cancer progression from a comprehensive perspective. Furthermore, a pan-cancer prognostic analysis was performed to identify risk protein complexes of 19 cancer types, which may provide novel potential targets for cancer research.
Collapse
Affiliation(s)
- Wei Wang
- Heilongjiang Institute of Technology, Harbin 150050, China
| | - Wei Liu
- School of Science at Heilongjiang Institute of Technology, Harbin 150050, China
| |
Collapse
|
11
|
Zindler T, Frieling H, Neyazi A, Bleich S, Friedel E. Simulating ComBat: how batch correction can lead to the systematic introduction of false positive results in DNA methylation microarray studies. BMC Bioinformatics 2020; 21:271. [PMID: 32605541 PMCID: PMC7328269 DOI: 10.1186/s12859-020-03559-6] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Accepted: 05/26/2020] [Indexed: 12/04/2022] Open
Abstract
Background Systematic technical effects—also called batch effects—are a considerable challenge when analyzing DNA methylation (DNAm) microarray data, because they can lead to false results when confounded with the variable of interest. Methods to correct these batch effects are error-prone, as previous findings have shown. Results Here, we demonstrate how using the R function ComBat to correct simulated Infinium HumanMethylation450 BeadChip (450 K) and Infinium MethylationEPIC BeadChip Kit (EPIC) DNAm data can lead to a large number of false positive results under certain conditions. We further provide a detailed assessment of the consequences for the highly relevant problem of p-value inflation with subsequent false positive findings after application of the frequently used ComBat method. Using ComBat to correct for batch effects in randomly generated samples produced alarming numbers of false discovery rate (FDR) and Bonferroni-corrected (BF) false positive results in unbalanced as well as in balanced sample distributions in terms of the relation between the outcome of interest variable and the technical position of the sample during the probe measurement. Both sample size and number of batch factors (e.g. number of chips) were systematically simulated to assess the probability of false positive findings. The effect of sample size was simulated using n = 48 up to n = 768 randomly generated samples. Increasing the number of corrected factors led to an exponential increase in the number of false positive signals. Increasing the number of samples reduced, but did not completely prevent, this effect. Conclusions Using the approach described, we demonstrate, that using ComBat for batch correction in DNAm data can lead to false positive results under certain conditions and sample distributions. Our results are thus contrary to previous publications, considering a balanced sample distribution as unproblematic when using ComBat. We do not claim completeness in terms of reporting all technical conditions and possible solutions of the occurring problems as we approach the problem from a clinician’s perspective and not from that of a computer scientist. With our approach of simulating data, we provide readers with a simple method to assess the probability of false positive findings in DNAm microarray data analysis pipelines.
Collapse
Affiliation(s)
- Tristan Zindler
- Department of Psychiatry, Social Psychiatry and Psychotherapy, Hannover Medical School, Hannover, Germany.
| | - Helge Frieling
- Department of Psychiatry, Social Psychiatry and Psychotherapy, Hannover Medical School, Hannover, Germany
| | - Alexandra Neyazi
- Department of Psychiatry, Social Psychiatry and Psychotherapy, Hannover Medical School, Hannover, Germany
| | - Stefan Bleich
- Department of Psychiatry, Social Psychiatry and Psychotherapy, Hannover Medical School, Hannover, Germany
| | - Eva Friedel
- Department of Psychiatry and Psychotherapy, Charité Campus Mitte (CCM), Charité-Universitätsmedizin Berlin, Berlin, Germany.,Berlin Institute of Health (BIH), 10178, Berlin, Germany
| |
Collapse
|
12
|
Goh WWB, Wong L. The Birth of Bio-data Science: Trends, Expectations, and Applications. GENOMICS, PROTEOMICS & BIOINFORMATICS 2020; 18:5-15. [PMID: 32428604 PMCID: PMC7393550 DOI: 10.1016/j.gpb.2020.01.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 12/02/2019] [Accepted: 02/26/2020] [Indexed: 12/23/2022]
Affiliation(s)
- Wilson Wen Bin Goh
- (1)School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore.
| | - Limsoon Wong
- (2)Department of Computer Science, National University of Singapore, Singapore 117417, Singapore.
| |
Collapse
|
13
|
Zhou L, Chi-Hau Sue A, Bin Goh WW. Examining the practical limits of batch effect-correction algorithms: When should you care about batch effects? J Genet Genomics 2019; 46:433-443. [PMID: 31611172 DOI: 10.1016/j.jgg.2019.08.002] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2019] [Revised: 08/02/2019] [Accepted: 08/04/2019] [Indexed: 12/20/2022]
Abstract
Batch effects are technical sources of variation and can confound analysis. While many performance ranking exercises have been conducted to establish the best batch effect-correction algorithm (BECA), we hold the viewpoint that the notion of best is context-dependent. Moreover, alternative questions beyond the simplistic notion of "best" are also interesting: are BECAs robust against various degrees of confounding and if so, what is the limit? Using two different methods for simulating class (phenotype) and batch effects and taking various representative datasets across both genomics (RNA-Seq) and proteomics platforms, we demonstrate that under situations where sample classes and batch factors are moderately confounded, most BECAs are remarkably robust and only weakly affected by upstream normalization procedures. This observation is consistently supported across the multitude of test datasets. BECAs do have limits: When sample classes and batch factors are strongly confounded, BECA performance declines, with variable performance in precision, recall and also batch correction. We also report that while conventional normalization methods have minimal impact on batch effect correction, they do not affect downstream statistical feature selection, and in strongly confounded scenarios, may even outperform BECAs. In other words, removing batch effects is no guarantee of optimal functional analysis. Overall, this study suggests that simplistic performance ranking exercises are quite trivial, and all BECAs are compromises in some context or another.
Collapse
Affiliation(s)
- Longjian Zhou
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, 30072, China
| | - Andrew Chi-Hau Sue
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, 30072, China
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore.
| |
Collapse
|
14
|
Goh WWB, Wong L. Advanced bioinformatics methods for practical applications in proteomics. Brief Bioinform 2019; 20:347-355. [PMID: 30657890 DOI: 10.1093/bib/bbx128] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Indexed: 12/22/2022] Open
Abstract
Mass spectrometry (MS)-based proteomics has undergone rapid advancements in recent years, creating challenging problems for bioinformatics. We focus on four aspects where bioinformatics plays a crucial role (and proteomics is needed for clinical application): peptide-spectra matching (PSM) based on the new data-independent acquisition (DIA) paradigm, resolving missing proteins (MPs), dealing with biological and technical heterogeneity in data and statistical feature selection (SFS). DIA is a brute-force strategy that provides greater width and depth but, because it indiscriminately captures spectra such that signal from multiple peptides is mixed, getting good PSMs is difficult. We consider two strategies: simplification of DIA spectra to pseudo-data-dependent acquisition spectra or, alternatively, brute-force search of each DIA spectra against known reference libraries. The MP problem arises when proteins are never (or inconsistently) detected by MS. When observed in at least one sample, imputation methods can be used to guess the approximate protein expression level. If never observed at all, network/protein complex-based contextualization provides an independent prediction platform. Data heterogeneity is a difficult problem with two dimensions: technical (batch effects), which should be removed, and biological (including demography and disease subpopulations), which should be retained. Simple normalization is seldom sufficient, while batch effect-correction algorithms may create errors. Batch effect-resistant normalization methods are a viable alternative. Finally, SFS is vital for practical applications. While many methods exist, there is no best method, and both upstream (e.g. normalization) and downstream processing (e.g. multiple-testing correction) are performance confounders. We also discuss signal detection when class effects are weak.
Collapse
|
15
|
Goh WWB, Sng JCG, Yee JY, See YM, Lee TS, Wong L, Lee J. Can Peripheral Blood-Derived Gene Expressions Characterize Individuals at Ultra-high Risk for Psychosis? COMPUTATIONAL PSYCHIATRY (CAMBRIDGE, MASS.) 2017; 1:168-183. [PMID: 30090857 PMCID: PMC6067827 DOI: 10.1162/cpsy_a_00007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/07/2017] [Indexed: 12/17/2022]
Abstract
The ultra-high risk (UHR) state was originally conceived to identify individuals at imminent risk of developing psychosis. Although recent studies have suggested that most individuals designated UHR do not, they constitute a distinctive group, exhibiting cognitive and functional impairments alongside multiple psychiatric morbidities. UHR characterization using molecular markers may improve understanding, provide novel insight into pathophysiology, and perhaps improve psychosis prediction reliability. Whole-blood gene expressions from 56 UHR subjects and 28 healthy controls are checked for existence of a consistent gene expression profile (signature) underlying UHR, across a variety of normalization and heterogeneity-removal techniques, including simple log-conversion, quantile normalization, gene fuzzy scoring (GFS), and surrogate variable analysis. During functional analysis, consistent and reproducible identification of important genes depends largely on how data are normalized. Normalization techniques that address sample heterogeneity are superior. The best performer, the unsupervised GFS, produced a strong and concise 12-gene signature, enriched for psychosis-associated genes. Importantly, when applied on random subsets of data, classifiers built with GFS are "meaningful" in the sense that the classifier models built using genes selected after other forms of normalization do not outperform random ones, but GFS-derived classifiers do. Data normalization can present highly disparate interpretations on biological data. Comparative analysis has shown that GFS is efficient at preserving signals while eliminating noise. Using this, we demonstrate confidently that the UHR designation is well correlated with a distinct blood-based gene signature.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, Singapore
- Department of Computer Science, National University of Singapore, Singapore
| | - Judy Chia-Ghee Sng
- Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Jie Yin Yee
- Research Division, Institute of Mental Health, Singapore
| | - Yuen Mei See
- Research Division, Institute of Mental Health, Singapore
| | - Tih-Shih Lee
- Neuroscience and Behavioral Disorders Program, Duke–National University of Singapore Medical School, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore
- Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Jimmy Lee
- Research Division, Institute of Mental Health, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| |
Collapse
|
16
|
Abstract
Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms, but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package NetProt, which provides implementations of representative feature-selection methods. NetProt also provides methods for generating simulated differential data and generating pseudocomplexes for complex-based performance benchmarking. The NetProt open source R package is available for download from https://github.com/gohwils/NetProt/releases/ , and online documentation is available at http://rpubs.com/gohwils/204259 .
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University , 92 Weijin Road, Tianjin 300072, China.,School of Biological Sciences, Nanyang Technological University , 60 Nanyang Drive, Singapore 637551.,Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417.,Department of Pathology, National University of Singapore , 5 Lower Kent Ridge Road, Singapore 119074
| |
Collapse
|
17
|
Goh WWB, Wong L. Class-paired Fuzzy SubNETs: A paired variant of the rank-based network analysis family for feature selection based on protein complexes. Proteomics 2017; 17:e1700093. [PMID: 28390171 DOI: 10.1002/pmic.201700093] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 04/05/2017] [Indexed: 01/12/2023]
Abstract
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature-selection methods collectively referred to as Rank-Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are similar in the sense that they deploy rank-defined weights among proteins per sample. This procedure is known as gene fuzzy scoring. Currently, no RBNA exists for paired-sample scenarios where both control and test tissues originate from the same source (e.g. same patient). It is expected that paired tests, when used appropriately, are more powerful than approaches intended for unpaired samples. We report that the class-paired RBNA, PPFSNET, dominates in both simulated and real data scenarios. Moreover, for the first time, we explicitly incorporate batch-effect resistance as an additional evaluation criterion for feature-selection approaches. Batch effects are class irrelevant variations arising from different handlers or processing times, and can obfuscate analysis. We demonstrate that PPFSNET and an earlier RBNA, PFSNET, are particularly resistant against batch effects, and only select features strongly correlated with class but not batch.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, P. R. China.,Department of Computer Science, National University of Singapore, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore.,Department of Pathology, National University of Singapore, Singapore
| |
Collapse
|
18
|
Goh WWB, Wang W, Wong L. Why Batch Effects Matter in Omics Data, and How to Avoid Them. Trends Biotechnol 2017; 35:498-507. [PMID: 28351613 DOI: 10.1016/j.tibtech.2017.02.012] [Citation(s) in RCA: 230] [Impact Index Per Article: 28.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2016] [Revised: 02/16/2017] [Accepted: 02/28/2017] [Indexed: 12/23/2022]
Abstract
Effective integration and analysis of new high-throughput data, especially gene-expression and proteomic-profiling data, are expected to deliver novel clinical insights and therapeutic options. Unfortunately, technical heterogeneity or batch effects (different experiment times, handlers, reagent lots, etc.) have proven challenging. Although batch effect-correction algorithms (BECAs) exist, we know little about effective batch-effect mitigation: even now, new batch effect-associated problems are emerging. These include false effects due to misapplying BECAs and positive bias during model evaluations. Depending on the choice of algorithm and experimental set-up, biological heterogeneity can be mistaken for batch effects and wrongfully removed. Here, we examine these emerging batch effect-associated problems, propose a series of best practices, and discuss some of the challenges that lie ahead.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin 300072, P.R. China; Department of Computer Science, National University of Singapore, Singapore 117417, Republic of Singapore.
| | - Wei Wang
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin 300072, P.R. China
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore 117417, Republic of Singapore; Department of Pathology, National University of Singapore, Singapore 119074, Republic of Singapore.
| |
Collapse
|