1
|
Van R, Alvarez D, Mize T, Gannavarapu S, Chintham Reddy L, Nasoz F, Han MV. A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies. BMC Bioinformatics 2024; 25:181. [PMID: 38720247 PMCID: PMC11080237 DOI: 10.1186/s12859-024-05801-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 05/02/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
Collapse
Affiliation(s)
- Richard Van
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Daniel Alvarez
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Travis Mize
- Icahn School of Medicine at Mount Sinai, Institute for Genomic Health, New York City, NY, USA
| | - Sravani Gannavarapu
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Lohitha Chintham Reddy
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Fatma Nasoz
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Mira V Han
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.
| |
Collapse
|
2
|
Rabaglino MB, Sánchez JM, McDonald M, O’Callaghan E, Lonergan P. Maternal blood transcriptome as a sensor of fetal organ maturation at the end of organogenesis in cattle†. Biol Reprod 2023; 109:749-758. [PMID: 37658765 PMCID: PMC10651065 DOI: 10.1093/biolre/ioad103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 07/25/2023] [Accepted: 08/31/2023] [Indexed: 09/05/2023] Open
Abstract
Harnessing information from the maternal blood to predict fetal growth is attractive yet scarcely explored in livestock. The objectives were to determine the transcriptomic modifications in maternal blood and fetal liver, gonads, and heart according to fetal weight and to model a molecular signature based on the fetal organs allowing the prediction of fetal weight from the maternal blood transcriptome in cattle. In addition to a contemporaneous maternal blood sample, organ samples were collected from 10 male fetuses at 42 days of gestation for RNA-sequencing. Fetal weight ranged from 1.25 to 1.69 g (mean = 1.44 ± 0.15 g). Clustering data analysis revealed clusters of co-expressed genes positively correlated with fetal weight and enriching ontological terms biologically relevant for the organ. For the heart, the 1346 co-expressed genes were involved in energy generation and protein synthesis. For the gonads, the 1042 co-expressed genes enriched seminiferous tubule development. The 459 co-expressed genes identified in the liver were associated with lipid synthesis and metabolism. Finally, the cluster of 571 co-expressed genes determined in maternal blood enriched oxidative phosphorylation and thermogenesis. Next, data from the fetal organs were used to train a regression model of fetal weight, which was predicted with the maternal blood data. The best prediction was achieved when the model was trained with 35 co-expressed genes overlapping between heart and maternal blood (root-mean-square error = 0.04, R2 = 0.93). In conclusion, linking transcriptomic information from maternal blood with that from the fetal heart unveiled maternal blood as a predictor of fetal development.
Collapse
Affiliation(s)
- Maria Belen Rabaglino
- School of Agriculture and Food Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - José María Sánchez
- Departamento de Reproducción Animal, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria, Madrid, Spain
| | - Michael McDonald
- School of Agriculture and Food Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Elena O’Callaghan
- School of Agriculture and Food Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Pat Lonergan
- School of Agriculture and Food Science, University College Dublin, Belfield, Dublin 4, Ireland
| |
Collapse
|
3
|
Pogosova-Agadjanyan EL, Hua X, Othus M, Appelbaum FR, Chauncey TR, Erba HP, Fitzgibbon MP, Jenkins IC, Fang M, Lee SC, Moseley A, Naru J, Radich JP, Smith JL, Willborg BE, Willman CL, Wu F, Meshinchi S, Stirewalt DL. Verification of prognostic expression biomarkers is improved by examining enriched leukemic blasts rather than mononuclear cells from acute myeloid leukemia patients. Biomark Res 2023; 11:31. [PMID: 36927800 PMCID: PMC10022072 DOI: 10.1186/s40364-023-00461-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 01/30/2023] [Indexed: 03/18/2023] Open
Abstract
BACKGROUND Studies have not systematically compared the ability to verify performance of prognostic transcripts in paired bulk mononuclear cells versus viable CD34-expressing leukemic blasts from patients with acute myeloid leukemia. We hypothesized that examining the homogenous leukemic blasts will yield different biological information and may improve prognostic performance of expression biomarkers. METHODS To assess the impact of cellular heterogeneity on expression biomarkers in acute myeloid leukemia, we systematically examined paired mononuclear cells and viable CD34-expressing leukemic blasts from SWOG diagnostic specimens. After enrichment, patients were assigned into discovery and validation cohorts based on availability of extracted RNA. Analyses of RNA sequencing data examined how enrichment impacted differentially expressed genes associated with pre-analytic variables, patient characteristics, and clinical outcomes. RESULTS Blast enrichment yielded significantly different expression profiles and biological pathways associated with clinical characteristics (e.g., cytogenetics). Although numerous differentially expressed genes were associated with clinical outcomes, most lost their prognostic significance in the mononuclear cells and blasts after adjusting for age and ELN risk, with only 11 genes remaining significant for overall survival in both cell populations (CEP70, COMMD7, DNMT3B, ECE1, LNX2, NEGR1, PIK3C2B, SEMA4D, SMAD2, TAF8, ZNF444). To examine the impact of enrichment on biomarker verification, these 11 candidate biomarkers were examined by quantitative RT/PCR in the validation cohort. After adjusting for ELN risk and age, expression of 4 genes (CEP70, DNMT3B, ECE1, and PIK3CB) remained significantly associated with overall survival in the blasts, while none met statistical significance in mononuclear cells. CONCLUSIONS This study provides insights into biological information gained/lost by examining viable CD34-expressing leukemic blasts versus mononuclear cells from the same patient and shows an improved verification rate for expression biomarkers in blasts.
Collapse
Affiliation(s)
- Era L Pogosova-Agadjanyan
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
| | - Xing Hua
- SWOG Statistical Center, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Megan Othus
- SWOG Statistical Center, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Frederick R Appelbaum
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
- Departments of Oncology and Hematology, University of Washington, Seattle, WA, USA
| | - Thomas R Chauncey
- Departments of Oncology and Hematology, University of Washington, Seattle, WA, USA
- VA Puget Sound Health Care System, Seattle, WA, USA
| | | | | | - Isaac C Jenkins
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
- Clinical Biostatistics, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Min Fang
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
| | - Stanley C Lee
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
| | - Anna Moseley
- SWOG Statistical Center, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Jasmine Naru
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
| | - Jerald P Radich
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
- Departments of Oncology and Hematology, University of Washington, Seattle, WA, USA
| | - Jenny L Smith
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
- Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Brooke E Willborg
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
| | - Cheryl L Willman
- Department of Laboratory Medicine and Pathology, Mayo Clinic Comprehensive Cancer Center, Rochester, MN, USA
| | - Feinan Wu
- Bioinformatics Shared Resource, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Soheil Meshinchi
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA
- Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Derek L Stirewalt
- Clinical Research Division, Fred Hutchinson Cancer Center, 1100 Fairview Ave N, D5-112, Seattle, WA, 98109, USA.
- Departments of Oncology and Hematology, University of Washington, Seattle, WA, USA.
| |
Collapse
|
4
|
Rabaglino MB, Salilew-Wondim D, Zolini A, Tesfaye D, Hoelker M, Lonergan P, Hansen PJ. Machine-learning methods applied to integrated transcriptomic data from bovine blastocysts and elongating conceptuses to identify genes predictive of embryonic competence. FASEB J 2023; 37:e22809. [PMID: 36753406 DOI: 10.1096/fj.202201977r] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 01/13/2023] [Accepted: 01/26/2023] [Indexed: 02/09/2023]
Abstract
Early pregnancy loss markedly impacts reproductive efficiency in cattle. The objectives were to model a biologically relevant gene signature predicting embryonic competence for survival after integrating transcriptomic data from blastocysts and elongating conceptuses with different developmental capacities and to validate the potential biomarkers with independent embryonic data sets through the application of machine-learning algorithms. First, two data sets from in vivo-produced blastocysts competent or not to sustain a pregnancy were integrated with a data set from long and short day-15 conceptuses. A statistical contrast determined differentially expressed genes (DEG) increasing in expression from a competent blastocyst to a long conceptus and vice versa; these were enriched for KEGG pathways related to glycolysis/gluconeogenesis and RNA processing, respectively. Next, the most discriminative DEG between blastocysts that resulted or did not in pregnancy were selected by linear discriminant analysis. These eight putative biomarker genes were validated by modeling their expression in competent or noncompetent blastocysts through Bayesian logistic regression or neural networks and predicting embryo developmental fate in four external data sets consisting of in vitro-produced blastocysts (i) competent or not, or (ii) exposed or not to detrimental conditions during culture, and elongated conceptuses (iii) of different length, or (iv) developed in the uteri of high- or subfertile heifers. Predictions for each data set were more than 85% accurate, suggesting that these genes play a key role in embryo development and pregnancy establishment. In conclusion, this study integrated transcriptomic data from seven independent experiments to identify a small set of genes capable of predicting embryonic competence for survival.
Collapse
Affiliation(s)
- Maria Belen Rabaglino
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland
| | - Dessie Salilew-Wondim
- Institute of Animal Sciences, Animal Breeding, University of Bonn, Bonn, Germany.,Department of Animal Science, Biotechnology & Reproduction in Farm Animals, University of Goettingen, Goettingen, Germany
| | - Adriana Zolini
- Department of Animal Sciences, D.H. Barron Reproductive and Perinatal Biology Research Program, and Genetics Institute, University of Florida, Gainesville, Florida, USA
| | - Dawit Tesfaye
- Animal Reproduction and Biotechnology Laboratory, Department of Biomedical Sciences, Colorado State University, Fort Collins, Colorado, USA
| | - Michael Hoelker
- Department of Animal Science, Biotechnology & Reproduction in Farm Animals, University of Goettingen, Goettingen, Germany
| | - Pat Lonergan
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland
| | - Peter J Hansen
- Department of Animal Sciences, D.H. Barron Reproductive and Perinatal Biology Research Program, and Genetics Institute, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
5
|
Rabaglino MB, O’Doherty A, Bojsen-Møller Secher J, Lonergan P, Hyttel P, Fair T, Kadarmideen HN. Application of multi-omics data integration and machine learning approaches to identify epigenetic and transcriptomic differences between in vitro and in vivo produced bovine embryos. PLoS One 2021; 16:e0252096. [PMID: 34029343 PMCID: PMC8143403 DOI: 10.1371/journal.pone.0252096] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 05/09/2021] [Indexed: 01/16/2023] Open
Abstract
Pregnancy rates for in vitro produced (IVP) embryos are usually lower than for embryos produced in vivo after ovarian superovulation (MOET). This is potentially due to alterations in their trophectoderm (TE), the outermost layer in physical contact with the maternal endometrium. The main objective was to apply a multi-omics data integration approach to identify both temporally differentially expressed and differentially methylated genes (DEG and DMG), between IVP and MOET embryos, that could impact TE function. To start, four and five published transcriptomic and epigenomic datasets, respectively, were processed for data integration. Second, DEG from day 7 to days 13 and 16 and DMG from day 7 to day 17 were determined in the TE from IVP vs. MOET embryos. Third, genes that were both DE and DM were subjected to hierarchical clustering and functional enrichment analysis. Finally, findings were validated through a machine learning approach with two additional datasets from day 15 embryos. There were 1535 DEG and 6360 DMG, with 490 overlapped genes, whose expression profiles at days 13 and 16 resulted in three main clusters. Cluster 1 (188) and Cluster 2 (191) genes were down-regulated at day 13 or day 16, respectively, while Cluster 3 genes (111) were up-regulated at both days, in IVP embryos compared to MOET embryos. The top enriched terms were the KEGG pathway "focal adhesion" in Cluster 1 (FDR = 0.003), and the cellular component: "extracellular exosome" in Cluster 2 (FDR<0.0001), also enriched in Cluster 1 (FDR = 0.04). According to the machine learning approach, genes in Cluster 1 showed a similar expression pattern between IVP and less developed (short) MOET conceptuses; and between MOET and DKK1-treated (advanced) IVP conceptuses. In conclusion, these results suggest that early conceptuses derived from IVP embryos exhibit epigenomic and transcriptomic changes that later affect its elongation and focal adhesion, impairing post-transfer survival.
Collapse
Affiliation(s)
- Maria B. Rabaglino
- Quantitative Genetics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark
| | - Alan O’Doherty
- School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
| | - Jan Bojsen-Møller Secher
- Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg C, Denmark
| | - Patrick Lonergan
- School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
| | - Poul Hyttel
- Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg C, Denmark
| | - Trudee Fair
- School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
| | - Haja N. Kadarmideen
- Quantitative Genetics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
6
|
Machine learning approach to integrated endometrial transcriptomic datasets reveals biomarkers predicting uterine receptivity in cattle at seven days after estrous. Sci Rep 2020; 10:16981. [PMID: 33046742 PMCID: PMC7550564 DOI: 10.1038/s41598-020-72988-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Accepted: 09/07/2020] [Indexed: 12/12/2022] Open
Abstract
The main goal was to apply machine learning (ML) methods on integrated multi-transcriptomic data, to identify endometrial genes capable of predicting uterine receptivity according to their expression patterns in the cow. Public data from five studies were re-analyzed. In all of them, endometrial samples were obtained at day 6–7 of the estrous cycle, from cows or heifers of four different European breeds, classified as pregnant (n = 26) or not (n = 26). First, gene selection was performed through supervised and unsupervised ML algorithms. Then, the predictive ability of potential key genes was evaluated through support vector machine as classifier, using the expression levels of the samples from all the breeds but one, to train the model, and the samples from that one breed, to test it. Finally, the biological meaning of the key genes was explored. Fifty genes were identified, and they could predict uterine receptivity with an overall 96.1% accuracy, despite the animal’s breed and category. Genes with higher expression in the pregnant cows were related to circadian rhythm, Wnt receptor signaling pathway, and embryonic development. This novel and robust combination of computational tools allowed the identification of a group of biologically relevant endometrial genes that could support pregnancy in the cattle.
Collapse
|
7
|
Mazzoni G, Pedersen HS, Rabaglino MB, Hyttel P, Callesen H, Kadarmideen HN. Characterization of the endometrial transcriptome in early diestrus influencing pregnancy status in dairy cattle after transfer of in vitro-produced embryos. Physiol Genomics 2020; 52:269-279. [PMID: 32508252 DOI: 10.1152/physiolgenomics.00027.2020] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Modifications of the endometrial transcriptome at day 7 of the estrus cycle are crucial to maintain gestation after transfer of in vitro-produced (IVP) embryos, although these changes are still largely unknown. The aim of this study was to identify genes, and their related biological mechanisms, important for pregnancy establishment based on the endometrial transcriptome of recipient lactating dairy cows that become pregnant in the subsequent estrus cycle, upon transfer of IVP embryos. Endometrial biopsies were taken from Holstein Friesian cows on day 6-8 of the estrus cycle followed by embryo transfer in the following cycle. Animals were classified retrospectively as pregnant (PR, n = 8) or nonpregnant (non-PR, n = 11) cows, according to pregnancy status at 26-47 days. Extracted mRNAs from endometrial samples were sequenced with an Illumina platform to determine differentially expressed genes (DEG) between the endometrial transcriptome from PR and non-PR cows. There were 111 DEG (false discovery rate < 0.05), which were mainly related to extracellular matrix interaction, histotroph metabolic composition, prostaglandin synthesis, transforming growth factor-β signaling as well as inflammation and leukocyte activation. Comparison of these DEG with DEG identified in two public external data sets confirmed the more fertile endometrial molecular profile of PR cows. In conclusion, this study provides insights into the key early endometrial mechanisms for pregnancy establishment, after IVP embryo transfer in dairy cows.
Collapse
Affiliation(s)
- Gianluca Mazzoni
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, Denmark.,Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Maria B Rabaglino
- Quantitative Genetics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Poul Hyttel
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, Denmark
| | - Henrik Callesen
- Department of Animal Science, Aarhus University, Tjele, Denmark
| | - Haja N Kadarmideen
- Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Frederiksberg, Denmark.,Quantitative Genetics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
8
|
Samaga D, Hornung R, Braselmann H, Hess J, Zitzelsberger H, Belka C, Boulesteix AL, Unger K. Single-center versus multi-center data sets for molecular prognostic modeling: a simulation study. Radiat Oncol 2020; 15:109. [PMID: 32410693 PMCID: PMC7227093 DOI: 10.1186/s13014-020-01543-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 04/22/2020] [Indexed: 02/07/2023] Open
Abstract
Background Prognostic models based on high-dimensional omics data generated from clinical patient samples, such as tumor tissues or biopsies, are increasingly used for prognosis of radio-therapeutic success. The model development process requires two independent discovery and validation data sets. Each of them may contain samples collected in a single center or a collection of samples from multiple centers. Multi-center data tend to be more heterogeneous than single-center data but are less affected by potential site-specific biases. Optimal use of limited data resources for discovery and validation with respect to the expected success of a study requires dispassionate, objective decision-making. In this work, we addressed the impact of the choice of single-center and multi-center data as discovery and validation data sets, and assessed how this impact depends on the three data characteristics signal strength, number of informative features and sample size. Methods We set up a simulation study to quantify the predictive performance of a model trained and validated on different combinations of in silico single-center and multi-center data. The standard bioinformatical analysis workflow of batch correction, feature selection and parameter estimation was emulated. For the determination of model quality, four measures were used: false discovery rate, prediction error, chance of successful validation (significant correlation of predicted and true validation data outcome) and model calibration. Results In agreement with literature about generalizability of signatures, prognostic models fitted to multi-center data consistently outperformed their single-center counterparts when the prediction error was the quality criterion of interest. However, for low signal strengths and small sample sizes, single-center discovery sets showed superior performance with respect to false discovery rate and chance of successful validation. Conclusions With regard to decision making, this simulation study underlines the importance of study aims being defined precisely a priori. Minimization of the prediction error requires multi-center discovery data, whereas single-center data are preferable with respect to false discovery rate and chance of successful validation when the expected signal or sample size is low. In contrast, the choice of validation data solely affects the quality of the estimator of the prediction error, which was more precise on multi-center validation data.
Collapse
Affiliation(s)
- Daniel Samaga
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.
| | - Roman Hornung
- Department of Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Herbert Braselmann
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany
| | - Julia Hess
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.,Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Horst Zitzelsberger
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.,Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Claus Belka
- Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Anne-Laure Boulesteix
- Department of Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Kristian Unger
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.,Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| |
Collapse
|
9
|
Scalable Prediction of Acute Myeloid Leukemia Using High-Dimensional Machine Learning and Blood Transcriptomics. iScience 2019; 23:100780. [PMID: 31918046 PMCID: PMC6992905 DOI: 10.1016/j.isci.2019.100780] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/03/2019] [Accepted: 12/12/2019] [Indexed: 01/16/2023] Open
Abstract
Acute myeloid leukemia (AML) is a severe, mostly fatal hematopoietic malignancy. We were interested in whether transcriptomic-based machine learning could predict AML status without requiring expert input. Using 12,029 samples from 105 different studies, we present a large-scale study of machine learning-based prediction of AML in which we address key questions relating to the combination of machine learning and transcriptomics and their practical use. We find data-driven, high-dimensional approaches—in which multivariate signatures are learned directly from genome-wide data with no prior knowledge—to be accurate and robust. Importantly, these approaches are highly scalable with low marginal cost, essentially matching human expert annotation in a near-automated workflow. Our results support the notion that transcriptomics combined with machine learning could be used as part of an integrated -omics approach wherein risk prediction, differential diagnosis, and subclassification of AML are achieved by genomics while diagnosis could be assisted by transcriptomic-based machine learning. Study presents one of the largest transcriptomics datasets to date for AML prediction Effective classifiers can be obtained by high-dimensional machine learning Accuracy increases with dataset size Includes challenging scenarios such as cross-study and cross-technology
Collapse
|
10
|
Gradin R, Lindstedt M, Johansson H. Batch adjustment by reference alignment (BARA): Improved prediction performance in biological test sets with batch effects. PLoS One 2019; 14:e0212669. [PMID: 30794641 PMCID: PMC6386283 DOI: 10.1371/journal.pone.0212669] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 02/07/2019] [Indexed: 12/15/2022] Open
Abstract
Many biological data acquisition platforms suffer from inadvertent inclusion of biologically irrelevant variance in analyzed data, collectively termed batch effects. Batch effects can lead to difficulties in downstream analysis by lowering the power to detect biologically interesting differences and can in certain instances lead to false discoveries. They are especially troublesome in predictive modelling where samples in training sets and test sets are often completely correlated with batches. In this article, we present BARA, a normalization method for adjusting batch effects in predictive modelling. BARA utilizes a few reference samples to adjust for batch effects in a compressed data space spanned by the training set. We evaluate BARA using a collection of publicly available datasets and three different prediction models, and compare its performance to already existing methods developed for similar purposes. The results show that data normalized with BARA generates high and consistent prediction performances. Further, they suggest that BARA produces reliable performances independent of the examined classifiers. We therefore conclude that BARA has great potential to facilitate the development of predictive assays where test sets and training sets are correlated with batch.
Collapse
Affiliation(s)
| | - Malin Lindstedt
- Department of Immunotechnology, Lund University, Lund, Sweden
| | | |
Collapse
|
11
|
Yi H, Raman AT, Zhang H, Allen GI, Liu Z. Detecting hidden batch factors through data-adaptive adjustment for biological effects. Bioinformatics 2018; 34:1141-1147. [PMID: 29617963 PMCID: PMC6454417 DOI: 10.1093/bioinformatics/btx635] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2017] [Revised: 09/05/2017] [Accepted: 10/06/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest. Results We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects. Availability and implementation DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC. Contact zhanghan@nankai.edu.cn or zhandonl@bcm.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haidong Yi
- College of Computer and Control Engineering, Nankai University, Tianjin, China
| | - Ayush T Raman
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX, USA
- Department of Pediatrics, Neurological Research Institute, Baylor College of Medicine, Houston, TX, USA
| | - Han Zhang
- College of Computer and Control Engineering, Nankai University, Tianjin, China
- Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tianjin, China
| | | | - Zhandong Liu
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX, USA
- Department of Pediatrics, Neurological Research Institute, Baylor College of Medicine, Houston, TX, USA
| |
Collapse
|