1
|
Zhao Y, Peng F, Wang C, Murano T, Baba H, Ikematsu H, Li W, Goel A. A DNA Methylation-based Epigenetic Signature for the Identification of Lymph Node Metastasis in T1 Colorectal Cancer. Ann Surg 2023; 277:655-663. [PMID: 35837968 PMCID: PMC9840712 DOI: 10.1097/sla.0000000000005564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
OBJECTIVE This study aimed to unravel the lymph node metastasis (LNM)-related methylated DNA (mDNA) landscape and develop a mDNA signature to identify LNM in patients with T1 colorectal cancers (T1 CRC). BACKGROUND Considering the invasiveness of T1 CRC, current guidelines recommend endoscopic resection in patients with LNM-negative, and radical surgical resection only for high-risk LNM-positive patients. Unfortunately, the clinicopathological criteria for LNM risk stratification are imperfect, resulting in frequent misdiagnosis leading to unnecessary radical surgeries and postsurgical complications. METHODS We conducted genome-wide methylation profiling of 39 T1 CRC specimens to identify differentially methylated CpGs between LNM-positive and LNM-negative, and performed quantitative pyrosequencing analysis in 235 specimens from 3 independent patient cohorts, including 195 resected tissues (training cohort: n=128, validation cohort: n=67) and 40 pretreatment biopsies. RESULTS Using logistic regression analysis, we developed a 9-CpG signature to distinguish LNM-positive versus LNM-negative surgical specimens in the training cohort [area under the curve (AUC)=0.831, 95% confidence interval (CI)=0.755-0.892; P <0.0001], which was subsequently validated in additional surgical specimens (AUC=0.825; 95% CI=0.696-0.955; P =0.003) and pretreatment biopsies (AUC=0.836; 95% CI=0.640-1.000, P =0.0036). This diagnostic power was further improved by combining the signature with conventional clinicopathological features. CONCLUSIONS We established a novel epigenetic signature that can robustly identify LNM in surgical specimens and even pretreatment biopsies from patients with T1 CRC. Our signature has strong translational potential to improve the selection of high-risk patients who require radical surgery while sparing others from its complications and expense.
Collapse
Affiliation(s)
- Yinghui Zhao
- Department of Molecular Diagnostics and Experimental Therapeutics, Beckman Research Institute of City of Hope, Monrovia, CA, USA
- Department of Clinical Laboratory, The Second Hospital, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Fuduan Peng
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA, USA
| | - Chuanxin Wang
- Shandong Engineering & Technology Research Center for Tumor Marker Detection, Jinan, China
- Shandong Provincial Clinical Medicine Research Center for Clinical Laboratory, Jinan, China
- Department of Clinical Laboratory, The Second Hospital, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Tatsuro Murano
- Department of Gastroenterology and Endoscopy, National Cancer Center Hospital East, Chiba, Japan
| | - Hideo Baba
- Department of Gastroenterological Surgery, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan Department of Gastroenterological Surgery, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan
| | - Hiroaki Ikematsu
- Department of Gastroenterology and Endoscopy, National Cancer Center Hospital East, Chiba, Japan
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA, USA
| | - Ajay Goel
- Department of Molecular Diagnostics and Experimental Therapeutics, Beckman Research Institute of City of Hope, Monrovia, CA, USA
- City of Hope Comprehensive Cancer Center, Duarte, CA, USA
| |
Collapse
|
2
|
Tanvir Ahmed K, Cheng S, Li Q, Yong J, Zhang W. Incomplete time-series gene expression in integrative study for islet autoimmunity prediction. Brief Bioinform 2022; 24:6895461. [PMID: 36513375 PMCID: PMC9851333 DOI: 10.1093/bib/bbac537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 10/27/2022] [Accepted: 11/08/2022] [Indexed: 12/15/2022] Open
Abstract
Type 1 diabetes (T1D) outcome prediction plays a vital role in identifying novel risk factors, ensuring early patient care and designing cohort studies. TEDDY is a longitudinal cohort study that collects a vast amount of multi-omics and clinical data from its participants to explore the progression and markers of T1D. However, missing data in the omics profiles make the outcome prediction a difficult task. TEDDY collected time series gene expression for less than 6% of enrolled participants. Additionally, for the participants whose gene expressions are collected, 79% time steps are missing. This study introduces an advanced bioinformatics framework for gene expression imputation and islet autoimmunity (IA) prediction. The imputation model generates synthetic data for participants with partially or entirely missing gene expression. The prediction model integrates the synthetic gene expression with other risk factors to achieve better predictive performance. Comprehensive experiments on TEDDY datasets show that: (1) Our pipeline can effectively integrate synthetic gene expression with family history, HLA genotype and SNPs to better predict IA status at 2 years (sensitivity 0.622, AUC 0.715) compared with the individual datasets and state-of-the-art results in the literature (AUC 0.682). (2) The synthetic gene expression contains predictive signals as strong as the true gene expression, reducing reliance on expensive and long-term longitudinal data collection. (3) Time series gene expression is crucial to the proposed improvement and shows significantly better predictive ability than cross-sectional gene expression. (4) Our pipeline is robust to limited data availability. Availability: Code is available at https://github.com/compbiolabucf/TEDDY.
Collapse
Affiliation(s)
| | - Sze Cheng
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA
| | - Qian Li
- Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Jeongsik Yong
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA
| | - Wei Zhang
- Corresponding author. Wei Zhang, Computer Science Department, University of Central Florida. Tel.: 407-823-2763;
| |
Collapse
|
3
|
Hu Z, Hu M, Yuan X, Yu H, Zou J, Zhang Y, Lu Z. Verbal learning, working memory, and attention/vigilance may be candidate phenotypes of bipolar II depression in Chinese Han nationality. Acta Psychol (Amst) 2022; 226:103563. [PMID: 35313178 DOI: 10.1016/j.actpsy.2022.103563] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Revised: 03/04/2022] [Accepted: 03/14/2022] [Indexed: 11/01/2022] Open
Abstract
OBJECTIVES Bipolar II depression (BD-II) is a subtype of bipolar disorder with recurrent depressive, manic, and frequent depressive episodes as the main clinical manifestations. This study aimed to compare the cognitive function of patients with BD-II with those of healthy siblings and controls to explore the internal phenotype of BD-II in the field of cognitive function. METHODS 66 BD-II patients, 58 healthy siblings, and 55 healthy controls were assessed with the Trail Making Test (TMT), Digit Symbol Coding Test (DSCT), Category Fluency, Hopkins Verbal Learning Test-Revised (HVLTR), Brief Visuospatial Memory Test-Revised (BVMT-R), Wechsler Memory Scale 3rd ed. Spatial Span Subtest (WMS-III SS), Neuropsychological Assessment Battery Mazes (NABM), Continuous Performance Test, and Identical Pairs (CPT-IP). RESULTS Patients with BD-II showed cognitive deficits in visual learning, reasoning and problem solving, verbal learning, attention/vigilance, working memory, and speed of processing. Healthy siblings showed cognitive deficits in reasoning and problem solving, verbal learning, attention/vigilance, working memory, and speed of processing. Substantial differences were observed among the three groups in reasoning and problem solving. CONCLUSIONS Verbal learning, working memory, and attention/vigilance may be potential endophenotypes that can be used to identify BD-II among Han Chinese in the early stage.
Collapse
|
4
|
Wartmann H, Heins S, Kloiber K, Bonn S. Bias-invariant RNA-sequencing metadata annotation. Gigascience 2021; 10:giab064. [PMID: 34553213 PMCID: PMC8559615 DOI: 10.1093/gigascience/giab064] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 06/11/2021] [Accepted: 09/01/2021] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs. FINDINGS Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples. CONCLUSION Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
Collapse
Affiliation(s)
- Hannes Wartmann
- Institute of Medical Systems Biology, Center for Biomedical AI, University
Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Sven Heins
- Institute of Medical Systems Biology, Center for Biomedical AI, University
Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Karin Kloiber
- Institute of Medical Systems Biology, Center for Biomedical AI, University
Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Stefan Bonn
- Institute of Medical Systems Biology, Center for Biomedical AI, University
Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany
| |
Collapse
|
5
|
Oliver J, Nair N, Orozco G, Smith S, Hyrich KL, Morgan A, Isaacs J, Wilson AG, Barton A, Plant D. Transcriptome-wide study of TNF-inhibitor therapy in rheumatoid arthritis reveals early signature of successful treatment. Arthritis Res Ther 2021; 23:80. [PMID: 33691749 PMCID: PMC7948368 DOI: 10.1186/s13075-021-02451-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 02/11/2021] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Despite the success of TNF-inhibitor therapy in rheumatoid arthritis treatment, up to 40% of patients fail to respond adequately. This study aimed to identify transcriptome-based biomarkers of adalimumab response in rheumatoid arthritis (RA) to aid timely switching in non-responder patients and provide a better mechanistic understanding of the pathways involved in response/non-response. METHODS The Affymetrix Human Transcriptome Array 2.0 (HTA) was used to measure the transcriptome in whole blood at pre-treatment and at 3 months in EULAR good- and non-responders to adalimumab therapy. Differential expression of transcripts was analysed at the transcript level using multiple linear regression. Differentially expressed genes were validated in independent samples using OpenArray™ RT-qPCR. RESULTS In total, 813 transcripts were differentially expressed between pre-treatment and 3 months in adalimumab good-responders. No significant differential expression was observed between good- and non-responders at either time-point and no significant changes were observed in non-responders between time-points. OpenArray™ RT-qPCR was performed for 104 differentially expressed transcripts in good-responders, selected based on magnitude of effect or p value or based on prior association with RA or the immune system, validating differential expression for 17 transcripts. CONCLUSIONS An early transcriptome signature of DAS28 response to adalimumab has been identified and replicated in independent datasets. Whilst treat-to-target approaches encourage early switching in non-responsive patients, registry evidence suggests that this does not always occur. The results herein could guide the development of a blood test to distinguish responders from non-responders at 3 months and support clinical decisions to switch non-responsive patients to an alternative therapy.
Collapse
Affiliation(s)
- James Oliver
- Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Sciences Centre, The University of Manchester, Manchester, UK
| | - Nisha Nair
- Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Sciences Centre, The University of Manchester, Manchester, UK
| | - Gisela Orozco
- Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Sciences Centre, The University of Manchester, Manchester, UK
| | - Samantha Smith
- Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Sciences Centre, The University of Manchester, Manchester, UK
| | - Kimme L Hyrich
- NIHR Manchester BRC, Manchester University Foundation Trust, Manchester, UK
- Versus Arthritis Centre for Epidemiology, Centre for Musculoskeletal Research, Manchester Academic Health Sciences Centre, The University of Manchester, Manchester, UK
| | - Ann Morgan
- Leeds Institute of Rheumatic and Musculoskeletal Medicine, University of Leeds and NIHR Leeds Musculoskeletal Biomedical Research Unit, Leeds Teaching Hospitals NHS Trust, Leeds, UK
| | - John Isaacs
- Institute of Cellular Medicine, Newcastle University, Newcastle upon Tyne, UK
- National Institute for Health Research Newcastle Biomedical Research Centre at Newcastle upon Tyne Hospitals NHS Foundation Trust and Newcastle University, Newcastle upon Tyne, UK
| | - Anthony G Wilson
- UCD School of Medicine and Medical Science, Conway Institute, University College Dublin, Dublin, Ireland
| | - Anne Barton
- Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Sciences Centre, The University of Manchester, Manchester, UK
- NIHR Manchester BRC, Manchester University Foundation Trust, Manchester, UK
| | - Darren Plant
- Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Sciences Centre, The University of Manchester, Manchester, UK.
- NIHR Manchester BRC, Manchester University Foundation Trust, Manchester, UK.
| |
Collapse
|
6
|
Elbadawi M, Gaisford S, Basit AW. Advanced machine-learning techniques in drug discovery. Drug Discov Today 2020; 26:769-777. [PMID: 33290820 DOI: 10.1016/j.drudis.2020.12.003] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 11/16/2020] [Accepted: 12/02/2020] [Indexed: 01/20/2023]
Abstract
The popularity of machine learning (ML) across drug discovery continues to grow, yielding impressive results. As their use increases, so do their limitations become apparent. Such limitations include their need for big data, sparsity in data, and their lack of interpretability. It has also become apparent that the techniques are not truly autonomous, requiring retraining even post deployment. In this review, we detail the use of advanced techniques to circumvent these challenges, with examples drawn from drug discovery and allied disciplines. In addition, we present emerging techniques and their potential role in drug discovery. The techniques presented herein are anticipated to expand the applicability of ML in drug discovery.
Collapse
Affiliation(s)
- Moe Elbadawi
- Department of Pharmaceutics, UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London, WC1N 1AX, UK
| | - Simon Gaisford
- Department of Pharmaceutics, UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London, WC1N 1AX, UK; FabRx Ltd, 3 Romney Road, Ashford, TN24 0RW, UK
| | - Abdul W Basit
- Department of Pharmaceutics, UCL School of Pharmacy, University College London, 29-39 Brunswick Square, London, WC1N 1AX, UK; FabRx Ltd, 3 Romney Road, Ashford, TN24 0RW, UK.
| |
Collapse
|
7
|
Pheno-RNA, a method to associate genes with a specific phenotype, identifies genes linked to cellular transformation. Proc Natl Acad Sci U S A 2020; 117:28925-28929. [PMID: 33144504 DOI: 10.1073/pnas.2014165117] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Cellular transformation is associated with dramatic changes in gene expression, but it is difficult to determine which regulated genes are oncogenically relevant. Here we describe Pheno-RNA, a general approach to identifying candidate genes associated with a specific phenotype. Specifically, we generate a "phenotypic series" by treating a nontransformed breast cell line with a wide variety of molecules that induce cellular transformation to various extents. By performing transcriptional profiling across this phenotypic series, the expression profile of every gene can be correlated with the strength of the transformed phenotype. We identify ∼200 genes whose expression profiles are very highly correlated with the transformation phenotype, strongly suggesting their importance in transformation. Within biological categories linked to cancer, some genes show high correlations with the transformed phenotype, but others do not. Many genes whose expression profiles are highly correlated with transformation have never been associated with cancer, suggesting the involvement of heretofore unknown genes in cancer.
Collapse
|
8
|
Lung PY, Zhong D, Pang X, Li Y, Zhang J. Maximizing the reusability of gene expression data by predicting missing metadata. PLoS Comput Biol 2020; 16:e1007450. [PMID: 33156882 PMCID: PMC7673503 DOI: 10.1371/journal.pcbi.1007450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Revised: 11/18/2020] [Accepted: 10/09/2020] [Indexed: 11/18/2022] Open
Abstract
Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.
Collapse
Affiliation(s)
- Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, United States of America
| | - Dongrui Zhong
- Department of Statistics, Florida State University, Tallahassee, United States of America
| | - Xiaodong Pang
- Insilicom LLC, Tallahassee, United States of America
| | - Yan Li
- Department of Breast Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, China
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, United States of America
- * E-mail:
| |
Collapse
|
9
|
A novel computational approach for predicting complex phenotypes in Drosophila (starvation-sensitive and sterile) by deriving their gene expression signatures from public data. PLoS One 2020; 15:e0240824. [PMID: 33104720 PMCID: PMC7588067 DOI: 10.1371/journal.pone.0240824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 10/05/2020] [Indexed: 11/19/2022] Open
Abstract
Many research teams perform numerous genetic, transcriptomic, proteomic and other types of omic experiments to understand molecular, cellular and physiological mechanisms of disease and health. Often (but not always), the results of these experiments are deposited in publicly available repository databases. These data records often include phenotypic characteristics following genetic and environmental perturbations, with the aim of discovering underlying molecular mechanisms leading to the phenotypic responses. A constrained set of phenotypic characteristics is usually recorded and these are mostly hypothesis driven of possible to record within financial or practical constraints. We present a novel proof-of-principal computational approach for combining publicly available gene-expression data from control/mutant animal experiments that exhibit a particular phenotype, and we use this approach to predict unobserved phenotypic characteristics in new experiments (data derived from EBI’s ArrayExpress and ExpressionAtlas respectively). We utilised available microarray gene-expression data for two phenotypes (starvation-sensitive and sterile) in Drosophila. The data were combined using a linear-mixed effects model with the inclusion of consecutive principal components to account for variability between experiments in conjunction with Gene Ontology enrichment analysis. We present how available data can be ranked in accordance to a phenotypic likelihood of exhibiting these two phenotypes using random forest. The results from our study show that it is possible to integrate seemingly different gene-expression microarray data and predict a potential phenotypic manifestation with a relatively high degree of confidence (>80% AUC). This provides thus far unexplored opportunities for inferring unknown and unbiased phenotypic characteristics from already performed experiments, in order to identify studies for future analyses. Molecular mechanisms associated with gene and environment perturbations are intrinsically linked and give rise to a variety of phenotypic manifestations. Therefore, unravelling the phenotypic spectrum can help to gain insights into disease mechanisms associated with gene and environmental perturbations. Our approach uses public data that are set to increase in volume, thus providing value for money.
Collapse
|
10
|
Crawford J, Greene CS. Incorporating biological structure into machine learning models in biomedicine. Curr Opin Biotechnol 2020; 63:126-134. [PMID: 31962244 PMCID: PMC7308204 DOI: 10.1016/j.copbio.2019.12.021] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 12/17/2019] [Accepted: 12/19/2019] [Indexed: 12/19/2022]
Abstract
In biomedical applications of machine learning, relevant information often has a rich structure that is not easily encoded as real-valued predictors. Examples of such data include DNA or RNA sequences, gene sets or pathways, gene interaction or coexpression networks, ontologies, and phylogenetic trees. We highlight recent examples of machine learning models that use structure to constrain model architecture or incorporate structured data into model training. For machine learning in biomedicine, where sample size is limited and model interpretability is crucial, incorporating prior knowledge in the form of structured data can be particularly useful. The area of research would benefit from performant open source implementations and independent benchmarking efforts.
Collapse
Affiliation(s)
- Jake Crawford
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, United States.
| |
Collapse
|
11
|
Smith AM, Walsh JR, Long J, Davis CB, Henstock P, Hodge MR, Maciejewski M, Mu XJ, Ra S, Zhao S, Ziemek D, Fisher CK. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics 2020; 21:119. [PMID: 32197580 PMCID: PMC7085143 DOI: 10.1186/s12859-020-3427-8] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 02/21/2020] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND The ability to confidently predict health outcomes from gene expression would catalyze a revolution in molecular diagnostics. Yet, the goal of developing actionable, robust, and reproducible predictive signatures of phenotypes such as clinical outcome has not been attained in almost any disease area. Here, we report a comprehensive analysis spanning prediction tasks from ulcerative colitis, atopic dermatitis, diabetes, to many cancer subtypes for a total of 24 binary and multiclass prediction problems and 26 survival analysis tasks. We systematically investigate the influence of gene subsets, normalization methods and prediction algorithms. Crucially, we also explore the novel use of deep representation learning methods on large transcriptomics compendia, such as GTEx and TCGA, to boost the performance of state-of-the-art methods. The resources and findings in this work should serve as both an up-to-date reference on attainable performance, and as a benchmarking resource for further research. RESULTS Approaches that combine large numbers of genes outperformed single gene methods consistently and with a significant margin, but neither unsupervised nor semi-supervised representation learning techniques yielded consistent improvements in out-of-sample performance across datasets. Our findings suggest that using l2-regularized regression methods applied to centered log-ratio transformed transcript abundances provide the best predictive analyses overall. CONCLUSIONS Transcriptomics-based phenotype prediction benefits from proper normalization techniques and state-of-the-art regularized regression approaches. In our view, breakthrough performance is likely contingent on factors which are independent of normalization and general modeling techniques; these factors might include reduction of systematic errors in sequencing data, incorporation of other data types such as single-cell sequencing and proteomics, and improved use of prior knowledge.
Collapse
Affiliation(s)
| | | | - John Long
- Computational Sciences, Worldwide Research & Development, Pfizer Inc., Cambridge, MA, USA
| | - Craig B Davis
- Oncology Global Product Development, Pfizer Inc., San Diego, CA, USA
| | | | - Martin R Hodge
- Inflammation and Immunology, Worldwide Research & Development, Pfizer Inc., Cambridge, MA, USA
| | - Mateusz Maciejewski
- Inflammation and Immunology, Worldwide Research & Development, Pfizer Inc., Cambridge, MA, USA
| | - Xinmeng Jasmine Mu
- Oncology Research & Development, Worldwide Research & Development, Pfizer Inc., San Diego, CA, USA
| | - Stephen Ra
- Computational Sciences, Worldwide Research & Development, Pfizer Inc., Cambridge, MA, USA
| | - Shanrong Zhao
- Computational Sciences, Worldwide Research & Development, Pfizer Inc., Cambridge, MA, USA
| | - Daniel Ziemek
- Inflammation and Immunology, Worldwide Research & Development, Pfizer Pharma GmbH., Berlin, Germany
| | | |
Collapse
|
12
|
Azodi CB, Pardo J, VanBuren R, de Los Campos G, Shiu SH. Transcriptome-Based Prediction of Complex Traits in Maize. THE PLANT CELL 2020; 32:139-151. [PMID: 31641024 PMCID: PMC6961623 DOI: 10.1105/tpc.19.00332] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 09/24/2019] [Accepted: 10/21/2019] [Indexed: 05/11/2023]
Abstract
The ability to predict traits from genome-wide sequence information (i.e., genomic prediction) has improved our understanding of the genetic basis of complex traits and transformed breeding practices. Transcriptome data may also be useful for genomic prediction. However, it remains unclear how well transcript levels can predict traits, particularly when traits are scored at different development stages. Using maize (Zea mays) genetic markers and transcript levels from seedlings to predict mature plant traits, we found that transcript and genetic marker models have similar performance. When the transcripts and genetic markers with the greatest weights (i.e., the most important) in those models were used in one joint model, performance increased. Furthermore, genetic markers important for predictions were not close to or identified as regulatory variants for important transcripts. These findings demonstrate that transcript levels are useful for predicting traits and that their predictive power is not simply due to genetic variation in the transcribed genomic regions. Finally, genetic marker models identified only 1 of 14 benchmark flowering-time genes, while transcript models identified 5. These data highlight that, in addition to being useful for genomic prediction, transcriptome data can provide a link between traits and variation that cannot be readily captured at the sequence level.
Collapse
Affiliation(s)
- Christina B Azodi
- Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824
- The DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, Michigan, 48824
| | - Jeremy Pardo
- Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824
- Plant Resilience Institute, Michigan State University, East Lansing, Michigan 48824
| | - Robert VanBuren
- Plant Resilience Institute, Michigan State University, East Lansing, Michigan 48824
- Department of Horticulture, Michigan State University, East Lansing, Michigan 48824
| | - Gustavo de Los Campos
- Epidemiology and Biostatistics and Statistics and Probability Departments, Michigan State University, East Lansing, Michigan 48824
| | - Shin-Han Shiu
- Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824
- The DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, Michigan, 48824
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, Michigan 48824
| |
Collapse
|
13
|
Zhou H, Xi L, Ziemek D, O’Neil S, Lee J, Stewart Z, Zhan Y, Zhao S, Zhang Y, Page K, Huang A, Maciejewski M, Zhang B, Gorelick KJ, Fitz L, Pradhan V, Cataldi F, Vincent M, Von Schack D, Hung K, Hassan-Zahraee M. Molecular Profiling of Ulcerative Colitis Subjects from the TURANDOT Trial Reveals Novel Pharmacodynamic/Efficacy Biomarkers. J Crohns Colitis 2019; 13:702-713. [PMID: 30901380 PMCID: PMC6535501 DOI: 10.1093/ecco-jcc/jjy217] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/31/2018] [Revised: 11/30/2018] [Accepted: 01/10/2019] [Indexed: 12/13/2022]
Abstract
BACKGROUND AND AIMS To define pharmacodynamic and efficacy biomarkers in ulcerative colitis [UC] patients treated with PF-00547659, an anti-human mucosal addressin cell adhesion molecule-1 [MAdCAM-1] monoclonal antibody, in the TURANDOT study. METHODS Transcriptome, proteome and immunohistochemistry data were generated in peripheral blood and intestinal biopsies from 357 subjects in the TURANDOT study. RESULTS In peripheral blood, C-C motif chemokine receptor 9 [CCR9] gene expression demonstrated a dose-dependent increase relative to placebo, but in inflamed intestinal biopsies CCR9 gene expression decreased with increasing PF-00547659 dose. Statistical models incorporating the full RNA transcriptome in inflamed intestinal biopsies showed significant ability to assess response and remission status. Oncostatin M [OSM] gene expression in inflamed intestinal biopsies demonstrated significant associations with, and good accuracy for, efficacy, and this observation was confirmed in independent published studies in which UC patients were treated with infliximab or vedolizumab. Compared with the placebo group, intestinal T-regulatory cells demonstrated a significant increase in the intermediate 22.5-mg dose cohort, but not in the 225-mg cohort. CONCLUSIONS CCR9 and OSM are implicated as novel pharmacodynamic and efficacy biomarkers. These findings occur amid coordinated transcriptional changes that enable the definition of surrogate efficacy biomarkers based on inflamed biopsy or blood transcriptomics data.ClinicalTrials.gov identifierNCT01620255.
Collapse
Affiliation(s)
| | - Li Xi
- Pfizer, Cambridge, MA, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Mina Hassan-Zahraee
- Pfizer, Cambridge, MA, USA,Corresponding author: Mina Hassan-Zahraee, PhD, Early Clinical R&D, Pfizer Worldwide Research & Development, Pfizer, Inc., 1 Portland Street, 3rd floor, Cambridge, MA 02139, USA. Tel: 1-617-674-6338; fax: 1-973-660-8096;
| |
Collapse
|
14
|
Plant D, Maciejewski M, Smith S, Nair N, Hyrich K, Ziemek D, Barton A, Verstappen S. Profiling of Gene Expression Biomarkers as a Classifier of Methotrexate Nonresponse in Patients With Rheumatoid Arthritis. Arthritis Rheumatol 2019; 71:678-684. [PMID: 30615300 PMCID: PMC9328381 DOI: 10.1002/art.40810] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Accepted: 12/04/2018] [Indexed: 12/16/2022]
Abstract
Objective Approximately 30–40% of rheumatoid arthritis (RA) patients who are initially started on low‐dose methotrexate (MTX) will not benefit from the treatment. To date, no reliable biomarkers of MTX inefficacy in RA have been identified. The aim of this study was to analyze whole blood samples from RA patients at 2 time points (pretreatment and 4 weeks following initiation of MTX), to identify gene expression biomarkers of the MTX response. Methods RA patients who were about to commence treatment with MTX were selected from the Rheumatoid Arthritis Medication Study. Using European League Against Rheumatism (EULAR) response criteria, 42 patients were categorized as good responders and 43 as nonresponders at 6 months following the initation of MTX treatment. Data on whole blood transcript expression were generated, and supervised machine learning methods were used to predict a EULAR nonresponse. Models in which transcript levels were included were compared to models in which clinical covariates alone (e.g., baseline disease activity, sex) were included. Gene network and ontology analysis was also performed. Results Based on the ratio of transcript values (i.e., the difference in log2‐transformed expression values between 4 weeks of treatment and pretreatment), a highly predictive classifier of MTX nonresponse was developed using L2‐regularized logistic regression (mean ± SEM area under the receiver operating characteristic [ROC] curve [AUC] 0.78 ± 0.11). This classifier was superior to models that included clinical covariates (ROC AUC 0.63 ± 0.06). Pathway analysis of gene networks revealed significant overrepresentation of type I interferon signaling pathway genes in nonresponders at pretreatment (P = 2.8 × 10−25) and at 4 weeks after treatment initiation (P = 4.9 × 10−28). Conclusion Testing for changes in gene expression between pretreatment and 4 weeks post–treatment initiation may provide an early classifier of the MTX treatment response in RA patients who are unlikely to benefit from MTX over 6 months. Such patients should, therefore, have their treatment escalated more rapidly, which would thus potentially impact treatment pathways. These findings emphasize the importance of a role for early treatment biomarker monitoring in RA patients started on MTX.
Collapse
Affiliation(s)
- Darren Plant
- Manchester University NHS Foundation Trust, Manchester, UK
| | | | | | - Nisha Nair
- University of Manchester, Manchester, UK
| | | | - Kimme Hyrich
- Manchester University NHS Foundation Trust, Manchester, UK
| | | | - Anne Barton
- Manchester University NHS Foundation Trust, Manchester, UK
| | | |
Collapse
|
15
|
Understanding the hidden relations between pro- and anti-inflammatory cytokine genes in bovine oviduct epithelium using a multilayer response surface method. Sci Rep 2019; 9:3189. [PMID: 30816156 PMCID: PMC6395797 DOI: 10.1038/s41598-019-39081-w] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Accepted: 01/18/2019] [Indexed: 02/06/2023] Open
Abstract
An understanding gene-gene interaction helps users to design the next experiments efficiently and (if applicable) to make a better decision of drugs application based on the different biological conditions of the patients. This study aimed to identify changes in the hidden relationships between pro- and anti-inflammatory cytokine genes in the bovine oviduct epithelial cells (BOECs) under various experimental conditions using a multilayer response surface method. It was noted that under physiological conditions (BOECs with sperm or sex hormones, such as ovarian sex steroids and LH), the mRNA expressions of IL10, IL1B, TNFA, TLR4, and TNFA were associated with IL1B, TNFA, TLR4, IL4, and IL10, respectively. Under pathophysiological + physiological conditions (BOECs with lipopolysaccharide + hormones, alpha-1-acid glycoprotein + hormones, zearalenone + hormones, or urea + hormones), the relationship among genes was changed. For example, the expression of IL10 and TNFA was associated with (IL1B, TNFA, or IL4) and TLR4 expression, respectively. Furthermore, under physiological conditions, the co-expression of IL10 + TNFA, TLR4 + IL4, TNFA + IL4, TNFA + IL4, or IL10 + IL1B and under pathophysiological + physiological conditions, the co-expression of IL10 + IL4, IL4 + IL10, TNFA + IL10, TNFA + TLR4, or IL10 + IL1B were associated with IL1B, TNFA, TLR4, IL10, or IL4 expression, respectively. Collectively, the relationships between pro- and anti-inflammatory cytokine genes can be changed with respect to the presence/absence of toxins, sex hormones, sperm, and co-expression of other gene pairs in BOECs, suggesting that considerable cautions are needed in interpreting the results obtained from such narrowly focused in vitro studies.
Collapse
|
16
|
Li Z, Gao N, Martini JWR, Simianer H. Integrating Gene Expression Data Into Genomic Prediction. Front Genet 2019; 10:126. [PMID: 30858865 PMCID: PMC6397893 DOI: 10.3389/fgene.2019.00126] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2018] [Accepted: 02/04/2019] [Indexed: 01/14/2023] Open
Abstract
Gene expression profiles potentially hold valuable information for the prediction of breeding values and phenotypes. In this study, the utility of transcriptome data for phenotype prediction was tested with 185 inbred lines of Drosophila melanogaster for nine traits in two sexes. We incorporated the transcriptome data into genomic prediction via two methods: GTBLUP and GRBLUP, both combining single nucleotide polymorphisms (SNPs) and transcriptome data. The genotypic data was used to construct the common additive genomic relationship, which was used in genomic best linear unbiased prediction (GBLUP) or jointly in a linear mixed model with a transcriptome-based linear kernel (GTBLUP), or with a transcriptome-based Gaussian kernel (GRBLUP). We studied the predictive ability of the models and discuss a concept of "omics-augmented broad sense heritability" for the multi-omics era. For most traits, GRBLUP and GBLUP provided similar predictive abilities, but GRBLUP explained more of the phenotypic variance. There was only one trait (olfactory perception to Ethyl Butyrate in females) in which the predictive ability of GRBLUP (0.23) was significantly higher than the predictive ability of GBLUP (0.21). Our results suggest that accounting for transcriptome data has the potential to improve genomic predictions if transcriptome data can be included on a larger scale.
Collapse
Affiliation(s)
- Zhengcao Li
- Animal Breeding and Genetics Group, Department of Animal Sciences, Center for Integrated Breeding Research, University of Göttingen, Göttingen, Germany
| | - Ning Gao
- State Key Laboratory of Biocontrol, Guangzhou Higher Education Mega Center, School of Life Science, Sun Yat-sen University, Guangzhou, China
| | | | - Henner Simianer
- Animal Breeding and Genetics Group, Department of Animal Sciences, Center for Integrated Breeding Research, University of Göttingen, Göttingen, Germany
| |
Collapse
|