1
|
Huang W, Liu H. Predicting single-cell cellular responses to perturbations using cycle consistency learning. Bioinformatics 2024; 40:i462-i470. [PMID: 38940153 DOI: 10.1093/bioinformatics/btae248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Phenotype-based drug screening emerges as a powerful approach for identifying compounds that actively interact with cells. Transcriptional and proteomic profiling of cell lines and individual cells provide insights into the cellular state alterations that occur at the molecular level in response to external perturbations, such as drugs or genetic manipulations. In this paper, we propose cycleCDR, a novel deep learning framework to predict cellular response to external perturbations. We leverage the autoencoder to map the unperturbed cellular states to a latent space, in which we postulate the effects of drug perturbations on cellular states follow a linear additive model. Next, we introduce the cycle consistency constraints to ensure that unperturbed cellular state subjected to drug perturbation in the latent space would produces the perturbed cellular state through the decoder. Conversely, removal of perturbations from the perturbed cellular states can restore the unperturbed cellular state. The cycle consistency constraints and linear modeling in the latent space enable to learn transferable representations of external perturbations, so that our model can generalize well to unseen drugs during training stage. We validate our model on four different types of datasets, including bulk transcriptional responses, bulk proteomic responses, and single-cell transcriptional responses to drug/gene perturbations. The experimental results demonstrate that our model consistently outperforms existing state-of-the-art methods, indicating our method is highly versatile and applicable to a wide range of scenarios. AVAILABILITY AND IMPLEMENTATION The source code is available at: https://github.com/hliulab/cycleCDR.
Collapse
Affiliation(s)
- Wei Huang
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, Jiangsu 211816, China
| | - Hui Liu
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, Jiangsu 211816, China
| |
Collapse
|
2
|
Zhang S, Ma B, Liu Y, Shen Y, Li D, Liu S, Song F. Predicting locus-specific DNA methylation levels in cancer and paracancer tissues. Epigenomics 2024; 16:549-570. [PMID: 38477028 PMCID: PMC11158003 DOI: 10.2217/epi-2023-0114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 02/20/2024] [Indexed: 03/14/2024] Open
Abstract
Aim: To predict base-resolution DNA methylation in cancerous and paracancerous tissues. Material & methods: We collected six cancer DNA methylation datasets from The Cancer Genome Atlas and five cancer datasets from Gene Expression Omnibus and established machine learning models using paired cancerous and paracancerous tissues. Tenfold cross-validation and independent validation were performed to demonstrate the effectiveness of the proposed method. Results: The developed cross-tissue prediction models can substantially increase the accuracy at more than 68% of CpG sites and contribute to enhancing the statistical power of differential methylation analyses. An XGBoost model leveraging multiple correlating CpGs may elevate the prediction accuracy. Conclusion: This study provides a powerful tool for DNA methylation analysis and has the potential to gain new insights into cancer research from epigenetics.
Collapse
Affiliation(s)
- Shuzheng Zhang
- School of Information Science & Technology, Dalian Maritime University, Dalian, 116026, China
| | - Baoshan Ma
- School of Information Science & Technology, Dalian Maritime University, Dalian, 116026, China
| | - Yu Liu
- School of Information Science & Technology, Dalian Maritime University, Dalian, 116026, China
| | - Yiwen Shen
- School of Information Science & Technology, Dalian Maritime University, Dalian, 116026, China
| | - Di Li
- Department of Neuro Intervention, Dalian Medical University affiliated Dalian Municipal Central Hospital, Dalian, 116033, China
| | - Shuxin Liu
- Department of Nephrology, Dalian Medical University affiliated Dalian Municipal Central Hospital, Dalian, 116033, China
| | - Fengju Song
- Department of Epidemiology & Biostatistics, Key Laboratory of Molecular Cancer Epidemiology, Tianjin, National Clinical Research Center of Cancer, Tianjin Medical University Cancer Institute & Hospital, Tianjin, 300060, China
| |
Collapse
|
3
|
Ferreira MADM, Silveira WBD, Nikoloski Z. Protein constraints in genome-scale metabolic models: Data integration, parameter estimation, and prediction of metabolic phenotypes. Biotechnol Bioeng 2024; 121:915-930. [PMID: 38178617 DOI: 10.1002/bit.28650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 10/24/2023] [Accepted: 12/18/2023] [Indexed: 01/06/2024]
Abstract
Genome-scale metabolic models provide a valuable resource to study metabolism and cell physiology. These models are employed with approaches from the constraint-based modeling framework to predict metabolic and physiological phenotypes. The prediction performance of genome-scale metabolic models can be improved by including protein constraints. The resulting protein-constrained models consider data on turnover numbers (kcat ) and facilitate the integration of protein abundances. In this systematic review, we present and discuss the current state-of-the-art regarding the estimation of kinetic parameters used in protein-constrained models. We also highlight how data-driven and constraint-based approaches can aid the estimation of turnover numbers and their usage in improving predictions of cellular phenotypes. Finally, we identify standing challenges in protein-constrained metabolic models and provide a perspective regarding future approaches to improve the predictive performance.
Collapse
Affiliation(s)
| | | | - Zoran Nikoloski
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany
- Systems Biology and Mathematical Modeling, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany
| |
Collapse
|
4
|
Moura Ferreira MAD, Wendering P, Arend M, Batista da Silveira W, Nikoloski Z. Accurate prediction of in vivo protein abundances by coupling constraint-based modelling and machine learning. Metab Eng 2023; 80:184-192. [PMID: 37802292 DOI: 10.1016/j.ymben.2023.09.014] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/10/2023] [Accepted: 09/25/2023] [Indexed: 10/08/2023]
Abstract
Quantification of how different environmental cues affect protein allocation can provide important insights for understanding cell physiology. While absolute quantification of proteins can be obtained by resource-intensive mass-spectrometry-based technologies, prediction of protein abundances offers another way to obtain insights into protein allocation. Here we present CAMEL, a framework that couples constraint-based modelling with machine learning to predict protein abundance for any environmental condition. This is achieved by building machine learning models that leverage static features, derived from protein sequences, and condition-dependent features predicted from protein-constrained metabolic models. Our findings demonstrate that CAMEL results in excellent prediction of protein allocation in E. coli (average Pearson correlation of at least 0.9), and moderate performance in S. cerevisiae (average Pearson correlation of at least 0.5). Therefore, CAMEL outperformed contending approaches without using molecular read-outs from unseen conditions and provides a valuable tool for using protein allocation in biotechnological applications.
Collapse
Affiliation(s)
| | - Philipp Wendering
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, 14476, Germany; Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, Potsdam, 14476, Germany
| | - Marius Arend
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, 14476, Germany; Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, Potsdam, 14476, Germany
| | | | - Zoran Nikoloski
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, 14476, Germany; Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, Potsdam, 14476, Germany.
| |
Collapse
|
5
|
Wang XY, Xu YM, Lau ATY. Proteogenomics in Cancer: Then and Now. J Proteome Res 2023; 22:3103-3122. [PMID: 37725793 DOI: 10.1021/acs.jproteome.3c00196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/21/2023]
Abstract
For years, the paths of sequencing technologies and mass spectrometry have occurred in isolation, with each developing its own unique culture and expertise. These two technologies are crucial for inspecting complementary aspects of the molecular phenotype across the central dogma. Integrative multiomics strives to bridge the analysis gap among different fields to complete more comprehensive mechanisms of life events and diseases. Proteogenomics is one integrated multiomics field. Here in this review, we mainly summarize and discuss three aspects: workflow of proteogenomics, proteogenomics applications in cancer research, and the SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis of proteogenomics in cancer research. In conclusion, proteogenomics has a promising future as it clarifies the functional consequences of many unannotated genomic abnormalities or noncanonical variants and identifies driver genes and novel therapeutic targets across cancers, which would substantially accelerate the development of precision oncology.
Collapse
Affiliation(s)
- Xiu-Yun Wang
- Laboratory of Cancer Biology and Epigenetics, Department of Cell Biology and Genetics, Shantou University Medical College, Shantou, Guangdong 515041, People's Republic of China
| | - Yan-Ming Xu
- Laboratory of Cancer Biology and Epigenetics, Department of Cell Biology and Genetics, Shantou University Medical College, Shantou, Guangdong 515041, People's Republic of China
| | - Andy T Y Lau
- Laboratory of Cancer Biology and Epigenetics, Department of Cell Biology and Genetics, Shantou University Medical College, Shantou, Guangdong 515041, People's Republic of China
| |
Collapse
|
6
|
Varshney N, Mishra AK. Deep Learning in Phosphoproteomics: Methods and Application in Cancer Drug Discovery. Proteomes 2023; 11:proteomes11020016. [PMID: 37218921 DOI: 10.3390/proteomes11020016] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/24/2023] Open
Abstract
Protein phosphorylation is a key post-translational modification (PTM) that is a central regulatory mechanism of many cellular signaling pathways. Several protein kinases and phosphatases precisely control this biochemical process. Defects in the functions of these proteins have been implicated in many diseases, including cancer. Mass spectrometry (MS)-based analysis of biological samples provides in-depth coverage of phosphoproteome. A large amount of MS data available in public repositories has unveiled big data in the field of phosphoproteomics. To address the challenges associated with handling large data and expanding confidence in phosphorylation site prediction, the development of many computational algorithms and machine learning-based approaches have gained momentum in recent years. Together, the emergence of experimental methods with high resolution and sensitivity and data mining algorithms has provided robust analytical platforms for quantitative proteomics. In this review, we compile a comprehensive collection of bioinformatic resources used for the prediction of phosphorylation sites, and their potential therapeutic applications in the context of cancer.
Collapse
Affiliation(s)
- Neha Varshney
- Division of Biological Sciences, Department of Cellular and Molecular Medicine, University of California, San Diego, CA 93093, USA
- Ludwig Institute for Cancer Research, La Jolla, CA 92093, USA
| | - Abhinava K Mishra
- Molecular, Cellular and Developmental Biology Department, University of California, Santa Barbara, CA 93106, USA
| |
Collapse
|
7
|
Srivastava H, Lippincott MJ, Currie J, Canfield R, Lam MPY, Lau E. Protein prediction models support widespread post-transcriptional regulation of protein abundance by interacting partners. PLoS Comput Biol 2022; 18:e1010702. [PMID: 36356032 PMCID: PMC9681107 DOI: 10.1371/journal.pcbi.1010702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 11/22/2022] [Accepted: 11/01/2022] [Indexed: 11/12/2022] Open
Abstract
Protein and mRNA levels correlate only moderately. The availability of proteogenomics data sets with protein and transcript measurements from matching samples is providing new opportunities to assess the degree to which protein levels in a system can be predicted from mRNA information. Here we examined the contributions of input features in protein abundance prediction models. Using large proteogenomics data from 8 cancer types within the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data set, we trained models to predict the abundance of over 13,000 proteins using matching transcriptome data from up to 958 tumor or normal adjacent tissue samples each, and compared predictive performances across algorithms, data set sizes, and input features. Over one-third of proteins (4,648) showed relatively poor predictability (elastic net r ≤ 0.3) from their cognate transcripts. Moreover, we found widespread occurrences where the abundance of a protein is considerably less well explained by its own cognate transcript level than that of one or more trans locus transcripts. The incorporation of additional trans-locus transcript abundance data as input features increasingly improved the ability to predict sample protein abundance. Transcripts that contribute to non-cognate protein abundance primarily involve those encoding known or predicted interaction partners of the protein of interest, including not only large multi-protein complexes as previously shown, but also small stable complexes in the proteome with only one or few stable interacting partners. Network analysis further shows a complex proteome-wide interdependency of protein abundance on the transcript levels of multiple interacting partners. The predictive model analysis here therefore supports that protein-protein interaction including in small protein complexes exert post-transcriptional influence on proteome compositions more broadly than previously recognized. Moreover, the results suggest mRNA and protein co-expression analysis may have utility for finding gene interactions and predicting expression changes in biological systems.
Collapse
Affiliation(s)
- Himangi Srivastava
- Department of Medicine/Cardiology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Consortium for Fibrosis Research and Translation, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Michael J. Lippincott
- Department of Medicine/Cardiology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Jordan Currie
- Department of Medicine/Cardiology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Robert Canfield
- Department of Medicine/Cardiology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Maggie P. Y. Lam
- Department of Medicine/Cardiology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Consortium for Fibrosis Research and Translation, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Edward Lau
- Department of Medicine/Cardiology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Consortium for Fibrosis Research and Translation, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| |
Collapse
|
8
|
Upadhya SR, Ryan CJ. Experimental reproducibility limits the correlation between mRNA and protein abundances in tumor proteomic profiles. CELL REPORTS METHODS 2022; 2:100288. [PMID: 36160043 PMCID: PMC9499981 DOI: 10.1016/j.crmeth.2022.100288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Revised: 07/14/2022] [Accepted: 08/16/2022] [Indexed: 11/21/2022]
Abstract
Large-scale studies of human proteomes have revealed only a moderate correlation between mRNA and protein abundances. It is unclear to what extent this moderate correlation reflects post-transcriptional regulation and to what extent it reflects measurement error. Here, by analyzing replicate profiles of tumors and cell lines, we show that there is considerable variation in the reproducibility of measurements of transcripts and proteins from individual genes. Proteins with more reproducible measurements tend to have a higher mRNA-protein correlation, suggesting that measurement reproducibility accounts for a substantial fraction of the unexplained variation between mRNA and protein abundances. The reproducibility of individual proteins is somewhat consistent across studies, and we exploit this to develop an aggregate reproducibility score that explains a substantial amount of the variation in mRNA-protein correlations across multiple studies. Finally, we show that pathways previously reported to have a higher-than-average mRNA-protein correlation may simply contain members that can be more reproducibly quantified.
Collapse
Affiliation(s)
- Swathi Ramachandra Upadhya
- School of Computer Science, University College Dublin, Dublin, Ireland
- Systems Biology Ireland, University College Dublin, Dublin, Ireland
| | - Colm J. Ryan
- School of Computer Science, University College Dublin, Dublin, Ireland
- Systems Biology Ireland, University College Dublin, Dublin, Ireland
| |
Collapse
|
9
|
Abstract
The advancement of precision medicine in medical care has led behind the conventional symptom-driven treatment process by allowing early risk prediction of disease through improved diagnostics and customization of more effective treatments. It is necessary to scrutinize overall patient data alongside broad factors to observe and differentiate between ill and relatively healthy people to take the most appropriate path toward precision medicine, resulting in an improved vision of biological indicators that can signal health changes. Precision and genomic medicine combined with artificial intelligence have the potential to improve patient healthcare. Patients with less common therapeutic responses or unique healthcare demands are using genomic medicine technologies. AI provides insights through advanced computation and inference, enabling the system to reason and learn while enhancing physician decision making. Many cell characteristics, including gene up-regulation, proteins binding to nucleic acids, and splicing, can be measured at high throughput and used as training objectives for predictive models. Researchers can create a new era of effective genomic medicine with the improved availability of a broad range of datasets and modern computer techniques such as machine learning. This review article has elucidated the contributions of ML algorithms in precision and genome medicine.
Collapse
Affiliation(s)
- Sameer Quazi
- GenLab Biosolutions Private Limited, Bangalore, Karnataka, 560043, India.
- Department of Biomedical Sciences, School of Life Sciences, Anglia Ruskin University, Cambridge, UK.
| |
Collapse
|
10
|
Xu W, He H, Guo Z, Li W. Evaluation of machine learning models on protein level inference from prioritized RNA features. Brief Bioinform 2022; 23:6555405. [PMID: 35352096 DOI: 10.1093/bib/bbac091] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 02/16/2022] [Accepted: 02/23/2022] [Indexed: 11/12/2022] Open
Abstract
The parallel measurement of transcriptome and proteome revealed unmatched profiles. Since proteomic analysis is more expensive and challenging than transcriptomic analysis, the question of how to use messenger RNA (mRNA) expression data to predict protein level is extremely important. Here, we comprehensively evaluated 13 machine learning models on inferring protein expression levels using RNA expression profile. A total of 20 proteogenomic datasets from three mainstream proteomic platforms with >2500 samples of 13 human tissues were collected for model evaluation. Our results highlighted that the appropriate feature selection methods combined with classical machine learning models could achieve excellent predictive performance. The voting ensemble model outperformed other candidate models across datasets. Adding the mRNA proxy model to the regression model further improved the prediction performance. The dataset and gene characteristics could affect the prediction performance. Finally, we applied the model to the brain transcriptome of cerebral cortex regions to infer the protein profile for better understanding the functional characteristics of the brain regions. This benchmarking work not only provides useful hints on the inherent correlation between transcriptome and proteome, but also has practical value of the transcriptome-based prediction of protein expression levels.
Collapse
Affiliation(s)
- Wenjian Xu
- Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Rare Disease Center, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Haochen He
- Department of Radiation Protection and Health Physics, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Zhengguang Guo
- Core Facility of Instruments, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, 5 Dong Dan San Tiao, Beijing 100005, China
| | - Wei Li
- Beijing Key Laboratory for Genetics of Birth Defects, Beijing Pediatric Research Institute; MOE Key Laboratory of Major Diseases in Children; Rare Disease Center, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| |
Collapse
|
11
|
Han Y, Li LZ, Kastury NL, Thomas CT, Lam MPY, Lau E. Transcriptome features of striated muscle aging and predictability of protein level changes. Mol Omics 2021; 17:796-808. [PMID: 34328155 PMCID: PMC8511094 DOI: 10.1039/d1mo00178g] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
We performed total RNA sequencing and multi-omics analysis comparing skeletal muscle and cardiac muscle in young adult (4 months) vs. early aging (20 months) mice to examine the molecular mechanisms of striated muscle aging. We observed that aging cardiac and skeletal muscles both invoke transcriptomic changes in innate immune system and mitochondria pathways but diverge in extracellular matrix processes. On an individual gene level, we identified 611 age-associated signatures in skeletal and cardiac muscles, including a number of myokine and cardiokine encoding genes. Because RNA and protein levels correlate only partially, we reason that differentially expressed transcripts that accurately reflect their protein counterparts will be more valuable proxies for proteomic changes and by extension physiological states. We applied a computational data analysis workflow to estimate which transcriptomic changes are more likely relevant to protein-level regulation using large proteogenomics data sets. We estimate about 48% of the aging-associated transcripts predict protein levels well (r ≥ 0.5). In parallel, a comparison of the identified aging-regulated genes with public human transcriptomics data showed that only 35-45% of the identified genes show an age-dependent expression in corresponding human tissues. Thus, integrating both RNA-protein correlation and human conservation across data sources, we nominate 134 prioritized aging striated muscle signatures that are predicted to correlate strongly with protein levels and that show age-dependent expression in humans. The results here reveal new details into how aging reshapes gene expression in striated muscles at the transcript and protein levels.
Collapse
Affiliation(s)
- Yu Han
- Department of Medicine, Consortium for Fibrosis Research & Translation, University of Colorado School of Medicine, Aurora, CO 80045, USA.
| | - Lauren Z Li
- Department of Medicine, Consortium for Fibrosis Research & Translation, University of Colorado School of Medicine, Aurora, CO 80045, USA.
| | - Nikhitha L Kastury
- Department of Medicine, Consortium for Fibrosis Research & Translation, University of Colorado School of Medicine, Aurora, CO 80045, USA.
| | - Cody T Thomas
- Department of Medicine, Consortium for Fibrosis Research & Translation, University of Colorado School of Medicine, Aurora, CO 80045, USA.
| | - Maggie P Y Lam
- Department of Medicine, Consortium for Fibrosis Research & Translation, University of Colorado School of Medicine, Aurora, CO 80045, USA.
- Department of Biochemistry and Molecular Genetics, Consortium for Fibrosis Research & Translation, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Edward Lau
- Department of Medicine, Consortium for Fibrosis Research & Translation, University of Colorado School of Medicine, Aurora, CO 80045, USA.
| |
Collapse
|
12
|
Dai X, Xu F, Wang S, Mundra PA, Zheng J. PIKE-R2P: Protein-protein interaction network-based knowledge embedding with graph neural network for single-cell RNA to protein prediction. BMC Bioinformatics 2021; 22:139. [PMID: 34078261 PMCID: PMC8170782 DOI: 10.1186/s12859-021-04022-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 02/11/2021] [Indexed: 12/05/2022] Open
Abstract
Background Recent advances in simultaneous measurement of RNA and protein abundances at single-cell level provide a unique opportunity to predict protein abundance from scRNA-seq data using machine learning models. However, existing machine learning methods have not considered relationship among the proteins sufficiently. Results We formulate this task in a multi-label prediction framework where multiple proteins are linked to each other at the single-cell level. Then, we propose a novel method for single-cell RNA to protein prediction named PIKE-R2P, which incorporates protein–protein interactions (PPI) and prior knowledge embedding into a graph neural network. Compared with existing methods, PIKE-R2P could significantly improve prediction performance in terms of smaller errors and higher correlations with the gold standard measurements. Conclusion The superior performance of PIKE-R2P indicates that adding the prior knowledge of PPI to graph neural networks can be a powerful strategy for cross-modality prediction of protein abundances at the single-cell level.
Collapse
Affiliation(s)
- Xinnan Dai
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China
| | - Fan Xu
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China
| | - Shike Wang
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China
| | - Piyushkumar A Mundra
- Molecular Oncology Group, Cancer Research UK Manchester Institute, The University of Manchester, Alderley Park, Manchester, UK
| | - Jie Zheng
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong District, Shanghai, 201210, China.
| |
Collapse
|
13
|
Patel SK, George B, Rai V. Artificial Intelligence to Decode Cancer Mechanism: Beyond Patient Stratification for Precision Oncology. Front Pharmacol 2020; 11:1177. [PMID: 32903628 PMCID: PMC7438594 DOI: 10.3389/fphar.2020.01177] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Accepted: 07/20/2020] [Indexed: 12/13/2022] Open
Abstract
The multitude of multi-omics data generated cost-effectively using advanced high-throughput technologies has imposed challenging domain for research in Artificial Intelligence (AI). Data curation poses a significant challenge as different parameters, instruments, and sample preparations approaches are employed for generating these big data sets. AI could reduce the fuzziness and randomness in data handling and build a platform for the data ecosystem, and thus serve as the primary choice for data mining and big data analysis to make informed decisions. However, AI implication remains intricate for researchers/clinicians lacking specific training in computational tools and informatics. Cancer is a major cause of death worldwide, accounting for an estimated 9.6 million deaths in 2018. Certain cancers, such as pancreatic and gastric cancers, are detected only after they have reached their advanced stages with frequent relapses. Cancer is one of the most complex diseases affecting a range of organs with diverse disease progression mechanisms and the effectors ranging from gene-epigenetics to a wide array of metabolites. Hence a comprehensive study, including genomics, epi-genomics, transcriptomics, proteomics, and metabolomics, along with the medical/mass-spectrometry imaging, patient clinical history, treatments provided, genetics, and disease endemicity, is essential. Cancer Moonshot℠ Research Initiatives by NIH National Cancer Institute aims to collect as much information as possible from different regions of the world and make a cancer data repository. AI could play an immense role in (a) analysis of complex and heterogeneous data sets (multi-omics and/or inter-omics), (b) data integration to provide a holistic disease molecular mechanism, (c) identification of diagnostic and prognostic markers, and (d) monitor patient's response to drugs/treatments and recovery. AI enables precision disease management well beyond the prevalent disease stratification patterns, such as differential expression and supervised classification. This review highlights critical advances and challenges in omics data analysis, dealing with data variability from lab-to-lab, and data integration. We also describe methods used in data mining and AI methods to obtain robust results for precision medicine from "big" data. In the future, AI could be expanded to achieve ground-breaking progress in disease management.
Collapse
Affiliation(s)
- Sandip Kumar Patel
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Mumbai, India
- Buck Institute for Research on Aging, Novato, CA, United States
| | - Bhawana George
- Department of Hematopathology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Vineeta Rai
- Department of Entomology & Plant Pathology, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|