1
|
Paul D, Sinnarasan VSP, Das R, Sheikh MMR, Venkatesan A. Machine learning approach to predict blood-secretory proteins and potential biomarkers for liver cancer using omics data. J Proteomics 2024; 309:105298. [PMID: 39216516 DOI: 10.1016/j.jprot.2024.105298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 08/22/2024] [Accepted: 08/29/2024] [Indexed: 09/04/2024]
Abstract
Identifying non-invasive blood-based biomarkers is crucial for early detection and monitoring of liver cancer (LC), thereby improving patient outcomes. This study leveraged computational approaches to predict potential blood-based biomarkers for LC. Machine learning (ML) models were developed using selected features from blood-secretory proteins collected from the curated databases. The logistic regression (LR) model demonstrated the optimal performance. Transcriptome analysis across 7 LC cohorts revealed 231 common differentially expressed genes (DEGs). The encoded proteins of these DEGs were compared with the ML dataset, revealing 29 proteins overlapping with the blood-secretory dataset. The LR model also predicted 29 additional proteins as blood-secretory with the remaining protein-coding genes. As a result, 58 potential blood-secretory proteins were obtained. Among the top 20 genes, 13 common hub genes were identified. Further, area under the receiver operating characteristic curve (ROC AUC) analysis was performed to assess the genes as potential diagnostic blood biomarkers. Six genes, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6, exhibited an AUC value higher than 0.85 and were predicted as blood-secretory. This study highlights the potential of an integrative computational approach for discovering non-invasive blood-based biomarkers in LC, facilitating for further validation and clinical translation. SIGNIFICANCE: Liver cancer is one of the leading causes of premature death worldwide, with its prevalence and mortality rates projected to increase. Although current diagnostic methods are highly sensitive, they are invasive and unsuitable for repeated testing. Blood biomarkers offer a promising non-invasive alternative, but their wide dynamic range of protein concentration poses experimental challenges. Therefore, utilizing available omics data to develop a diagnostic model could provide a potential solution for accurate diagnosis. This study developed a computational method integrating machine learning and bioinformatics analysis to identify potential blood biomarkers. As a result, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6 biomarkers were identified, holding significant promise for improving diagnosis and understanding of liver cancer. The integrated method can be applied to other cancers, offering a possible solution for early detection and improved patient outcomes.
Collapse
Affiliation(s)
- Dahrii Paul
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India
| | | | - Rajesh Das
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India
| | | | - Amouda Venkatesan
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India.
| |
Collapse
|
2
|
Zhou S, Zhou Y, Liu T, Zheng J, Jia C. PredLLPS_PSSM: a novel predictor for liquid-liquid protein separation identification based on evolutionary information and a deep neural network. Brief Bioinform 2023; 24:bbad299. [PMID: 37609923 DOI: 10.1093/bib/bbad299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 08/01/2023] [Accepted: 08/02/2023] [Indexed: 08/24/2023] Open
Abstract
The formation of biomolecular condensates by liquid-liquid phase separation (LLPS) has become a universal mechanism for spatiotemporal coordination of biological activities in cells and has been widely observed to directly regulate the key cellular processes involved in cancer cell pathology. However, the complexity of protein sequences and the diversity of conformations are inherently disordered, which poses great challenges for LLPS protein calculations and experimental research. Herein, we proposed a novel predictor named PredLLPS_PSSM for LLPS protein identification based only on sequence evolution information. Because finding real and reliable samples is the cornerstone of building predictors, we collected anew and collated the LLPS proteins from the latest versions of three databases. By comparing the performance of the position-specific score matrix (PSSM) and word embedding, PredLLPS_PSSM combined PSSM-based information and two deep learning frameworks. Independent tests using three existing independent test datasets and two newly constructed independent test datasets demonstrated the superiority of PredLLPS_PSSM compared with state-of-the-art methods. Furthermore, we tested PredLLPS_PSSM on nine experimentally identified LLPS proteins from three insects that were not included in any of the databases. In addition, the powerful Shapley Additive exPlanation algorithm and heatmap were applied to find the most critical amino acids relevant to LLPS.
Collapse
Affiliation(s)
- Shengming Zhou
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Yetong Zhou
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Tian Liu
- School of Bioengineering, Dalian University of Technology, Dalian 116024, China
| | - Jia Zheng
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
3
|
Waury K, de Wit R, Verberk IMW, Teunissen CE, Abeln S. Deciphering Protein Secretion from the Brain to Cerebrospinal Fluid for Biomarker Discovery. J Proteome Res 2023; 22:3068-3080. [PMID: 37606934 PMCID: PMC10476268 DOI: 10.1021/acs.jproteome.3c00366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Indexed: 08/23/2023]
Abstract
Cerebrospinal fluid (CSF) is an essential matrix for the discovery of neurological disease biomarkers. However, the high dynamic range of protein concentrations in CSF hinders the detection of the least abundant protein biomarkers by untargeted mass spectrometry. It is thus beneficial to gain a deeper understanding of the secretion processes within the brain. Here, we aim to explore if and how the secretion of brain proteins to the CSF can be predicted. By combining a curated CSF proteome and the brain elevated proteome of the Human Protein Atlas, brain proteins were classified as CSF or non-CSF secreted. A machine learning model was trained on a range of sequence-based features to differentiate between CSF and non-CSF groups and effectively predict the brain origin of proteins. The classification model achieves an area under the curve of 0.89 if using high confidence CSF proteins. The most important prediction features include the subcellular localization, signal peptides, and transmembrane regions. The classifier generalized well to the larger brain detected proteome and is able to correctly predict novel CSF proteins identified by affinity proteomics. In addition to elucidating the underlying mechanisms of protein secretion, the trained classification model can support biomarker candidate selection.
Collapse
Affiliation(s)
- Katharina Waury
- Department
of Computer Science, Vrije Universiteit
Amsterdam, 1081 HV Amsterdam, The Netherlands
| | - Renske de Wit
- Department
of Computer Science, Vrije Universiteit
Amsterdam, 1081 HV Amsterdam, The Netherlands
| | - Inge M. W. Verberk
- Neurochemistry
Laboratory, Department of Clinical Chemistry, Amsterdam Neuroscience, VU University Medical Center, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands
| | - Charlotte E. Teunissen
- Neurochemistry
Laboratory, Department of Clinical Chemistry, Amsterdam Neuroscience, VU University Medical Center, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands
| | - Sanne Abeln
- Department
of Computer Science, Vrije Universiteit
Amsterdam, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
4
|
He K, Wang Y, Xie X, Shao D. Prediction of Proteins in Cerebrospinal Fluid and Application to Glioma Biomarker Identification. Molecules 2023; 28:molecules28083617. [PMID: 37110850 PMCID: PMC10144833 DOI: 10.3390/molecules28083617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 04/18/2023] [Accepted: 04/19/2023] [Indexed: 04/29/2023] Open
Abstract
Cerebrospinal fluid (CSF) proteins are very important because they can serve as biomarkers for central nervous system diseases. Although many CSF proteins have been identified with wet experiments, the identification of CSF proteins is still a challenge. In this paper, we propose a novel method to predict proteins in CSF based on protein features. A two-stage feature-selection method is employed to remove irrelevant features and redundant features. The deep neural network and bagging method are used to construct the model for the prediction of CSF proteins. The experiment results on the independent testing dataset demonstrate that our method performs better than other methods in the prediction of CSF proteins. Furthermore, our method is also applied to the identification of glioma biomarkers. A differentially expressed gene analysis is performed on the glioma data. After combining the analysis results with the prediction results of our model, the biomarkers of glioma are identified successfully.
Collapse
Affiliation(s)
- Kai He
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Xuping Xie
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Dan Shao
- College of Computer Science and Technology, Changchun University, Changchun 130022, China
| |
Collapse
|
5
|
MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids. MATHEMATICS 2022. [DOI: 10.3390/math10152562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Prediction of secreted proteins in human body fluids is essential since secreted proteins hold promise as disease biomarkers. Various approaches have been proposed to predict whether a protein is secreted into a specific fluid by its sequence. However, there may be relationships between different human body fluids when proteins are secreted into these fluids. Current approaches ignore these relationships directly, and therefore their performances are limited. Here, we present MultiSec, an improved approach for secreted protein discovery to exploit relationships between fluids via multi-task learning. Specifically, a sampling-based balance strategy is proposed to solve imbalance problems in all fluids, an effective network is presented to extract features for all fluids, and multi-objective gradient descent is employed to prevent fluids from hurting each other. MultiSec was trained and tested in 17 human body fluids. The comparison benchmarks on the independent testing datasets demonstrate that our approach outperforms other available approaches in all compared fluids.
Collapse
|
6
|
DenSec: Secreted Protein Prediction in Cerebrospinal Fluid Based on DenseNet and Transformer. MATHEMATICS 2022. [DOI: 10.3390/math10142490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Cerebrospinal fluid (CSF) exists in the surrounding spaces of mammalian central nervous systems (CNS); therefore, there are numerous potential protein biomarkers associated with CNS disease in CSF. Currently, approximately 4300 proteins have been identified in CSF by protein profiling. However, due to the diverse modifications, as well as the existing technical limits, large-scale protein identification in CSF is still considered a challenge. Inspired by computational methods, this paper proposes a deep learning framework, named DenSec, for secreted protein prediction in CSF. In the first phase of DenSec, all input proteins are encoded as a matrix with a fixed size of 1000 × 20 by calculating a position-specific score matrix (PSSM) of protein sequences. In the second phase, a dense convolutional network (DenseNet) is adopted to extract the feature from these PSSMs automatically. After that, Transformer with a fully connected dense layer acts as classifier to perform a binary classification in terms of secretion into CSF or not. According to the experiment results, DenSec achieves a mean accuracy of 86.00% in the test dataset and outperforms the state-of-the-art methods.
Collapse
|
7
|
Shao D, Dai Y, Li N, Cao X, Zhao W, Cheng L, Rong Z, Huang L, Wang Y, Zhao J. Artificial intelligence in clinical research of cancers. Brief Bioinform 2021; 23:6470966. [PMID: 34929741 PMCID: PMC8769909 DOI: 10.1093/bib/bbab523] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 11/06/2021] [Accepted: 11/13/2021] [Indexed: 12/16/2022] Open
Abstract
Several factors, including advances in computational algorithms, the availability of high-performance computing hardware, and the assembly of large community-based databases, have led to the extensive application of Artificial Intelligence (AI) in the biomedical domain for nearly 20 years. AI algorithms have attained expert-level performance in cancer research. However, only a few AI-based applications have been approved for use in the real world. Whether AI will eventually be capable of replacing medical experts has been a hot topic. In this article, we first summarize the cancer research status using AI in the past two decades, including the consensus on the procedure of AI based on an ideal paradigm and current efforts of the expertise and domain knowledge. Next, the available data of AI process in the biomedical domain are surveyed. Then, we review the methods and applications of AI in cancer clinical research categorized by the data types including radiographic imaging, cancer genome, medical records, drug information and biomedical literatures. At last, we discuss challenges in moving AI from theoretical research to real-world cancer research applications and the perspectives toward the future realization of AI participating cancer treatment.
Collapse
Affiliation(s)
- Dan Shao
- College of Computer Science and Technology, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Yinfei Dai
- College of Computer Science and Technology, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Nianfeng Li
- College of Computer Science and Technology, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Xuqing Cao
- Department of Neurology, People's Hospital of Ningxia Hui Autonomous Region (The Affiliated people's Hospital of Ningxia Medical University and The First Affiliated Hospital of Northwest Minzu University), Yinchuan 750002, China
| | - Wei Zhao
- Department of Biochemistry and Molecular Biology, Ningxia Medical University, Yinchuan 750002, China
| | - Li Cheng
- Department of Electrical Diagnosis, Affiliated Hospital of Changchun University of Traditional Chinese Medicine, Changchun, 130021, China
| | - Zhuqing Rong
- School of Science, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Lan Huang
- Key laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yan Wang
- Key laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Jing Zhao
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, 43210, USA
| |
Collapse
|