1
|
Yu JC, Ni K, Chen CT. ENCAP: Computational prediction of tumor T cell antigens with ensemble classifiers and diverse sequence features. PLoS One 2024; 19:e0307176. [PMID: 39024250 PMCID: PMC11257298 DOI: 10.1371/journal.pone.0307176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 07/01/2024] [Indexed: 07/20/2024] Open
Abstract
Cancer immunotherapy enhances the body's natural immune system to combat cancer, offering the advantage of lowered side effects compared to traditional treatments because of its high selectivity and efficacy. Utilizing computational methods to identify tumor T cell antigens (TTCAs) is valuable in unraveling the biological mechanisms and enhancing the effectiveness of immunotherapy. In this study, we present ENCAP, a predictor for TTCA based on ensemble classifiers and diverse sequence features. Sequences were encoded as a feature vector of 4349 entries based on 57 different feature types, followed by feature engineering and hyperparameter optimization for machine learning models, respectively. The selected feature subsets of ENCAP are primarily composed of physicochemical properties, with several features specifically related to hydrophobicity and amphiphilicity. Two publicly available datasets were used for performance evaluation. ENCAP yields an AUC (Area Under the ROC Curve) of 0.768 and an MCC (Matthew's Correlation Coefficient) of 0.522 on the first independent test set. On the second test set, it achieves an AUC of 0.960 and an MCC of 0.789. Performance evaluations show that ENCAP generates 4.8% and 13.5% improvements in MCC over the state-of-the-art methods on two popular TTCA datasets, respectively. For the third test dataset of 71 experimentally validated TTCAs from the literature, ENCAP yields prediction accuracy of 0.873, achieving improvements ranging from 12% to 25.7% compared to three state-of-the-art methods. In general, the prediction accuracy is higher for sequences of fewer hydrophobic residues, and more hydrophilic and charged residues. The source code of ENCAP is freely available at https://github.com/YnnJ456/ENCAP.
Collapse
Affiliation(s)
- Jen-Chieh Yu
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
| | - Kuan Ni
- Graduate Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Ching-Tai Chen
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
- Center for Precision Health Research, Asia University, Taichung, Taiwan
| |
Collapse
|
2
|
Zhang L, Hu X, Xiao K, Kong L. Effective identification and differential analysis of anticancer peptides. Biosystems 2024; 241:105246. [PMID: 38848816 DOI: 10.1016/j.biosystems.2024.105246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2024] [Revised: 05/27/2024] [Accepted: 06/04/2024] [Indexed: 06/09/2024]
Abstract
Anticancer peptides (ACPs) have recently emerged as promising cancer therapeutics due to their selectivity and lower toxicity. However, the number of experimentally validated ACPs is limited, and identifying ACPs from large-scale sequence data is time-consuming and expensive. Therefore, it is critical to develop and improve upon existing computational models for identifying ACPs. In this study, a computational method named ACP_DA was proposed based on peptide residue composition and physiochemical properties information. To curtail overfitting and reduce computational costs, a sequential forward selection method was utilized to construct the optimal feature groups. Subsequently, the feature vectors were fed into light gradient boosting machine classifier for model construction. It was observed by an independent set test that ACP_DA achieved the highest Matthew's correlation coefficient of 0.63 and accuracy of 0.8129, displaying at least a 2% enhancement compared to state-of-the-art methods. The satisfactory results demonstrate the effectiveness of ACP_DA as a powerful tool for identifying ACPs, with the potential to significantly contribute to the development and optimization of promising therapies. The data and resource codes are available at https://github.com/Zlclab/ACP_DA.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, PR China; Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, PR China
| | - Xueli Hu
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, PR China
| | - Kang Xiao
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, PR China
| | - Liang Kong
- Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, PR China; School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao, PR China.
| |
Collapse
|
3
|
Harun-Or-Roshid M, Pham NT, Manavalan B, Kurata H. Meta-2OM: A multi-classifier meta-model for the accurate prediction of RNA 2'-O-methylation sites in human RNA. PLoS One 2024; 19:e0305406. [PMID: 38924058 PMCID: PMC11207182 DOI: 10.1371/journal.pone.0305406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
2'-O-methylation (2-OM or Nm) is a widespread RNA modification observed in various RNA types like tRNA, mRNA, rRNA, miRNA, piRNA, and snRNA, which plays a crucial role in several biological functional mechanisms and innate immunity. To comprehend its modification mechanisms and potential epigenetic regulation, it is necessary to accurately identify 2-OM sites. However, biological experiments can be tedious, time-consuming, and expensive. Furthermore, currently available computational methods face challenges due to inadequate datasets and limited classification capabilities. To address these challenges, we proposed Meta-2OM, a cutting-edge predictor that can accurately identify 2-OM sites in human RNA. In brief, we applied a meta-learning approach that considered eight conventional machine learning algorithms, including tree-based classifiers and decision boundary-based classifiers, and eighteen different feature encoding algorithms that cover physicochemical, compositional, position-specific and natural language processing information. The predicted probabilities of 2-OM sites from the baseline models are then combined and trained using logistic regression to generate the final prediction. Consequently, Meta-2OM achieved excellent performance in both 5-fold cross-validation training and independent testing, outperforming all existing state-of-the-art methods. Specifically, on the independent test set, Meta-2OM achieved an overall accuracy of 0.870, sensitivity of 0.836, specificity of 0.904, and Matthew's correlation coefficient of 0.743. To facilitate its use, a user-friendly web server and standalone program have been developed and freely available at http://kurata35.bio.kyutech.ac.jp/Meta-2OM and https://github.com/kuratahiroyuki/Meta-2OM.
Collapse
Affiliation(s)
- Md. Harun-Or-Roshid
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| |
Collapse
|
4
|
Ipkovich Á, Czvetkó T, A. Acosta L, Lee S, Nzimenyera I, Sebestyén V, Abonyi J. Network science and explainable AI-based life cycle management of sustainability models. PLoS One 2024; 19:e0300531. [PMID: 38870225 PMCID: PMC11175538 DOI: 10.1371/journal.pone.0300531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 02/29/2024] [Indexed: 06/15/2024] Open
Abstract
Model-based assessment of the potential impacts of variables on the Sustainable Development Goals (SDGs) can bring great additional information about possible policy intervention points. In the context of sustainability planning, machine learning techniques can provide data-driven solutions throughout the modeling life cycle. In a changing environment, existing models must be continuously reviewed and developed for effective decision support. Thus, we propose to use the Machine Learning Operations (MLOps) life cycle framework. A novel approach for model identification and development is introduced, which involves utilizing the Shapley value to determine the individual direct and indirect contributions of each variable towards the output, as well as network analysis to identify key drivers and support the identification and validation of possible policy intervention points. The applicability of the methods is demonstrated through a case study of the Hungarian water model developed by the Global Green Growth Institute. Based on the model exploration of the case of water efficiency and water stress (in the examined period for the SDG 6.4.1 & 6.4.2) SDG indicators, water reuse and water circularity offer a more effective intervention option than pricing and the use of internal or external renewable water resources.
Collapse
Affiliation(s)
- Ádám Ipkovich
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
| | - Tímea Czvetkó
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
| | - Lilibeth A. Acosta
- Climate Action and Inclusive Development (CAID) Unit, Global Green Growth Institute, Jung-gu, Seoul, Republic of Korea
| | - Sanga Lee
- Climate Action and Inclusive Development (CAID) Unit, Global Green Growth Institute, Jung-gu, Seoul, Republic of Korea
| | - Innocent Nzimenyera
- Climate Action and Inclusive Development (CAID) Unit, Global Green Growth Institute, Jung-gu, Seoul, Republic of Korea
| | - Viktor Sebestyén
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
- Sustainability Solutions Research Lab, Faculty of Engineering, University of Pannonia, Veszprém, Hungary
| | - János Abonyi
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
| |
Collapse
|
5
|
Jia Y, Yu Z, Hong Z. Semantic aware-based instruction embedding for binary code similarity detection. PLoS One 2024; 19:e0305299. [PMID: 38861533 PMCID: PMC11166306 DOI: 10.1371/journal.pone.0305299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Accepted: 05/27/2024] [Indexed: 06/13/2024] Open
Abstract
Binary code similarity detection plays a crucial role in various applications within binary security, including vulnerability detection, malicious software analysis, etc. However, existing methods suffer from limited differentiation in binary embedding representations across different compilation environments, lacking dynamic high-level semantics. Moreover, current approaches often neglect multi-level semantic feature extraction, thereby failing to acquire precise semantic information about the binary code. To address these limitations, this paper introduces a novel detection solution called BinBcla. This method employs an enhanced pre-training model to generate instruction embeddings with dynamic semantics for binary functions. Subsequently, multi-feature fusion technique is utilized to extract local semantic information and long-distance global features from the code, respectively, employing self-attention to comprehend the structure information of the code. Finally, an improved cosine similarity method is employed to learn relationships among all elements of the distance vectors, thereby enhancing the model's robustness to new sample functions. Experiments are conducted across different architectures, compilers, and optimization levels. The results indicate that BinBcla achieves higher accuracy, precision and F1 score compared to existing methods.
Collapse
Affiliation(s)
- Yuhao Jia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, China
| | - Zhicheng Yu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, China
| | - Zhen Hong
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, China
| |
Collapse
|
6
|
Chen T, Kabir MF. Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data. PLoS One 2024; 19:e0302947. [PMID: 38728288 PMCID: PMC11086842 DOI: 10.1371/journal.pone.0302947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 04/15/2024] [Indexed: 05/12/2024] Open
Abstract
In recent years, researchers have proven the effectiveness and speediness of machine learning-based cancer diagnosis models. However, it is difficult to explain the results generated by machine learning models, especially ones that utilized complex high-dimensional data like RNA sequencing data. In this study, we propose the binarilization technique as a novel way to treat RNA sequencing data and used it to construct explainable cancer prediction models. We tested our proposed data processing technique on five different models, namely neural network, random forest, xgboost, support vector machine, and decision tree, using four cancer datasets collected from the National Cancer Institute Genomic Data Commons. Since our datasets are imbalanced, we evaluated the performance of all models using metrics designed for imbalance performance like geometric mean, Matthews correlation coefficient, F-Measure, and area under the receiver operating characteristic curve. Our approach showed comparative performance while relying on less features. Additionally, we demonstrated that data binarilization offers higher explainability by revealing how each feature affects the prediction. These results demonstrate the potential of data binarilization technique in improving the performance and explainability of RNA sequencing based cancer prediction models.
Collapse
Affiliation(s)
- Tianjie Chen
- Department of Computer Science, Pennsylvania State University Harrisburg, Middletown, Pennsylvania, United States of America
| | - Md Faisal Kabir
- Department of Computer Science, Pennsylvania State University Harrisburg, Middletown, Pennsylvania, United States of America
| |
Collapse
|
7
|
Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med 2024; 151:102860. [PMID: 38552379 DOI: 10.1016/j.artmed.2024.102860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 02/21/2024] [Accepted: 03/25/2024] [Indexed: 04/26/2024]
Abstract
Globally, fungal infections have become a major health concern in humans. Fungal diseases generally occur due to the invading fungus appearing on a specific portion of the body and becoming hard for the human immune system to resist. The recent emergence of COVID-19 has intensely increased different nosocomial fungal infections. The existing wet-laboratory-based medications are expensive, time-consuming, and may have adverse side effects on normal cells. In the last decade, peptide therapeutics have gained significant attention due to their high specificity in targeting affected cells without affecting healthy cells. Motivated by the significance of peptide-based therapies, we developed a highly discriminative prediction scheme called iAFPs-Mv-BiTCN to predict antifungal peptides correctly. The training peptides are encoded using word embedding methods such as skip-gram and attention mechanism-based bidirectional encoder representation using transformer. Additionally, transform-based evolutionary features are generated using the Pseduo position-specific scoring matrix using discrete wavelet transform (PsePSSM-DWT). The fused vector of word embedding and evolutionary descriptors is formed to compensate for the limitations of single encoding methods. A Shapley Additive exPlanations (SHAP) based global interpolation approach is applied to reduce training costs by choosing the optimal feature set. The selected feature set is trained using a bi-directional temporal convolutional network (BiTCN). The proposed iAFPs-Mv-BiTCN model achieved a predictive accuracy of 98.15 % and an AUC of 0.99 using training samples. In the case of the independent samples, our model obtained an accuracy of 94.11 % and an AUC of 0.98. Our iAFPs-Mv-BiTCN model outperformed existing models with a ~4 % and ~5 % higher accuracy using training and independent samples, respectively. The reliability and efficacy of the proposed iAFPs-Mv-BiTCN model make it a valuable tool for scientists and may perform a beneficial role in pharmaceutical design and research academia.
Collapse
Affiliation(s)
- Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China; Department of Computer Science, Abdul Wali Khan University Mardan, KP 23200, Pakistan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, PR China.
| | - Ali Raza
- Department of Physical and Numerical Sciences, Qurtuba University of Science and Information Technology, Peshawar, KP 25124, Pakistan
| | - Fawaz Khaled Alarfaj
- Department of Management Information Systems (MIS), School of Business, King Faisal University (KFU), Al-Ahsa 31982, Saudi Arabia
| |
Collapse
|
8
|
Sargsyan K, Lim C. Using protein language models for protein interaction hot spot prediction with limited data. BMC Bioinformatics 2024; 25:115. [PMID: 38493120 PMCID: PMC10943781 DOI: 10.1186/s12859-024-05737-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 03/11/2024] [Indexed: 03/18/2024] Open
Abstract
BACKGROUND Protein language models, inspired by the success of large language models in deciphering human language, have emerged as powerful tools for unraveling the intricate code of life inscribed within protein sequences. They have gained significant attention for their promising applications across various areas, including the sequence-based prediction of secondary and tertiary protein structure, the discovery of new functional protein sequences/folds, and the assessment of mutational impact on protein fitness. However, their utility in learning to predict protein residue properties based on scant datasets, such as protein-protein interaction (PPI)-hotspots whose mutations significantly impair PPIs, remained unclear. Here, we explore the feasibility of using protein language-learned representations as features for machine learning to predict PPI-hotspots using a dataset containing 414 experimentally confirmed PPI-hotspots and 504 PPI-nonhot spots. RESULTS Our findings showcase the capacity of unsupervised learning with protein language models in capturing critical functional attributes of protein residues derived from the evolutionary information encoded within amino acid sequences. We show that methods relying on protein language models can compete with methods employing sequence and structure-based features to predict PPI-hotspots from the free protein structure. We observed an optimal number of features for model precision, suggesting a balance between information and overfitting. CONCLUSIONS This study underscores the potential of transformer-based protein language models to extract critical knowledge from sparse datasets, exemplified here by the challenging realm of predicting PPI-hotspots. These models offer a cost-effective and time-efficient alternative to traditional experimental methods for predicting certain residue properties. However, the challenge of explaining why specific features are important for determining certain residue properties remains.
Collapse
Affiliation(s)
- Karen Sargsyan
- Institute of Biomedical Sciences, Academia Sinica, Taipei, 115, Taiwan.
| | - Carmay Lim
- Institute of Biomedical Sciences, Academia Sinica, Taipei, 115, Taiwan.
| |
Collapse
|
9
|
Liang Y, Yin X, Zhang Y, Guo Y, Wang Y. Predicting lncRNA-protein interactions through deep learning framework employing multiple features and random forest algorithm. BMC Bioinformatics 2024; 25:108. [PMID: 38475723 DOI: 10.1186/s12859-024-05727-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Accepted: 03/01/2024] [Indexed: 03/14/2024] Open
Abstract
RNA-protein interaction (RPI) is crucial to the life processes of diverse organisms. Various researchers have identified RPI through long-term and high-cost biological experiments. Although numerous machine learning and deep learning-based methods for predicting RPI currently exist, their robustness and generalizability have significant room for improvement. This study proposes LPI-MFF, an RPI prediction model based on multi-source information fusion, to address these issues. The LPI-MFF employed protein-protein interactions features, sequence features, secondary structure features, and physical and chemical properties as the information sources with the corresponding coding scheme, followed by the random forest algorithm for feature screening. Finally, all information was combined and a classification method based on convolutional neural networks is used. The experimental results of fivefold cross-validation demonstrated that the accuracy of LPI-MFF on RPI1807 and NPInter was 97.60% and 97.67%, respectively. In addition, the accuracy rate on the independent test set RPI1168 was 84.9%, and the accuracy rate on the Mus musculus dataset was 90.91%. Accordingly, LPI-MFF demonstrated greater robustness and generalization than other prevalent RPI prediction methods.
Collapse
Affiliation(s)
- Ying Liang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Zhimin Avenue, Nanchang, China
| | - XingRui Yin
- College of Computer and Information Engineering, Jiangxi Agricultural University, Zhimin Avenue, Nanchang, China
| | - YangSen Zhang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Zhimin Avenue, Nanchang, China
| | - You Guo
- First Affiliated Hospital, Gannan Medical University, Medical College Road, Ganzhou, China.
| | - YingLong Wang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Zhimin Avenue, Nanchang, China.
| |
Collapse
|
10
|
Akbar S, Raza A, Zou Q. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC Bioinformatics 2024; 25:102. [PMID: 38454333 PMCID: PMC10921744 DOI: 10.1186/s12859-024-05726-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/01/2024] [Indexed: 03/09/2024] Open
Abstract
BACKGROUND Viral infections have been the main health issue in the last decade. Antiviral peptides (AVPs) are a subclass of antimicrobial peptides (AMPs) with substantial potential to protect the human body against various viral diseases. However, there has been significant production of antiviral vaccines and medications. Recently, the development of AVPs as an antiviral agent suggests an effective way to treat virus-affected cells. Recently, the involvement of intelligent machine learning techniques for developing peptide-based therapeutic agents is becoming an increasing interest due to its significant outcomes. The existing wet-laboratory-based drugs are expensive, time-consuming, and cannot effectively perform in screening and predicting the targeted motif of antiviral peptides. METHODS In this paper, we proposed a novel computational model called Deepstacked-AVPs to discriminate AVPs accurately. The training sequences are numerically encoded using a novel Tri-segmentation-based position-specific scoring matrix (PSSM-TS) and word2vec-based semantic features. Composition/Transition/Distribution-Transition (CTDT) is also employed to represent the physiochemical properties based on structural features. Apart from these, the fused vector is formed using PSSM-TS features, semantic information, and CTDT descriptors to compensate for the limitations of single encoding methods. Information gain (IG) is applied to choose the optimal feature set. The selected features are trained using a stacked-ensemble classifier. RESULTS The proposed Deepstacked-AVPs model achieved a predictive accuracy of 96.60%%, an area under the curve (AUC) of 0.98, and a precision-recall (PR) value of 0.97 using training samples. In the case of the independent samples, our model obtained an accuracy of 95.15%, an AUC of 0.97, and a PR value of 0.97. CONCLUSION Our Deepstacked-AVPs model outperformed existing models with a ~ 4% and ~ 2% higher accuracy using training and independent samples, respectively. The reliability and efficacy of the proposed Deepstacked-AVPs model make it a valuable tool for scientists and may perform a beneficial role in pharmaceutical design and research academia.
Collapse
Affiliation(s)
- Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Ali Raza
- Department of Physical and Numerical Sciences, Qurtuba University of Science and Information Technology, Peshawar, 25124, KP, Pakistan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, People's Republic of China.
| |
Collapse
|
11
|
Wu H, Liu X, Fang Y, Yang Y, Huang Y, Pan X, Shen HB. Decoding protein binding landscape on circular RNAs with base-resolution transformer models. Comput Biol Med 2024; 171:108175. [PMID: 38402841 DOI: 10.1016/j.compbiomed.2024.108175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 01/16/2024] [Accepted: 02/18/2024] [Indexed: 02/27/2024]
Abstract
Circular RNAs (circRNAs), a class of endogenous RNA with a covalent loop structure, can regulate gene expression by serving as sponges for microRNAs and RNA-binding proteins (RBPs). To date, most computational methods for predicting RBP binding sites on circRNAs focus on circRNA fragments instead of circRNAs. These methods detect whether a circRNA fragment contains binding sites, but cannot determine where are the binding sites and how many binding sites are on the circRNA transcript. We report a hybrid deep learning-based tool, CircSite, to predict RBP binding sites at single-nucleotide resolution and detect key contributed nucleotides on circRNA transcripts. CircSite takes advantage of convolutional neural networks (CNNs) and Transformer for learning local and global representations of circRNAs binding to RBPs, respectively. We construct 37 datasets of circRNAs interacting with proteins for benchmarking and the experimental results show that CircSite offers accurate predictions of RBP binding nucleotides and detects key subsequences aligning well with known binding motifs. CircSite is an easy-to-use online webserver for predicting RBP binding sites on circRNA transcripts and freely available at http://www.csbio.sjtu.edu.cn/bioinf/CircSite/.
Collapse
Affiliation(s)
- Hehe Wu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaojian Liu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Yi Fang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Yang Yang
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yan Huang
- State Key Laboratory of Infrared Physics, Shanghai Institute of Technical Physics Chinese Academy of Sciences, 500 Yutian Road, Shanghai, 200083, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| |
Collapse
|
12
|
Alruwaili O, Yousef A, Jumani TA, Armghan A. Response score-based protein structure analysis for cancer prediction aided by the Internet of Things. Sci Rep 2024; 14:2324. [PMID: 38282060 PMCID: PMC10822874 DOI: 10.1038/s41598-024-52634-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 01/22/2024] [Indexed: 01/30/2024] Open
Abstract
Medical diagnosis through prediction and analysis is par excellence in integrating modern technologies such as the Internet of Things (IoT). With the aid of such technologies, clinical assessments are eased with protracted computing. Specifically, cancer research through structure prediction and analysis is improved through human and machine interventions sustaining precision improvements. This article, therefore, introduces a Protein Structure Prediction Technique based on Three-Dimensional Sequence. This sequence is modeled using amino acids and their folds observed during the pre-initial cancer stages. The observed sequences and the inflammatory response score of the structure are used to predict the impact of cancer. In this process, ensemble learning is used to identify sequence and folding responses to improve inflammations. This score is correlated with the clinical data for structures and their folds independently for determining the structure changes. Such changes through different sequences are handled using repeated ensemble learning for matching and unmatching response scores. The introduced idea integrated with deep ensemble learning and IoT combination, notably employing stacking method for enhanced cancer prediction precision and interdisciplinary collaboration. The proposed technique improves prediction precision, data correlation, and change detection by 11.83%, 8.48%, and 13.23%, respectively. This technique reduces correlation time and complexity by 10.43% and 12.33%, respectively.
Collapse
Affiliation(s)
- Omar Alruwaili
- Department of Computer Engineering and Networks, College of Computer and Information Science, Jouf University, 72388, Sakaka, Saudi Arabia
| | - Amr Yousef
- Electrical Engineering Department, University of Business and Technology, 23435, Ar Rawdah, Jeddah, Saudi Arabia
- Engineering Mathematics Department, Alexandria University, Lotfy El-Sied St. Off Gamal Abd El-Naser, Alexandria, 11432, Egypt
| | - Touqeer A Jumani
- Department of Electrical Engineering, Mehran University of Engineering and Technology, SZAB Campus, Khairpur Mirs, 66020, Pakistan
| | - Ammar Armghan
- Department of Electrical Engineering. College of Engineering, Jouf University, 72388, Sakaka, Saudi Arabia.
| |
Collapse
|
13
|
Hassan E, Abd El-Hafeez T, Shams MY. Optimizing classification of diseases through language model analysis of symptoms. Sci Rep 2024; 14:1507. [PMID: 38233458 PMCID: PMC10794698 DOI: 10.1038/s41598-024-51615-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Accepted: 01/07/2024] [Indexed: 01/19/2024] Open
Abstract
This paper investigated the use of language models and deep learning techniques for automating disease prediction from symptoms. Specifically, we explored the use of two Medical Concept Normalization-Bidirectional Encoder Representations from Transformers (MCN-BERT) models and a Bidirectional Long Short-Term Memory (BiLSTM) model, each optimized with a different hyperparameter optimization method, to predict diseases from symptom descriptions. In this paper, we utilized two distinct dataset called Dataset-1, and Dataset-2. Dataset-1 consists of 1,200 data points, with each point representing a unique combination of disease labels and symptom descriptions. While, Dataset-2 is designed to identify Adverse Drug Reactions (ADRs) from Twitter data, comprising 23,516 rows categorized as ADR (1) or Non-ADR (0) tweets. The results indicate that the MCN-BERT model optimized with AdamP achieved 99.58% accuracy for Dataset-1 and 96.15% accuracy for Dataset-2. The MCN-BERT model optimized with AdamW performed well with 98.33% accuracy for Dataset-1 and 95.15% for Dataset-2, while the BiLSTM model optimized with Hyperopt achieved 97.08% accuracy for Dataset-1 and 94.15% for Dataset-2. Our findings suggest that language models and deep learning techniques have promise for supporting earlier detection and more prompt treatment of diseases, as well as expanding remote diagnostic capabilities. The MCN-BERT and BiLSTM models demonstrated robust performance in accurately predicting diseases from symptoms, indicating the potential for further related research.
Collapse
Affiliation(s)
- Esraa Hassan
- Faculty of Artificial Intelligence, Kafrelsheikh University, Kafrelsheikh, 33516, Egypt.
| | - Tarek Abd El-Hafeez
- Department of Computer Science, Faculty of Science, Minia University, Minia, 61519, Egypt.
- Computer Science Unit, Deraya University, Minia University, Minia, 61765, Egypt.
| | - Mahmoud Y Shams
- Faculty of Artificial Intelligence, Kafrelsheikh University, Kafrelsheikh, 33516, Egypt.
| |
Collapse
|