1
|
Carpenter KA, Altman RB. Databases of ligand-binding pockets and protein-ligand interactions. Comput Struct Biotechnol J 2024; 23:1320-1338. [PMID: 38585646 PMCID: PMC10997877 DOI: 10.1016/j.csbj.2024.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/16/2024] [Accepted: 03/17/2024] [Indexed: 04/09/2024] Open
Abstract
Many research groups and institutions have created a variety of databases curating experimental and predicted data related to protein-ligand binding. The landscape of available databases is dynamic, with new databases emerging and established databases becoming defunct. Here, we review the current state of databases that contain binding pockets and protein-ligand binding interactions. We have compiled a list of such databases, fifty-three of which are currently available for use. We discuss variation in how binding pockets are defined and summarize pocket-finding methods. We organize the fifty-three databases into subgroups based on goals and contents, and describe standard use cases. We also illustrate that pockets within the same protein are characterized differently across different databases. Finally, we assess critical issues of sustainability, accessibility and redundancy.
Collapse
Affiliation(s)
- Kristy A. Carpenter
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Russ B. Altman
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Medicine, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
2
|
Kang Y, Wang H, Qin Y, Liu G, Yu Y, Zhang Y. PSATF-6mA: an integrated learning fusion feature-encoded DNA-6 mA methylcytosine modification site recognition model based on attentional mechanisms. Front Genet 2024; 15:1498884. [PMID: 39600317 PMCID: PMC11588721 DOI: 10.3389/fgene.2024.1498884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Accepted: 10/30/2024] [Indexed: 11/29/2024] Open
Abstract
DNA methylation is of crucial importance for biological genetic expression, such as biological cell differentiation and cellular tumours. The identification of DNA-6mA sites using traditional biological experimental methods requires more cumbersome steps and a large amount of time. The advent of neural network technology has facilitated the identification of 6 mA sites on cross-species DNA with enhanced efficacy. Nevertheless, the majority of contemporary neural network models for identifying 6 mA sites prioritize the design of the identification model, with comparatively limited research conducted on the statistically significant DNA sequence itself. Consequently, this paper will focus on the statistical strategy of DNA double-stranded features, utilising the multi-head self-attention mechanism in neural networks applied to DNA position probabilistic relationships. Furthermore, a new recognition model, PSATF-6 mA, will be constructed by continually adjusting the attentional tendency of feature fusion through an integrated learning framework. The experimental results, obtained through cross-validation with cross-species data, demonstrate that the PSATF-6 mA model outperforms the baseline model. The in-Matthews correlation coefficient (MCC) for the cross-species dataset of rice and m. musus genomes can reach a score of 0.982. The present model is expected to assist biologists in more accurately identifying 6 mA locus and in formulating new testable biological hypotheses.
Collapse
Affiliation(s)
- Yanmei Kang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Hongyuan Wang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yubo Qin
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Guanlin Liu
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yi Yu
- College of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Yongjian Zhang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| |
Collapse
|
3
|
Li Y, Nan X, Zhang S, Zhou Q, Lu S, Tian Z. PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy. Biomolecules 2024; 14:1220. [PMID: 39456153 PMCID: PMC11506650 DOI: 10.3390/biom14101220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 09/22/2024] [Accepted: 09/24/2024] [Indexed: 10/28/2024] Open
Abstract
Proteins perform different biological functions through binding with various molecules which are mediated by a few key residues and accurate prediction of such protein binding residues (PBRs) is crucial for understanding cellular processes and for designing new drugs. Many computational prediction approaches have been proposed to identify PBRs with sequence-based features. However, these approaches face two main challenges: (1) these methods only concatenate residue feature vectors with a simple sliding window strategy, and (2) it is challenging to find a uniform sliding window size suitable for learning embeddings across different types of PBRs. In this study, we propose one novel framework that could apply multiple types of PBRs Prediciton task through Multi-scale Sequence-based Feature Fusion (PMSFF) strategy. Firstly, PMSFF employs a pre-trained language model named ProtT5, to encode amino acid residues in protein sequences. Then, it generates multi-scale residue embeddings by applying multi-size windows to capture effective neighboring residues and multi-size kernels to learn information across different scales. Additionally, the proposed model treats protein sequences as sentences, employing a bidirectional GRU to learn global context. We also collect benchmark datasets encompassing various PBRs types and evaluate our PMSFF approach to these datasets. Compared with state-of-the-art methods, PMSFF demonstrates superior performance on most PBRs prediction tasks.
Collapse
Affiliation(s)
- Yuguang Li
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
| | - Xiaofei Nan
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
| | - Shoutao Zhang
- School of Life Sciences, Zhengzhou University, Zhengzhou 450001, China;
- Longhu Laboratory of Advanced Immunology, Zhengzhou 450001, China
| | - Qinglei Zhou
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
| | - Shuai Lu
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
- National Supercomputing Center in Zhengzhou, Zhengzhou University, Zhengzhou 450001, China
| | - Zhen Tian
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China
| |
Collapse
|
4
|
Zhou L, Wang X, Peng L, Chen M, Wen H. SEnSCA: Identifying possible ligand-receptor interactions and its application in cell-cell communication inference. J Cell Mol Med 2024; 28:e18372. [PMID: 38747737 PMCID: PMC11095317 DOI: 10.1111/jcmm.18372] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 04/10/2024] [Accepted: 04/18/2024] [Indexed: 05/18/2024] Open
Abstract
Multicellular organisms have dense affinity with the coordination of cellular activities, which severely depend on communication across diverse cell types. Cell-cell communication (CCC) is often mediated via ligand-receptor interactions (LRIs). Existing CCC inference methods are limited to known LRIs. To address this problem, we developed a comprehensive CCC analysis tool SEnSCA by integrating single cell RNA sequencing and proteome data. SEnSCA mainly contains potential LRI acquisition and CCC strength evaluation. For acquiring potential LRIs, it first extracts LRI features and reduces the feature dimension, subsequently constructs negative LRI samples through K-means clustering, finally acquires potential LRIs based on Stacking ensemble comprising support vector machine, 1D-convolutional neural networks and multi-head attention mechanism. During CCC strength evaluation, SEnSCA conducts LRI filtering and then infers CCC by combining the three-point estimation approach and single cell RNA sequencing data. SEnSCA computed better precision, recall, accuracy, F1 score, AUC and AUPR under most of conditions when predicting possible LRIs. To better illustrate the inferred CCC network, SEnSCA provided three visualization options: heatmap, bubble diagram and network diagram. Its application on human melanoma tissue demonstrated its reliability in CCC detection. In summary, SEnSCA offers a useful CCC inference tool and is freely available at https://github.com/plhhnu/SEnSCA.
Collapse
Affiliation(s)
- Liqian Zhou
- School of Life Sciences and ChemistryHunan University of TechnologyHunanChina
| | - Xiwen Wang
- School of Life Sciences and ChemistryHunan University of TechnologyHunanChina
| | - Lihong Peng
- School of Life Sciences and ChemistryHunan University of TechnologyHunanChina
| | - Min Chen
- School of Computer ScienceHunan Institute of TechnologyHengyangChina
| | - Hong Wen
- School of Computer ScienceHunan University of TechnologyHunanChina
| |
Collapse
|
5
|
Qi X, Zhao Y, Qi Z, Hou S, Chen J. Machine Learning Empowering Drug Discovery: Applications, Opportunities and Challenges. Molecules 2024; 29:903. [PMID: 38398653 PMCID: PMC10892089 DOI: 10.3390/molecules29040903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/08/2024] [Accepted: 02/14/2024] [Indexed: 02/25/2024] Open
Abstract
Drug discovery plays a critical role in advancing human health by developing new medications and treatments to combat diseases. How to accelerate the pace and reduce the costs of new drug discovery has long been a key concern for the pharmaceutical industry. Fortunately, by leveraging advanced algorithms, computational power and biological big data, artificial intelligence (AI) technology, especially machine learning (ML), holds the promise of making the hunt for new drugs more efficient. Recently, the Transformer-based models that have achieved revolutionary breakthroughs in natural language processing have sparked a new era of their applications in drug discovery. Herein, we introduce the latest applications of ML in drug discovery, highlight the potential of advanced Transformer-based ML models, and discuss the future prospects and challenges in the field.
Collapse
Affiliation(s)
- Xin Qi
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Yuanchun Zhao
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Zhuang Qi
- School of Software, Shandong University, Jinan 250101, China;
| | - Siyu Hou
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Jiajia Chen
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| |
Collapse
|
6
|
Qi Y, Liu H, Zhao J, Zhang S, Zhang X, Zhang W, Wang Y, Xu J, Li J, Ding Y. Trends and driving forces of agricultural carbon emissions: A case study of Anhui, China. PLoS One 2024; 19:e0292523. [PMID: 38346018 PMCID: PMC10861070 DOI: 10.1371/journal.pone.0292523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 09/23/2023] [Indexed: 02/15/2024] Open
Abstract
To facilitate accurate prediction and empirical research on regional agricultural carbon emissions, this paper uses the LLE-PSO-XGBoost carbon emission model, which combines the Local Linear Embedding (LLE), Particle Swarm Algorithm (PSO) and Extreme Gradient Boosting Algorithm (XGBoost), to forecast regional agricultural carbon emissions in Anhui Province under different scenarios. The results show that the regional agricultural carbon emissions in Anhui Province generally show an upward and then downward trend during 2000-2021, and the regional agricultural carbon emissions in Anhui Province in 2030 are expected to fluctuate between 11,342,100 tones and 14,445,700 tones under five different set scenarios. The projections of regional agricultural carbon emissions can play an important role in supporting the development of local regional agriculture, helping to guide the input and policy guidance of local rural low-carbon agriculture and promoting the development of rural areas towards a resource-saving and environment-friendly society.
Collapse
Affiliation(s)
- Yanwei Qi
- School of Economics & Management, Xidian University, Xi’an, China
| | - Huailiang Liu
- School of Economics & Management, Xidian University, Xi’an, China
| | - Jianbo Zhao
- School of Economics & Management, Xidian University, Xi’an, China
| | - Shanzhuang Zhang
- School of Economics & Management, Xidian University, Xi’an, China
| | - Xiaojin Zhang
- School of Economics & Management, Xidian University, Xi’an, China
| | - Weili Zhang
- School of Economics & Management, Xidian University, Xi’an, China
| | - Yakai Wang
- School of Economics & Management, Xidian University, Xi’an, China
| | - Jiajun Xu
- School of Economics & Management, Xidian University, Xi’an, China
| | - Jie Li
- School of Economics & Management, Xidian University, Xi’an, China
| | - Yulan Ding
- School of Economics & Management, Xidian University, Xi’an, China
| |
Collapse
|
7
|
Yu W, Wang X, Jiang X, Zhao R, Zhao S. A novel hybrid model based on two-stage data processing and machine learning for forecasting chlorophyll-a concentration in reservoirs. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2024; 31:262-279. [PMID: 38015396 DOI: 10.1007/s11356-023-31148-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 11/17/2023] [Indexed: 11/29/2023]
Abstract
The accurate and efficient prediction of chlorophyll-a (Chl-a) concentration is crucial for the early detection of algal blooms in reservoirs. Nevertheless, predicting Chl-a concentration in multivariate time series poses a significant challenge due to the complex interrelationships within the aquatic environment and the discrete and non-stationary nature of online monitoring of water quality data. To address the aforementioned issue, this paper proposes a novel prediction model named SGMD-KPCA-BiLSTM (SKB) for predicting Chl-a concentration. The model combines two-stage data processing and machine learning (ML). To capture nonlinear relationships in multivariate time series data, the optimal data subset is determined by combining symplectic geometry mode decomposition (SGMD) and kernel principal component analysis (KPCA). This subset is then input into a bidirectional long short-term memory (BiLSTM) model, and the model's hyperparameters are optimized using the sparrow search algorithm (SSA) to improve the accuracy of predictions. The performance of the model was evaluated at Qiaodian Reservoir in Shandong, China. To assess its superiority, the evaluation criteria included the root mean square error (RMSE), mean absolute percentage error (MAPE), mean absolute error (MAE), coefficient of determination (R2), frequency histograms of the prediction error, and the Taylor diagram. The prediction performance of five single models, namely the back-propagation (BP) neural network, support vector regression (SVR), long short-term memory (LSTM), convolutional neural network with long short-term memory (CNN-LSTM), and BiLSTM, as well as three hybrid models, namely SGMD-LSTM, SGMD-KPCA-LSTM, and SGMD-BiLSTM, were compared against the SKB model. The results demonstrated that the SKB model performs best in predicting Chl-a concentration (R2 = 96.19%, RMSE = 1.05, MAE = 0.65, MAPE = 0.08). It significantly reduced the prediction error compared to other models for comparison. Furthermore, the multi-step predictive capabilities of the SKB model are also discussed. The analysis shows a decline in predictive performance with larger prediction time steps, and the SKB model exhibits slightly superior performance compared to the other model at corresponding prediction intervals. The model has significant advantages in terms of its ability to accurately predict the non-smooth and nonlinear Chl-a sequences observed by the online monitoring system. This study presents a potential solution for controlling and preventing reservoir eutrophication, as well as an innovative approach for predicting water quality.
Collapse
Affiliation(s)
- Wenqing Yu
- Department of Civil Engineering and Water Conservancy, Shandong University, Jinan, 250061, China
| | - Xingju Wang
- Department of Civil Engineering and Water Conservancy, Shandong University, Jinan, 250061, China
| | - Xin Jiang
- Water Resources Research Institute of Shandong Province, Jinan, 250014, China
| | - Ranhang Zhao
- Department of Civil Engineering and Water Conservancy, Shandong University, Jinan, 250061, China.
- Qianfoshan Campus of Shandong University, No. 17923, Jingshi Road, Lixia District, Jinan City, 250014, Shandong Province, China.
| | - Shen Zhao
- Water Resources Research Institute of Shandong Province, Jinan, 250014, China
- School of Water Conservancy and Environment, University of Jinan, Jinan, 250022, China
| |
Collapse
|
8
|
Cong H, Liu H, Cao Y, Liang C, Chen Y. Protein-protein interaction site prediction by model ensembling with hybrid feature and self-attention. BMC Bioinformatics 2023; 24:456. [PMID: 38053020 DOI: 10.1186/s12859-023-05592-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2022] [Accepted: 11/30/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) are crucial in various biological functions and cellular processes. Thus, many computational approaches have been proposed to predict PPI sites. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in sequences. Many feature extraction methods rely on the sliding window technique, which simply merges all the features of residues into a vector. The importance of some key residues may be weakened in the feature vector, leading to poor performance. RESULTS We propose a novel sequence-based method for PPI sites prediction. The new network model, PPINet, contains multiple feature processing paths. For a residue, the PPINet extracts the features of the targeted residue and its context separately. These two types of features are processed by two paths in the network and combined to form a protein representation, where the two types of features are of relatively equal importance. The model ensembling technique is applied to make use of more features. The base models are trained with different features and then ensembled via stacking. In addition, a data balancing strategy is presented, by which our model can get significant improvement on highly unbalanced data. CONCLUSION The proposed method is evaluated on a fused dataset constructed from Dset186, Dset_72, and PDBset_164, as well as the public Dset_448 dataset. Compared with current state-of-the-art methods, the performance of our method is better than the others. In the most important metrics, such as AUPRC and recall, it surpasses the second-best programmer on the latter dataset by 6.9% and 4.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model, especially, the hybrid feature. We share our code for reproducibility and future research at https://github.com/CandiceCong/StackingPPINet .
Collapse
Affiliation(s)
- Hanhan Cong
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
- Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan, China
| | - Hong Liu
- School of Information Science and Engineering, Shandong Normal University, Jinan, China.
- Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan, China.
| | - Yi Cao
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| |
Collapse
|
9
|
Zandi F, Mansouri P, Goodarzi M. Global protein-protein interaction networks in yeast saccharomyces cerevisiae and helicobacter pylori. Talanta 2023; 265:124836. [PMID: 37393709 DOI: 10.1016/j.talanta.2023.124836] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/04/2023] [Accepted: 06/17/2023] [Indexed: 07/04/2023]
Abstract
Understanding many biological processes relies heavily on accurately predicting protein-protein interactions (PPIs). In this study, we propose a novel method for predicting PPIs that is based on LogitBoost with a binary bat feature selection algorithm. Our approach involves the extraction of an initial feature vector by combining pseudo amino acid composition (PseAAC), pseudo-position-specific scoring matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD). Subsequently, a binary bat algorithm is applied to eliminate redundant features, and the resulting optimal features are fed into the LogitBoost classifier for the identification of PPIs. To evaluate the proposed method, we test it on two databases, Saccharomyces cerevisiae and Helicobacter pylori, using 10-fold cross-validation, and achieve accuracies of 94.39% and 97.89%, respectively. Our results showcase the significant potential of our pipeline in accurately predicting protein-protein interactions (PPIs), thereby offering a valuable resource to the scientific research community.
Collapse
Affiliation(s)
- Farzad Zandi
- Faculty of Sciences, Islamic Azad University, Arak Branch, Arak, Markazi, Iran
| | | | - Mohammad Goodarzi
- Department of Immunology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
10
|
Peng L, Huang L, Su Q, Tian G, Chen M, Han G. LDA-VGHB: identifying potential lncRNA-disease associations with singular value decomposition, variational graph auto-encoder and heterogeneous Newton boosting machine. Brief Bioinform 2023; 25:bbad466. [PMID: 38127089 PMCID: PMC10734633 DOI: 10.1093/bib/bbad466] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Revised: 10/05/2023] [Accepted: 11/25/2023] [Indexed: 12/23/2023] Open
Abstract
Long noncoding RNAs (lncRNAs) participate in various biological processes and have close linkages with diseases. In vivo and in vitro experiments have validated many associations between lncRNAs and diseases. However, biological experiments are time-consuming and expensive. Here, we introduce LDA-VGHB, an lncRNA-disease association (LDA) identification framework, by incorporating feature extraction based on singular value decomposition and variational graph autoencoder and LDA classification based on heterogeneous Newton boosting machine. LDA-VGHB was compared with four classical LDA prediction methods (i.e. SDLDA, LDNFSGB, IPCARF and LDASR) and four popular boosting models (XGBoost, AdaBoost, CatBoost and LightGBM) under 5-fold cross-validations on lncRNAs, diseases, lncRNA-disease pairs and independent lncRNAs and independent diseases, respectively. It greatly outperformed the other methods with its prominent performance under four different cross-validations on the lncRNADisease and MNDR databases. We further investigated potential lncRNAs for lung cancer, breast cancer, colorectal cancer and kidney neoplasms and inferred the top 20 lncRNAs associated with them among all their unobserved lncRNAs. The results showed that most of the predicted top 20 lncRNAs have been verified by biomedical experiments provided by the Lnc2Cancer 3.0, lncRNADisease v2.0 and RNADisease databases as well as publications. We found that HAR1A, KCNQ1DN, ZFAT-AS1 and HAR1B could associate with lung cancer, breast cancer, colorectal cancer and kidney neoplasms, respectively. The results need further biological experimental validation. We foresee that LDA-VGHB was capable of identifying possible lncRNAs for complex diseases. LDA-VGHB is publicly available at https://github.com/plhhnu/LDA-VGHB.
Collapse
Affiliation(s)
- Lihong Peng
- School of Computer Science, Hunan University of Technology, 412007, Hunan, China
- College of Life Sciences and Chemistry, Hunan University of Technology, 412007, Hunan, China
| | - Liangliang Huang
- School of Computer Science, Hunan University of Technology, 412007, Hunan, China
| | - Qiongli Su
- Department of Pharmacy, the Affiliated Zhuzhou Hospital Xiangya Medical College CSU, 412007, Hunan, China
| | - Geng Tian
- Geneis (Beijing) Co. Ltd, China, 100102, Beijing, China
| | - Min Chen
- School of Computer Science, Hunan Institute of Technology, 421002, No. 18 Henghua Road, Zhuhui District, Hengyang, Hunan, China
| | - Guosheng Han
- School of Mathematics and Computational Science, Xiangtan University, 411105, Yuhu District, Xiangtan, Hunan, China
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, 411105, Yuhu District, Xiangtan, Hunan, China
| |
Collapse
|
11
|
Alghushairy O, Ali F, Alghamdi W, Khalid M, Alsini R, Asiry O. Machine learning-based model for accurate identification of druggable proteins using light extreme gradient boosting. J Biomol Struct Dyn 2023; 42:12330-12341. [PMID: 37850427 DOI: 10.1080/07391102.2023.2269280] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 10/04/2023] [Indexed: 10/19/2023]
Abstract
The identification of druggable proteins (DPs) is significant for the development of new drugs, personalized medicine, understanding of disease mechanisms, drug repurposing, and economic benefits. By identifying new druggable targets, researchers can develop new therapies for a range of diseases, leading to better patient outcomes. Identification of DPs by machine learning strategies is more efficient and cost-effective than conventional methods. In this study, a computational predictor, namely Drug-LXGB, is introduced to enhance the identification of DPs. Features are discovered by composition, transition, and distribution (CTD), composition of K-spaced amino acid pair (CKSAAP), pseudo-position-specific scoring matrix (PsePSSM), and a novel descriptor, called multi-block pseudo amino acid composition (MB-PseAAC). The dimensions of CTD, CKSAAP, PsePSSM, and MB-PseAAC are integrated and utilized the sequential forward selection as feature selection algorithm. The best characteristics are provided by random forest, extreme gradient boosting, and light eXtreme gradient boosting (LXGB). The predictive analysis of these learning methods is measured via 10-fold cross-validation. The LXGB-based model secures the highest results than other existing predictors. Our novel protocol will perform an active role in designing novel drugs and would be fruitful to explore the potential target. This study will help better to capture a more universal view of a potential target.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Omar Alghushairy
- Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Farman Ali
- Department of Software Engineering, Sarhad University of Science and Information Technology Peshawar Mardan Campus, Peshawar, Pakistan
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Majdi Khalid
- Department of Computer Science, College of Computers and Information Systems, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Raed Alsini
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Othman Asiry
- Department of Information Technology, College of Computing and Information Technology at Khulais, University of Jeddah, Jeddah, Saudi Arabia
| |
Collapse
|
12
|
Peng L, Yuan R, Han C, Han G, Tan J, Wang Z, Chen M, Chen X. CellEnBoost: A Boosting-Based Ligand-Receptor Interaction Identification Model for Cell-to-Cell Communication Inference. IEEE Trans Nanobioscience 2023; 22:705-715. [PMID: 37216267 DOI: 10.1109/tnb.2023.3278685] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Cell-to-cell communication (CCC) plays important roles in multicellular organisms. The identification of communication between cancer cells themselves and one between cancer cells and normal cells in tumor microenvironment helps understand cancer genesis, development and metastasis. CCC is usually mediated by Ligand-Receptor Interactions (LRIs). In this manuscript, we developed a Boosting-based LRI identification model (CellEnBoost) for CCC inference. First, potential LRIs are predicted by data collection, feature extraction, dimensional reduction, and classification based on an ensemble of Light gradient boosting machine and AdaBoost combining convolutional neural network. Next, the predicted LRIs and known LRIs are filtered. Third, the filtered LRIs are applied to CCC elucidation by combining CCC strength measurement and single-cell RNA sequencing data. Finally, CCC inference results are visualized using heatmap view, Circos plot view, and network view. The experimental results show that CellEnBoost obtained the best AUCs and AUPRs on the collected four LRI datasets. Case study in human head and neck squamous cell carcinoma (HNSCC) tissues demonstrates that fibroblasts were more likely to communicate with HNSCC cells, which is in accord with the results from iTALK. We anticipate that this work can contribute to the diagnosis and treatment of cancers.
Collapse
|
13
|
Peng L, Tan J, Xiong W, Zhang L, Wang Z, Yuan R, Li Z, Chen X. Deciphering ligand-receptor-mediated intercellular communication based on ensemble deep learning and the joint scoring strategy from single-cell transcriptomic data. Comput Biol Med 2023; 163:107137. [PMID: 37364528 DOI: 10.1016/j.compbiomed.2023.107137] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 05/18/2023] [Accepted: 06/04/2023] [Indexed: 06/28/2023]
Abstract
BACKGROUND Cell-cell communication in a tumor microenvironment is vital to tumorigenesis, tumor progression and therapy. Intercellular communication inference helps understand molecular mechanisms of tumor growth, progression and metastasis. METHODS Focusing on ligand-receptor co-expressions, in this study, we developed an ensemble deep learning framework, CellComNet, to decipher ligand-receptor-mediated cell-cell communication from single-cell transcriptomic data. First, credible LRIs are captured by integrating data arrangement, feature extraction, dimension reduction, and LRI classification based on an ensemble of heterogeneous Newton boosting machine and deep neural network. Next, known and identified LRIs are screened based on single-cell RNA sequencing (scRNA-seq) data in certain tissues. Finally, cell-cell communication is inferred by incorporating scRNA-seq data, the screened LRIs, a joint scoring strategy that combines expression thresholding and expression product of ligands and receptors. RESULTS The proposed CellComNet framework was compared with four competing protein-protein interaction prediction models (PIPR, XGBoost, DNNXGB, and OR-RCNN) and obtained the best AUCs and AUPRs on four LRI datasets, elucidating the optimal LRI classification ability. CellComNet was further applied to analyze intercellular communication in human melanoma and head and neck squamous cell carcinoma (HNSCC) tissues. The results demonstrate that cancer-associated fibroblasts highly communicate with melanoma cells and endothelial cells strong communicate with HNSCC cells. CONCLUSIONS The proposed CellComNet framework efficiently identified credible LRIs and significantly improved cell-cell communication inference performance. We anticipate that CellComNet can contribute to anticancer drug design and tumor-targeted therapy.
Collapse
Affiliation(s)
- Lihong Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou, 412007, Hunan, China; College of Life Sciences and Chemistry, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Jingwei Tan
- School of Computer Science, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Wei Xiong
- School of Computer Science, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Li Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, Jiangsu, China
| | - Zhao Wang
- School of Computer Science, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Ruya Yuan
- School of Computer Science, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Zejun Li
- School of Computer Science, Hunan Institute of Technology, Hengyang, 421002, Hunan, China.
| | - Xing Chen
- School of Science, Jiangnan University, Wuxi, 214122, Jiangsu, China.
| |
Collapse
|
14
|
Wang M, Yan L, Jia J, Lai J, Zhou H, Yu B. DE-MHAIPs: Identification of SARS-CoV-2 phosphorylation sites based on differential evolution multi-feature learning and multi-head attention mechanism. Comput Biol Med 2023; 160:106935. [PMID: 37120990 PMCID: PMC10140648 DOI: 10.1016/j.compbiomed.2023.106935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/12/2023] [Accepted: 04/13/2023] [Indexed: 05/02/2023]
Abstract
The rapid spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) around the world affects the normal lives of people all over the world. The computational methods can be used to accurately identify SARS-CoV-2 phosphorylation sites. In this paper, a new prediction model of SARS-CoV-2 phosphorylation sites, called DE-MHAIPs, is proposed. First, we use six feature extraction methods to extract protein sequence information from different perspectives. For the first time, we use a differential evolution (DE) algorithm to learn individual feature weights and fuse multi-information in a weighted combination. Next, Group LASSO is used to select a subset of good features. Then, the important protein information is given higher weight through multi-head attention. After that, the processed data is fed into long short-term memory network (LSTM) to further enhance model's ability to learn features. Finally, the data from LSTM are input into fully connected neural network (FCN) to predict SARS-CoV-2 phosphorylation sites. The AUC values of the S/T and Y datasets under 5-fold cross-validation reach 91.98% and 98.32%, respectively. The AUC values of the two datasets on the independent test set reach 91.72% and 97.78%, respectively. The experimental results show that the DE-MHAIPs method exhibits excellent predictive ability compared with other methods.
Collapse
Affiliation(s)
- Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lu Yan
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jihua Jia
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jiali Lai
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Hongyan Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
15
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
16
|
Singh A, Tiwari AK. Machine learning-based approach for prediction of ion channels and their subclasses. J Cell Biochem 2023; 124:72-88. [PMID: 36271914 DOI: 10.1002/jcb.30343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 10/10/2022] [Accepted: 10/12/2022] [Indexed: 01/25/2023]
Abstract
Ion channels are ion-permeable protein pores that are found in all cell lipid membranes. Distinct ion channels play multiple roles in biological processes. Proteomic data is fast accumulating as a result of the fast growth of mass spectrometry and giving us the chance to comprehensively explore ion channel classes along with their subclasses. This paper proposes an eXtreme Gradient Boosting (XGBoost)-based method to estimate the ion channel classes and their subclasses. Here, 12 feature vectors are applied to better characterize protein sequences like amino acid composition, pseudo-amino acid composition, normalized moreau-broto autocorrelation, amphiphilic pseudo-amino acid composition, dipeptide composition, Geary autocorrelation, tripeptide composition, sequence-order-coupling number, composition/transition/distribution, conjoint triad, moran autocorrelation, quasi-sequence-order descriptors. Here, a total of 9920 features are extracted from the protein sequence. The principal component analysis is applied to determine the optimal number of features to optimize the performance. In 10-fold cross-validation the proposed XGBoost based approach with optimal 50 features achieved accuracy of 100%, 98.70%, 98.77%, 97.26%, 87.40%, 97.39%, 98.03%, 96.42%, and F1-Score of 100%, 99%, 99%, 97%, 87%, 97%, 98%, 97%, for prediction of ion channel and nonion channel, voltage-gated and ligand-gated ion channels, subclasses of voltage-gated ion channels (VGICs), subclasses of ligand-gated ion channels (LGICs), subclasses of voltage-gated calcium channels (VGCCs), subclasses of voltage-gated potassium channels (VGKCs), subclasses of voltage-gated sodium channels (VGSCs), and subclasses of voltage-gated chloride channels, respectively. Here the proposed approach also compares with the other approaches such as support vector machine, k-nearest neighbor, Gaussian Naïve Bayes, and random forest and also compares with existing methods such as support vector machine (SVM) with maximum relevance maximum distance with an accuracy of 86.6%, 83.7%, and 85.1%, for ion channels, non-ion channels and overall respectively and SVM with radial basis function kernel-based method with an accuracy of 100%, 97% and 99.9% for ion channels, nonion channels, and overall accuracy, respectively.
Collapse
Affiliation(s)
- Anuj Singh
- Department of Computer Science and Engineering, Kamla Nehru Institute of Technology, Sultanpur, Uttar Pradesh, India
| | - Arvind Kumar Tiwari
- Department of Computer Science and Engineering, Kamla Nehru Institute of Technology, Sultanpur, Uttar Pradesh, India
| |
Collapse
|
17
|
Identification of adaptor proteins using the ANOVA feature selection technique. Methods 2022; 208:42-47. [DOI: 10.1016/j.ymeth.2022.10.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 10/01/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
|
18
|
Gao H, Chen C, Li S, Wang C, Zhou W, Yu B. Prediction of protein-protein interactions based on ensemble residual conventional neural network. Comput Biol Med 2022. [DOI: 10.1016/j.compbiomed.2022.106471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
19
|
Wei Q, Zhang Q, Gao H, Song T, Salhi A, Yu B. DEEPStack-RBP: Accurate identification of RNA-binding proteins based on autoencoder feature selection and deep stacking ensemble classifier. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
|
20
|
PITHIA: Protein Interaction Site Prediction Using Multiple Sequence Alignments and Attention. Int J Mol Sci 2022; 23:ijms232112814. [PMID: 36361606 PMCID: PMC9657891 DOI: 10.3390/ijms232112814] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 10/06/2022] [Accepted: 10/07/2022] [Indexed: 11/22/2022] Open
Abstract
Cellular functions are governed by proteins, and, while some proteins work independently, most work by interacting with other proteins. As a result it is crucially important to know the interaction sites that facilitate the interactions between the proteins. Since the experimental methods are costly and time consuming, it is essential to develop effective computational methods. We present PITHIA, a sequence-based deep learning model for protein interaction site prediction that exploits the combination of multiple sequence alignments and learning attention. We demonstrate that our new model clearly outperforms the state-of-the-art models on a wide range of metrics. In order to provide meaningful comparison, we update existing test datasets with new information regarding interaction site, as well as introduce an additional new testing dataset which resolves the shortcomings of the existing ones.
Collapse
|
21
|
DBP-iDWT: Improving DNA-Binding Proteins Prediction Using Multi-Perspective Evolutionary Profile and Discrete Wavelet Transform. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:2987407. [PMID: 36211019 PMCID: PMC9534628 DOI: 10.1155/2022/2987407] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 08/19/2022] [Accepted: 09/09/2022] [Indexed: 11/17/2022]
Abstract
DNA-binding proteins (DBPs) have crucial biotic activities including DNA replication, recombination, and transcription. DBPs are highly concerned with chronic diseases and are used in the manufacturing of antibiotics and steroids. A series of predictors were established to identify DBPs. However, researchers are still working to further enhance the identification of DBPs. This research designed a novel predictor to identify DBPs more accurately. The features from the sequences are transformed by F-PSSM (Filtered position-specific scoring matrix), PSSM-DPC (Position specific scoring matrix-dipeptide composition), and R-PSSM (Reduced position-specific scoring matrix). To eliminate the noisy attributes, we extended DWT (discrete wavelet transform) to F-PSSM, PSSM-DPC, and R-PSSM and introduced three novel descriptors, namely, F-PSSM-DWT, PSSM-DPC-DWT, and R-PSSM-DWT. Onward, the training of the four models were performed using LiXGB (Light eXtreme gradient boosting), XGB (eXtreme gradient boosting, ERT (extremely randomized trees), and Adaboost. LiXGB with R-PSSM-DWT has attained 6.55% higher accuracy on training and 5.93% on testing dataset than the best existing predictors. The results reveal the excellent performance of our novel predictor over the past studies. DBP-iDWT would be fruitful for establishing more operative therapeutic strategies for fatal disease treatment.
Collapse
|
22
|
Shi Y, Li M, Wen J, Yang Y, Zeng J. Deep Learning-Based Approach for Heat Transfer Efficiency Prediction with Deep Feature Extraction. ACS OMEGA 2022; 7:31013-31035. [PMID: 36092576 PMCID: PMC9453825 DOI: 10.1021/acsomega.2c03052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Accepted: 08/16/2022] [Indexed: 06/15/2023]
Abstract
Failure to blow ash on the heated surface of the boiler will cause a drop in heat transfer rate and even industrial safety accidents. Nowadays, the shortcomings of the fixed soot blowing operation every hour and every shift are significant, which can be improved by high-precision ash accumulation prediction. Therefore, this paper proposes a deep learning model fused with deep feature extraction. First, a dynamic fouling model and a health index-clearness factor (CF) of the heated surface are established. The data preprocessing method reduces unnecessary forecasting difficulty and makes the degradation trend of the CF time series more obvious. In addition, deep feature extraction is composed of complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and kernel principal component analysis (KPCA), which completes the multiscale analysis of time series and reduces the training time of deep learning models, and has significant contributions to improving prediction accuracy and reducing time consumption. The adaptive sliding window and the encoder-decoder based on the attention mechanism (EDA) can better mine the internal information of the time series. Compared with long short-term memory (LSTM), taking the 300 MW boiler's various heated surface data sets as an example, multistep forward prediction and different starting point prediction experiments have verified the superiority and effectiveness of the model. Finally, under the variable working condition economizer datasets, the proposed method better completes the predictive maintenance task of the heated surface. The research results provide operational guidance for improving heat transfer rate, energy saving, and reducing consumption.
Collapse
|
23
|
Shi G, Shen X, Ren H, Rao Y, Weng S, Tang X. Kernel principal component analysis and differential non-linear feature extraction of pesticide residues on fruit surface based on surface-enhanced Raman spectroscopy. FRONTIERS IN PLANT SCIENCE 2022; 13:956778. [PMID: 35928706 PMCID: PMC9344007 DOI: 10.3389/fpls.2022.956778] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 06/29/2022] [Indexed: 06/15/2023]
Abstract
Surface-enhanced Raman spectroscopy (SERS) has attracted much attention because of its high sensitivity, high speed, and simple sample processing, and has great potential for application in the field of pesticide residue detection. However, SERS is susceptible to the influence of a complex detection environment in the detection of pesticide residues on the surface of fruits, facing problems such as interference from the spectral peaks of detected impurities, unclear dimension of effective correlation data, and poor linearity of sensing signals. In this work, the enhanced raw data of the pesticide thiram residues on the fruit surface using gold nanoparticle (Au-NPs) solution are formed into the raw data set of Raman signal in the IoT environment of Raman spectroscopy principal component detection. Considering the non-linear characteristics of sensing data, this work adopts kernel principal component analysis (KPCA) including radial basis function (RBF) to extract the main features for the spectra in the ranges of 653∼683 cm-1, 705∼728 cm-1, and 847∼872 cm-1, and discusses the effects of different kernel function widths (σ) to construct a qualitative analysis of pesticide residues based on SERS spectral data model, so that the SERS spectral data produce more useful dimensionality reduction with minimal loss, higher mean squared error for cross-validation in non-linear scenarios, and effectively weaken the interference features of detecting impurity spectral peaks, unclear dimensionality of effective correlation data, and poor linearity of sensing signals, reflecting better extraction effects than conventional principal component analysis (PCA) models.
Collapse
Affiliation(s)
- Guolong Shi
- School of Information and Computer, Anhui Agricultural University, Hefei, China
- School of Electrical Engineering and Automation, Wuhan University, Wuhan, China
| | - Xinyi Shen
- School of Information and Computer, Anhui Agricultural University, Hefei, China
| | - Huan Ren
- School of Information and Computer, Anhui Agricultural University, Hefei, China
| | - Yuan Rao
- School of Information and Computer, Anhui Agricultural University, Hefei, China
- Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Hefei, China
| | - Shizhuang Weng
- National Engineering Research Center for Agro-Ecological Big Data Analysis and Application, Anhui University, Hefei, China
| | - Xianghu Tang
- School of Information and Computer, Anhui Agricultural University, Hefei, China
- Institute of Solid State Physics, Hefei Institutes of Physical Science (HFIPS), Chinese Academy of Sciences, Hefei, China
| |
Collapse
|
24
|
Yu B, Zhang Y, Wang X, Gao H, Sun J, Gao X. Identification of DNA modification sites based on elastic net and bidirectional gated recurrent unit with convolutional neural network. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103566] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
25
|
Yu B, Wang X, Zhang Y, Gao H, Wang Y, Liu Y, Gao X. RPI-MDLStack: Predicting RNA-protein interactions through deep learning with stacking strategy and LASSO. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108676] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
26
|
Li S, Wu S, Wang L, Li F, Jiang H, Bai F. Recent advances in predicting protein-protein interactions with the aid of artificial intelligence algorithms. Curr Opin Struct Biol 2022; 73:102344. [PMID: 35219216 DOI: 10.1016/j.sbi.2022.102344] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 01/02/2022] [Accepted: 01/17/2022] [Indexed: 12/15/2022]
Abstract
Protein-protein interactions (PPIs) are essential in the regulation of biological functions and cell events, therefore understanding PPIs have become a key issue to understanding the molecular mechanism and investigating the design of drugs. Here we highlight the major developments in computational methods developed for predicting PPIs by using types of artificial intelligence algorithms. The first part introduces the source of experimental PPI data. The second part is devoted to the PPI prediction methods based on sequential information. The third part covers representative methods using structural information as the input feature. The last part is methods designed by combining different types of features. For each part, the state-of-the-art computational PPI prediction methods are reviewed in an inclusive view. Finally, we discuss the flaws existing in this area and future directions of next-generation algorithms.
Collapse
Affiliation(s)
- Shiwei Li
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Sanan Wu
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Lin Wang
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Fenglei Li
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China; School of Information Science and Technology, ShanghaiTech University, Shanghai, China
| | - Hualiang Jiang
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Pudong, Shanghai, 201203, China
| | - Fang Bai
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China; School of Information Science and Technology, ShanghaiTech University, Shanghai, China.
| |
Collapse
|
27
|
Zhang Y, Jiang Z, Chen C, Wei Q, Gu H, Yu B. DeepStack-DTIs: Predicting Drug-Target Interactions Using LightGBM Feature Selection and Deep-Stacked Ensemble Classifier. Interdiscip Sci 2021; 14:311-330. [PMID: 34731411 DOI: 10.1007/s12539-021-00488-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Revised: 10/19/2021] [Accepted: 10/21/2021] [Indexed: 12/12/2022]
Abstract
Accurate prediction of drug-target interactions (DTIs), which is often used in the fields of drug discovery and drug repositioning, is regarded a key challenge in the study of drug science. In this paper, a new method called DeepStack-DTIs is proposed to predict DTIs. First, for the target protein, pseudo-position specific score matrix, pseudo amino acid composition and SPIDER3 are used to extract the different feature information of the target protein. Meanwhile, the path-based fingerprint features of each drug are extracted. Then, the synthetic minority oversampling technique (SMOTE) and light gradient boosting machine (LightGBM) are used for data balancing and feature selection, respectively. Finally, the processed features are input to the deep-stacked ensemble classifier composed of gated recurrent unit (GRU), deep neural network (DNN), support vector machine (SVM), eXtreme gradient boosting (XGBoost) and logistic regression (LR) to predict DTIs. Under the five-fold cross-validation and compared with existing methods, the proposed method achieves higher prediction accuracy on the gold standard dataset. To evaluate the predictive power of DeepStack-DTIs, we validate the method on another dataset and predict the drug-target interaction network. The results indicate that DeepStack-DTIs has excellent predictive ability than the other methods, and provides novel insights for the prediction of DTIs. A novel method DeepStack-DTIs for drug-target interactions prediction. PsePSSM, PseAAC, SPIDER3 and FP2 are fused to convert protein sequence and drug molecule information into digital information, respectively. The SMOTE algorithm is used to balance the dataset and LightGBM feature selection algorithm is employed to remove redundant and irrelevant features to select the optimal feature subset. This optimal feature subset is inputted into the deep-stacked ensemble classifier to predict drug-target interactions. The experimental results show DeepStack-DTIs method can significantly improve the prediction accuracy of drug-target interactions.
Collapse
Affiliation(s)
- Yan Zhang
- College of Mechanical and Electrical Engineering, Qingdao University of Science and Technology, Qingdao, 266061, China.,College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Zhiwen Jiang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Cheng Chen
- School of Computer Science and Technology, Shandong University, Qingdao, 266237, China
| | - Qinqin Wei
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Haiming Gu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| |
Collapse
|
28
|
Wang P, Zhang G, Yu ZG, Huang G. A Deep Learning and XGBoost-Based Method for Predicting Protein-Protein Interaction Sites. Front Genet 2021; 12:752732. [PMID: 34764983 PMCID: PMC8576272 DOI: 10.3389/fgene.2021.752732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/20/2021] [Indexed: 11/29/2022] Open
Abstract
Knowledge about protein-protein interactions is beneficial in understanding cellular mechanisms. Protein-protein interactions are usually determined according to their protein-protein interaction sites. Due to the limitations of current techniques, it is still a challenging task to detect protein-protein interaction sites. In this article, we presented a method based on deep learning and XGBoost (called DeepPPISP-XGB) for predicting protein-protein interaction sites. The deep learning model served as a feature extractor to remove redundant information from protein sequences. The Extreme Gradient Boosting algorithm was used to construct a classifier for predicting protein-protein interaction sites. The DeepPPISP-XGB achieved the following results: area under the receiver operating characteristic curve of 0.681, a recall of 0.624, and area under the precision-recall curve of 0.339, being competitive with the state-of-the-art methods. We also validated the positive role of global features in predicting protein-protein interaction sites.
Collapse
Affiliation(s)
- Pan Wang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| | - Guiyang Zhang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| |
Collapse
|