51
|
Predicting Protein–Protein Interactions Based on Ensemble Learning-Based Model from Protein Sequence. BIOLOGY 2022; 11:biology11070995. [PMID: 36101379 PMCID: PMC9311754 DOI: 10.3390/biology11070995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 05/27/2022] [Accepted: 06/29/2022] [Indexed: 11/17/2022]
Abstract
Simple Summary Due to most traditional high-throughput experiments are tedious and laborious in identifying potential protein–protein interaction. To better improve accuracy prediction in protein–protein interactions. We proposed a novel computational method that can identify unknown protein–protein interaction efficiently and hope this method can provide a helpful idea and tool for proteomics research. Abstract Protein–protein interactions (PPIs) play an essential role in many biological cellular functions. However, it is still tedious and time-consuming to identify protein–protein interactions through traditional experimental methods. For this reason, it is imperative and necessary to develop a computational method for predicting PPIs efficiently. This paper explores a novel computational method for detecting PPIs from protein sequence, the approach which mainly adopts the feature extraction method: Locality Preserving Projections (LPP) and classifier: Rotation Forest (RF). Specifically, we first employ the Position Specific Scoring Matrix (PSSM), which can remain evolutionary information of biological for representing protein sequence efficiently. Then, the LPP descriptor is applied to extract feature vectors from PSSM. The feature vectors are fed into the RF to obtain the final results. The proposed method is applied to two datasets: Yeast and H. pylori, and obtained an average accuracy of 92.81% and 92.56%, respectively. We also compare it with K nearest neighbors (KNN) and support vector machine (SVM) to better evaluate the performance of the proposed method. In summary, all experimental results indicate that the proposed approach is stable and robust for predicting PPIs and promising to be a useful tool for proteomics research.
Collapse
|
52
|
Li X, Han P, Wang G, Chen W, Wang S, Song T. SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 2022; 23:474. [PMID: 35761175 PMCID: PMC9235110 DOI: 10.1186/s12864-022-08687-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 06/10/2022] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) dominate intracellular molecules to perform a series of tasks such as transcriptional regulation, information transduction, and drug signalling. The traditional wet experiment method to obtain PPIs information is costly and time-consuming. RESULT In this paper, SDNN-PPI, a PPI prediction method based on self-attention and deep learning is proposed. The method adopts amino acid composition (AAC), conjoint triad (CT), and auto covariance (AC) to extract global and local features of protein sequences, and leverages self-attention to enhance DNN feature extraction to more effectively accomplish the prediction of PPIs. In order to verify the generalization ability of SDNN-PPI, a 5-fold cross-validation on the intraspecific interactions dataset of Saccharomyces cerevisiae (core subset) and human is used to measure our model in which the accuracy reaches 95.48% and 98.94% respectively. The accuracy of 93.15% and 88.33% are obtained in the interspecific interactions dataset of human-Bacillus Anthracis and Human-Yersinia pestis, respectively. In the independent data set Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, all prediction accuracy is 100%, which is higher than the previous PPIs prediction methods. To further evaluate the advantages and disadvantages of the model, the one-core and crossover network are conducted to predict PPIs, and the data show that the model correctly predicts the interaction pairs in the network. CONCLUSION In this paper, AAC, CT and AC methods are used to encode the sequence, and SDNN-PPI method is proposed to predict PPIs based on self-attention deep learning neural network. Satisfactory results are obtained on interspecific and intraspecific data sets, and good performance is also achieved in cross-species prediction. It can also correctly predict the protein interaction of cell and tumor information contained in one-core network and crossover network.The SDNN-PPI proposed in this paper not only explores the mechanism of protein-protein interaction, but also provides new ideas for drug design and disease prevention.
Collapse
Affiliation(s)
- Xue Li
- College of Computer Science and technology, China University of Petroleum (East China), Qingdao, China
| | - Peifu Han
- College of Computer Science and technology, China University of Petroleum (East China), Qingdao, China
| | - Gan Wang
- College of Computer Science and technology, China University of Petroleum (East China), Qingdao, China
| | - Wenqi Chen
- College of Computer Science and technology, China University of Petroleum (East China), Qingdao, China
| | - Shuang Wang
- College of Computer Science and technology, China University of Petroleum (East China), Qingdao, China
| | - Tao Song
- College of Computer Science and technology, China University of Petroleum (East China), Qingdao, China.
| |
Collapse
|
53
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
54
|
Dhal SB, Jungbluth K, Lin R, Sabahi SP, Bagavathiannan M, Braga-Neto U, Kalafatis S. A Machine-Learning-Based IoT System for Optimizing Nutrient Supply in Commercial Aquaponic Operations. SENSORS (BASEL, SWITZERLAND) 2022; 22:3510. [PMID: 35591199 PMCID: PMC9104751 DOI: 10.3390/s22093510] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 05/01/2022] [Accepted: 05/03/2022] [Indexed: 11/16/2022]
Abstract
Nutrient regulation in aquaponic environments has been a topic of research for many years. Most studies have focused on appropriate control of nutrients in an aquaponic set-up, but very little research has been conducted on commercial-scale applications. In our model, the input data were sourced on a weekly basis from three commercial aquaponic farms in Southeast Texas over the course of a year. Due to the limited number of data points, dimensionality reduction techniques such as pairwise correlation matrix were used to remove the highly correlated predictors. Feature selection techniques such as the XGBoost classifier and Recursive Feature Elimination with ExtraTreesClassifier were used to rank the features in order of their relative importance. Ammonium and calcium were found to be the top two nutrient predictors, and based on the months in which lettuce was cultivated, the median of these nutrient values from the historical dataset served as the optimal concentration to be maintained in the aquaponic solution to sustain healthy growth of tilapia fish and lettuce plants in a coupled set-up. To accomplish this, Vernier sensors were used to measure the nutrient values and actuator systems were built to dispense the appropriate nutrient into the ecosystem via a closed loop.
Collapse
Affiliation(s)
- Sambandh Bhusan Dhal
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Kyle Jungbluth
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Raymond Lin
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Seyed Pouyan Sabahi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | | | - Ulisses Braga-Neto
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Stavros Kalafatis
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| |
Collapse
|
55
|
Bhagat SK, Tiyasha T, Kumar A, Malik T, Jawad AH, Khedher KM, Deo RC, Yaseen ZM. Integrative artificial intelligence models for Australian coastal sediment lead prediction: An investigation of in-situ measurements and meteorological parameters effects. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2022; 309:114711. [PMID: 35182982 DOI: 10.1016/j.jenvman.2022.114711] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 01/17/2022] [Accepted: 02/09/2022] [Indexed: 06/14/2023]
Abstract
Heavy metals (HMs) such as Lead (Pb) have played a vital role in increasing the sediments of the Australian bay's ecosystem. Several meteorological parameters (i.e., minimum, maximum and average temperature (Tmin, Tmax and TavgoC), rainfall (Rn mm) and their interactions with the other batch HMs, are hypothesized to have high impact for the decision-making strategies to minimize the impacts of Pb. Three feature selection (FS) algorithms namely the Boruta method, genetic algorithm (GA) and extreme gradient boosting (XGBoost) were investigated to select the highly important predictors for Pb concentration in the coastal bay sediments of Australia. These FS algorithms were statistically evaluated using principal component analysis (PCA) Biplot along with the correlation metrics describing the statistical characteristics that exist in the input and output parameter space of the models. To ensure a high accuracy attained by the applied predictive artificial intelligence (AI) models i.e., XGBoost, support vector machine (SVM) and random forest (RF), an auto-hyper-parameter tuning process using a Grid-search approach was also implemented. Cu, Ni, Ce, and Fe were selected by all the three applied FS algorithms whereas the Tavg and Rn inputs remained the essential parameters identified by GA and Boruta. The order of the FS outcome was XGBoost > GA > Boruta based on the applied statistical examination and the PCA Biplot results and the order of applied AI predictive models was XGBoost-SVM > GA-SVM > Boruta-SVM, where the SVM model remained at the top performance among the other statistical metrics. Based on the Taylor diagram for model evaluation, the RF model was reflected only marginally different so overall, the proposed integrative AI model provided an evidence a robust and reliable predictive technique used for coastal sediment Pb prediction.
Collapse
Affiliation(s)
- Suraj Kumar Bhagat
- Faculty of Civil Engineering, Ton Duc Thang University, Ho Chi Minh City, Viet Nam.
| | - Tiyasha Tiyasha
- Faculty of Civil Engineering, Ton Duc Thang University, Ho Chi Minh City, Viet Nam.
| | - Adarsh Kumar
- Institute of Natural Sciences and Mathematics, Ural Federal University, Ekaterinburg, 620002, Russia.
| | - Tabarak Malik
- Department of Biochemistry, College of Medicine & Health Sciences, School of Medicine, University of Gondar, Ethiopia.
| | - Ali H Jawad
- Faculty of Applied Sciences, Universiti Teknologi MARA, 40450, Shah Alam, Selangor, Malaysia.
| | - Khaled Mohamed Khedher
- Department of Civil Engineering, College of Engineering, King Khalid University, Abha 61421, Saudi Arabia; Department of Civil Engineering, High Institute of Technological Studies, Mrezgua University Campus, Nabeul, 8000, Tunisia
| | - Ravinesh C Deo
- School of Mathematics, Physics and Computing, University of Southern Queensland, Springfield, QLD, 4300, Australia
| | - Zaher Mundher Yaseen
- Adjunct Research Fellow, USQ's Advanced Data Analytics Research Group, School of Mathematics Physics and Computing, University of Southern Queensland, QLD 4350, Australia; Department of Urban Planning, Engineering Networks and Systems, Institute of Architecture and Construction, South Ural State University, 76, Lenin Prospect, 454080 Chelyabinsk, Russia; College of Creative Design, Asia University, Taichung City, Taiwan; New Era and Development in Civil Engineering Research Group, Scientific Research Center, Al-Ayen University, Thi-Qar, 64001, Iraq; Institute for Big Data Analytics and Artificial Intelligence (IBDAAI), Kompleks Al-Khawarizmi, Universiti Teknologi MARA, Shah Alam, 40450 Selangor, Malaysia.
| |
Collapse
|
56
|
Xu Z, York LM, Seethepalli A, Bucciarelli B, Cheng H, Samac DA. Objective Phenotyping of Root System Architecture Using Image Augmentation and Machine Learning in Alfalfa (Medicago sativa L.). PLANT PHENOMICS (WASHINGTON, D.C.) 2022; 2022:9879610. [PMID: 35479182 PMCID: PMC9012978 DOI: 10.34133/2022/9879610] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 03/03/2022] [Indexed: 12/28/2022]
Abstract
Active breeding programs specifically for root system architecture (RSA) phenotypes remain rare; however, breeding for branch and taproot types in the perennial crop alfalfa is ongoing. Phenotyping in this and other crops for active RSA breeding has mostly used visual scoring of specific traits or subjective classification into different root types. While image-based methods have been developed, translation to applied breeding is limited. This research is aimed at developing and comparing image-based RSA phenotyping methods using machine and deep learning algorithms for objective classification of 617 root images from mature alfalfa plants collected from the field to support the ongoing breeding efforts. Our results show that unsupervised machine learning tends to incorrectly classify roots into a normal distribution with most lines predicted as the intermediate root type. Encouragingly, random forest and TensorFlow-based neural networks can classify the root types into branch-type, taproot-type, and an intermediate taproot-branch type with 86% accuracy. With image augmentation, the prediction accuracy was improved to 97%. Coupling the predicted root type with its prediction probability will give breeders a confidence level for better decisions to advance the best and exclude the worst lines from their breeding program. This machine and deep learning approach enables accurate classification of the RSA phenotypes for genomic breeding of climate-resilient alfalfa.
Collapse
Affiliation(s)
- Zhanyou Xu
- USDA-ARS, Plant Science Research Unit, 1991 Upper Buford Circle, St. Paul, MN 55108, USA
| | - Larry M. York
- Biosciences Division and Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
| | | | - Bruna Bucciarelli
- Department of Agronomy and Plant Genetics, University of Minnesota, 1991 Upper Buford Circle, St. Paul, MN 55108, USA
| | - Hao Cheng
- Department of Animal Science, University of California, 2251 Meyer Hall, One Shields Ave., Davis, CA 95616, USA
| | - Deborah A. Samac
- USDA-ARS, Plant Science Research Unit, 1991 Upper Buford Circle, St. Paul, MN 55108, USA
| |
Collapse
|
57
|
Sahni G, Mewara B, Lalwani S, Kumar R. CF-PPI: Centroid based new feature extraction approach for Protein-Protein Interaction Prediction. J EXP THEOR ARTIF IN 2022. [DOI: 10.1080/0952813x.2022.2052189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Gunjan Sahni
- Department of Computer Science and Engineering, Career Point University, Kota, India
| | - Bhawna Mewara
- Department of Computer Science and Engineering, Career Point University, Kota, India
| | - Soniya Lalwani
- Department of Mathematics, Career Point University, Kota, India
| | - Rajesh Kumar
- Department of Electrical Engineering, Malaviya National Institute of Technology, Jaipur, India
| |
Collapse
|
58
|
Pan J, You ZH, Li LP, Huang WZ, Guo JX, Yu CQ, Wang LP, Zhao ZY. DWPPI: A Deep Learning Approach for Predicting Protein–Protein Interactions in Plants Based on Multi-Source Information With a Large-Scale Biological Network. Front Bioeng Biotechnol 2022; 10:807522. [PMID: 35387292 PMCID: PMC8978800 DOI: 10.3389/fbioe.2022.807522] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 02/25/2022] [Indexed: 12/30/2022] Open
Abstract
The prediction of protein–protein interactions (PPIs) in plants is vital for probing the cell function. Although multiple high-throughput approaches in the biological domain have been developed to identify PPIs, with the increasing complexity of PPI network, these methods fall into laborious and time-consuming situations. Thus, it is essential to develop an effective and feasible computational method for the prediction of PPIs in plants. In this study, we present a network embedding-based method, called DWPPI, for predicting the interactions between different plant proteins based on multi-source information and combined with deep neural networks (DNN). The DWPPI model fuses the protein natural language sequence information (attribute information) and protein behavior information to represent plant proteins as feature vectors and finally sends these features to a deep learning–based classifier for prediction. To validate the prediction performance of DWPPI, we performed it on three model plant datasets: Arabidopsis thaliana (A. thaliana), mazie (Zea mays), and rice (Oryza sativa). The experimental results with the fivefold cross-validation technique demonstrated that DWPPI obtains great performance with the AUC (area under ROC curves) values of 0.9548, 0.9867, and 0.9213, respectively. To further verify the predictive capacity of DWPPI, we compared it with some different state-of-the-art machine learning classifiers. Moreover, case studies were performed with the AC149810.2_FGP003 protein. As a result, 14 of the top 20 PPI pairs identified by DWPPI with the highest scores were confirmed by the literature. These excellent results suggest that the DWPPI model can act as a promising tool for related plant molecular biology.
Collapse
Affiliation(s)
- Jie Pan
- School of Information Engineering, Xijing University, Xi’an, China
| | - Zhu-Hong You
- School of Information Engineering, Xijing University, Xi’an, China
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi’an, China
- College of Grassland and Environment Science, Xinjiang Agricultural University, Urumqi, China
- *Correspondence: Li-Ping Li, ; Chang-Qing Yu,
| | - Wen-Zhun Huang
- School of Information Engineering, Xijing University, Xi’an, China
| | - Jian-Xin Guo
- School of Information Engineering, Xijing University, Xi’an, China
| | - Chang-Qing Yu
- School of Information Engineering, Xijing University, Xi’an, China
- *Correspondence: Li-Ping Li, ; Chang-Qing Yu,
| | - Li-Ping Wang
- School of Information Engineering, Xijing University, Xi’an, China
| | - Zheng-Yang Zhao
- School of Information Engineering, Xijing University, Xi’an, China
| |
Collapse
|
59
|
Yu B, Wang X, Zhang Y, Gao H, Wang Y, Liu Y, Gao X. RPI-MDLStack: Predicting RNA-protein interactions through deep learning with stacking strategy and LASSO. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108676] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
60
|
Wang M, Song L, Zhang Y, Gao H, Yan L, Yu B. Malsite-Deep: Prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108191] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
61
|
Wu Y, Sun L, Sun X, Wang B. A hybrid XGBoost-ISSA-LSTM model for accurate short-term and long-term dissolved oxygen prediction in ponds. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2022; 29:18142-18159. [PMID: 34686955 DOI: 10.1007/s11356-021-17020-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 10/09/2021] [Indexed: 06/13/2023]
Abstract
Dissolved oxygen (DO) is one of the most critical factors to measure the water quality in ponds, which greatly impacts on healthy growth of aquatic organisms. To improve the prediction accuracy of DO and grasp its changing trends, a novel hybrid DO prediction model based on the long short-term memory network (LSTM) optimized by an improved sparrow search algorithm (ISSA) is proposed. Firstly, to discard redundant information and improve the calculation speed of the model, the key factors that have a greater correlation with DO are selected as the input parameters by extreme gradient boosting (XGBoost). Secondly, towards expanding the searching range of sparrows and balancing the global and local search, we introduce an adaptive factor exponential declining strategy for producers, and an arcsine decreasing strategy for scouters, which nonlinearly decreases with the increase of iterations. Besides, we also improve the position updating of scouters, making the sparrows gradually move to the best position. Finally, LSTM is optimized by ISSA to get the best initial weights and thresholds to construct an XGBoost-ISSA-LSTM DO prediction model. Specifically, we first analyze the method for water quality prediction, which can make short-term prediction (including about 1 h, 2 h) and long-term prediction (including about 12 h, 24 h) of DO. In 1-h prediction, the root mean square error (RMSE) of the model is 0.5571, the mean absolute error (MAE) is 0.2572, and the R2 is 0.9276. In 24 h prediction, RMSE of the model is 0.6310, MAE is 0.4562, and R2 is 0.9082. The experimental results show that the proposed model has better generalization performance and higher prediction accuracy compared with other common models. Therefore, the presented model based on XGBoost-ISSA-LSTM is more effective and could meet the actual demand of accurate prediction of DO.
Collapse
Affiliation(s)
- Yuhan Wu
- National Innovation Center for Digital Fishery, China Agricultural University, 17 Tsinghua East Road, P. O. Box 121, Beijing, 100083, People's Republic of China
- Precision Agricultural Technology Integration Research Base (Fishery), Ministry of Agriculture and Rural Affairs, Beijing, 100083, China
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
| | - Longqing Sun
- National Innovation Center for Digital Fishery, China Agricultural University, 17 Tsinghua East Road, P. O. Box 121, Beijing, 100083, People's Republic of China.
- Precision Agricultural Technology Integration Research Base (Fishery), Ministry of Agriculture and Rural Affairs, Beijing, 100083, China.
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China.
| | - Xibei Sun
- National Innovation Center for Digital Fishery, China Agricultural University, 17 Tsinghua East Road, P. O. Box 121, Beijing, 100083, People's Republic of China
- Precision Agricultural Technology Integration Research Base (Fishery), Ministry of Agriculture and Rural Affairs, Beijing, 100083, China
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
| | - Boning Wang
- National Innovation Center for Digital Fishery, China Agricultural University, 17 Tsinghua East Road, P. O. Box 121, Beijing, 100083, People's Republic of China
- Precision Agricultural Technology Integration Research Base (Fishery), Ministry of Agriculture and Rural Affairs, Beijing, 100083, China
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
| |
Collapse
|
62
|
Industrial Internet of Things for Condition Monitoring and Diagnosis of Dry Vacuum Pumps in Atomic Layer Deposition Equipment. ELECTRONICS 2022. [DOI: 10.3390/electronics11030375] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In the modern semiconductor industry, defective products occur with unexpected small variables due to process miniaturization. Managing the condition of each part is an effective way of preventing unexpected errors. The industrial internet of things (IIoT) environment, which can monitor and analyze the performance degradation of parts that affect process results, enables advanced process yield management. This paper introduces the IIoT concept-based data monitoring and diagnostic system construction results. The process of pump vibration data acquisition is explained to evaluate the effectiveness of this system. The target process is deposition. The purpose of the system is to detect degradation of pumps due to by-products of the atomic layer deposition (ALD) process. The system consists of three areas: a data acquisition unit using six vibration sensors, a Web access-based monitoring unit that can monitor vibration data, and an Azure platform that searches for outliers in vibration data.
Collapse
|
63
|
Guo Y, Ju Y, Chen D, Wang L. Research on the Computational Prediction of Essential Genes. Front Cell Dev Biol 2021; 9:803608. [PMID: 34938741 PMCID: PMC8685449 DOI: 10.3389/fcell.2021.803608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 11/22/2021] [Indexed: 11/19/2022] Open
Abstract
Genes, the nucleotide sequences that encode a polypeptide chain or functional RNA, are the basic genetic unit controlling biological traits. They are the guarantee of the basic structures and functions in organisms, and they store information related to biological factors and processes such as blood type, gestation, growth, and apoptosis. The environment and genetics jointly affect important physiological processes such as reproduction, cell division, and protein synthesis. Genes are related to a wide range of phenomena including growth, decline, illness, aging, and death. During the evolution of organisms, there is a class of genes that exist in a conserved form in multiple species. These genes are often located on the dominant strand of DNA and tend to have higher expression levels. The protein encoded by it usually either performs very important functions or is responsible for maintaining and repairing these essential functions. Such genes are called persistent genes. Among them, the irreplaceable part of the body’s life activities is the essential gene. For example, when starch is the only source of energy, the genes related to starch digestion are essential genes. Without them, the organism will die because it cannot obtain enough energy to maintain basic functions. The function of the proteins encoded by these genes is thought to be fundamental to life. Nowadays, DNA can be extracted from blood, saliva, or tissue cells for genetic testing, and detailed genetic information can be obtained using the most advanced scientific instruments and technologies. The information gained from genetic testing is useful to assess the potential risks of disease, and to help determine the prognosis and development of diseases. Such information is also useful for developing personalized medication and providing targeted health guidance to improve the quality of life. Therefore, it is of great theoretical and practical significance to identify important and essential genes. In this paper, the research status of essential genes and the essential genome database of bacteria are reviewed, the computational prediction method of essential genes based on communication coding theory is expounded, and the significance and practical application value of essential genes are discussed.
Collapse
Affiliation(s)
- Yuxin Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Lihong Wang
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
64
|
Maruf FA, Pratama R, Song G. DNN-Boost: Somatic mutation identification of tumor-only whole-exome sequencing data using deep neural network and XGBoost. J Bioinform Comput Biol 2021; 19:2140017. [PMID: 34895111 DOI: 10.1142/s0219720021400175] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Detection of somatic mutation in whole-exome sequencing data can help elucidate the mechanism of tumor progression. Most computational approaches require exome sequencing for both tumor and normal samples. However, it is more common to sequence exomes for tumor samples only without the paired normal samples. To include these types of data for extensive studies on the process of tumorigenesis, it is necessary to develop an approach for identifying somatic mutations using tumor exome sequencing data only. In this study, we designed a machine learning approach using Deep Neural Network (DNN) and XGBoost to identify somatic mutations in tumor-only exome sequencing data and we integrated this into a pipeline called DNN-Boost. The XGBoost algorithm is used to extract the features from the results of variant callers and these features are then fed into the DNN model as input. The XGBoost algorithm resolves issues of missing values and overfitting. We evaluated our proposed model and compared its performance with other existing benchmark methods. We noted that the DNN-Boost classification model outperformed the benchmark method in classifying somatic mutations from paired tumor-normal exome data and tumor-only exome data.
Collapse
Affiliation(s)
- Firda Aminy Maruf
- School of Computer Science and Engineering, Pusan National University, 63 Busandaehak-Ro, Busan 46241, Republic of Korea
| | - Rian Pratama
- School of Computer Science and Engineering, Pusan National University, 63 Busandaehak-Ro, Busan 46241, Republic of Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, 63 Busandaehak-Ro, Busan 46241, Republic of Korea
| |
Collapse
|
65
|
O'Neil LJ, Hu P, Liu Q, Islam MM, Spicer V, Rech J, Hueber A, Anaparti V, Smolik I, El-Gabalawy HS, Schett G, Wilkins JA. Proteomic Approaches to Defining Remission and the Risk of Relapse in Rheumatoid Arthritis. Front Immunol 2021; 12:729681. [PMID: 34867950 PMCID: PMC8636686 DOI: 10.3389/fimmu.2021.729681] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 10/20/2021] [Indexed: 12/29/2022] Open
Abstract
Objectives Patients with Rheumatoid Arthritis (RA) are increasingly achieving stable disease remission, yet the mechanisms that govern ongoing clinical disease and subsequent risk of future flare are not well understood. We sought to identify serum proteomic alterations that dictate clinically important features of stable RA, and couple broad-based proteomics with machine learning to predict future flare. Methods We studied baseline serum samples from a cohort of stable RA patients (RETRO, n = 130) in clinical remission (DAS28<2.6) and quantified 1307 serum proteins using the SOMAscan platform. Unsupervised hierarchical clustering and supervised classification were applied to identify proteomic-driven clusters and model biomarkers that were associated with future disease flare after 12 months of follow-up and RA medication withdrawal. Network analysis was used to define pathways that were enriched in proteomic datasets. Results We defined 4 proteomic clusters, with one cluster (Cluster 4) displaying a lower mean DAS28 score (p = 0.03), with DAS28 associating with humoral immune responses and complement activation. Clustering did not clearly predict future risk of flare, however an XGboost machine learning algorithm classified patients who relapsed with an AUC (area under the receiver operating characteristic curve) of 0.80 using only baseline serum proteomics. Conclusions The serum proteome provides a rich dataset to understand stable RA and its clinical heterogeneity. Combining proteomics and machine learning may enable prediction of future RA disease flare in patients with RA who aim to withdrawal therapy.
Collapse
Affiliation(s)
- Liam J O'Neil
- Section of Rheumatology, Department of Internal Medicine, University of Manitoba, Winnipeg, MB, Canada.,Manitoba Centre for Proteomics and Systems Biology, University of Manitoba and Health Sciences Centre, Winnipeg, MB, Canada
| | - Pingzhao Hu
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada.,Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Qian Liu
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada.,Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Md Mohaiminul Islam
- Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada.,Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Victor Spicer
- Manitoba Centre for Proteomics and Systems Biology, University of Manitoba and Health Sciences Centre, Winnipeg, MB, Canada
| | - Juergen Rech
- Department of Medicine, Friedrich-Alexander University Erlangen-Nuernberg and Universitaetsklinikum Erlangen, Erlangen, Germany
| | - Axel Hueber
- Department of Medicine, Friedrich-Alexander University Erlangen-Nuernberg and Universitaetsklinikum Erlangen, Erlangen, Germany
| | - Vidyanand Anaparti
- Manitoba Centre for Proteomics and Systems Biology, University of Manitoba and Health Sciences Centre, Winnipeg, MB, Canada
| | - Irene Smolik
- Section of Rheumatology, Department of Internal Medicine, University of Manitoba, Winnipeg, MB, Canada
| | - Hani S El-Gabalawy
- Section of Rheumatology, Department of Internal Medicine, University of Manitoba, Winnipeg, MB, Canada.,Manitoba Centre for Proteomics and Systems Biology, University of Manitoba and Health Sciences Centre, Winnipeg, MB, Canada
| | - Georg Schett
- Department of Medicine, Friedrich-Alexander University Erlangen-Nuernberg and Universitaetsklinikum Erlangen, Erlangen, Germany
| | - John A Wilkins
- Section of Rheumatology, Department of Internal Medicine, University of Manitoba, Winnipeg, MB, Canada.,Manitoba Centre for Proteomics and Systems Biology, University of Manitoba and Health Sciences Centre, Winnipeg, MB, Canada
| |
Collapse
|
66
|
Jiang F, Ma J. A comprehensive study of macro factors related to traffic fatality rates by XGBoost-based model and GIS techniques. ACCIDENT; ANALYSIS AND PREVENTION 2021; 163:106431. [PMID: 34758411 DOI: 10.1016/j.aap.2021.106431] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Revised: 07/09/2021] [Accepted: 09/30/2021] [Indexed: 06/13/2023]
Abstract
With the fast development of economics, road safety is becoming a serious problem. Exploring macro factors is effective to improve road safety. However, the existing studies have some limitations: (1) The existing studies only considered one aspect of macro factors and constructed models based on a few data samples. (2) The methods commonly used cannot address the non-linear relationship or calculate the feature importance. The findings obtained from such models may be limited and biased. To address the limitations, this study proposes a BO-CV-XGBoost framework to explore the macro factors related to traffic fatality rate classes based on a high-dimensional dataset that fully considers the impact of multi-factor interaction with adequate data samples. The proposed framework is applied to a dataset in the US. 453 county-level macro factors are collected from various data sources, covering ten macro aspects, including topography, transportation, etc. The optimized BO-CV-XGBoost model obtains the best classification performance with an AUC of 0.8977 and an accuracy of 85.02%. Compared with other methods, the proposed model has superiority on fatality rate classification. Ten macro factors are identified, including 'Current-dollar GDP', 'highway miles per person', etc. The ten factors contain four aspects of information, including economics, transportation, education, and medical condition. Geographic information system (GIS) techniques are further used for spatial analysis of the identified macro factors. Therefore, targeted and effective measures are accordingly proposed to prevent traffic fatalities and improve road safety.
Collapse
Affiliation(s)
- Feifeng Jiang
- Faculty of Architecture, The University of Hong Kong, Hong Kong, China
| | - Jun Ma
- Department of Urban Planning and Design, The University of Hong Kong, Hong Kong, China.
| |
Collapse
|
67
|
Zhang Y, Jiang Z, Chen C, Wei Q, Gu H, Yu B. DeepStack-DTIs: Predicting Drug-Target Interactions Using LightGBM Feature Selection and Deep-Stacked Ensemble Classifier. Interdiscip Sci 2021; 14:311-330. [PMID: 34731411 DOI: 10.1007/s12539-021-00488-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Revised: 10/19/2021] [Accepted: 10/21/2021] [Indexed: 12/12/2022]
Abstract
Accurate prediction of drug-target interactions (DTIs), which is often used in the fields of drug discovery and drug repositioning, is regarded a key challenge in the study of drug science. In this paper, a new method called DeepStack-DTIs is proposed to predict DTIs. First, for the target protein, pseudo-position specific score matrix, pseudo amino acid composition and SPIDER3 are used to extract the different feature information of the target protein. Meanwhile, the path-based fingerprint features of each drug are extracted. Then, the synthetic minority oversampling technique (SMOTE) and light gradient boosting machine (LightGBM) are used for data balancing and feature selection, respectively. Finally, the processed features are input to the deep-stacked ensemble classifier composed of gated recurrent unit (GRU), deep neural network (DNN), support vector machine (SVM), eXtreme gradient boosting (XGBoost) and logistic regression (LR) to predict DTIs. Under the five-fold cross-validation and compared with existing methods, the proposed method achieves higher prediction accuracy on the gold standard dataset. To evaluate the predictive power of DeepStack-DTIs, we validate the method on another dataset and predict the drug-target interaction network. The results indicate that DeepStack-DTIs has excellent predictive ability than the other methods, and provides novel insights for the prediction of DTIs. A novel method DeepStack-DTIs for drug-target interactions prediction. PsePSSM, PseAAC, SPIDER3 and FP2 are fused to convert protein sequence and drug molecule information into digital information, respectively. The SMOTE algorithm is used to balance the dataset and LightGBM feature selection algorithm is employed to remove redundant and irrelevant features to select the optimal feature subset. This optimal feature subset is inputted into the deep-stacked ensemble classifier to predict drug-target interactions. The experimental results show DeepStack-DTIs method can significantly improve the prediction accuracy of drug-target interactions.
Collapse
Affiliation(s)
- Yan Zhang
- College of Mechanical and Electrical Engineering, Qingdao University of Science and Technology, Qingdao, 266061, China.,College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Zhiwen Jiang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Cheng Chen
- School of Computer Science and Technology, Shandong University, Qingdao, 266237, China
| | - Qinqin Wei
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Haiming Gu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| |
Collapse
|
68
|
Noh B, Yoon H, Youm C, Kim S, Lee M, Park H, Kim B, Choi H, Noh Y. Prediction of Decline in Global Cognitive Function Using Machine Learning with Feature Ranking of Gait and Physical Fitness Outcomes in Older Adults. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph182111347. [PMID: 34769864 PMCID: PMC8582857 DOI: 10.3390/ijerph182111347] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 10/26/2021] [Accepted: 10/27/2021] [Indexed: 11/30/2022]
Abstract
Gait and physical fitness are related to cognitive function. A decrease in motor function and physical fitness can serve as an indicator of declining global cognitive function in older adults. This study aims to use machine learning (ML) to identify important features of gait and physical fitness to predict a decline in global cognitive function in older adults. A total of three hundred and six participants aged seventy-five years or older were included in the study, and their gait performance at various speeds and physical fitness were evaluated. Eight ML models were applied to data ranked by the p-value (LP) of linear regression and the importance gain (XI) of XGboost. Five optimal features were selected using elastic net on the LP data for men, and twenty optimal features were selected using support vector machine on the XI data for women. Thus, the important features for predicting a potential decline in global cognitive function in older adults were successfully identified herein. The proposed ML approach could inspire future studies on the early detection and prevention of cognitive function decline in older adults.
Collapse
Affiliation(s)
- Byungjoo Noh
- Department of Kinesiology, Jeju National University, Jeju 63243, Korea;
| | - Hyemin Yoon
- Department of Management Information Systems, Dong-A University, Busan 49315, Korea; (H.Y.); (Y.N.)
| | - Changhong Youm
- Department of Health Sciences, The Graduate School of Dong-A University, Busan 49315, Korea; (H.P.); (B.K.); (H.C.)
- Correspondence: (C.Y.); (S.K.); Tel.: +82-51-200-7830 (C.Y.); +82-05-200-7484 (S.K.); Fax: +82-51-200-7505 (C.Y.)
| | - Sangjin Kim
- Department of Management Information Systems, Dong-A University, Busan 49315, Korea; (H.Y.); (Y.N.)
- Correspondence: (C.Y.); (S.K.); Tel.: +82-51-200-7830 (C.Y.); +82-05-200-7484 (S.K.); Fax: +82-51-200-7505 (C.Y.)
| | - Myeounggon Lee
- Department of Health and Human Performance, University of Houston, Houston, TX 77004, USA;
| | - Hwayoung Park
- Department of Health Sciences, The Graduate School of Dong-A University, Busan 49315, Korea; (H.P.); (B.K.); (H.C.)
| | - Bohyun Kim
- Department of Health Sciences, The Graduate School of Dong-A University, Busan 49315, Korea; (H.P.); (B.K.); (H.C.)
| | - Hyejin Choi
- Department of Health Sciences, The Graduate School of Dong-A University, Busan 49315, Korea; (H.P.); (B.K.); (H.C.)
| | - Yoonjae Noh
- Department of Management Information Systems, Dong-A University, Busan 49315, Korea; (H.Y.); (Y.N.)
| |
Collapse
|
69
|
Pan J, Li LP, Yu CQ, You ZH, Guan YJ, Ren ZH. Sequence-Based Prediction of Plant Protein-Protein Interactions by Combining Discrete Sine Transformation With Rotation Forest. Evol Bioinform Online 2021; 17:11769343211050067. [PMID: 34671178 PMCID: PMC8521741 DOI: 10.1177/11769343211050067] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Accepted: 09/13/2021] [Indexed: 11/24/2022] Open
Abstract
Protein-protein interactions (PPIs) in plants are essential for understanding the regulation of biological processes. Although high-throughput technologies have been widely used to identify PPIs, they are usually laborious, expensive, and suffer from high false-positive rates. Therefore, it is imperative to develop novel computational approaches as a supplement tool to detect PPIs in plants. In this work, we presented a method, namely DST-RoF, to identify PPIs in plants by combining an ensemble learning classifier-Rotation Forest (RoF) with discrete sine transformation (DST). Specifically, plant protein sequence is firstly converted into Position-Specific Scoring Matrix (PSSM). Then, the discrete sine transformation was employed to extract effective features for obtaining the evolutionary information of proteins. Finally, these optimal features were fed into the RoF classifier for training and prediction. When performed on the plant datasets Arabidopsis, Rice, and Maize, DST-RoF yielded high prediction accuracy of 82.95%, 88.82%, and 93.70%, respectively. To further evaluate the prediction ability of our approach, we compared it with 4 state-of-the-art classifiers and 3 different feature extraction methods. Comprehensive experimental results anticipated that our method is feasible and robust for predicting potential plant-protein interacted pairs.
Collapse
Affiliation(s)
- Jie Pan
- College of Information Engineering, Xijing University, Xi'an, China
| | - Li-Ping Li
- College of Information Engineering, Xijing University, Xi'an, China
| | - Chang-Qing Yu
- College of Information Engineering, Xijing University, Xi'an, China
| | - Zhu-Hong You
- College of Information Engineering, Xijing University, Xi'an, China
| | - Yong-Jian Guan
- College of Information Engineering, Xijing University, Xi'an, China
| | - Zhong-Hao Ren
- College of Information Engineering, Xijing University, Xi'an, China
| |
Collapse
|
70
|
Prediction for understanding the effectiveness of antiviral peptides. Comput Biol Chem 2021; 95:107588. [PMID: 34655913 DOI: 10.1016/j.compbiolchem.2021.107588] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 10/01/2021] [Accepted: 10/02/2021] [Indexed: 11/20/2022]
Abstract
The low efficacy of current antivirals in conjunction with the resistance of viruses against existing antiviral drugs has resulted in the demand for the development of novel antiviral agents. Antiviral peptides (AVPs) are those bioactive peptides having virucidal activity and they can be developed into promising antiviral drugs. They are shorter length peptides having the ability to cease the progression of viral infections. The use of antiviral peptides in therapeutics has recently attracted the attention of the research community. The development and identification of AVPs is imperative for the discovery of novel therapeutics for viral infections. In the present work, a meta classifier (stacking) based approach is implemented for the prediction of IC50 (half maximal inhibitory concentration) and pIC50 (negative log of half maximal inhibitory concentration) values. The best prediction model with evolutionary information and local alignment scores as features achieved a correlation coefficient values of 0.670 and 0.753 on the training and testing sets respectively for IC50. Further, the prediction of pIC50 reached a correlation coefficient value of 0.797 and 0.789 for training and testing sets respectively. For the development of machine learning models involved in the prediction of IC50, the use of pIC50 over IC50 is recommended as the target variable. Further on a systematic comparison of AVPs with high IC50 values and Low IC50 values, it is revealed that higher mean charge and tiny amino acids are preferred and higher length and consecutive hydrophilic amino acids are avoided in the former.
Collapse
|
71
|
Mahapatra S, Sahu SS. ANOVA-particle swarm optimization-based feature selection and gradient boosting machine classifier for improved protein-protein interaction prediction. Proteins 2021; 90:443-454. [PMID: 34528291 DOI: 10.1002/prot.26236] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 08/09/2021] [Accepted: 09/03/2021] [Indexed: 01/22/2023]
Abstract
Feature fusion and selection strategies have been applied to improve accuracy in the prediction of protein-protein interaction (PPI). In this paper, an embedded feature selection framework is developed by integrating a cost function based on analysis of variance (ANOVA) with the particle swarm optimization (PSO), termed AVPSO. Initially, the features of the protein sequences extracted using pseudo-amino acid composition (PseAAC), conjoint triad composition, and local descriptor are fused. Then, AVPSO is employed to select the optimal set of features. The light gradient boosting machine (LGBM) classifier is used to predict the PPIs using the optimal feature subset. On the five-fold cross-validation analysis, the proposed model (AVPSO-LGBM) achieved an average accuracy of 97.12% and 95.09%, respectively, on the intraspecies PPI datasets Saccharomyces cerevisiae and Helicobacter pylori. On the interspecies, PPI datasets of the Human-Bacillus and Human-Yersinia, an average accuracy of 95.20% and 93.44%, are achieved. Results obtained on independent test datasets, and network datasets show that the prediction accuracy of the AVPSO-LGBM is better than the existing methods, demonstrating its generalization ability. The improved prediction performance obtained by the proposed model makes it a reliable and effective PPI prediction model.
Collapse
Affiliation(s)
- Satyajit Mahapatra
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Ranchi, India
| | - Sitanshu Sekhar Sahu
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Ranchi, India
| |
Collapse
|
72
|
BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7764764. [PMID: 34484416 PMCID: PMC8413034 DOI: 10.1155/2021/7764764] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 08/13/2021] [Indexed: 01/19/2023]
Abstract
As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.
Collapse
|
73
|
Shi R, Xu X, Li J, Li Y. Prediction and analysis of train arrival delay based on XGBoost and Bayesian optimization. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107538] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
74
|
Kang EM, Ryu IH, Lee G, Kim JK, Lee IS, Jeon GH, Song H, Kamiya K, Yoo TK. Development of a Web-Based Ensemble Machine Learning Application to Select the Optimal Size of Posterior Chamber Phakic Intraocular Lens. Transl Vis Sci Technol 2021; 10:5. [PMID: 34111253 PMCID: PMC8107636 DOI: 10.1167/tvst.10.6.5] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Purpose Selecting the optimal lens size by predicting the postoperative vault can reduce complications after implantation of an implantable collamer lens with a central-hole (ICL with KS-aquaport). We built a web-based machine learning application that incorporated clinical measurements to predict the postoperative ICL vault and select the optimal ICL size. Methods We applied the stacking ensemble technique based on eXtreme Gradient Boosting (XGBoost) and a light gradient boosting machine to pre-operative ocular data from two eye centers to predict the postoperative vault. We assigned the Korean patient data to a training (N = 2756 eyes) and internal validation (N = 693 eyes) datasets (prospective validation). Japanese patient data (N = 290 eyes) were used as an independent external dataset from different centers to validate the model. Results We developed an ensemble model that showed statistically better performance with a lower mean absolute error for ICL vault prediction (106.88 µm and 143.69 µm in the internal and external validation, respectively) than the other machine learning techniques and the classic ICL sizing methods did when applied to both validation datasets. Considering the lens size selection accuracy, our proposed method showed the best performance for both reference datasets (75.9% and 67.4% in the internal and external validation, respectively). Conclusions Applying the ensemble approach to a large dataset of patients who underwent ICL implantation resulted in a more accurate prediction of vault size and selection of the optimal ICL size. Translational Relevance We developed a web-based application for ICL sizing to facilitate the use of machine learning calculators for clinicians.
Collapse
Affiliation(s)
| | - Ik Hee Ryu
- B&VIIT Eye Center, Seoul, South Korea.,VISUWORKS, Seoul, South Korea
| | | | - Jin Kuk Kim
- B&VIIT Eye Center, Seoul, South Korea.,VISUWORKS, Seoul, South Korea
| | | | - Ga Hee Jeon
- B&VIIT Eye Center, Seoul, South Korea.,VISUWORKS, Seoul, South Korea
| | - Hojin Song
- B&VIIT Eye Center, Seoul, South Korea.,VISUWORKS, Seoul, South Korea
| | - Kazutaka Kamiya
- Visual Physiology, School of Allied Health Sciences, Kitasato University, Kanagawa, Japan
| | - Tae Keun Yoo
- B&VIIT Eye Center, Seoul, South Korea.,Department of Ophthalmology, Aerospace Medical Center, Republic of Korea Air Force, Cheongju, South Korea
| |
Collapse
|
75
|
Liu Y, Jin S, Song L, Han Y, Yu B. Prediction of protein ubiquitination sites via multi-view features based on eXtreme gradient boosting classifier. J Mol Graph Model 2021; 107:107962. [PMID: 34198216 DOI: 10.1016/j.jmgm.2021.107962] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 05/03/2021] [Accepted: 06/02/2021] [Indexed: 01/29/2023]
Abstract
Ubiquitination is a common and reversible post-translational protein modification that regulates apoptosis and plays an important role in protein degradation and cell diseases. However, experimental identification of protein ubiquitination sites is usually time-consuming and labor-intensive, so it is necessary to establish effective predictors. In this study, we propose a ubiquitination sites prediction method based on multi-view features, namely UbiSite-XGBoost. Firstly, we use seven single-view features encoding methods to convert protein sequence fragments into digital information. Secondly, the least absolute shrinkage and selection operator (LASSO) is applied to remove the redundant information and get the optimal feature subsets. Finally, these features are inputted into the eXtreme gradient boosting (XGBoost) classifier to predict ubiquitination sites. Five-fold cross-validation shows that the AUC values of Set1-Set6 datasets are 0.8258, 0.7592, 0.7853, 0.8345, 0.8979 and 0.8901, respectively. The synthetic minority oversampling technique (SMOTE) is employed in Set4-Set6 unbalanced datasets, and the AUC values are 0.9777, 0.9782 and 0.9860, respectively. In addition, we have constructed three independent test datasets which the AUC values are 0.8007, 0.6897 and 0.7280, respectively. The results show that the proposed method UbiSite-XGBoost is superior to other ubiquitination prediction methods and it provides new guidance for the identification of ubiquitination sites. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/UbiSite-XGBoost/.
Collapse
Affiliation(s)
- Yushuang Liu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lili Song
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yu Han
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| |
Collapse
|
76
|
Kaushik M, Chandra Joshi R, Kushwah AS, Gupta MK, Banerjee M, Burget R, Dutta MK. Cytokine gene variants and socio-demographic characteristics as predictors of cervical cancer: A machine learning approach. Comput Biol Med 2021; 134:104559. [PMID: 34147008 DOI: 10.1016/j.compbiomed.2021.104559] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 05/30/2021] [Accepted: 06/04/2021] [Indexed: 01/03/2023]
Abstract
Cervical cancer is still one of the most prevalent cancers in women and a significant cause of mortality. Cytokine gene variants and socio-demographic characteristics have been reported as biomarkers for determining the cervical cancer risk in the Indian population. This study was designed to apply a machine learning-based model using these risk factors for better prognosis and prediction of cervical cancer. This study includes the dataset of cytokine gene variants, clinical and socio-demographic characteristics of normal healthy control subjects, and cervical cancer cases. Different risk factors, including demographic details and cytokine gene variants, were analysed using different machine learning approaches. Various statistical parameters were used for evaluating the proposed method. After multi-step data processing and random splitting of the dataset, machine learning methods were applied and evaluated with 5-fold cross-validation and also tested on the unseen data records of a collected dataset for proper evaluation and analysis. The proposed approaches were verified after analysing various performance metrics. The logistic regression technique achieved the highest average accuracy of 82.25% and the highest average F1-score of 82.58% among all the methods. Ridge classifiers and the Gaussian Naïve Bayes classifier achieved the highest sensitivity-85%. The ridge classifier surpasses most of the machine learning classifiers with 84.78% accuracy and 97.83% sensitivity. The risk factors analysed in this study can be taken as biomarkers in developing a cervical cancer diagnosis system. The outcomes demonstrate that the machine learning assisted analysis of cytokine gene variants and socio-demographic characteristics can be utilised effectively for predicting the risk of developing cervical cancer.
Collapse
Affiliation(s)
- Manoj Kaushik
- Centre for Advanced Studies, Dr. A. P. J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh, India
| | - Rakesh Chandra Joshi
- Centre for Advanced Studies, Dr. A. P. J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh, India
| | - Atar Singh Kushwah
- Molecular & Human Genetics Laboratory, Department of Zoology, University of Lucknow, Lucknow, Uttar Pradesh, India; Department of Zoology, Institute of Science, Banaras Hindu University, Varanasi, Uttar Pradesh, India
| | - Maneesh Kumar Gupta
- Molecular & Human Genetics Laboratory, Department of Zoology, University of Lucknow, Lucknow, Uttar Pradesh, India
| | - Monisha Banerjee
- Molecular & Human Genetics Laboratory, Department of Zoology, University of Lucknow, Lucknow, Uttar Pradesh, India
| | - Radim Burget
- Brno University of Technology, Faculty of Electrical Engineering, Brno, Czech Republic
| | - Malay Kishore Dutta
- Centre for Advanced Studies, Dr. A. P. J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh, India.
| |
Collapse
|
77
|
Wang X, Zhang Y, Yu B, Salhi A, Chen R, Wang L, Liu Z. Prediction of protein-protein interaction sites through eXtreme gradient boosting with kernel principal component analysis. Comput Biol Med 2021; 134:104516. [PMID: 34119922 DOI: 10.1016/j.compbiomed.2021.104516] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 05/24/2021] [Accepted: 05/24/2021] [Indexed: 12/22/2022]
Abstract
Predicting protein-protein interaction sites (PPI sites) can provide important clues for understanding biological activity. Using machine learning to predict PPI sites can mitigate the cost of running expensive and time-consuming biological experiments. Here we propose PPISP-XGBoost, a novel PPI sites prediction method based on eXtreme gradient boosting (XGBoost). First, the characteristic information of protein is extracted through the pseudo-position specific scoring matrix (PsePSSM), pseudo-amino acid composition (PseAAC), hydropathy index and solvent accessible surface area (ASA) under the sliding window. Next, these raw features are preprocessed to obtain more optimal representations in order to achieve better prediction. In particular, the synthetic minority oversampling technique (SMOTE) is used to circumvent class imbalance, and the kernel principal component analysis (KPCA) is applied to remove redundant characteristics. Finally, these optimal features are fed to the XGBoost classifier to identify PPI sites. Using PPISP-XGBoost, the prediction accuracy on the training dataset Dset186 reaches 85.4%, and the accuracy on the independent validation datasets Dtestset72, PDBtestset164, Dset_448 and Dset_355 reaches 85.3%, 83.9%, 85.8% and 85.4%, respectively, which all show an increase in accuracy against existing PPI sites prediction methods. These results demonstrate that the PPISP-XGBoost method can further enhance the prediction of PPI sites.
Collapse
Affiliation(s)
- Xue Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yaqun Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| | - Adil Salhi
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Ruixin Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lin Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Zengfeng Liu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| |
Collapse
|
78
|
Shen Z, Wu Q, Wang Z, Chen G, Lin B. Diabetic Retinopathy Prediction by Ensemble Learning Based on Biochemical and Physical Data. SENSORS 2021; 21:s21113663. [PMID: 34070287 PMCID: PMC8197325 DOI: 10.3390/s21113663] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Revised: 05/15/2021] [Accepted: 05/20/2021] [Indexed: 11/16/2022]
Abstract
(1) Background: Diabetic retinopathy, one of the most serious complications of diabetes, is the primary cause of blindness in developed countries. Therefore, the prediction of diabetic retinopathy has a positive impact on its early detection and treatment. The prediction of diabetic retinopathy based on high-dimensional and small-sample-structured datasets (such as biochemical data and physical data) was the problem to be solved in this study. (2) Methods: This study proposed the XGB-Stacking model with the foundation of XGBoost and stacking. First, a wrapped feature selection algorithm, XGBIBS (Improved Backward Search Based on XGBoost), was used to reduce data feature redundancy and improve the effect of a single ensemble learning classifier. Second, in view of the slight limitation of a single classifier, a stacking model fusion method, Sel-Stacking (Select-Stacking), which keeps Label-Proba as the input matrix of meta-classifier and determines the optimal combination of learners by a global search, was used in the XGB-Stacking model. (3) Results: XGBIBS greatly improved the prediction accuracy and the feature reduction rate of a single classifier. Compared to a single classifier, the accuracy of the Sel-Stacking model was improved to varying degrees. Experiments proved that the prediction model of XGB-Stacking based on the XGBIBS algorithm and the Sel-Stacking method made effective predictions on diabetes retinopathy. (4) Conclusion: The XGB-Stacking prediction model of diabetic retinopathy based on biochemical and physical data had outstanding performance. This is highly significant to improve the screening efficiency of diabetes retinopathy and reduce the cost of diagnosis.
Collapse
Affiliation(s)
- Zun Shen
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
| | - Qingfeng Wu
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
- Correspondence:
| | - Zhi Wang
- Department of Microelectronics and Nanoelectronics, Tsinghua University, Beijing 100876, China;
| | - Guoyi Chen
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
| | - Bin Lin
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
| |
Collapse
|
79
|
Wang CY, Lee SJ. Regional Population Forecast and Analysis Based on Machine Learning Strategy. ENTROPY 2021; 23:e23060656. [PMID: 34073825 PMCID: PMC8225119 DOI: 10.3390/e23060656] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 05/14/2021] [Accepted: 05/18/2021] [Indexed: 01/29/2023]
Abstract
Regional population forecast and analysis is of essence to urban and regional planning, and a well-designed plan can effectively construct a sound national infrastructure and stabilize positive population growth. Traditionally, either urban or regional planning relies on the opinions of demographers in terms of how the population of a city or a region will grow. Multi-regional population forecast is currently possible, carried out mainly on the basis of the Interregional Cohort-Component model. While this model has its unique advantages, several demographic rates are determined based on the decisions made by primary planners. Hence, the only drawback for cohort-component type population forecasting is allowing the analyst to specify the demographic rates of the future, and it goes without saying that this tends to introduce a biased result in forecasting accuracy. To effectively avoid this problem, this work proposes a machine learning-based method to forecast multi-regional population growth objectively. Thus, this work, drawing upon the newly developed machine learning technology, attempts to analyze and forecast the population growth of major cities in Taiwan. By effectively using the advantage of the XGBoost algorithm, the evaluation of feature importance and the forecast of multi-regional population growth between the present and the near future can be observed objectively, and it can further provide an objective reference to the urban planning of regional population.
Collapse
Affiliation(s)
- Chian-Yue Wang
- Graduate Institute of Urban Planning, National Taipei University, Taipei 237, Taiwan;
| | - Shin-Jye Lee
- Institute of Management of Technology, National Chiao Tung University, Hsinchu 300, Taiwan
- Correspondence:
| |
Collapse
|
80
|
Chen YZ, Wang ZZ, Wang Y, Ying G, Chen Z, Song J. nhKcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning. Brief Bioinform 2021; 22:6277413. [PMID: 34002774 DOI: 10.1093/bib/bbab146] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 03/18/2021] [Accepted: 03/25/2021] [Indexed: 12/20/2022] Open
Abstract
Lysine crotonylation (Kcr) is a newly discovered type of protein post-translational modification and has been reported to be involved in various pathophysiological processes. High-resolution mass spectrometry is the primary approach for identification of Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and expensive when compared with computational approaches. To date, several predictors for Kcr site prediction have been developed, most of which are capable of predicting crotonylation sites on either histones alone or mixed histone and nonhistone proteins together. These methods exhibit high diversity in their algorithms, encoding schemes, feature selection techniques and performance assessment strategies. However, none of them were designed for predicting Kcr sites on nonhistone proteins. Therefore, it is desirable to develop an effective predictor for identifying Kcr sites from the large amount of nonhistone sequence data. For this purpose, we first provide a comprehensive review on six methods for predicting crotonylation sites. Second, we develop a novel deep learning-based computational framework termed as CNNrgb for Kcr site prediction on nonhistone proteins by integrating different types of features. We benchmark its performance against multiple commonly used machine learning classifiers (including random forest, logitboost, naïve Bayes and logistic regression) by performing both 10-fold cross-validation and independent test. The results show that the proposed CNNrgb framework achieves the best performance with high computational efficiency on large datasets. Moreover, to facilitate users' efforts to investigate Kcr sites on human nonhistone proteins, we implement an online server called nhKcr and compare it with other existing tools to illustrate the utility and robustness of our method. The nhKcr web server and all the datasets utilized in this study are freely accessible at http://nhKcr.erc.monash.edu/.
Collapse
Affiliation(s)
- Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | | | | | - Guoguang Ying
- Laboratory of Tumor Cell Biology in Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia
| |
Collapse
|
81
|
Karabulut OC, Karpuzcu BA, Türk E, Ibrahim AH, Süzek BE. ML-AdVInfect: A Machine-Learning Based Adenoviral Infection Predictor. Front Mol Biosci 2021; 8:647424. [PMID: 34026828 PMCID: PMC8139618 DOI: 10.3389/fmolb.2021.647424] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 04/22/2021] [Indexed: 01/08/2023] Open
Abstract
Adenoviruses (AdVs) constitute a diverse family with many pathogenic types that infect a broad range of hosts. Understanding the pathogenesis of adenoviral infections is not only clinically relevant but also important to elucidate the potential use of AdVs as vectors in therapeutic applications. For an adenoviral infection to occur, attachment of the viral ligand to a cellular receptor on the host organism is a prerequisite and, in this sense, it is a criterion to decide whether an adenoviral infection can potentially happen. The interaction between any virus and its corresponding host organism is a specific kind of protein-protein interaction (PPI) and several experimental techniques, including high-throughput methods are being used in exploring such interactions. As a result, there has been accumulating data on virus-host interactions including a significant portion reported at publicly available bioinformatics resources. There is not, however, a computational model to integrate and interpret the existing data to draw out concise decisions, such as whether an infection happens or not. In this study, accepting the cellular entry of AdV as a decisive parameter for infectivity, we have developed a machine learning, more precisely support vector machine (SVM), based methodology to predict whether adenoviral infection can take place in a given host. For this purpose, we used the sequence data of the known receptors of AdVs, we identified sets of adenoviral ligands and their respective host species, and eventually, we have constructed a comprehensive adenovirus–host interaction dataset. Then, we committed interaction predictions through publicly available virus-host PPI tools and constructed an AdV infection predictor model using SVM with RBF kernel, with the overall sensitivity, specificity, and AUC of 0.88 ± 0.011, 0.83 ± 0.064, and 0.86 ± 0.030, respectively. ML-AdVInfect is the first of its kind as an effective predictor to screen the infection capacity along with anticipating any cross-species shifts. We anticipate our approach led to ML-AdVInfect can be adapted in making predictions for other viral infections.
Collapse
Affiliation(s)
- Onur Can Karabulut
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Betül Asiye Karpuzcu
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Erdem Türk
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Ahmad Hassan Ibrahim
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Barış Ethem Süzek
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey.,Georgetown University Medical Center, Biochemistry and Molecular and Cellular Biology, Washington, DC, United States
| |
Collapse
|
82
|
Prasasty VD, Hutagalung RA, Gunadi R, Sofia DY, Rosmalena R, Yazid F, Sinaga E. Prediction of human-Streptococcus pneumoniae protein-protein interactions using logistic regression. Comput Biol Chem 2021; 92:107492. [PMID: 33964803 DOI: 10.1016/j.compbiolchem.2021.107492] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Accepted: 04/21/2021] [Indexed: 02/07/2023]
Abstract
Streptococcus pneumoniae is a major cause of mortality in children under five years old. In recent years, the emergence of antibiotic-resistant strains of S. pneumoniae increases the threat level of this pathogen. For that reason, the exploration of S. pneumoniae protein virulence factors should be considered in developing new drugs or vaccines, for instance by the analysis of host-pathogen protein-protein interactions (HP-PPIs). In this research, prediction of protein-protein interactions was performed with a logistic regression model with the number of protein domain occurrences as features. By utilizing HP-PPIs of three different pathogens as training data, the model achieved 57-77 % precision, 64-75 % recall, and 96-98 % specificity. Prediction of human-S. pneumoniae protein-protein interactions using the model yielded 5823 interactions involving thirty S. pneumoniae proteins and 324 human proteins. Pathway enrichment analysis showed that most of the pathways involved in the predicted interactions are immune system pathways. Network topology analysis revealed β-galactosidase (BgaA) as the most central among the S. pneumoniae proteins in the predicted HP-PPI networks, with a degree centrality of 1.0 and a betweenness centrality of 0.451853. Further experimental studies are required to validate the predicted interactions and examine their roles in S. pneumoniae infection.
Collapse
Affiliation(s)
- Vivitri Dewi Prasasty
- Faculty of Biotechnology, Atma Jaya Catholic University of Indonesia, Jakarta, 12930, Indonesia.
| | - Rory Anthony Hutagalung
- Faculty of Biotechnology, Atma Jaya Catholic University of Indonesia, Jakarta, 12930, Indonesia
| | - Reinhart Gunadi
- Department of Biology, Faculty of Life Sciences, Universitas Surya, Tangerang, Banten, 15143, Indonesia
| | - Dewi Yustika Sofia
- Department of Biology, Faculty of Life Sciences, Universitas Surya, Tangerang, Banten, 15143, Indonesia
| | - Rosmalena Rosmalena
- Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta, 10430, Indonesia
| | - Fatmawaty Yazid
- Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta, 10430, Indonesia
| | - Ernawati Sinaga
- Faculty of Biology, Universitas Nasional, Jakarta, 12520, Indonesia.
| |
Collapse
|
83
|
Zhang Q, Liu P, Wang X, Zhang Y, Han Y, Yu B. StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106921] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
84
|
Novaes MT, Ferreira de Carvalho OL, Guimarães Ferreira PH, Nunes Tiraboschi TL, Silva CS, Zambrano JC, Gomes CM, de Paula Miranda E, Abílio de Carvalho Júnior O, de Bessa Júnior J. Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100538] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
|
85
|
Wei L, He W, Malik A, Su R, Cui L, Manavalan B. Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework. Brief Bioinform 2020; 22:5956930. [PMID: 33152766 DOI: 10.1093/bib/bbaa275] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 09/14/2020] [Accepted: 09/21/2020] [Indexed: 12/13/2022] Open
Abstract
Origins of replication sites (ORIs), which refers to the initiative locations of genomic DNA replication, play essential roles in DNA replication process. Detection of ORIs' distribution in genome scale is one of key steps to in-depth understanding their regulation mechanisms. In this study, we presented a novel machine learning-based approach called Stack-ORI encompassing 10 cell-specific prediction models for identifying ORIs from four different eukaryotic species (Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana). For each cell-specific model, we employed 12 feature encoding schemes that cover nucleic acid composition, position-specific and physicochemical properties information. The optimal feature set was identified from each encoding individually and developed their respective baseline models using the eXtreme Gradient Boosting (XGBoost) classifier. Subsequently, the predicted scores of 12 baseline models are integrated as a novel feature vector to train XGBoost and develop the final model. Extensive experimental results show that Stack-ORI achieves significantly better performance as compared with their baseline models on both training and independent datasets. Interestingly, Stack-ORI consistently outperforms existing predictor in all cell-specific models, not only on training but also on independent test. Moreover, our novel approach provides necessary interpretations that help understanding model success by leveraging the powerful SHapley Additive exPlanation algorithm, thus underlining the most important feature encoding schemes significant for predicting cell-specific ORIs.
Collapse
Affiliation(s)
- Leyi Wei
- computer science from Xiamen University, China
| | - Wenjia He
- School of Software at Shandong University, China
| | - Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul, Republic of Korea
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Lizhen Cui
- School of Software, Shandong University, the Deputy Director of the E-Commerce Research Center and the Director of the Research Center of Software and Data Engineering, Jinan
| | | |
Collapse
|