1
|
Fu L, Shi S, Yi J, Wang N, He Y, Wu Z, Peng J, Deng Y, Wang W, Wu C, Lyu A, Zeng X, Zhao W, Hou T, Cao D. ADMETlab 3.0: an updated comprehensive online ADMET prediction platform enhanced with broader coverage, improved performance, API functionality and decision support. Nucleic Acids Res 2024; 52:W422-W431. [PMID: 38572755 PMCID: PMC11223840 DOI: 10.1093/nar/gkae236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/10/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
ADMETlab 3.0 is the second updated version of the web server that provides a comprehensive and efficient platform for evaluating ADMET-related parameters as well as physicochemical properties and medicinal chemistry characteristics involved in the drug discovery process. This new release addresses the limitations of the previous version and offers broader coverage, improved performance, API functionality, and decision support. For supporting data and endpoints, this version includes 119 features, an increase of 31 compared to the previous version. The updated number of entries is 1.5 times larger than the previous version with over 400 000 entries. ADMETlab 3.0 incorporates a multi-task DMPNN architecture coupled with molecular descriptors, a method that not only guaranteed calculation speed for each endpoint simultaneously, but also achieved a superior performance in terms of accuracy and robustness. In addition, an API has been introduced to meet the growing demand for programmatic access to large amounts of data in ADMETlab 3.0. Moreover, this version includes uncertainty estimates in the prediction results, aiding in the confident selection of candidate compounds for further studies and experiments. ADMETlab 3.0 is publicly for access without the need for registration at: https://admetlab3.scbdd.com.
Collapse
Affiliation(s)
- Li Fu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Shaohua Shi
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon, Hong Kong SAR, 999077, P.R. China
| | - Jiacai Yi
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Ningning Wang
- Xiangya Hospital of Central South University, Changsha, Hunan 410008, P.R. China
| | - Yuanhang He
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Zhenxing Wu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P.R. China
| | - Jinfu Peng
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Youchao Deng
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Wenxuan Wang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Chengkun Wu
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon, Hong Kong SAR, 999077, P.R. China
| | - Xiangxiang Zeng
- Department of Computer Science, Hunan University, Changsha, Hunan 410082, P.R. China
| | - Wentao Zhao
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P.R. China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| |
Collapse
|
2
|
Tran HN, Nguyen PXQ, Guo F, Wang J. Prediction of Protein-Protein Interactions Based on Integrating Deep Learning and Feature Fusion. Int J Mol Sci 2024; 25:5820. [PMID: 38892007 PMCID: PMC11172432 DOI: 10.3390/ijms25115820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/27/2024] [Accepted: 04/29/2024] [Indexed: 06/21/2024] Open
Abstract
Understanding protein-protein interactions (PPIs) helps to identify protein functions and develop other important applications such as drug preparation and protein-disease relationship identification. Deep-learning-based approaches are being intensely researched for PPI determination to reduce the cost and time of previous testing methods. In this work, we integrate deep learning with feature fusion, harnessing the strengths of both approaches, handcrafted features, and protein sequence embedding. The accuracies of the proposed model using five-fold cross-validation on Yeast core and Human datasets are 96.34% and 99.30%, respectively. In the task of predicting interactions in important PPI networks, our model correctly predicted all interactions in one-core, Wnt-related, and cancer-specific networks. The experimental results on cross-species datasets, including Caenorhabditis elegans, Helicobacter pylori, Homo sapiens, Mus musculus, and Escherichia coli, also show that our feature fusion method helps increase the generalization capability of the PPI prediction model.
Collapse
Affiliation(s)
| | | | | | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China (F.G.)
| |
Collapse
|
3
|
Cao MY, Zainudin S, Daud KM. Protein features fusion using attributed network embedding for predicting protein-protein interaction. BMC Genomics 2024; 25:466. [PMID: 38741045 DOI: 10.1186/s12864-024-10361-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 04/29/2024] [Indexed: 05/16/2024] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) hold significant importance in biology, with precise PPI prediction as a pivotal factor in comprehending cellular processes and facilitating drug design. However, experimental determination of PPIs is laborious, time-consuming, and often constrained by technical limitations. METHODS We introduce a new node representation method based on initial information fusion, called FFANE, which amalgamates PPI networks and protein sequence data to enhance the precision of PPIs' prediction. A Gaussian kernel similarity matrix is initially established by leveraging protein structural resemblances. Concurrently, protein sequence similarities are gauged using the Levenshtein distance, enabling the capture of diverse protein attributes. Subsequently, to construct an initial information matrix, these two feature matrices are merged by employing weighted fusion to achieve an organic amalgamation of structural and sequence details. To gain a more profound understanding of the amalgamated features, a Stacked Autoencoder (SAE) is employed for encoding learning, thereby yielding more representative feature representations. Ultimately, classification models are trained to predict PPIs by using the well-learned fusion feature. RESULTS When employing 5-fold cross-validation experiments on SVM, our proposed method achieved average accuracies of 94.28%, 97.69%, and 84.05% in terms of Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori datasets, respectively. CONCLUSION Experimental findings across various authentic datasets validate the efficacy and superiority of this fusion feature representation approach, underscoring its potential value in bioinformatics.
Collapse
Affiliation(s)
- Mei-Yuan Cao
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia.
| | - Suhaila Zainudin
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia
| | - Kauthar Mohd Daud
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia
| |
Collapse
|
4
|
Dang TH, Vu TA. xCAPT5: protein-protein interaction prediction using deep and wide multi-kernel pooling convolutional neural networks with protein language model. BMC Bioinformatics 2024; 25:106. [PMID: 38461247 PMCID: PMC10924985 DOI: 10.1186/s12859-024-05725-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Accepted: 02/28/2024] [Indexed: 03/11/2024] Open
Abstract
BACKGROUND Predicting protein-protein interactions (PPIs) from sequence data is a key challenge in computational biology. While various computational methods have been proposed, the utilization of sequence embeddings from protein language models, which contain diverse information, including structural, evolutionary, and functional aspects, has not been fully exploited. Additionally, there is a significant need for a comprehensive neural network capable of efficiently extracting these multifaceted representations. RESULTS Addressing this gap, we propose xCAPT5, a novel hybrid classifier that uniquely leverages the T5-XL-UniRef50 protein large language model for generating rich amino acid embeddings from protein sequences. The core of xCAPT5 is a multi-kernel deep convolutional siamese neural network, which effectively captures intricate interaction features at both micro and macro levels, integrated with the XGBoost algorithm, enhancing PPIs classification performance. By concatenating max and average pooling features in a depth-wise manner, xCAPT5 effectively learns crucial features with low computational cost. CONCLUSION This study represents one of the initial efforts to extract informative amino acid embeddings from a large protein language model using a deep and wide convolutional network. Experimental results show that xCAPT5 outperforms recent state-of-the-art methods in binary PPI prediction, excelling in cross-validation on several benchmark datasets and demonstrating robust generalization across intra-species, cross-species, inter-species, and stringent similarity contexts.
Collapse
Affiliation(s)
- Thanh Hai Dang
- Faculty of Information Technology, VNU University of Engineering and Technology, 144 Xuan Thuy, Hanoi, 10000, Vietnam.
| | - Tien Anh Vu
- Faculty of Biology, VNU University of Science, 334 Nguyen Trai, Hanoi, 10000, Vietnam
| |
Collapse
|
5
|
Jadhav S, Vyavahare AJ, Sharma M. Salp-J Colony Optimization-based advanced hybrid ensemble deep predictor with LSTM for protein structure prediction. J Biomol Struct Dyn 2024:1-16. [PMID: 38444340 DOI: 10.1080/07391102.2023.2294386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 12/04/2023] [Indexed: 03/07/2024]
Abstract
Protein structure prediction (PSP) is a key concern in computational biology, which is considered a challenging task that is vital to determine the structure and the protein function since each protein possesses a definite shape, whereas the protein secondary structure prediction (PSSP) is the foundation for three-dimensional PSP. An Advanced hybrid ensemble deep predictor is utilized for predicting the structure of a protein using Long-Short Term Memory (LSTM), in which the performance of the predictor is improved for obtaining the features through the Salp-J Colony Optimization, which is developed by integrating the features of three optimizations the exploration behavior of Ulmaris, the immune system of virus colony and the teamwork of salp for solution update that helps to predict the accurate protein structure. The proposed method achieved the value of 99.1% accuracy, 99.5% sensitivity, 98.85% specificity, and 0.9% error at the 80% of training percentage 90 using CullPDB. Similarly, in Protein Net, the attained value of accuracy is 97.27%, sensitivity is 98.13%, specificity is 97%, and error is 2.7% concerning training percentage 90%.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Swati Jadhav
- Electronics and Telecommunication Department, D. Y. Patil College of Engineering, Akurdi, Pune, Maharashtra, India
| | - Arati J Vyavahare
- Electronics and Telecommunication Department, PES's Modern College of Engineering, Pune, Maharashtra, India
| | - Manish Sharma
- Electronics and Telecommunication Department, D. Y. Patil College of Engineering, Akurdi, Pune, Maharashtra, India
| |
Collapse
|
6
|
Wang C, Wang Y, Ding P, Li S, Yu X, Yu B. ML-FGAT: Identification of multi-label protein subcellular localization by interpretable graph attention networks and feature-generative adversarial networks. Comput Biol Med 2024; 170:107944. [PMID: 38215617 DOI: 10.1016/j.compbiomed.2024.107944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/08/2023] [Accepted: 01/01/2024] [Indexed: 01/14/2024]
Abstract
The prediction of multi-label protein subcellular localization (SCL) is a pivotal area in bioinformatics research. Recent advancements in protein structure research have facilitated the application of graph neural networks. This paper introduces a novel approach termed ML-FGAT. The approach begins by extracting node information of proteins from sequence data, physical-chemical properties, evolutionary insights, and structural details. Subsequently, various evolutionary techniques are integrated to consolidate multi-view information. A linear discriminant analysis framework, grounded on entropy weight, is then employed to reduce the dimensionality of the merged features. To enhance the robustness of the model, the training dataset is augmented using feature-generative adversarial networks. For the primary prediction step, graph attention networks are employed to determine multi-label protein SCL, leveraging both node and neighboring information. The interpretability is enhanced by analyzing the attention weight parameters. The training is based on the Gram-positive bacteria dataset, while validation employs newly constructed datasets: human, virus, Gram-negative bacteria, plant, and SARS-CoV-2. Following a leave-one-out cross-validation procedure, ML-FGAT demonstrates noteworthy superiority in this domain.
Collapse
Affiliation(s)
- Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yifei Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Xu Yu
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
7
|
Yin H, Sharma B, Hu H, Liu F, Kaur M, Cohen G, McConnell R, Eckel SP. Predicting the Climate Impact of Healthcare Facilities Using Gradient Boosting Machines. CLEANER ENVIRONMENTAL SYSTEMS 2024; 12:100155. [PMID: 38444563 PMCID: PMC10909736 DOI: 10.1016/j.cesys.2023.100155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/07/2024]
Abstract
Health care accounts for 9-10% of greenhouse gas (GHG) emissions in the United States. Strategies for monitoring these emissions at the hospital level are needed to decarbonize the sector. However, data collection to estimate emissions is challenging, especially for smaller hospitals. We explored the potential of gradient boosting machines (GBM) to impute missing data on resource consumption in the 2020 survey of a consortium of 283 hospitals participating in Practice Greenhealth. GBM imputed missing values for selected variables in order to predict electricity use and beef consumption (R2=0.82) and anesthetic gas desflurane use (R2=0.51), using administrative data readily available for most hospitals. After imputing missing consumption data, estimated GHG emissions associated with these three examples totaled over 3 million metric tons of CO2 equivalent emissions (MTCO2e). Specifically, electricity consumption had the largest total carbon footprint (2.4 MTCO2e), followed by beef (0.6 million MTCO2e) and desflurane consumption (0.03 million MTCO2e) across the 283 hospitals. The approach should be applicable to other sources of hospital GHGs in order to estimate total emissions of individual hospitals and to refine survey questions to help develop better intervention strategies.
Collapse
Affiliation(s)
- Hao Yin
- Department of Economics, University of Southern California, Los Angeles, California, USA, 90089
| | - Bhavna Sharma
- School of Architecture, University of Southern California, Los Angeles, California, USA, 90089
| | - Howard Hu
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA, 90033
| | - Fei Liu
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA, 90033
| | - Mehak Kaur
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA, 90033
| | - Gary Cohen
- Health Care Without Harm, Boston, Massachusetts, USA, 20190
| | - Rob McConnell
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA, 90033
| | - Sandrah P. Eckel
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA, 90033
| |
Collapse
|
8
|
Teimouri H, Medvedeva A, Kolomeisky AB. Physical-Chemical Features Selection Reveals That Differences in Dipeptide Compositions Correlate Most with Protein-Protein Interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.27.582345. [PMID: 38464064 PMCID: PMC10925282 DOI: 10.1101/2024.02.27.582345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
The ability to accurately predict protein-protein interactions is critically important for our understanding of major cellular processes. However, current experimental and computational approaches for identifying them are technically very challenging and still have limited success. We propose a new computational method for predicting protein-protein interactions using only primary sequence information. It utilizes a concept of physical-chemical similarity to determine which interactions will most probably occur. In our approach, the physical-chemical features of protein are extracted using bioinformatics tools for different organisms, and then they are utilized in a machine-learning method to identify successful protein-protein interactions via correlation analysis. It is found that the most important property that correlates most with the protein-protein interactions for all studied organisms is dipeptide amino acid compositions. The analysis is specifically applied to the bacterial two-component system that includes histidine kinase and transcriptional response regulators. Our theoretical approach provides a simple and robust method for quantifying the important details of complex mechanisms of biological processes.
Collapse
Affiliation(s)
- Hamid Teimouri
- Department of Chemistry, Rice University, Houston, Texas, United States
- Center for Theoretical Biological Physics, Rice University, Houston, Texas, United States
| | - Angela Medvedeva
- Department of Chemistry, Rice University, Houston, Texas, United States
- Center for Theoretical Biological Physics, Rice University, Houston, Texas, United States
| | - Anatoly B. Kolomeisky
- Department of Chemistry, Rice University, Houston, Texas, United States
- Center for Theoretical Biological Physics, Rice University, Houston, Texas, United States
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, United States
- Department of Physics and Astronomy, Rice University, Houston, TX, United States
| |
Collapse
|
9
|
Fu X, Yuan Y, Qiu H, Suo H, Song Y, Li A, Zhang Y, Xiao C, Li Y, Dou L, Zhang Z, Cui F. AGF-PPIS: A protein-protein interaction site predictor based on an attention mechanism and graph convolutional networks. Methods 2024; 222:142-151. [PMID: 38242383 DOI: 10.1016/j.ymeth.2024.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/04/2024] [Accepted: 01/13/2024] [Indexed: 01/21/2024] Open
Abstract
Protein-protein interactions play an important role in various biological processes. Interaction among proteins has a wide range of applications. Therefore, the correct identification of protein-protein interactions sites is crucial. In this paper, we propose a novel predictor for protein-protein interactions sites, AGF-PPIS, where we utilize a multi-head self-attention mechanism (introducing a graph structure), graph convolutional network, and feed-forward neural network. We use the Euclidean distance between each protein residue to generate the corresponding protein graph as the input of AGF-PPIS. On the independent test dataset Test_60, AGF-PPIS achieves superior performance over comparative methods in terms of seven different evaluation metrics (ACC, precision, recall, F1-score, MCC, AUROC, AUPRC), which fully demonstrates the validity and superiority of the proposed AGF-PPIS model. The source codes and the steps for usage of AGF-PPIS are available at https://github.com/fxh1001/AGF-PPIS.
Collapse
Affiliation(s)
- Xiuhao Fu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Ye Yuan
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Haoye Qiu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Haodong Suo
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yingying Song
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Anqi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yupeng Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Cuilin Xiao
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yazi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, USA
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| |
Collapse
|
10
|
Ji S. SSC: The novel self-stack ensemble model for thyroid disease prediction. PLoS One 2024; 19:e0295501. [PMID: 38170718 PMCID: PMC10763970 DOI: 10.1371/journal.pone.0295501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 11/22/2023] [Indexed: 01/05/2024] Open
Abstract
Thyroid disease presents a significant health risk, lowering the quality of life and increasing treatment costs. The diagnosis of thyroid disease can be challenging, especially for inexperienced practitioners. Machine learning has been established as one of the methods for disease diagnosis based on previous studies. This research introduces a novel and more effective technique for predicting thyroid disease by utilizing machine learning methodologies, surpassing the performance of previous studies in this field. This study utilizes the UCI thyroid disease dataset, which consists of 9172 samples and 30 features, and exhibits a highly imbalanced target class distribution. However, machine learning algorithms trained on imbalanced thyroid disease data face challenges in reliably detecting minority data and disease. To address this issue, re-sampling is employed, which modifies the ratio between target classes to balance the data. In this study, the down-sampling approach is utilized to achieve a balanced distribution of target classes. A novel RF-based self-stacking classifier is presented in this research for efficient thyroid disease detection. The proposed approach demonstrates the ability to diagnose primary hypothyroidism, increased binding protein, compensated hypothyroidism, and concurrent non-thyroidal illness with an accuracy of 99.5%. The recommended model exhibits state-of-the-art performance, achieving 100% macro precision, 100% macro recall, and 100% macro F1-score. A thorough comparative assessment is conducted to demonstrate the viability of the proposed approach, including several machine learning classifiers, deep neural networks, and ensemble voting classifiers. The results of K-fold cross-validation provide further support for the efficacy of the proposed self-stacking classifier.
Collapse
Affiliation(s)
- Shengjun Ji
- School of information, Xi’an University of Finance and Economics, Xi’an, China
| |
Collapse
|
11
|
Zandi F, Mansouri P, Goodarzi M. Global protein-protein interaction networks in yeast saccharomyces cerevisiae and helicobacter pylori. Talanta 2023; 265:124836. [PMID: 37393709 DOI: 10.1016/j.talanta.2023.124836] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/04/2023] [Accepted: 06/17/2023] [Indexed: 07/04/2023]
Abstract
Understanding many biological processes relies heavily on accurately predicting protein-protein interactions (PPIs). In this study, we propose a novel method for predicting PPIs that is based on LogitBoost with a binary bat feature selection algorithm. Our approach involves the extraction of an initial feature vector by combining pseudo amino acid composition (PseAAC), pseudo-position-specific scoring matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD). Subsequently, a binary bat algorithm is applied to eliminate redundant features, and the resulting optimal features are fed into the LogitBoost classifier for the identification of PPIs. To evaluate the proposed method, we test it on two databases, Saccharomyces cerevisiae and Helicobacter pylori, using 10-fold cross-validation, and achieve accuracies of 94.39% and 97.89%, respectively. Our results showcase the significant potential of our pipeline in accurately predicting protein-protein interactions (PPIs), thereby offering a valuable resource to the scientific research community.
Collapse
Affiliation(s)
- Farzad Zandi
- Faculty of Sciences, Islamic Azad University, Arak Branch, Arak, Markazi, Iran
| | | | - Mohammad Goodarzi
- Department of Immunology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
12
|
Wang J, Chen C, Yao G, Ding J, Wang L, Jiang H. Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review. Molecules 2023; 28:7865. [PMID: 38067593 PMCID: PMC10707872 DOI: 10.3390/molecules28237865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/13/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
In recent years, the widespread application of artificial intelligence algorithms in protein structure, function prediction, and de novo protein design has significantly accelerated the process of intelligent protein design and led to many noteworthy achievements. This advancement in protein intelligent design holds great potential to accelerate the development of new drugs, enhance the efficiency of biocatalysts, and even create entirely new biomaterials. Protein characterization is the key to the performance of intelligent protein design. However, there is no consensus on the most suitable characterization method for intelligent protein design tasks. This review describes the methods, characteristics, and representative applications of traditional descriptors, sequence-based and structure-based protein characterization. It discusses their advantages, disadvantages, and scope of application. It is hoped that this could help researchers to better understand the limitations and application scenarios of these methods, and provide valuable references for choosing appropriate protein characterization techniques for related research in the field, so as to better carry out protein research.
Collapse
Affiliation(s)
| | | | | | - Junjie Ding
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Liangliang Wang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Hui Jiang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| |
Collapse
|
13
|
Wang Y, Xie Y, Luo Y, Jia P, Wei J, Zhang J, Yan W, Huang J. iASMP: An interpretable in-silico predictive tool focusing on species-specific antimicrobial peptides. J Pept Sci 2023; 29:e3490. [PMID: 36994602 DOI: 10.1002/psc.3490] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 03/02/2023] [Accepted: 03/25/2023] [Indexed: 03/31/2023]
Abstract
Antimicrobial peptides (AMPs), a crucial part of the innate immune system, have been exploited as promising candidates for antibacterial agents. Many researchers have been devoting their efforts to develop novel AMPs in recent decades. In this term, many computational approaches have been developed to identify potential AMPs accurately. However, finding peptides specific to a particular bacterial species is challenging. Streptococcus mutans is a pathogen with an apparent cariogenic effect, and it is of great significance to study AMP that inhibit S. mutans for the prevention and treatment of caries. In this study, we proposed a sequence-based machine learning model, namely iASMP, to exactly identify potential anti-S. mutans peptides (ASMPs). After collecting ASMPs, the performances of models were compared by utilizing multiple feature descriptors and different classification algorithms. Among the baseline predictors, the model integrating the extra trees (ET) algorithm and the hybrid features exhibited optimal results. The feature selection method was utilized to remove redundant feature information to improve the model performance further. Finally, the proposed model achieved the maximum accuracy (ACC) of 0.962 on the training dataset and performed on the testing dataset with an ACC of 0.750. The results demonstrated that iASMP had an excellent predictive performance and was suitable for identifying potential ASMP. Furthermore, we also visualized the selected features and rationally explained the impact of individual features on the model output.
Collapse
Affiliation(s)
- Yuqiang Wang
- Key Laboratory of Dental Maxillofacial Reconstruction and Biological Intelligence Manufacturing of Gansu Province, School of Stomatology, Lanzhou University, Lanzhou, Gansu, China
| | - Yihao Xie
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China
| | - Yang Luo
- Key Laboratory of Dental Maxillofacial Reconstruction and Biological Intelligence Manufacturing of Gansu Province, School of Stomatology, Lanzhou University, Lanzhou, Gansu, China
| | - Pengfei Jia
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China
| | - Jiaqi Wei
- Key Laboratory of Dental Maxillofacial Reconstruction and Biological Intelligence Manufacturing of Gansu Province, School of Stomatology, Lanzhou University, Lanzhou, Gansu, China
| | - Jie Zhang
- Key Laboratory of Dental Maxillofacial Reconstruction and Biological Intelligence Manufacturing of Gansu Province, School of Stomatology, Lanzhou University, Lanzhou, Gansu, China
| | - Wenjin Yan
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China
| | - Jinqi Huang
- The Affiliated Hospital of Guangdong Medical University, Zhanjiang, Guangdong, China
| |
Collapse
|
14
|
Yang X, Qiu H, Zhang Y, Zhang P. Quantitative structure-activity relationship study of amide derivatives as xanthine oxidase inhibitors using machine learning. Front Pharmacol 2023; 14:1227536. [PMID: 37456753 PMCID: PMC10339742 DOI: 10.3389/fphar.2023.1227536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 06/16/2023] [Indexed: 07/18/2023] Open
Abstract
The target of the study is to predict the inhibitory effect of amide derivatives on xanthine oxidase (XO) by building several models, which are based on the theory of the quantitative structure-activity relationship (QSAR). The heuristic method (HM) was used to linearly select descriptors and build a linear model. XGBoost was used to non-linearly select descriptors, and radial basis kernel function support vector regression (RBF SVR), polynomial kernel function SVR (poly SVR), linear kernel function SVR (linear SVR), mix-kernel function SVR (MIX SVR), and random forest (RF) were adopted to establish non-linear models, in which the MIX-SVR method gives the best result. The kernel function of MIX SVR has strong abilities of learning and generalization of established models simultaneously, which is because it is a combination of the linear kernel function, the radial basis kernel function, and the polynomial kernel function. In order to test the robustness of the models, leave-one-out cross validation (LOOCV) was adopted. In a training set, R2 = 0.97 and RMSE = 0.01; in a test set, R2 = 0.95, RMSE = 0.01, and Rcv2 = 0.96. This result is in line with the experimental expectations, which indicate that the MIX-SVR modeling approach has good applications in the study of amide derivatives.
Collapse
|
15
|
Xiang K, Yu H, Du H, Hasan MH, Wei S, Xiang X. Exploring influential factors of CO 2 emissions in China's cities using machine learning techniques. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023:10.1007/s11356-023-28285-3. [PMID: 37347332 DOI: 10.1007/s11356-023-28285-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 06/12/2023] [Indexed: 06/23/2023]
Abstract
The precise and exhaustive discernment of factors influencing CO2 emissions underpins the advancement toward sustainable, low-carbon development. Although numerous studies have probed the correlation between predetermined proxy variables and carbon emissions, methodological constraints have often led to an inability to effectively discern carbon emission determinants among numerous potential variables or unravel complex, non-linear relationships, and interaction effects. To redress these research gaps, this research utilized machine learning models to correlate urban CO2 emissions with socioeconomic indicators. The model outputs were then visualized and interpreted using explainable methods. The findings indicated that the model successfully identified a comprehensive array of dominant influences on urban CO2 emissions, principally associated with local fiscal policies, land use, energy consumption, industrial development, and urban transportation. The findings further revealed a complex non-linear association between these factors and urban CO2 emissions; however, the majority of these variables displayed a prevalent propensity to intensify carbon emissions in correspondence with an increase in sample value. Additionally, these factors exhibited a complex interactive influence on urban CO2 emissions, with distinct pairings producing a suppressive effect exclusively at specific combination of sample values. Consequently, this research posited that a robust correlation between urban socioeconomic development and CO2 emissions in China remains to be established. Given the varied impacts of these influencing factors across different cities, a differentiated approach to development should be adopted when charting low-carbon trajectories.
Collapse
Affiliation(s)
- Kun Xiang
- Department of Civil, Environmental, and Construction Engineering, University of Central Florida, Orlando, FL, 32816, USA.
- Research Center of Machine Learning and Environment Science, China Three Gorges University, Yichang, 443002, China.
| | - Haofei Yu
- Department of Civil, Environmental, and Construction Engineering, University of Central Florida, Orlando, FL, 32816, USA
| | - Hao Du
- Research Center of Machine Learning and Environment Science, China Three Gorges University, Yichang, 443002, China
| | - Md Hasibul Hasan
- Department of Civil, Environmental, and Construction Engineering, University of Central Florida, Orlando, FL, 32816, USA
| | - Siyi Wei
- Research Center of Machine Learning and Environment Science, China Three Gorges University, Yichang, 443002, China
| | - Xiangyun Xiang
- Research Center of Machine Learning and Environment Science, China Three Gorges University, Yichang, 443002, China
| |
Collapse
|
16
|
Xiang T, Li T, Li J, Li X, Wang J. Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs. FASEB J 2023; 37:e22961. [PMID: 37178007 DOI: 10.1096/fj.202300245r] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Revised: 03/30/2023] [Accepted: 04/25/2023] [Indexed: 05/15/2023]
Abstract
Genomic prediction, which is based on solving linear mixed-model (LMM) equations, is the most popular method for predicting breeding values or phenotypic performance for economic traits in livestock. With the need to further improve the performance of genomic prediction, nonlinear methods have been considered as an alternative and promising approach. The excellent ability to predict phenotypes in animal husbandry has been demonstrated by machine learning (ML) approaches, which have been rapidly developed. To investigate the feasibility and reliability of implementing genomic prediction using nonlinear models, the performances of genomic predictions for pig productive traits using the linear genomic selection model and nonlinear machine learning models were compared. Then, to reduce the high-dimensional features of genome sequence data, different machine learning algorithms, including the random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost) and convolutional neural network (CNN) algorithms, were used to perform genomic feature selection as well as genomic prediction on reduced feature genome data. All of the analyses were processed on two real pig datasets: the published PIC pig dataset and a dataset comprising data from a national pig nucleus herd in Chifeng, North China. Overall, the accuracies of predicted phenotypic performance for traits T1, T2, T3 and T5 in the PIC dataset and average daily gain (ADG) in the Chifeng dataset were higher using the ML methods than the LMM method, while those for trait T4 in the PIC dataset and total number of piglets born (TNB) in the Chifeng dataset were slightly lower using the ML methods than the LMM method. Among all the different ML algorithms, SVM was the most appropriate for genomic prediction. For the genomic feature selection experiment, the most stable and most accurate results across different algorithms were achieved using XGBoost in combination with the SVM algorithm. Through feature selection, the number of genomic markers can be reduced to 1 in 20, while the predictive performance on some traits can even be improved compared to using the full genome data. Finally, we developed a new tool that can be used to execute combined XGBoost and SVM algorithms to realize genomic feature selection and phenotypic prediction.
Collapse
Affiliation(s)
- Tao Xiang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, China
| | - Tao Li
- College of Informatics, Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Smart Farming for Agricultural Animals, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Jielin Li
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, China
| | - Xin Li
- College of Informatics, Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Smart Farming for Agricultural Animals, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Jia Wang
- College of Informatics, Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Smart Farming for Agricultural Animals, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
17
|
Zhou H, Xin Y, Li S. A diabetes prediction model based on Boruta feature selection and ensemble learning. BMC Bioinformatics 2023; 24:224. [PMID: 37264332 DOI: 10.1186/s12859-023-05300-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 04/21/2023] [Indexed: 06/03/2023] Open
Abstract
BACKGROUND AND OBJECTIVE As a common chronic disease, diabetes is called the "second killer" among modern diseases. Currently, there is no medical cure for diabetes. We can only rely on medication for auxiliary treatment. However, many diabetic patients still die each year. In addition, a considerable number of people do not pay attention to their physical health or opt out of treatment due to lack of money, which eventually leads to various complications. Therefore, diagnosing diabetes at an early stage and intervening early is necessary; thus, developing an early detection method for diabetes is essential. METHODS In this study, a diabetes prediction model based on Boruta feature selection and ensemble learning is proposed. The model contains the use of Boruta feature selection, the extraction of salient features from datasets, the use of the K-Means++ algorithm for unsupervised clustering of data and stacking of an ensemble learning method for classification. It has been validated on a diabetes dataset. RESULTS The experiments were performed on the PIMA Indian diabetes dataset. The model was evaluated by accuracy, precision and F1 index. The obtained results show that the accuracy rate of the model reaches 98% and achieves good results. CONCLUSION Compared with other diabetes prediction models, this model achieved better results, and the obtained results indicate that this model is superior to other models in diabetes prediction and has better performance.
Collapse
Affiliation(s)
- Hongfang Zhou
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, 710048, China.
- Shaanxi Key Laboratory of Network Computing and Security Technology, Xi'an, 710048, China.
| | - Yinbo Xin
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, 710048, China
| | - Suli Li
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, 710048, China
| |
Collapse
|
18
|
Mazumdar B, Deva Sarma PK, Mahanta HJ, Sastry GN. Machine learning based dynamic consensus model for predicting blood-brain barrier permeability. Comput Biol Med 2023; 160:106984. [PMID: 37137267 DOI: 10.1016/j.compbiomed.2023.106984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 03/27/2023] [Accepted: 04/27/2023] [Indexed: 05/05/2023]
Abstract
The blood-brain barrier (BBB) is an important defence mechanism that restricts disease-causing pathogens and toxins to enter the brain from the bloodstream. In recent years, many in silico methods were proposed for predicting BBB permeability, however, the reliability of these models is questionable due to the smaller and class-imbalance dataset which subsequently leads to a very high false positive rate. In this study, machine learning and deep learning-based predictive models were built using XGboost, Random Forest, Extra-tree classifiers and deep neural network. A dataset of 8153 compounds comprising both the BBB permeable and BBB non-permeable was curated and subjected to calculations of molecular descriptors and fingerprints for generating the features for machine learning and deep learning models. Three balancing techniques were then applied to the dataset to address the class-imbalance issue. A comprehensive comparison among the models showed that the deep neural network model generated on the balanced MACCS fingerprint dataset outperformed with an accuracy of 97.8% and a ROC-AUC score of 0.98 among all the models. Additionally, a dynamic consensus model was prepared with the machine learning models and validated with a benchmark dataset for predicting BBB permeability with higher confidence scores.
Collapse
Affiliation(s)
- Bitopan Mazumdar
- Department of Computer Science, Assam University, Silchar, 788011, Assam, India; Advanced Computation and Data Sciences Division, CSIR- North East Institute of Science and Technology, Jorhat, 785006, Assam, India
| | | | - Hridoy Jyoti Mahanta
- Advanced Computation and Data Sciences Division, CSIR- North East Institute of Science and Technology, Jorhat, 785006, Assam, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, Uttar Pradesh, India.
| | - G Narahari Sastry
- Advanced Computation and Data Sciences Division, CSIR- North East Institute of Science and Technology, Jorhat, 785006, Assam, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, Uttar Pradesh, India
| |
Collapse
|
19
|
Wang M, Yan L, Jia J, Lai J, Zhou H, Yu B. DE-MHAIPs: Identification of SARS-CoV-2 phosphorylation sites based on differential evolution multi-feature learning and multi-head attention mechanism. Comput Biol Med 2023; 160:106935. [PMID: 37120990 PMCID: PMC10140648 DOI: 10.1016/j.compbiomed.2023.106935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/12/2023] [Accepted: 04/13/2023] [Indexed: 05/02/2023]
Abstract
The rapid spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) around the world affects the normal lives of people all over the world. The computational methods can be used to accurately identify SARS-CoV-2 phosphorylation sites. In this paper, a new prediction model of SARS-CoV-2 phosphorylation sites, called DE-MHAIPs, is proposed. First, we use six feature extraction methods to extract protein sequence information from different perspectives. For the first time, we use a differential evolution (DE) algorithm to learn individual feature weights and fuse multi-information in a weighted combination. Next, Group LASSO is used to select a subset of good features. Then, the important protein information is given higher weight through multi-head attention. After that, the processed data is fed into long short-term memory network (LSTM) to further enhance model's ability to learn features. Finally, the data from LSTM are input into fully connected neural network (FCN) to predict SARS-CoV-2 phosphorylation sites. The AUC values of the S/T and Y datasets under 5-fold cross-validation reach 91.98% and 98.32%, respectively. The AUC values of the two datasets on the independent test set reach 91.72% and 97.78%, respectively. The experimental results show that the DE-MHAIPs method exhibits excellent predictive ability compared with other methods.
Collapse
Affiliation(s)
- Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lu Yan
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jihua Jia
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jiali Lai
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Hongyan Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
20
|
MSINGB: A Novel Computational Method Based on NGBoost for Identifying Microsatellite Instability Status from Tumor Mutation Annotation Data. Interdiscip Sci 2023; 15:100-110. [PMID: 36350503 DOI: 10.1007/s12539-022-00544-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 10/19/2022] [Accepted: 10/22/2022] [Indexed: 11/11/2022]
Abstract
Microsatellite instability (MSI), a vital mutator phenotype caused by DNA mismatch repair deficiency, is frequently observed in several tumors. MSI is recognized as a critical molecular biomarker for diagnosis, prognosis, and therapeutic selection in several cancers. Identifying MSI status for current gold standard methods based on experimental analysis is laborious, time-consuming, and costly. Although several computational methods based on machine learning have been proposed to identify MSI status, we need to further understand which machine learning model would favor identification for MSI and which feature subset is strongly related to MSI. On this basis, more effective machine learning-based methods can be developed to improve the performance of MSI status identification. In this work, we present MSINGB, an NGBoost-based method for identifying MSI status from tumor somatic mutation annotation data. MSINGB first evaluates the prediction performance of 11 popular machine learning algorithms and 9 deep learning models to identify MSI. Among 20 models, NGBoost, a novel natural gradient boosting method, achieves the overall best performance. MSINGB then introduces two feature selection strategies to find the compact feature subset, which is strongly related to MSI, and employs the SHAP approach to interpreting how selected features impact the model prediction. MSINGB achieves a better prediction performance on both the tenfold cross-validation test and independent test compared with state-of-the-art methods.
Collapse
|
21
|
Hamdy W, Ismail A, Awad WA, Ibrahim AH, Hassanien AE. An Optimized Ensemble Deep Learning Model for Predicting Plant miRNA-IncRNA Based on Artificial Gorilla Troops Algorithm. SENSORS (BASEL, SWITZERLAND) 2023; 23:2219. [PMID: 36850816 PMCID: PMC9964106 DOI: 10.3390/s23042219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Revised: 02/11/2023] [Accepted: 02/14/2023] [Indexed: 06/18/2023]
Abstract
MicroRNAs (miRNA) are small, non-coding regulatory molecules whose effective alteration might result in abnormal gene manifestation in the downstream pathway of their target. miRNA gene variants can impact miRNA transcription, maturation, or target selectivity, impairing their usefulness in plant growth and stress responses. Simple Sequence Repeat (SSR) based on miRNA is a newly introduced functional marker that has recently been used in plant breeding. MicroRNA and long non-coding RNA (lncRNA) are two examples of non-coding RNA (ncRNA) that play a vital role in controlling the biological processes of animals and plants. According to recent studies, the major objective for decoding their functional activities is predicting the relationship between lncRNA and miRNA. Traditional feature-based classification systems' prediction accuracy and reliability are frequently harmed because of the small data size, human factors' limits, and huge quantity of noise. This paper proposes an optimized deep learning model built with Independently Recurrent Neural Networks (IndRNNs) and Convolutional Neural Networks (CNNs) to predict the interaction in plants between lncRNA and miRNA. The deep learning ensemble model automatically investigates the function characteristics of genetic sequences. The proposed model's main advantage is the enhanced accuracy in plant miRNA-IncRNA prediction due to optimal hyperparameter tuning, which is performed by the artificial Gorilla Troops Algorithm and the proposed intelligent preying algorithm. IndRNN is adapted to derive the representation of learned sequence dependencies and sequence features by overcoming the inaccuracies of natural factors in traditional feature architecture. Working with large-scale data, the suggested model outperforms the current deep learning model and shallow machine learning, notably for extended sequences, according to the findings of the experiments, where we obtained an accuracy of 97.7% in the proposed method.
Collapse
Affiliation(s)
- Walid Hamdy
- Faculty of Science, Port Said University, Port Said 42511, Egypt
| | - Amr Ismail
- Faculty of Science, Port Said University, Port Said 42511, Egypt
| | - Wael A. Awad
- Faculty of Computers and Artificial Intelligence, Damietta University, El-Gadeeda 34519, Egypt
| | - Ali H. Ibrahim
- Faculty of Science, Port Said University, Port Said 42511, Egypt
| | - Aboul Ella Hassanien
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza 12613, Egypt
| |
Collapse
|
22
|
Albu AI, Bocicor MI, Czibula G. MM-StackEns: A new deep multimodal stacked generalization approach for protein-protein interaction prediction. Comput Biol Med 2023; 153:106526. [PMID: 36623437 DOI: 10.1016/j.compbiomed.2022.106526] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 12/13/2022] [Accepted: 12/31/2022] [Indexed: 01/05/2023]
Abstract
Accurate in-silico identification of protein-protein interactions (PPIs) is a long-standing problem in biology, with important implications in protein function prediction and drug design. Current computational approaches predominantly use a single data modality for describing protein pairs, which may not fully capture the characteristics relevant for identifying PPIs. Another limitation of existing methods is their poor generalization to proteins outside the training graph. In this paper, we aim to address these shortcomings by proposing a new ensemble approach for PPI prediction, which learns information from two modalities, corresponding to pairs of sequences and to the graph formed by the training proteins and their interactions. Our approach uses a siamese neural network to process sequence information, while graph attention networks are employed for the network view. For capturing the relationships between the proteins in a pair, we design a new feature fusion module, based on computing the distance between the distributions corresponding to the two proteins. The prediction is made using a stacked generalization procedure, in which the final classifier is represented by a Logistic Regression model trained on the scores predicted by the sequence and graph models. Additionally, we show that protein sequence embeddings obtained using pretrained language models can significantly improve the generalization of PPI methods. The experimental results demonstrate the good performance of our approach, which surpasses all the related work on two Yeast data sets, while outperforming the majority of literature approaches on two Human data sets and on independent multi-species data sets.
Collapse
Affiliation(s)
- Alexandra-Ioana Albu
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| | - Maria-Iuliana Bocicor
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| | - Gabriela Czibula
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| |
Collapse
|
23
|
Li X, Han P, Chen W, Gao C, Wang S, Song T, Niu M, Rodriguez-Patón A. MARPPI: boosting prediction of protein-protein interactions with multi-scale architecture residual network. Brief Bioinform 2023; 24:6887309. [PMID: 36502435 DOI: 10.1093/bib/bbac524] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 09/29/2022] [Accepted: 11/04/2022] [Indexed: 12/14/2022] Open
Abstract
Protein-protein interactions (PPIs) are a major component of the cellular biochemical reaction network. Rich sequence information and machine learning techniques reduce the dependence of exploring PPIs on wet experiments, which are costly and time-consuming. This paper proposes a PPI prediction model, multi-scale architecture residual network for PPIs (MARPPI), based on dual-channel and multi-feature. Multi-feature leverages Res2vec to obtain the association information between residues, and utilizes pseudo amino acid composition, autocorrelation descriptors and multivariate mutual information to achieve the amino acid composition and order information, physicochemical properties and information entropy, respectively. Dual channel utilizes multi-scale architecture improved ResNet network which extracts protein sequence features to reduce protein feature loss. Compared with other advanced methods, MARPPI achieves 96.03%, 99.01% and 91.80% accuracy in the intraspecific datasets of Saccharomyces cerevisiae, Human and Helicobacter pylori, respectively. The accuracy on the two interspecific datasets of Human-Bacillus anthracis and Human-Yersinia pestis is 97.29%, and 95.30%, respectively. In addition, results on specific datasets of disease (neurodegenerative and metabolic disorders) demonstrate the ability to detect hidden interactions. To better illustrate the performance of MARPPI, evaluations on independent datasets and PPIs network suggest that MARPPI can be used to predict cross-species interactions. The above shows that MARPPI can be regarded as a concise, efficient and accurate tool for PPI datasets.
Collapse
Affiliation(s)
- Xue Li
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Peifu Han
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Wenqi Chen
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Changnan Gao
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Shuang Wang
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Tao Song
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Muyuan Niu
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| | - Alfonso Rodriguez-Patón
- School of Computer Science and Technology, China University of Petroleum, Qingdao 266580, China
| |
Collapse
|
24
|
Choi H, Kim T, Kim SJ, Sa BG, Ryu IH, Lee IS, Kim JK, Han E, Kim HK, Yoo TK. Predicting Postoperative Anterior Chamber Angle for Phakic Intraocular Lens Implantation Using Preoperative Anterior Segment Metrics. Transl Vis Sci Technol 2023; 12:10. [PMID: 36607625 PMCID: PMC9836008 DOI: 10.1167/tvst.12.1.10] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Purpose The anterior chamber angle (ACA) is a critical factor in posterior chamber phakic intraocular lens (EVO Implantable Collamer Lens [ICL]) implantation. Herein, we predicted postoperative ACAs to select the optimal ICL size to reduce narrow ACA-related complications. Methods Regression models were constructed using pre-operative anterior segment optical coherence tomography metrics to predict postoperative ACAs, including trabecular-iris angles (TIAs) and scleral-spur angles (SSAs) at 500 µm and 750 µm from the scleral spur (TIA500, TIA750, SSA500, and SSA750). Data from three expert surgeons were assigned to the development (N = 430 eyes) and internal validation (N = 108 eyes) datasets. Additionally, data from a novice surgeon (N = 42 eyes) were used for external validation. Results Postoperative ACAs were highly predictable using the machine-learning (ML) technique (extreme gradient boosting regression [XGBoost]), with mean absolute errors (MAEs) of 4.42 degrees, 3.77 degrees, 5.25 degrees, and 4.30 degrees for TIA500, TIA750, SSA500, and SSA750, respectively, in internal validation. External validation also showed MAEs of 3.93 degrees, 3.86 degrees, 5.02 degrees, and 4.74 degrees for TIA500, TIA750, SSA500, and SSA750, respectively. Linear regression using the pre-operative anterior chamber depth, anterior chamber width, crystalline lens rise, TIA, and ICL size also exhibited good performance, with no significant difference compared with XGBoost in the validation sets. Conclusions We developed linear regression and ML models to predict postoperative ACAs for ICL surgery anterior segment metrics. These will prevent surgeons from overlooking the risks associated with the narrowing of the ACA. Translational Relevance Using the proposed algorithms, surgeons can consider the postoperative ACAs to increase surgical accuracy and safety.
Collapse
Affiliation(s)
- Hannuy Choi
- Department of Refractive Surgery, B&VIIT Eye Center, Seoul, South Korea
| | - Taein Kim
- Research and Development Department, VISUWORKS, Seoul, South Korea
| | - Su Jeong Kim
- Research and Development Department, VISUWORKS, Seoul, South Korea
| | - Beom Gi Sa
- Research and Development Department, VISUWORKS, Seoul, South Korea
| | - Ik Hee Ryu
- Department of Refractive Surgery, B&VIIT Eye Center, Seoul, South Korea,Research and Development Department, VISUWORKS, Seoul, South Korea
| | - In Sik Lee
- Department of Refractive Surgery, B&VIIT Eye Center, Seoul, South Korea
| | - Jin Kuk Kim
- Department of Refractive Surgery, B&VIIT Eye Center, Seoul, South Korea
| | - Eoksoo Han
- Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea
| | - Hong Kyu Kim
- Department of Ophthalmology, Dankook University Hospital, Dankook University College of Medicine, Cheonan, South Korea
| | - Tae Keun Yoo
- Department of Refractive Surgery, B&VIIT Eye Center, Seoul, South Korea,Research and Development Department, VISUWORKS, Seoul, South Korea
| |
Collapse
|
25
|
Zou Z, Wu Q, Wang J, Xu L, Zhou M, Lu Z, He Y, Wang Y, Liu B, Zhao Y. Research on non-destructive testing of hotpot oil quality by fluorescence hyperspectral technology combined with machine learning. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2023; 284:121785. [PMID: 36058172 DOI: 10.1016/j.saa.2022.121785] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 08/21/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
Eating repeatedly used hotpot oil will cause serious harm to human health. In order to realize rapid non-destructive testing of hotpot oil quality, a modeling experiment method of fluorescence hyperspectral technology combined with machine learning algorithm was proposed. Five preprocessing algorithms were used to preprocess the original spectral data, which realized data denoising and reduces the influence of baseline drift and tilt. The feature bands extracted from the spectral data showed that the best feature bands for the two-classification model and the six-classification model were concentrated between 469 and 962 nm and 534-809 nm, respectively. Using the PCA algorithm to visualize the spectral data, the results showed the distribution of the six types of samples intuitively, and indicated that the data could be classified. Based on the modeling analysis of the feature bands, the results showed that the best two-classification models and the best six-classification models were MF-RF-RF and MF-XGBoost-LGB models, respectively, and the classification accuracy reached 100 %. Compared with the traditional model, the error was greatly reduced, and the calculation time was also saved. This study confirmed that fluorescence hyperspectral technology combined with machine learning algorithm could effectively realize the detection of reused hotpot oil.
Collapse
Affiliation(s)
- Zhiyong Zou
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Qingsong Wu
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Jian Wang
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Lijia Xu
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Man Zhou
- College of Food Sciences, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Zhiwei Lu
- College of Science, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Yong He
- College of Biosystems Engineering and Food Science, Zhejiang University, 866, Yuhangtang Road, Hangzhou 310058, PR China
| | - Yuchao Wang
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Bi Liu
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China
| | - Yongpeng Zhao
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Xin Kang Road, Yucheng District, Ya'an 625014, PR China.
| |
Collapse
|
26
|
Ensemble Learning of Multiple Models Using Deep Learning for Multiclass Classification of Ultrasound Images of Hepatic Masses. BIOENGINEERING (BASEL, SWITZERLAND) 2023; 10:bioengineering10010069. [PMID: 36671641 PMCID: PMC9854883 DOI: 10.3390/bioengineering10010069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2022] [Revised: 12/29/2022] [Accepted: 01/03/2023] [Indexed: 01/06/2023]
Abstract
Ultrasound (US) is often used to diagnose liver masses. Ensemble learning has recently been commonly used for image classification, but its detailed methods are not fully optimized. The purpose of this study is to investigate the usefulness and comparison of some ensemble learning and ensemble pruning techniques using multiple convolutional neural network (CNN) trained models for image classification of liver masses in US images. Dataset of the US images were classified into four categories: benign liver tumor (BLT) 6320 images, liver cyst (LCY) 2320 images, metastatic liver cancer (MLC) 9720 images, primary liver cancer (PLC) 7840 images. In this study, 250 test images were randomly selected for each class, for a total of 1000 images, and the remaining images were used as the training. 16 different CNNs were used for training and testing ultrasound images. The ensemble learning used soft voting (SV), weighted average voting (WAV), weighted hard voting (WHV) and stacking (ST). All four types of ensemble learning (SV, ST, WAV, and WHV) showed higher values of accuracy than the single CNN. All four types also showed significantly higher deep learning (DL) performance than ResNeXt101 alone. For image classification of liver masses using US images, ensemble learning improved the performance of DL over a single CNN.
Collapse
|
27
|
Ibrahim AH, Karabulut OC, Karpuzcu BA, Türk E, Süzek BE. A correlation coefficient-based feature selection approach for virus-host protein-protein interaction prediction. PLoS One 2023; 18:e0285168. [PMID: 37130110 PMCID: PMC10153705 DOI: 10.1371/journal.pone.0285168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Accepted: 04/17/2023] [Indexed: 05/03/2023] Open
Abstract
Prediction of virus-host protein-protein interactions (PPI) is a broad research area where various machine-learning-based classifiers are developed. Transforming biological data into machine-usable features is a preliminary step in constructing these virus-host PPI prediction tools. In this study, we have adopted a virus-host PPI dataset and a reduced amino acids alphabet to create tripeptide features and introduced a correlation coefficient-based feature selection. We applied feature selection across several correlation coefficient metrics and statistically tested their relevance in a structural context. We compared the performance of feature-selection models against that of the baseline virus-host PPI prediction models created using different classification algorithms without the feature selection. We also tested the performance of these baseline models against the previously available tools to ensure their predictive power is acceptable. Here, the Pearson coefficient provides the best performance with respect to the baseline model as measured by AUPR; a drop of 0.003 in AUPR while achieving a 73.3% (from 686 to 183) reduction in the number of tripeptides features for random forest. The results suggest our correlation coefficient-based feature selection approach, while decreasing the computation time and space complexity, has a limited impact on the prediction performance of virus-host PPI prediction tools.
Collapse
Affiliation(s)
- Ahmed Hassan Ibrahim
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Onur Can Karabulut
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Betül Asiye Karpuzcu
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Erdem Türk
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Barış Ethem Süzek
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey
- Georgetown University Medical Center, Biochemistry and Molecular & Cellular Biology, Washington DC, United States of America
| |
Collapse
|
28
|
Yue ZX, Yan TC, Xu HQ, Liu YH, Hong YF, Chen GX, Xie T, Tao L. A systematic review on the state-of-the-art strategies for protein representation. Comput Biol Med 2023; 152:106440. [PMID: 36543002 DOI: 10.1016/j.compbiomed.2022.106440] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 12/08/2022] [Accepted: 12/15/2022] [Indexed: 12/23/2022]
Abstract
The study of drug-target protein interaction is a key step in drug research. In recent years, machine learning techniques have become attractive for research, including drug research, due to their automated nature, predictive power, and expected efficiency. Protein representation is a key step in the study of drug-target protein interaction by machine learning, which plays a fundamental role in the ultimate accomplishment of accurate research. With the progress of machine learning, protein representation methods have gradually attracted attention and have consequently developed rapidly. Therefore, in this review, we systematically classify current protein representation methods, comprehensively review them, and discuss the latest advances of interest. According to the information extraction methods and information sources, these representation methods are generally divided into structure and sequence-based representation methods. Each primary class can be further divided into specific subcategories. As for the particular representation methods involve both traditional and the latest approaches. This review contains a comprehensive assessment of the various methods which researchers can use as a reference for their specific protein-related research requirements, including drug research.
Collapse
Affiliation(s)
- Zi-Xuan Yue
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Tian-Ci Yan
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
29
|
Karpuzcu BA, Türk E, Ibrahim AH, Karabulut OC, Süzek BE. Machine Learning Methods for Virus-Host Protein-Protein Interaction Prediction. Methods Mol Biol 2023; 2690:401-417. [PMID: 37450162 DOI: 10.1007/978-1-0716-3327-4_31] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2023]
Abstract
The attachment of a virion to a respective cellular receptor on the host organism occurring through the virus-host protein-protein interactions (PPIs) is a decisive step for viral pathogenicity and infectivity. Therefore, a vast number of wet-lab experimental techniques are used to study virus-host PPIs. Taking the great number and enormous variety of virus-host PPIs and the cost as well as labor of laboratory work, however, computational approaches toward analyzing the available interaction data and predicting previously unidentified interactions have been on the rise. Among them, machine-learning-based models are getting increasingly more attention with a great body of resources and tools proposed recently.In this chapter, we first provide the methodology with major steps toward the development of a virus-host PPI prediction tool. Next, we discuss the challenges involved and evaluate several existing machine-learning-based virus-host PPI prediction tools. Finally, we describe our experience with several ensemble techniques as utilized on available prediction results retrieved from individual PPI prediction tools. Overall, based on our experience, we recognize there is still room for the development of new individual and/or ensemble virus-host PPI prediction tools that leverage existing tools.
Collapse
Affiliation(s)
- Betül Asiye Karpuzcu
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Erdem Türk
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Ahmad Hassan Ibrahim
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Onur Can Karabulut
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Barış Ethem Süzek
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey.
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey.
| |
Collapse
|
30
|
Sun Y. A systematic pan-cancer analysis reveals the clinical prognosis and immunotherapy value of C-X3-C motif ligand 1 (CX3CL1). Front Genet 2023; 14:1183795. [PMID: 37153002 PMCID: PMC10157490 DOI: 10.3389/fgene.2023.1183795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 04/10/2023] [Indexed: 05/09/2023] Open
Abstract
It is now widely known that C-X3-C motif ligand 1 (CX3CL1) plays an essential part in the process of regulating pro-inflammatory cells migration across a wide range of inflammatory disorders, including a number of malignancies. However, there has been no comprehensive study on the correlation between CX3CL1 and cancers on the basis of clinical features. In order to investigate the potential function of CX3CL1 in the clinical prognosis and immunotherapy, I evaluated the expression of CX3CL1 in numerous cancer types, methylation levels and genetic alterations. I found CX3CL1 was differentially expressed in numerous cancer types, which indicated CX3CL1 may plays a potential role in tumor progression. Furthermore, CX3CL1 was variably expressed in methylation levels and gene alterations in most cancers according to The Cancer Genome Atlas (TCGA). CX3CL1 was robustly associated with clinical characteristics and pathological stages, suggesting that it was related to the degree of tumor malignancy and the physical function of patients. As determined by the Kaplan-Meier method of estimating survival, high CX3CL1 expression was associated with either favorable or unfavorable outcomes depending on the different types of cancer. It suggests the correlation between CX3CL1 and tumor prognosis. Significant positive correlations of CX3CL1 expression with CD4+ T cells, M1 macrophage cells and activated mast cells have been established in the majority of TCGA malignancies. Which indicates CX3CL1 plays an important role in tumor immune microenvironment. Gene Ontology (GO) terms and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis suggested that the chemokine signaling pathway may shed light on the pathway for CX3CL1 to exert function. In a conclusion, our study comprehensively summarizes the potential role of CX3CL1 in clinical prognosis and immunotherapy, suggesting that CX3CL1 may represent a promising pharmacological treatment target of tumors.
Collapse
|
31
|
Accurate Prediction of Anti-hypertensive Peptides Based on Convolutional Neural Network and Gated Recurrent unit. Interdiscip Sci 2022; 14:879-894. [PMID: 35474167 DOI: 10.1007/s12539-022-00521-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 03/30/2022] [Accepted: 04/06/2022] [Indexed: 12/30/2022]
Abstract
Hypertension (HT) is a general disease, and also one of the most ordinary and major causes of cardiovascular disease. Some diseases are caused by high blood pressure, including impairment of heart and kidney function, cerebral hemorrhage and myocardial infarction. Due to the limitations of laboratory methods, bioactive peptides for the treatment of HT need a long time to be identified. Therefore, it is of great immediate significance for the identification of anti-hypertensive peptides (AHTPs). With the prevalence of machine learning, it is suggested to use it as a supplementary method for AHTPs classification. Therefore, we develop a new model to identify AHTPs based on multiple features and deep learning. And the deep model is constructed by combining a convolutional neural network (CNN) and a gated recurrent unit (GRU). The unique convolution structure is used to reduce the feature dimension and running time. The data processed by CNN is input into the recurrent structure GRU, and important information is filtered out through the reset gate and update gate. Finally, the output layer adopts Sigmoid activation function. Firstly, we use Kmer, the deviation between the dipeptide frequency and the expected mean (DDE), encoding based on grouped weight (EBGW), enhanced grouped amino acid composition (EGAAC) and dipeptide binary profile and frequency (DBPF) to extract features. For Kmer, DDE, EBGW and EGAAC, it is widely used in the field of protein research. DBPF is a new feature representation method designed by us. It corresponds dipeptides to binary numbers, and finally obtains a binary coding file and a frequency file. Then these features are spliced together and input into our proposed model for prediction and analysis. After a tenfold cross-validation test, this model has a better competitive advantage than the previous methods, and the accuracy is 96.23% and 99.10%, respectively. From the results, compared with the previous methods, it has been greatly improved. It shows that the combination of convolution calculation and recurrent structure has a positive impact on the classification of AHTPs. The results show that this method is a feasible, efficient and competitive sequence analysis tool for AHTPs. Meanwhile, we design a friendly online prediction tool and it is freely accessible at http://ahtps.zhanglab.site/ .
Collapse
|
32
|
Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier. BMC Bioinformatics 2022; 23:518. [PMID: 36457083 PMCID: PMC9713954 DOI: 10.1186/s12859-022-04880-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 08/03/2022] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Self-interacting proteins (SIPs), two or more copies of the protein that can interact with each other expressed by one gene, play a central role in the regulation of most living cells and cellular functions. Although numerous SIPs data can be provided by using high-throughput experimental techniques, there are still several shortcomings such as in time-consuming, costly, inefficient, and inherently high in false-positive rates, for the experimental identification of SIPs even nowadays. Therefore, it is more and more significant how to develop efficient and accurate automatic approaches as a supplement of experimental methods for assisting and accelerating the study of predicting SIPs from protein sequence information. RESULTS In this paper, we present a novel framework, termed GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences. More specifically, we firstly convert the protein sequence into Position Specific Scoring Matrix (PSSM) containing protein sequence evolutionary information, exploiting the Position Specific Iterated BLAST (PSI-BLAST) tool. Secondly, using an efficient feature extraction approach, i.e., GLCM, we extract abstract salient and invariant feature vectors from the PSSM, and then perform a pre-processing operation, the adaptive synthetic (ADASYN) technique, to balance the SIPs dataset to generate new feature vectors for classification. Finally, we employ an efficient and reliable WSRC model to identify SIPs according to the known information of self-interacting and non-interacting proteins. CONCLUSIONS Extensive experimental results show that the proposed approach exhibits high prediction performance with 98.10% accuracy on the yeast dataset, and 91.51% accuracy on the human dataset, which further reveals that the proposed model could be a useful tool for large-scale self-interacting protein prediction and other bioinformatics tasks detection in the future.
Collapse
|
33
|
Tian H, Ketkar R, Tao P. ADMETboost: a web server for accurate ADMET prediction. J Mol Model 2022; 28:408. [PMID: 36454321 PMCID: PMC9903341 DOI: 10.1007/s00894-022-05373-8] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 10/31/2022] [Indexed: 12/03/2022]
Abstract
The absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties are important in drug discovery as they define efficacy and safety. In this work, we applied an ensemble of features, including fingerprints and descriptors, and a tree-based machine learning model, extreme gradient boosting, for accurate ADMET prediction. Our model performs well in the Therapeutics Data Commons ADMET benchmark group. For 22 tasks, our model is ranked first in 18 tasks and top 3 in 21 tasks. The trained machine learning models are integrated in ADMETboost, a web server that is publicly available at https://ai-druglab.smu.edu/admet .
Collapse
Affiliation(s)
- Hao Tian
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, 75205, TX, USA
| | | | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, 75205, TX, USA.
| |
Collapse
|
34
|
Gao H, Chen C, Li S, Wang C, Zhou W, Yu B. Prediction of protein-protein interactions based on ensemble residual conventional neural network. Comput Biol Med 2022. [DOI: 10.1016/j.compbiomed.2022.106471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
35
|
Wu Q, Xu L, Zou Z, Wang J, Zeng Q, Wang Q, Zhen J, Wang Y, Zhao Y, Zhou M. Rapid nondestructive detection of peanut varieties and peanut mildew based on hyperspectral imaging and stacked machine learning models. FRONTIERS IN PLANT SCIENCE 2022; 13:1047479. [PMID: 36438117 PMCID: PMC9685660 DOI: 10.3389/fpls.2022.1047479] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/18/2022] [Accepted: 10/12/2022] [Indexed: 06/16/2023]
Abstract
Moldy peanut seeds are damaged by mold, which seriously affects the germination rate of peanut seeds. At the same time, the quality and variety purity of peanut seeds profoundly affect the final yield of peanuts and the economic benefits of farmers. In this study, hyperspectral imaging technology was used to achieve variety classification and mold detection of peanut seeds. In addition, this paper proposed to use median filtering (MF) to preprocess hyperspectral data, use four variable selection methods to obtain characteristic wavelengths, and ensemble learning models (SEL) as a stable classification model. This paper compared the model performance of SEL and extreme gradient boosting algorithm (XGBoost), light gradient boosting algorithm (LightGBM), and type boosting algorithm (CatBoost). The results showed that the MF-LightGBM-SEL model based on hyperspectral data achieves the best performance. Its prediction accuracy on the data training and data testing reach 98.63% and 98.03%, respectively, and the modeling time was only 0.37s, which proved that the potential of the model to be used in practice. The approach of SEL combined with hyperspectral imaging techniques facilitates the development of a real-time detection system. It could perform fast and non-destructive high-precision classification of peanut seed varieties and moldy peanuts, which was of great significance for improving crop yields.
Collapse
Affiliation(s)
- Qingsong Wu
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Lijia Xu
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Zhiyong Zou
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Jian Wang
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Qifeng Zeng
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Qianlong Wang
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Jiangbo Zhen
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Yuchao Wang
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Yongpeng Zhao
- College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Yaan, China
| | - Man Zhou
- College of Food Sciences, Sichuan Agricultural University, Yaan, China
| |
Collapse
|
36
|
Kline A, Wang H, Li Y, Dennis S, Hutch M, Xu Z, Wang F, Cheng F, Luo Y. Multimodal machine learning in precision health: A scoping review. NPJ Digit Med 2022; 5:171. [PMID: 36344814 PMCID: PMC9640667 DOI: 10.1038/s41746-022-00712-8] [Citation(s) in RCA: 65] [Impact Index Per Article: 32.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 10/14/2022] [Indexed: 11/09/2022] Open
Abstract
Machine learning is frequently being leveraged to tackle problems in the health sector including utilization for clinical decision-support. Its use has historically been focused on single modal data. Attempts to improve prediction and mimic the multimodal nature of clinical expert decision-making has been met in the biomedical field of machine learning by fusing disparate data. This review was conducted to summarize the current studies in this field and identify topics ripe for future research. We conducted this review in accordance with the PRISMA extension for Scoping Reviews to characterize multi-modal data fusion in health. Search strings were established and used in databases: PubMed, Google Scholar, and IEEEXplore from 2011 to 2021. A final set of 128 articles were included in the analysis. The most common health areas utilizing multi-modal methods were neurology and oncology. Early fusion was the most common data merging strategy. Notably, there was an improvement in predictive performance when using data fusion. Lacking from the papers were clear clinical deployment strategies, FDA-approval, and analysis of how using multimodal approaches from diverse sub-populations may improve biases and healthcare disparities. These findings provide a summary on multimodal data fusion as applied to health diagnosis/prognosis problems. Few papers compared the outputs of a multimodal approach with a unimodal prediction. However, those that did achieved an average increase of 6.4% in predictive accuracy. Multi-modal machine learning, while more robust in its estimations over unimodal methods, has drawbacks in its scalability and the time-consuming nature of information concatenation.
Collapse
Affiliation(s)
- Adrienne Kline
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Hanyin Wang
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Yikuan Li
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Saya Dennis
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Meghan Hutch
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Zhenxing Xu
- Department of Population Health Sciences, Cornell University, New York, 10065, NY, USA
| | - Fei Wang
- Department of Population Health Sciences, Cornell University, New York, 10065, NY, USA
| | - Feixiong Cheng
- Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, 44195, OH, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA.
| |
Collapse
|
37
|
Zhu X, Zhang M, Wen Y, Shang D. Machine learning advances the integration of covariates in population pharmacokinetic models: Valproic acid as an example. Front Pharmacol 2022; 13:994665. [PMID: 36324679 PMCID: PMC9621318 DOI: 10.3389/fphar.2022.994665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 10/03/2022] [Indexed: 11/24/2022] Open
Abstract
Background and Aim: Many studies associated with the combination of machine learning (ML) and pharmacometrics have appeared in recent years. ML can be used as an initial step for fast screening of covariates in population pharmacokinetic (popPK) models. The present study aimed to integrate covariates derived from different popPK models using ML. Methods: Two published popPK models of valproic acid (VPA) in Chinese epileptic patients were used, where the population parameters were influenced by some covariates. Based on the covariates and a one-compartment model that describes the pharmacokinetics of VPA, a dataset was constructed using Monte Carlo simulation, to develop an XGBoost model to estimate the steady-state concentrations (Css) of VPA. We utilized SHapley Additive exPlanation (SHAP) values to interpret the prediction model, and calculated estimates of VPA exposure in four assumed scenarios involving different combinations of CYP2C19 genotypes and co-administered antiepileptic drugs. To develop an easy-to-use model in the clinic, we built a simplified model by using CYP2C19 genotypes and some noninvasive clinical parameters, and omitting several features that were infrequently measured or whose clinically available values were inaccurate, and verified it on our independent external dataset. Results: After data preprocessing, the finally generated combined dataset was divided into a derivation cohort and a validation cohort (8:2). The XGBoost model was developed in the derivation cohort and yielded excellent performance in the validation cohort with a mean absolute error of 2.4 mg/L, root-mean-squared error of 3.3 mg/L, mean relative error of 0%, and percentages within ±20% of actual values of 98.85%. The SHAP analysis revealed that daily dose, time, CYP2C19*2 and/or *3 variants, albumin, body weight, single dose, and CYP2C19*1*1 genotype were the top seven confounding factors influencing the Css of VPA. Under the simulated dosage regimen of 500 mg/bid, the VPA exposure in patients who had CYP2C19*2 and/or *3 variants and no carbamazepine, phenytoin, or phenobarbital treatment, was approximately 1.74-fold compared to those with CYP2C19*1/*1 genotype and co-administered carbamazepine + phenytoin + phenobarbital. The feasibility of the simplified model was fully illustrated by its performance in our external dataset. Conclusion: This study highlighted the bridging role of ML in big data and pharmacometrics, by integrating covariates derived from different popPK models.
Collapse
Affiliation(s)
- Xiuqing Zhu
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
| | - Ming Zhang
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
| | - Yuguan Wen
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
- *Correspondence: Yuguan Wen, ; Dewei Shang,
| | - Dewei Shang
- Department of Pharmacy, The Affiliated Brain Hospital of Guangzhou Medical University, Guangzhou, China
- Guangdong Engineering Technology Research Center for Translational Medicine of Mental Disorders, Guangzhou, China
- *Correspondence: Yuguan Wen, ; Dewei Shang,
| |
Collapse
|
38
|
Dutta A, Hasan MK, Ahmad M, Awal MA, Islam MA, Masud M, Meshref H. Early Prediction of Diabetes Using an Ensemble of Machine Learning Models. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph191912378. [PMID: 36231678 PMCID: PMC9566114 DOI: 10.3390/ijerph191912378] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/20/2022] [Accepted: 09/24/2022] [Indexed: 05/15/2023]
Abstract
Diabetes is one of the most rapidly spreading diseases in the world, resulting in an array of significant complications, including cardiovascular disease, kidney failure, diabetic retinopathy, and neuropathy, among others, which contribute to an increase in morbidity and mortality rate. If diabetes is diagnosed at an early stage, its severity and underlying risk factors can be significantly reduced. However, there is a shortage of labeled data and the occurrence of outliers or data missingness in clinical datasets that are reliable and effective for diabetes prediction, making it a challenging endeavor. Therefore, we introduce a newly labeled diabetes dataset from a South Asian nation (Bangladesh). In addition, we suggest an automated classification pipeline that includes a weighted ensemble of machine learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). Grid search hyperparameter optimization is employed to tune the critical hyperparameters of these ML models. Furthermore, missing value imputation, feature selection, and K-fold cross-validation are included in the framework design. A statistical analysis of variance (ANOVA) test reveals that the performance of diabetes prediction significantly improves when the proposed weighted ensemble (DT + RF + XGB + LGB) is executed with the introduced preprocessing, with the highest accuracy of 0.735 and an area under the ROC curve (AUC) of 0.832. In conjunction with the suggested ensemble model, our statistical imputation and RF-based feature selection techniques produced the best results for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.
Collapse
Affiliation(s)
- Aishwariya Dutta
- Department of Biomedical Engineering (BME), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
- Department of Biomedical Engineering (BME), Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, Bangladesh
| | - Md. Kamrul Hasan
- Department of Electrical and Electronic Engineering (EEE), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
| | - Mohiuddin Ahmad
- Department of Electrical and Electronic Engineering (EEE), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
| | - Md. Abdul Awal
- School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD 4072, Australia
- Electronics and Communication Engineering (ECE) Discipline, Khulna University (KU), Khulna 9208, Bangladesh
- Correspondence:
| | | | - Mehedi Masud
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| | - Hossam Meshref
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| |
Collapse
|
39
|
Ramshankar N, Joe Prathap P. Reviewer reliability and XGboost whale optimized sentiment analysis for online product recommendation. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-221633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Nowadays, people always use online promotions to know about best shops to buy the best products. This shopping experience and shopper’s opinion about the shop can be observed by the customer-experience shared on social media. A new customer when searching a shop needs information about manufacturing date (MRD) and manufacturing price (MRP), offers, quality, and suggestions which are only provided by the previous customer experience. Several approaches were used previously for predicting the product details, but no one approach provides accurate information. To overcome these issues, Reviewer Reliability and XGboost whale Optimized Sentiment Analysis for Online Product Recommendation is proposed in this manuscript.Initially, Amazon Product recommendation datathe data are preprocessed and given to XGboost Classifier that classifies the product recommendation result as, good, bad and average. Generally the XGboost Classifier does not reveal any adoption of optimization techniques for computing the optimal parameters for assuring accurate classification of product recommendation. Therefore in this work, proposed Whale optimization algorithm utilized to optimize the weight parameters of the XGboost. Then the proposed model is implemented in MATLAB. The proposed method attains 18.31%, 12.81%, 45.75%, 26.97% and 25.55% lower Mean Absolute error, 18.31%, 12.81%, 27.97%, 25.97%, and 25.55% higher Mean absolute percentage error and 15.31%, 10.33%, 25.86%, 22.86% and 15.22% lower Mean Square Error than the existing methods.
Collapse
Affiliation(s)
- N. Ramshankar
- Department of Computer Science and Engineering, Jagannath Institute of Engineering and Technology, Jagatpur Industrial Estate, Jagatpur, Odisha, India
| | - P.M. Joe Prathap
- Department of Computer Science and Engineering, R.M.D. Engineering College, Kavaraipettai, Tamil Nadu, India
| |
Collapse
|
40
|
Li X, Zhang S, Shi H. An improved residual network using deep fusion for identifying RNA 5-methylcytosine sites. Bioinformatics 2022; 38:4271-4277. [PMID: 35866985 DOI: 10.1093/bioinformatics/btac532] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 06/30/2022] [Accepted: 07/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION 5-Methylcytosine (m5C) is a crucial post-transcriptional modification. With the development of technology, it is widely found in various RNAs. Numerous studies have indicated that m5C plays an essential role in various activities of organisms, such as tRNA recognition, stabilization of RNA structure, RNA metabolism and so on. Traditional identification is costly and time-consuming by wet biological experiments. Therefore, computational models are commonly used to identify the m5C sites. Due to the vast computing advantages of deep learning, it is feasible to construct the predictive model through deep learning algorithms. RESULTS In this study, we construct a model to identify m5C based on a deep fusion approach with an improved residual network. First, sequence features are extracted from the RNA sequences using Kmer, K-tuple nucleotide frequency component (KNFC), Pseudo dinucleotide composition (PseDNC) and Physical and chemical property (PCP). Kmer and KNFC extract information from a statistical point of view. PseDNC and PCP extract information from the physicochemical properties of RNA sequences. Then, two parts of information are fused with new features using bidirectional long- and short-term memory and attention mechanisms, respectively. Immediately after, the fused features are fed into the improved residual network for classification. Finally, 10-fold cross-validation and independent set testing are used to verify the credibility of the model. The results show that the accuracy reaches 91.87%, 95.55%, 92.27% and 95.60% on the training sets and independent test sets of Arabidopsis thaliana and M.musculus, respectively. This is a considerable improvement compared to previous studies and demonstrates the robust performance of our model. AVAILABILITY AND IMPLEMENTATION The data and code related to the study are available at https://github.com/alivelxj/m5c-DFRESG.
Collapse
Affiliation(s)
- Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| |
Collapse
|
41
|
Can Machine Learning classifiers be used to regulate nutrients using small training datasets for aquaponic irrigation?: A comparative analysis. PLoS One 2022; 17:e0269401. [PMID: 35972941 PMCID: PMC9380945 DOI: 10.1371/journal.pone.0269401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/20/2022] [Indexed: 11/19/2022] Open
Abstract
With the recent advances in the field of alternate agriculture, there has been an ever-growing demand for aquaponics as a potential substitute for traditional agricultural techniques for improving sustainable food production. However, the lack of data-driven methods and approaches for aquaponic cultivation remains a challenge. The objective of this research is to investigate statistical methods to make inferences using small datasets for nutrient control in aquaponics to optimize yield. In this work, we employed the Density-Based Synthetic Minority Over-sampling TEchnique (DB-SMOTE) to address dataset imbalance, and ExtraTreesClassifer and Recursive Feature Elimination (RFE) to choose the relevant features. Synthetic data generation techniques such as the Monte-Carlo (MC) sampling techniques were used to generate enough data points and different feature engineering techniques were used on the predictors before evaluating the performance of kernel-based classifiers with the goal of controlling nutrients in the aquaponic solution for optimal growth.[27–35]
Collapse
|
42
|
Ullah M, Hadi F, Song J, Yu DJ. PScL-DDCFPred: an ensemble deep learning-based approach for characterizing multiclass subcellular localization of human proteins from bioimage data. Bioinformatics 2022; 38:4019-4026. [PMID: 35771606 PMCID: PMC9890309 DOI: 10.1093/bioinformatics/btac432] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 06/03/2022] [Accepted: 06/28/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Characterization of protein subcellular localization has become an important and long-standing task in bioinformatics and computational biology, which provides valuable information for elucidating various cellular functions of proteins and guiding drug design. RESULTS Here, we develop a novel bioimage-based computational approach, termed PScL-DDCFPred, to accurately predict protein subcellular localizations in human tissues. PScL-DDCFPred first extracts multiview image features, including global and local features, as base or pure features; next, it applies a new integrative feature selection method based on stepwise discriminant analysis and generalized discriminant analysis to identify the optimal feature sets from the extracted pure features; Finally, a classifier based on deep neural network (DNN) and deep-cascade forest (DCF) is established. Stringent 10-fold cross-validation tests on the new protein subcellular localization training dataset, constructed from the human protein atlas databank, illustrates that PScL-DDCFPred achieves a better performance than several existing state-of-the-art methods. Moreover, the independent test set further illustrates the generalization capability and superiority of PScL-DDCFPred over existing predictors. In-depth analysis shows that the excellent performance of PScL-DDCFPred can be attributed to three critical factors, namely the effective combination of the DNN and DCF models, complementarity of global and local features, and use of the optimal feature sets selected by the integrative feature selection algorithm. AVAILABILITY AND IMPLEMENTATION https://github.com/csbio-njust-edu/PScL-DDCFPred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matee Ullah
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fazal Hadi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
43
|
Shi H, Zhang S, Li X. R5hmCFDV: computational identification of RNA 5-hydroxymethylcytosine based on deep feature fusion and deep voting. Brief Bioinform 2022; 23:6658858. [PMID: 35945157 DOI: 10.1093/bib/bbac341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 07/17/2022] [Accepted: 07/25/2022] [Indexed: 11/13/2022] Open
Abstract
RNA 5-hydroxymethylcytosine (5hmC) is a kind of RNA modification, which is related to the life activities of many organisms. Studying its distribution is very important to reveal its biological function. Previously, high-throughput sequencing was used to identify 5hmC, but it is expensive and inefficient. Therefore, machine learning is used to identify 5hmC sites. Here, we design a model called R5hmCFDV, which is mainly divided into feature representation, feature fusion and classification. (i) Pseudo dinucleotide composition, dinucleotide binary profile and frequency, natural vector and physicochemical property are used to extract features from four aspects: nucleotide composition, coding, natural language and physical and chemical properties. (ii) To strengthen the relevance of features, we construct a novel feature fusion method. Firstly, the attention mechanism is employed to process four single features, stitch them together and feed them to the convolution layer. After that, the output data are processed by BiGRU and BiLSTM, respectively. Finally, the features of these two parts are fused by the multiply function. (iii) We design the deep voting algorithm for classification by imitating the soft voting mechanism in the Python package. The base classifiers contain deep neural network (DNN), convolutional neural network (CNN) and improved gated recurrent unit (GRU). And then using the principle of soft voting, the corresponding weights are assigned to the predicted probabilities of the three classifiers. The predicted probability values are multiplied by the corresponding weights and then summed to obtain the final prediction results. We use 10-fold cross-validation to evaluate the model, and the evaluation indicators are significantly improved. The prediction accuracy of the two datasets is as high as 95.41% and 93.50%, respectively. It demonstrates the stronger competitiveness and generalization performance of our model. In addition, all datasets and source codes can be found at https://github.com/HongyanShi026/R5hmCFDV.
Collapse
Affiliation(s)
- Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| |
Collapse
|
44
|
Sun CK, Tang YX, Liu TC, Lu CJ. An Integrated Machine Learning Scheme for Predicting Mammographic Anomalies in High-Risk Individuals Using Questionnaire-Based Predictors. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19159756. [PMID: 35955112 PMCID: PMC9368335 DOI: 10.3390/ijerph19159756] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/02/2022] [Accepted: 08/06/2022] [Indexed: 05/09/2023]
Abstract
This study aimed to investigate the important predictors related to predicting positive mammographic findings based on questionnaire-based demographic and obstetric/gynecological parameters using the proposed integrated machine learning (ML) scheme. The scheme combines the benefits of two well-known ML algorithms, namely, least absolute shrinkage and selection operator (Lasso) logistic regression and extreme gradient boosting (XGB), to provide adequate prediction for mammographic anomalies in high-risk individuals and the identification of significant risk factors. We collected questionnaire data on 18 breast-cancer-related risk factors from women who participated in a national mammographic screening program between January 2017 and December 2020 at a single tertiary referral hospital to correlate with their mammographic findings. The acquired data were retrospectively analyzed using the proposed integrated ML scheme. Based on the data from 21,107 valid questionnaires, the results showed that the Lasso logistic regression models with variable combinations generated by XGB could provide more effective prediction results. The top five significant predictors for positive mammography results were younger age, breast self-examination, older age at first childbirth, nulliparity, and history of mammography within 2 years, suggesting a need for timely mammographic screening for women with these risk factors.
Collapse
Affiliation(s)
- Cheuk-Kay Sun
- Division of Hepatology and Gastroenterology, Department of Internal Medicine, Shin Kong Wu Ho-Su Memorial Hospital, Taipei 11101, Taiwan
- Graduate Institute of Business Administration, Fu Jen Catholic University, New Taipei City 24205, Taiwan
- School of Medicine, Fu Jen Catholic University, New Taipei City 24205, Taiwan
- School of Medicine, Taipei Medical University, Taipei 11031, Taiwan
| | - Yun-Xuan Tang
- Department of Radiology, Shin Kong Wu Ho-Su Memorial Hospital, Taipei 11101, Taiwan
- Department of Medical Imaging and Radiological Technology, Yuanpei University of Medical Technology, Hsinchu 30015, Taiwan
| | - Tzu-Chi Liu
- Graduate Institute of Business Administration, Fu Jen Catholic University, New Taipei City 24205, Taiwan
| | - Chi-Jie Lu
- Graduate Institute of Business Administration, Fu Jen Catholic University, New Taipei City 24205, Taiwan
- Artificial Intelligence Development Center, Fu Jen Catholic University, New Taipei City 24205, Taiwan
- Department of Information Management, Fu Jen Catholic University, New Taipei City 24205, Taiwan
- Correspondence:
| |
Collapse
|
45
|
Torkey H, Belal NA. An Enhanced Multiple Sclerosis Disease Diagnosis via an Ensemble Approach. Diagnostics (Basel) 2022; 12:diagnostics12071771. [PMID: 35885672 PMCID: PMC9316893 DOI: 10.3390/diagnostics12071771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 06/25/2022] [Accepted: 07/18/2022] [Indexed: 11/30/2022] Open
Abstract
Multiple Sclerosis (MS) is a disease attacking the central nervous system. According to MS Atlas’s most recent statistics, there are more than 2.8 million people worldwide diagnosed with MS. Recently, studies started to explore machine learning techniques to predict MS using various data. The objective of this paper is to develop an ensemble approach for diagnosis of MS using gene expression profiles, while handling the class imbalance problem associated with the data. A hierarchical ensemble approach employing voting and boosting techniques is proposed. This approach adopts a heterogeneous voting approach using two base learners, random forest and support vector machine. Experiments show that our approach outperforms state-of-the-art methods, with the highest recorded accuracy being 92.81% and 93.5% with BoostFS and DEGs for feature selection, respectively. Conclusively, the proposed approach is able to efficiently diagnose MS using the gene expression profiles that are more relevant to the disease. The approach is not merely an ensemble classifier outperforming previous work; it also identifies differentially expressed genes between normal samples and patients with multiple sclerosis using a genome-wide expression microarray. The results obtained show that the proposed approach is an efficient diagnostic tool for MS.
Collapse
Affiliation(s)
- Hanaa Torkey
- Computer Science and Engineering Department, Faculty of Electronic Engineering, Menoufia University, Menouf 32952, Egypt;
| | - Nahla A. Belal
- College of Computing and Information Technology, Arab Academy for Science, Technology, and Maritime Transport, Smart Village 12577, Egypt
- Correspondence:
| |
Collapse
|
46
|
Protein-protein interaction and non-interaction predictions using gene sequence natural vector. Commun Biol 2022; 5:652. [PMID: 35780196 PMCID: PMC9250521 DOI: 10.1038/s42003-022-03617-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open
Abstract
Predicting protein–protein interaction and non-interaction are two important different aspects of multi-body structure predictions, which provide vital information about protein function. Some computational methods have recently been developed to complement experimental methods, but still cannot effectively detect real non-interacting protein pairs. We proposed a gene sequence-based method, named NVDT (Natural Vector combine with Dinucleotide and Triplet nucleotide), for the prediction of interaction and non-interaction. For protein–protein non-interactions (PPNIs), the proposed method obtained accuracies of 86.23% for Homo sapiens and 85.34% for Mus musculus, and it performed well on three types of non-interaction networks. For protein-protein interactions (PPIs), we obtained accuracies of 99.20, 94.94, 98.56, 95.41, and 94.83% for Saccharomyces cerevisiae, Drosophila melanogaster, Helicobacter pylori, Homo sapiens, and Mus musculus, respectively. Furthermore, NVDT outperformed established sequence-based methods and demonstrated high prediction results for cross-species interactions. NVDT is expected to be an effective approach for predicting PPIs and PPNIs. Protein-protein non-interactions and interactions are distinguished and predicted by gene sequence using single nucleotide and contiguous nucleotides combined with machine learning models.
Collapse
|
47
|
Predicting Protein–Protein Interactions Based on Ensemble Learning-Based Model from Protein Sequence. BIOLOGY 2022; 11:biology11070995. [PMID: 36101379 PMCID: PMC9311754 DOI: 10.3390/biology11070995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 05/27/2022] [Accepted: 06/29/2022] [Indexed: 11/17/2022]
Abstract
Simple Summary Due to most traditional high-throughput experiments are tedious and laborious in identifying potential protein–protein interaction. To better improve accuracy prediction in protein–protein interactions. We proposed a novel computational method that can identify unknown protein–protein interaction efficiently and hope this method can provide a helpful idea and tool for proteomics research. Abstract Protein–protein interactions (PPIs) play an essential role in many biological cellular functions. However, it is still tedious and time-consuming to identify protein–protein interactions through traditional experimental methods. For this reason, it is imperative and necessary to develop a computational method for predicting PPIs efficiently. This paper explores a novel computational method for detecting PPIs from protein sequence, the approach which mainly adopts the feature extraction method: Locality Preserving Projections (LPP) and classifier: Rotation Forest (RF). Specifically, we first employ the Position Specific Scoring Matrix (PSSM), which can remain evolutionary information of biological for representing protein sequence efficiently. Then, the LPP descriptor is applied to extract feature vectors from PSSM. The feature vectors are fed into the RF to obtain the final results. The proposed method is applied to two datasets: Yeast and H. pylori, and obtained an average accuracy of 92.81% and 92.56%, respectively. We also compare it with K nearest neighbors (KNN) and support vector machine (SVM) to better evaluate the performance of the proposed method. In summary, all experimental results indicate that the proposed approach is stable and robust for predicting PPIs and promising to be a useful tool for proteomics research.
Collapse
|
48
|
Li X, Han P, Wang G, Chen W, Wang S, Song T. SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 2022; 23:474. [PMID: 35761175 PMCID: PMC9235110 DOI: 10.1186/s12864-022-08687-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 06/10/2022] [Indexed: 12/20/2022] Open
Abstract
Background Protein-protein interactions (PPIs) dominate intracellular molecules to perform a series of tasks such as transcriptional regulation, information transduction, and drug signalling. The traditional wet experiment method to obtain PPIs information is costly and time-consuming. Result In this paper, SDNN-PPI, a PPI prediction method based on self-attention and deep learning is proposed. The method adopts amino acid composition (AAC), conjoint triad (CT), and auto covariance (AC) to extract global and local features of protein sequences, and leverages self-attention to enhance DNN feature extraction to more effectively accomplish the prediction of PPIs. In order to verify the generalization ability of SDNN-PPI, a 5-fold cross-validation on the intraspecific interactions dataset of Saccharomyces cerevisiae (core subset) and human is used to measure our model in which the accuracy reaches 95.48% and 98.94% respectively. The accuracy of 93.15% and 88.33% are obtained in the interspecific interactions dataset of human-Bacillus Anthracis and Human-Yersinia pestis, respectively. In the independent data set Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, all prediction accuracy is 100%, which is higher than the previous PPIs prediction methods. To further evaluate the advantages and disadvantages of the model, the one-core and crossover network are conducted to predict PPIs, and the data show that the model correctly predicts the interaction pairs in the network. Conclusion In this paper, AAC, CT and AC methods are used to encode the sequence, and SDNN-PPI method is proposed to predict PPIs based on self-attention deep learning neural network. Satisfactory results are obtained on interspecific and intraspecific data sets, and good performance is also achieved in cross-species prediction. It can also correctly predict the protein interaction of cell and tumor information contained in one-core network and crossover network.The SDNN-PPI proposed in this paper not only explores the mechanism of protein-protein interaction, but also provides new ideas for drug design and disease prevention.
Collapse
|
49
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
50
|
Dhal SB, Jungbluth K, Lin R, Sabahi SP, Bagavathiannan M, Braga-Neto U, Kalafatis S. A Machine-Learning-Based IoT System for Optimizing Nutrient Supply in Commercial Aquaponic Operations. SENSORS (BASEL, SWITZERLAND) 2022; 22:3510. [PMID: 35591199 PMCID: PMC9104751 DOI: 10.3390/s22093510] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 05/01/2022] [Accepted: 05/03/2022] [Indexed: 11/16/2022]
Abstract
Nutrient regulation in aquaponic environments has been a topic of research for many years. Most studies have focused on appropriate control of nutrients in an aquaponic set-up, but very little research has been conducted on commercial-scale applications. In our model, the input data were sourced on a weekly basis from three commercial aquaponic farms in Southeast Texas over the course of a year. Due to the limited number of data points, dimensionality reduction techniques such as pairwise correlation matrix were used to remove the highly correlated predictors. Feature selection techniques such as the XGBoost classifier and Recursive Feature Elimination with ExtraTreesClassifier were used to rank the features in order of their relative importance. Ammonium and calcium were found to be the top two nutrient predictors, and based on the months in which lettuce was cultivated, the median of these nutrient values from the historical dataset served as the optimal concentration to be maintained in the aquaponic solution to sustain healthy growth of tilapia fish and lettuce plants in a coupled set-up. To accomplish this, Vernier sensors were used to measure the nutrient values and actuator systems were built to dispense the appropriate nutrient into the ecosystem via a closed loop.
Collapse
Affiliation(s)
- Sambandh Bhusan Dhal
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Kyle Jungbluth
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Raymond Lin
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Seyed Pouyan Sabahi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | | | - Ulisses Braga-Neto
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| | - Stavros Kalafatis
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 79016, USA; (S.B.D.); (K.J.); (R.L.); (S.P.S.); (U.B.-N.)
| |
Collapse
|