1
|
Wei W, Wang Y, Ouyang R, Wang T, Chen R, Yuan X, Wang F, Wu S, Hou H. Machine Learning for Early Discrimination Between Lung Cancer and Benign Nodules Using Routine Clinical and Laboratory Data. Ann Surg Oncol 2024:10.1245/s10434-024-15762-3. [PMID: 39014163 DOI: 10.1245/s10434-024-15762-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 06/24/2024] [Indexed: 07/18/2024]
Abstract
BACKGROUND Lung cancer poses a global health threat necessitating early detection and precise staging for improved patient outcomes. This study focuses on developing and validating a machine learning-based risk model for early lung cancer screening and staging, using routine clinical data. METHODS Two medical center, observational, retrospective studies were conducted, involving 2312 lung cancer patients and 653 patients with benign nodules. Machine learning techniques, including differential analysis and feature selection, were employed to identify key factors for modeling. The study focused on variables such as nodule density, carcinoembryonic antigen (CEA), age, and lifestyle habits. The Logistic Regression model was utilized for early diagnoses, and the XGBoost model was utilized for staging based on selected features. RESULTS For early diagnoses, the Logistic Regression model achieved an area under the curve (AUC) of 0.716 (95% confidence interval [CI] 0.607-0.826), with 0.703 sensitivity and 0.654 specificity. The XGBoost model excelled in distinguishing late-stage from early-stage lung cancer, exhibiting an AUC of 0.913 (95% CI 0.862-0.963), with 0.909 sensitivity and 0.814 specificity. These findings highlight the model's potential for enhancing diagnostic accuracy and staging in lung cancer. CONCLUSION This study introduces a novel machine learning-based risk model for early lung cancer screening and staging, leveraging routine clinical information and laboratory data. The model shows promise in enhancing accuracy, mitigating overdiagnosis, and improving patient outcomes.
Collapse
Affiliation(s)
- Wei Wei
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Yun Wang
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Renren Ouyang
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Ting Wang
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Rujia Chen
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Xu Yuan
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Feng Wang
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
| | - Shiji Wu
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
| | - Hongyan Hou
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
| |
Collapse
|
2
|
Beltrán JF, Herrera-Belén L, Parraguez-Contreras F, Farías JG, Machuca-Sepúlveda J, Short S. MultiToxPred 1.0: a novel comprehensive tool for predicting 27 classes of protein toxins using an ensemble machine learning approach. BMC Bioinformatics 2024; 25:148. [PMID: 38609877 PMCID: PMC11010298 DOI: 10.1186/s12859-024-05748-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 03/14/2024] [Indexed: 04/14/2024] Open
Abstract
Protein toxins are defense mechanisms and adaptations found in various organisms and microorganisms, and their use in scientific research as therapeutic candidates is gaining relevance due to their effectiveness and specificity against cellular targets. However, discovering these toxins is time-consuming and expensive. In silico tools, particularly those based on machine learning and deep learning, have emerged as valuable resources to address this challenge. Existing tools primarily focus on binary classification, determining whether a protein is a toxin or not, and occasionally identifying specific types of toxins. For the first time, we propose a novel approach capable of classifying protein toxins into 27 distinct categories based on their mode of action within cells. To accomplish this, we assessed multiple machine learning techniques and found that an ensemble model incorporating the Light Gradient Boosting Machine and Quadratic Discriminant Analysis algorithms exhibited the best performance. During the tenfold cross-validation on the training dataset, our model exhibited notable metrics: 0.840 accuracy, 0.827 F1 score, 0.836 precision, 0.840 sensitivity, and 0.989 AUC. In the testing stage, using an independent dataset, the model achieved 0.846 accuracy, 0.838 F1 score, 0.847 precision, 0.849 sensitivity, and 0.991 AUC. These results present a powerful next-generation tool called MultiToxPred 1.0, accessible through a web application. We believe that MultiToxPred 1.0 has the potential to become an indispensable resource for researchers, facilitating the efficient identification of protein toxins. By leveraging this tool, scientists can accelerate their search for these toxins and advance their understanding of their therapeutic potential.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile.
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Fernanda Parraguez-Contreras
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge G Farías
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge Machuca-Sepúlveda
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Stefania Short
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| |
Collapse
|
3
|
Wu JS, Liu Y, Ge F, Yu DJ. Prediction of protein-ATP binding residues using multi-view feature learning via contextual-based co-attention network. Comput Biol Med 2024; 172:108227. [PMID: 38460308 DOI: 10.1016/j.compbiomed.2024.108227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 01/17/2024] [Accepted: 02/25/2024] [Indexed: 03/11/2024]
Abstract
Accurately predicting protein-ATP binding residues is critical for protein function annotation and drug discovery. Computational methods dedicated to the prediction of binding residues based on protein sequence information have exhibited notable advancements in predictive accuracy. Nevertheless, these methods continue to grapple with several formidable challenges, including limited means of extracting more discriminative features and inadequate algorithms for integrating protein and residue information. To address the problems, we propose ATP-Deep, a novel protein-ATP binding residues predictor. ATP-Deep harnesses the capabilities of unsupervised pre-trained language models and incorporates domain-specific evolutionary context information from homologous sequences. It further refines the embedding at the residue level through integration with corresponding protein-level information and employs a contextual-based co-attention mechanism to adeptly fuse multiple sources of features. The performance evaluation results on the benchmark datasets reveal that ATP-Deep achieves an AUC of 0.954 and 0.951, respectively, surpassing the performance of the state-of-the-art model. These findings underscore the effectiveness of assimilating protein-level information and deploying a contextual-based co-attention mechanism grounded in context to bolster the prediction performance of protein-ATP binding residues.
Collapse
Affiliation(s)
- Jia-Shun Wu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Yan Liu
- School of Information Engineering, Yangzhou University, 196 West Huayang, Yangzhou, 225100, China
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China.
| |
Collapse
|
4
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
5
|
Zhao T, Zeng J, Zhang R, Pu L, Wang H, Pan L, Jiang Y, Dai X, Sha Y, Han L. Proteomic advance of ischemic stroke: preclinical, clinical, and intervention. Metab Brain Dis 2023; 38:2521-2546. [PMID: 37440002 DOI: 10.1007/s11011-023-01262-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/23/2023] [Accepted: 07/01/2023] [Indexed: 07/14/2023]
Abstract
Ischemic stroke (IS) is the most common type of stroke and is characterized by high rates of mortality and long-term injury. The prediction and early diagnosis of IS are therefore crucial for optimal clinical intervention. Proteomics has provided important techniques for exploring protein markers associated with IS, but there has been no systematic evaluation and review of research that has used these techniques. Here, we review the differential proteins that have been found in cell- and animal- based studies and clinical trials of IS in the past 10 years; determine the key pathological proteins that have been identified in clinical trials; summarize the target proteins affected by interventions aimed at treating IS, with a focus on traditional Chinese medicine treatments. Overall, we clarify findings and problems that have been identified in recent proteomics research on IS and provide suggestions for improvements in this area. We also suggest areas that could be explored for determining the pathogenesis and developing interventions for IS.
Collapse
Affiliation(s)
- Tian Zhao
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China
| | - Jingjing Zeng
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China
| | - Ruijie Zhang
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China
| | - Liyuan Pu
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China
| | - Han Wang
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China
| | - Lifang Pan
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China
| | - Yannan Jiang
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China
| | - Xiaoyu Dai
- Department of Anus & Intestine Surgery, Ningbo No.2 Hospital, Ningbo, 315000, China
| | - Yuyi Sha
- Department of Intensive Care Medicine, Ningbo No.2 Hospital, Ningbo, 315000, China.
| | - Liyuan Han
- Key Laboratory of Diagnosis and Treatment of Digestive System Tumors of Zhejiang Province, Ningbo No.2 Hospital, 41 Northwest Street, Ningbo, 315000, Zhejiang, China.
- Center for Cardiovascular and Cerebrovascular Epidemiology and Translational Medicine, Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, 315000, China.
| |
Collapse
|
6
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta S, Gupta A, Parsad R. RBPLight: a computational tool for discovery of plant-specific RNA-binding proteins using light gradient boosting machine and ensemble of evolutionary features. Brief Funct Genomics 2023; 22:401-410. [PMID: 37158175 DOI: 10.1093/bfgp/elad016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/12/2023] [Accepted: 04/21/2023] [Indexed: 05/10/2023] Open
Abstract
RNA-binding proteins (RBPs) are essential for post-transcriptional gene regulation in eukaryotes, including splicing control, mRNA transport and decay. Thus, accurate identification of RBPs is important to understand gene expression and regulation of cell state. In order to detect RBPs, a number of computational models have been developed. These methods made use of datasets from several eukaryotic species, specifically from mice and humans. Although some models have been tested on Arabidopsis, these techniques fall short of correctly identifying RBPs for other plant species. Therefore, the development of a powerful computational model for identifying plant-specific RBPs is needed. In this study, we presented a novel computational model for locating RBPs in plants. Five deep learning models and ten shallow learning algorithms were utilized for prediction with 20 sequence-derived and 20 evolutionary feature sets. The highest repeated five-fold cross-validation accuracy, 91.24% AU-ROC and 91.91% AU-PRC, was achieved by light gradient boosting machine. While evaluated using an independent dataset, the developed approach achieved 94.00% AU-ROC and 94.50% AU-PRC. The proposed model achieved significantly higher accuracy for predicting plant-specific RBPs as compared to the currently available state-of-art RBP prediction models. Despite the fact that certain models have already been trained and assessed on the model organism Arabidopsis, this is the first comprehensive computer model for the discovery of plant-specific RBPs. The web server RBPLight was also developed, which is publicly accessible at https://iasri-sg.icar.gov.in/rbplight/, for the convenience of researchers to identify RBPs in plants.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sagar Gupta
- CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP) 176061, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
7
|
Chiu CC, Wu CM, Chien TN, Kao LJ, Li C, Chu CM. Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2023; 20:4340. [PMID: 36901354 PMCID: PMC10001457 DOI: 10.3390/ijerph20054340] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 02/22/2023] [Accepted: 02/24/2023] [Indexed: 06/18/2023]
Abstract
An ICU is a critical care unit that provides advanced medical support and continuous monitoring for patients with severe illnesses or injuries. Predicting the mortality rate of ICU patients can not only improve patient outcomes, but also optimize resource allocation. Many studies have attempted to create scoring systems and models that predict the mortality of ICU patients using large amounts of structured clinical data. However, unstructured clinical data recorded during patient admission, such as notes made by physicians, is often overlooked. This study used the MIMIC-III database to predict mortality in ICU patients. In the first part of the study, only eight structured variables were used, including the six basic vital signs, the GCS, and the patient's age at admission. In the second part, unstructured predictor variables were extracted from the initial diagnosis made by physicians when the patients were admitted to the hospital and analyzed using Latent Dirichlet Allocation techniques. The structured and unstructured data were combined using machine learning methods to create a mortality risk prediction model for ICU patients. The results showed that combining structured and unstructured data improved the accuracy of the prediction of clinical outcomes in ICU patients over time. The model achieved an AUROC of 0.88, indicating accurate prediction of patient vital status. Additionally, the model was able to predict patient clinical outcomes over time, successfully identifying important variables. This study demonstrated that a small number of easily collectible structured variables, combined with unstructured data and analyzed using LDA topic modeling, can significantly improve the predictive performance of a mortality risk prediction model for ICU patients. These results suggest that initial clinical observations and diagnoses of ICU patients contain valuable information that can aid ICU medical and nursing staff in making important clinical decisions.
Collapse
Affiliation(s)
- Chih-Chou Chiu
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chung-Min Wu
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Te-Nien Chien
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Ling-Jing Kao
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chengcheng Li
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chuan-Mei Chu
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| |
Collapse
|
8
|
Chiu CC, Wu CM, Chien TN, Kao LJ, Li C, Jiang HL. Applying an Improved Stacking Ensemble Model to Predict the Mortality of ICU Patients with Heart Failure. J Clin Med 2022; 11:6460. [PMID: 36362686 PMCID: PMC9659015 DOI: 10.3390/jcm11216460] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 10/21/2022] [Accepted: 10/26/2022] [Indexed: 08/31/2023] Open
Abstract
Cardiovascular diseases have been identified as one of the top three causes of death worldwide, with onset and deaths mostly due to heart failure (HF). In ICU, where patients with HF are at increased risk of death and consume significant medical resources, early and accurate prediction of the time of death for patients at high risk of death would enable them to receive appropriate and timely medical care. The data for this study were obtained from the MIMIC-III database, where we collected vital signs and tests for 6699 HF patient during the first 24 h of their first ICU admission. In order to predict the mortality of HF patients in ICUs more precisely, an integrated stacking model is proposed and applied in this paper. In the first stage of dataset classification, the datasets were subjected to first-level classifiers using RF, SVC, KNN, LGBM, Bagging, and Adaboost. Then, the fusion of these six classifier decisions was used to construct and optimize the stacked set of second-level classifiers. The results indicate that our model obtained an accuracy of 95.25% and AUROC of 82.55% in predicting the mortality rate of HF patients, which demonstrates the outstanding capability and efficiency of our method. In addition, the results of this study also revealed that platelets, glucose, and blood urea nitrogen were the clinical features that had the greatest impact on model prediction. The results of this analysis not only improve the understanding of patients' conditions by healthcare professionals but allow for a more optimal use of healthcare resources.
Collapse
Affiliation(s)
- Chih-Chou Chiu
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chung-Min Wu
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Te-Nien Chien
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Ling-Jing Kao
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chengcheng Li
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Han-Ling Jiang
- Alliance Manchester Business School, University of Manchester, Manchester M15 6PB, UK
| |
Collapse
|
9
|
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front Bioeng Biotechnol 2022; 10:788300. [PMID: 35875501 PMCID: PMC9301016 DOI: 10.3389/fbioe.2022.788300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 05/25/2022] [Indexed: 11/23/2022] Open
Abstract
Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.
Collapse
Affiliation(s)
- Jalil Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Luis Ochoa-Toledo
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Mario Javier Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Atocha Aliseda
- Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Fernando Pérez-Escamirosa
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | | | - Francine Ochoa-Fernández
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Ricardo Zamora-Solís
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Sebastián Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Cristina Revilla-Monsalve
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Nicolás Kemper-Valverde
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Myriam M. Altamirano-Bustamante
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| |
Collapse
|
10
|
Nguyen TTD, Chen S, Ho QT, Ou YY. Using multiple convolutional window scanning of convolutional neural network for an efficient prediction of ATP-binding sites in transport proteins. Proteins 2022; 90:1486-1492. [PMID: 35246878 DOI: 10.1002/prot.26329] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Revised: 02/23/2022] [Accepted: 02/25/2022] [Indexed: 12/31/2022]
Abstract
Protein multiple sequence alignment information has long been important features to know about functions of proteins inferred from related sequences with known functions. It is therefore one of the underlying ideas of Alpha fold 2, a breakthrough study and model for the prediction of three-dimensional structures of proteins from their primary sequence. Our study used protein multiple sequence alignment information in the form of position-specific scoring matrices as input. We also refined the use of a convolutional neural network, a well-known deep-learning architecture with impressive achievement on image and image-like data. Specifically, we revisited the study of prediction of adenosine triphosphate (ATP)-binding sites with more efficient convolutional neural networks. We applied multiple convolutional window scanning filters of a convolutional neural network on position-specific scoring matrices for as much as useful information as possible. Furthermore, only the most specific motifs are retained at each feature map output through the one-max pooling layer before going to the next layer. We assumed that this way could help us retain the most conserved motifs which are discriminative information for prediction. Our experiment results show that a convolutional neural network with not too many convolutional layers can be enough to extract the conserved information of proteins, which leads to higher performance. Our best prediction models were obtained after examining them with different hyper-parameters. Our experiment results showed that our models were superior to traditional use of convolutional neural networks on the same datasets as well as other machine-learning classification algorithms.
Collapse
Affiliation(s)
| | - Syun Chen
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| |
Collapse
|
11
|
Yamaguchi S, Nakashima H, Moriwaki Y, Terada T, Shimizu K. Prediction of protein mononucleotide binding sites using AlphaFold2 and machine learning. Comput Biol Chem 2022; 100:107744. [DOI: 10.1016/j.compbiolchem.2022.107744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 07/12/2022] [Accepted: 07/22/2022] [Indexed: 11/26/2022]
|
12
|
Li J, Zhu W, Zhou J, Yun W, Li X, Guan Q, Lv W, Cheng Y, Ni H, Xie Z, Li M, Zhang L, Xu Y, Zhang Q. A Presurgical Unfavorable Prediction Scale of Endovascular Treatment for Acute Ischemic Stroke. Front Aging Neurosci 2022; 14:942285. [PMID: 35847671 PMCID: PMC9284674 DOI: 10.3389/fnagi.2022.942285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 06/02/2022] [Indexed: 11/13/2022] Open
Abstract
ObjectiveTo develop a prognostic prediction model of endovascular treatment (EVT) for acute ischemic stroke (AIS) induced by large-vessel occlusion (LVO), this study applied machine learning classification model light gradient boosting machine (LightGBM) to construct a unique prediction model.MethodsA total of 973 patients were enrolled, primary outcome was assessed with modified Rankin scale (mRS) at 90 days, and favorable outcome was defined using mRS 0–2 scores. Besides, LightGBM algorithm and logistic regression (LR) were used to construct a prediction model. Then, a prediction scale was further established and verified by both internal data and other external data.ResultsA total of 20 presurgical variables were analyzed using LR and LightGBM. The results of LightGBM algorithm indicated that the accuracy and precision of the prediction model were 73.77 and 73.16%, respectively. The area under the curve (AUC) was 0.824. Furthermore, the top 5 variables suggesting unfavorable outcomes were namely admitting blood glucose levels, age, onset to EVT time, onset to hospital time, and National Institutes of Health Stroke Scale (NIHSS) scores (importance = 130.9, 102.6, 96.5, 89.5 and 84.4, respectively). According to AUC, we established the key cutoff points and constructed prediction scale based on their respective weightings. Then, the established prediction scale was verified in raw and external data and the sensitivity was 80.4 and 83.5%, respectively. Finally, scores >3 demonstrated better accuracy in predicting unfavorable outcomes.ConclusionPresurgical prediction scale is feasible and accurate in identifying unfavorable outcomes of AIS after EVT.
Collapse
Affiliation(s)
- Jingwei Li
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
- Institute of Brain Sciences, Nanjing University, Nanjing, China
- Jiangsu Key Laboratory for Molecular Medicine, Medical School of Nanjing University, Nanjing, China
- Jiangsu Province Stroke Center for Diagnosis and Therapy, Nanjing, China
- Nanjing Neurology Clinic Medical Center, Nanjing, China
| | - Wencheng Zhu
- The Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Junshan Zhou
- Department of Neurology, Nanjing First Hospital, Nanjing Medical University, Nanjing, China
| | - Wenwei Yun
- Department of Neurology, Changzhou No.2 People's Hospital Affiliated to Nanjing Medical University, Changzhou, China
| | - Xiaobo Li
- Department of Neurology, Northern Jiangsu People's Hospital, Clinical Medical School of Yangzhou University, Yangzhou, China
| | - Qiaochu Guan
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Weiping Lv
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Yue Cheng
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Huanyu Ni
- Department of Pharmacy of Drum Tower Hospital, Medical School, Nanjing University, Nanjing, China
| | - Ziyi Xie
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Mengyun Li
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Lu Zhang
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Yun Xu
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
- Institute of Brain Sciences, Nanjing University, Nanjing, China
- Jiangsu Key Laboratory for Molecular Medicine, Medical School of Nanjing University, Nanjing, China
- Jiangsu Province Stroke Center for Diagnosis and Therapy, Nanjing, China
- Nanjing Neurology Clinic Medical Center, Nanjing, China
| | - Qingxiu Zhang
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
- Institute of Brain Sciences, Nanjing University, Nanjing, China
- Jiangsu Key Laboratory for Molecular Medicine, Medical School of Nanjing University, Nanjing, China
- Jiangsu Province Stroke Center for Diagnosis and Therapy, Nanjing, China
- Nanjing Neurology Clinic Medical Center, Nanjing, China
- *Correspondence: Qingxiu Zhang
| |
Collapse
|
13
|
You X, Hu X, Feng Z, Wang Z, Hao S, Yang C. Recognizing Protein-metal Ion Ligands Binding Residues by Random Forest Algorithm with Adding Orthogonal Properties. Comput Biol Chem 2022; 98:107693. [DOI: 10.1016/j.compbiolchem.2022.107693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 05/02/2022] [Accepted: 05/03/2022] [Indexed: 11/16/2022]
|
14
|
An Interpretable Machine Learning Model for Daily Global Solar Radiation Prediction. ENERGIES 2021. [DOI: 10.3390/en14217367] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Machine learning (ML) models are commonly used in solar modeling due to their high predictive accuracy. However, the predictions of these models are difficult to explain and trust. This paper aims to demonstrate the utility of two interpretation techniques to explain and improve the predictions of ML models. We compared first the predictive performance of Light Gradient Boosting (LightGBM) with three benchmark models, including multilayer perceptron (MLP), multiple linear regression (MLR), and support-vector regression (SVR), for estimating the global solar radiation (H) in the city of Fez, Morocco. Then, the predictions of the most accurate model were explained by two model-agnostic explanation techniques: permutation feature importance (PFI) and Shapley additive explanations (SHAP). The results indicated that LightGBM (R2 = 0.9377, RMSE = 0.4827 kWh/m2, MAE = 0.3614 kWh/m2) provides similar predictive accuracy as SVR, and outperformed MLP and MLR in the testing stage. Both PFI and SHAP methods showed that extraterrestrial solar radiation (H0) and sunshine duration fraction (SF) are the two most important parameters that affect H estimation. Moreover, the SHAP method established how each feature influences the LightGBM estimations. The predictive accuracy of the LightGBM model was further improved slightly after re-examination of features, where the model combining H0, SF, and RH was better than the model with all features.
Collapse
|
15
|
Ding Y, Yang C, Tang J, Guo F. Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02737-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
16
|
Hybrid Deep Learning Models with Sparse Enhancement Technique for Detection of Newly Grown Tree Leaves. SENSORS 2021; 21:s21062077. [PMID: 33809537 PMCID: PMC8001602 DOI: 10.3390/s21062077] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Revised: 03/04/2021] [Accepted: 03/12/2021] [Indexed: 12/21/2022]
Abstract
The life cycle of leaves, from sprout to senescence, is the phenomenon of regular changes such as budding, branching, leaf spreading, flowering, fruiting, leaf fall, and dormancy due to seasonal climate changes. It is the effect of temperature and moisture in the life cycle on physiological changes, so the detection of newly grown leaves (NGL) is helpful for the estimation of tree growth and even climate change. This study focused on the detection of NGL based on deep learning convolutional neural network (CNN) models with sparse enhancement (SE). As the NGL areas found in forest images have similar sparse characteristics, we used a sparse image to enhance the signal of the NGL. The difference between the NGL and the background could be further improved. We then proposed hybrid CNN models that combined U-net and SegNet features to perform image segmentation. As the NGL in the image were relatively small and tiny targets, in terms of data characteristics, they also belonged to the problem of imbalanced data. Therefore, this paper further proposed 3-Layer SegNet, 3-Layer U-SegNet, 2-Layer U-SegNet, and 2-Layer Conv-U-SegNet architectures to reduce the pooling degree of traditional semantic segmentation models, and used a loss function to increase the weight of the NGL. According to the experimental results, our proposed algorithms were indeed helpful for the image segmentation of NGL and could achieve better kappa results by 0.743.
Collapse
|