1
|
Teng X, Han K, Jin W, Ma L, Wei L, Min D, Chen L, Du Y. Development and validation of an early diagnosis model for bone metastasis in non-small cell lung cancer based on serological characteristics of the bone metastasis mechanism. EClinicalMedicine 2024; 72:102617. [PMID: 38707910 PMCID: PMC11066529 DOI: 10.1016/j.eclinm.2024.102617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/10/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Background Bone metastasis significantly impact the prognosis of non-small cell lung cancer (NSCLC) patients, reducing their quality of life and shortening their survival. Currently, there are no effective tools for the diagnosis and risk assessment of early bone metastasis in NSCLC patients. This study employed machine learning to analyze serum indicators that are closely associated with bone metastasis, aiming to construct a model for the timely detection and prognostic evaluation of bone metastasis in NSCLC patients. Methods The derivation cohort consisted of 664 individuals with stage IV NSCLC, diagnosed between 2015 and 2018. The variables considered in this study included age, sex, and 18 specific serum indicators that have been linked to the occurrence of bone metastasis in NSCLC. Variable selection used multivariate logistic regression analysis and Lasso regression analysis. Six machine learning methods were utilized to develop a bone metastasis diagnostic model, assessed with Area Under the Curve (AUC), Decision Curve Analysis (DCA), sensitivity, specificity, and validation cohorts. External validation used 113 NSCLC patients from the Medical Alliance (2019-2020). Furthermore, a prospective validation study was conducted on a cohort of 316 patients (2019-2020) who were devoid of bone metastasis, and followed-up for at least two years to assess the predictive capabilities of this model. The model's prognostic value was evaluated using Kaplan-Meier survival curves. Findings Through variable selection, 11 serum indictors were identified as independent predictive factors for NSCLC bone metastasis. Six machine learning models were developed using age, sex, and these serum indicators. A random forest (RF) model demonstrated strong performance during the training and internal validation cohorts, achieving an AUC of 0.98 (95% CI 0.95-0.99) for internal validation. External validation further confirmed the RF model's effectiveness, yielding an AUC of 0.97 (95% CI 0.94-0.99). The calibration curves demonstrated a high level of concordance between the anticipated risk and the observed risk of the RF model. Prospective validation revealed that the RF model could predict the occurrence of bone metastasis approximately 10.27 ± 3.58 months in advance, according to the results of the SPECT. An online computing platform (https://bonemetastasis.shinyapps.io/shiny_cls_1model/) for this RF model is publicly available and free-to-use by doctors and patients. Interpretation This study innovatively employs age, gender, and 11 serological markers closely related to the mechanism of bone metastasis to construct an RF model, providing a reliable tool for the early screening and prognostic assessment of bone metastasis in NSCLC patients. However, as an exploratory study, the findings require further validation through large-scale, multicenter prospective studies. Funding This work is supported by the National Natural Science Foundation of China (NO.81974315); Shanghai Municipal Science and Technology Commission Medical Innovation Research Project (NO.20Y11903300); Shanghai Municipal Health Commission Health Industry Clinical Research Youth Program (NO.20204Y034).
Collapse
Affiliation(s)
- Xiaoyan Teng
- Department of Laboratory Medicine, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| | - Kun Han
- Department of Oncology, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| | - Wei Jin
- Department of Laboratory Medicine, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| | - Liru Ma
- Department of Laboratory Medicine, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| | - Lirong Wei
- Department of Laboratory Medicine, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| | - Daliu Min
- Department of Oncology, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| | - Libo Chen
- Department of Nuclear Medicine, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| | - Yuzhen Du
- Department of Laboratory Medicine, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200233, China
| |
Collapse
|
2
|
Chen Y, Du Z, Ren X, Pan C, Zhu Y, Li Z, Meng T, Yao X. mRNA-CLA: An interpretable deep learning approach for predicting mRNA subcellular localization. Methods 2024; 227:17-26. [PMID: 38705502 DOI: 10.1016/j.ymeth.2024.04.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 03/30/2024] [Accepted: 04/28/2024] [Indexed: 05/07/2024] Open
Abstract
Messenger RNA (mRNA) is vital for post-transcriptional gene regulation, acting as the direct template for protein synthesis. However, the methods available for predicting mRNA subcellular localization need to be improved and enhanced. Notably, few existing algorithms can annotate mRNA sequences with multiple localizations. In this work, we propose the mRNA-CLA, an innovative multi-label subcellular localization prediction framework for mRNA, leveraging a deep learning approach with a multi-head self-attention mechanism. The framework employs a multi-scale convolutional layer to extract sequence features across different regions and uses a self-attention mechanism explicitly designed for each sequence. Paired with Position Weight Matrices (PWMs) derived from the convolutional neural network layers, our model offers interpretability in the analysis. In particular, we perform a base-level analysis of mRNA sequences from diverse subcellular localizations to determine the nucleotide specificity corresponding to each site. Our evaluations demonstrate that the mRNA-CLA model substantially outperforms existing methods and tools.
Collapse
Affiliation(s)
- Yifan Chen
- Institute of Artificial Intelligence Application, College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan 410004, China
| | - Zhenya Du
- Guangzhou Xinhua University, 510520, Guangzhou, China
| | - Xuanbai Ren
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Chu Pan
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Yangbin Zhu
- Manufacturing and Electronic Engineering, Wenzhou University of Technology, 325027, Wenzhou, China.
| | - Zhen Li
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Tao Meng
- Institute of Artificial Intelligence Application, College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan 410004, China
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, 999078, Macao.
| |
Collapse
|
3
|
Tian Y, Yang X, Chen N, Li C, Yang W. Data-driven interpretable analysis for polysaccharide yield prediction. Environ Sci Ecotechnol 2024; 19:100321. [PMID: 38021368 PMCID: PMC10661693 DOI: 10.1016/j.ese.2023.100321] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 09/17/2023] [Accepted: 09/17/2023] [Indexed: 12/01/2023]
Abstract
Cornstalks show promise as a raw material for polysaccharide production through xylanase. Rapid and accurate prediction of polysaccharide yield can facilitate process optimization, eliminating the need for extensive experimentation in actual production to refine reaction conditions, thereby saving time and costs. However, the intricate interplay of enzymatic factors poses challenges in predicting and optimizing polysaccharide yield accurately. Here, we introduce an innovative data-driven approach leveraging multiple artificial intelligence techniques to enhance polysaccharide production. We propose a machine learning framework to identify highly accurate polysaccharide yield prediction modeling methods and uncover optimal enzymatic parameter combinations. Notably, Random Forest (RF) and eXtreme Gradient Boost (XGB) demonstrate robust performance, achieving prediction accuracies of 93.0% and 95.6%, respectively, while an independently developed deep neural network (DNN) model achieves 91.1% accuracy. A feature importance analysis of XGB reveals the enzyme solution volume's dominant role (43.7%), followed by time (20.7%), substrate concentration (15%), temperature (15%), and pH (5.6%). Further interpretability analysis unveils complex parameter interactions and potential optimization strategies. This data-driven approach, incorporating machine learning, deep learning, and interpretable analysis, offers a viable pathway for polysaccharide yield prediction and the potential recovery of various agricultural residues.
Collapse
Affiliation(s)
- Yushi Tian
- School of Resource and Environment, Northeast Agriculture University, Harbin, 150030, PR China
| | - Xu Yang
- School of Resource and Environment, Northeast Agriculture University, Harbin, 150030, PR China
| | - Nianhua Chen
- School of Resource and Environment, Northeast Agriculture University, Harbin, 150030, PR China
| | - Chunyan Li
- School of Resource and Environment, Northeast Agriculture University, Harbin, 150030, PR China
| | - Wulin Yang
- College of Environmental Sciences and Engineering, Peking University, Beijing, 100871, PR China
| |
Collapse
|
4
|
Kang T, Ding W, Chen P. CRESPR: Modular sparsification of DNNs to improve pruning performance and model interpretability. Neural Netw 2024; 172:106067. [PMID: 38199151 DOI: 10.1016/j.neunet.2023.12.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 11/27/2023] [Accepted: 12/12/2023] [Indexed: 01/12/2024]
Abstract
Modern DNNs often include a huge number of parameters that are expensive for both computation and memory. Pruning can significantly reduce model complexity and lessen resource demands, and less complex models can also be easier to explain and interpret. In this paper, we propose a novel pruning algorithm, Cluster-Restricted Extreme Sparsity Pruning of Redundancy (CRESPR), to prune a neural network into modular units and achieve better pruning efficiency. With the Hessian matrix, we provide an analytic explanation of why modular structures in a sparse DNN can better maintain performance, especially at an extreme high pruning ratio. In CRESPR, each modular unit contains mostly internal connections, which clearly shows how subgroups of input features are processed through a DNN and eventually contribute to classification decisions. Such process-level revealing of internal working mechanisms undoubtedly leads to better interpretability of a black-box DNN model. Extensive experiments were conducted with multiple DNN architectures and datasets, and CRESPR achieves higher pruning performance than current state-of-the-art methods at high and extremely high pruning ratios. Additionally, we show how CRESPR improves model interpretability through a concrete example.
Collapse
Affiliation(s)
- Tianyu Kang
- University of Massachusetts Boston, United States of America
| | - Wei Ding
- University of Massachusetts Boston, United States of America
| | - Ping Chen
- University of Massachusetts Boston, United States of America.
| |
Collapse
|
5
|
Zhai S, Chen K, Yang L, Li Z, Yu T, Chen L, Zhu H. Applying machine learning to anaerobic fermentation of waste sludge using two targeted modeling strategies. Sci Total Environ 2024; 916:170232. [PMID: 38278257 DOI: 10.1016/j.scitotenv.2024.170232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 01/13/2024] [Accepted: 01/15/2024] [Indexed: 01/28/2024]
Abstract
Anaerobic fermentation is an effective method to harvest volatile fatty acids (VFAs) from waste activated sludge (WAS). Accurately predicting and optimizing VFAs production is crucial for anaerobic fermentation engineering. In this study, we developed machine learning models using two innovative strategies to precisely predict the daily yield of VFAs in a laboratory anaerobic fermenter. Strategy-1 focuses on model interpretability to comprehend the influence of variables of interest on VFAs production, while Strategy-2 takes into account the cost of variable acquisition, making it more suitable for practical applications in prediction and optimization. The results showed that Support Vector Regression emerged as the most effective model in this study, with testing R2 values of 0.949 and 0.939 for the two strategies, respectively. We conducted feature importance analysis to identify the critical factors that influence VFAs production. Detailed explanations were provided using partial dependence plots and Shepley Additive Explanations analyses. To optimize VFAs production, we integrated the developed model with optimization algorithms, resulting in a maximum yield of 2997.282 mg/L. This value was 45.2 % higher than the average VFAs level in the operated fermenter. Our study offers valuable insights for predicting and optimizing VFAs production in sludge anaerobic fermentation, and it facilitates engineering practice in VFAs harvesting from WAS.
Collapse
Affiliation(s)
- Shixin Zhai
- Beijing Key Lab for Source Control Technology of Water Pollution, Beijing Forestry University, Beijing 100083, China
| | - Kai Chen
- Beijing Key Lab for Source Control Technology of Water Pollution, Beijing Forestry University, Beijing 100083, China
| | - Lisha Yang
- Beijing Key Lab for Source Control Technology of Water Pollution, Beijing Forestry University, Beijing 100083, China
| | - Zhuo Li
- Beijing Key Lab for Source Control Technology of Water Pollution, Beijing Forestry University, Beijing 100083, China
| | - Tong Yu
- Beijing Key Lab for Source Control Technology of Water Pollution, Beijing Forestry University, Beijing 100083, China
| | - Long Chen
- Beijing Key Lab for Source Control Technology of Water Pollution, Beijing Forestry University, Beijing 100083, China
| | - Hongtao Zhu
- Beijing Key Lab for Source Control Technology of Water Pollution, Beijing Forestry University, Beijing 100083, China.
| |
Collapse
|
6
|
Bhati A, Gour N, Khanna P, Ojha A, Werghi N. An interpretable dual attention network for diabetic retinopathy grading: IDANet. Artif Intell Med 2024; 149:102782. [PMID: 38462283 DOI: 10.1016/j.artmed.2024.102782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 01/05/2024] [Accepted: 01/15/2024] [Indexed: 03/12/2024]
Abstract
Diabetic retinopathy (DR) is the most prevalent cause of visual impairment in adults worldwide. Typically, patients with DR do not show symptoms until later stages, by which time it may be too late to receive effective treatment. DR Grading is challenging because of the small size and variation in lesion patterns. The key to fine-grained DR grading is to discover more discriminating elements such as cotton wool, hard exudates, hemorrhages, microaneurysms etc. Although deep learning models like convolutional neural networks (CNN) seem ideal for the automated detection of abnormalities in advanced clinical imaging, small-size lesions are very hard to distinguish by using traditional networks. This work proposes a bi-directional spatial and channel-wise parallel attention based network to learn discriminative features for diabetic retinopathy grading. The proposed attention block plugged with a backbone network helps to extract features specific to fine-grained DR-grading. This scheme boosts classification performance along with the detection of small-sized lesion parts. Extensive experiments are performed on four widely used benchmark datasets for DR grading, and performance is evaluated on different quality metrics. Also, for model interpretability, activation maps are generated using the LIME method to visualize the predicted lesion parts. In comparison with state-of-the-art methods, the proposed IDANet exhibits better performance for DR grading and lesion detection.
Collapse
Affiliation(s)
- Amit Bhati
- PDPM Indian Institute of Information Technology, Design and Manufacturing, Jabalpur 482005, India
| | - Neha Gour
- Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Pritee Khanna
- PDPM Indian Institute of Information Technology, Design and Manufacturing, Jabalpur 482005, India.
| | - Aparajita Ojha
- PDPM Indian Institute of Information Technology, Design and Manufacturing, Jabalpur 482005, India
| | - Naoufel Werghi
- Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates
| |
Collapse
|
7
|
Wang T, Li YY. Predictive modeling based on artificial neural networks for membrane fouling in a large pilot-scale anaerobic membrane bioreactor for treating real municipal wastewater. Sci Total Environ 2024; 912:169164. [PMID: 38081428 DOI: 10.1016/j.scitotenv.2023.169164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 11/25/2023] [Accepted: 12/05/2023] [Indexed: 12/17/2023]
Abstract
Membrane fouling is the primary obstacle to applying anaerobic membrane bioreactors (AnMBRs) in municipal wastewater treatment. This issue holds critical significance as efficient wastewater treatment serves as a cornerstone for achieving environmental sustainability. This study uses machine learning to predict membrane fouling, taking advantage of rapid computational and algorithmic advances. Based on the 525-day operation data of a large pilot-scale AnMBR for treating real municipal wastewater, regression prediction was realized using multilayer perceptron (MLP) and long short-term memory (LSTM) artificial neural networks under substantial variations in operating conditions. The models involved employing the organic loading rate, suspended solids concentration, protein concentration in extracellular polymeric substance (EPSp), polysaccharide concentration in EPS (EPSc), reactor temperature, and flux as input features, and transmembrane pressure as the prediction target output. Hyperparameter optimization enhanced the regression prediction accuracies, and the rationality and utility of the MLP model for predicting large-scale AnMBR membrane fouling were confirmed at global and local levels of interpretability analysis. This work not only provides a methodological advance but also underscores the importance of merging environmental engineering with computational advancements to address pressing environmental challenges.
Collapse
Affiliation(s)
- Tianjie Wang
- Laboratory of Environmental Protection Engineering, Department of Civil and Environmental Engineering, Graduate School of Engineering, Tohoku University, 6-6-06 Aza-Aoba, Aramaki, Aoba Ward, Sendai, Miyagi 980-8579, Japan
| | - Yu-You Li
- Laboratory of Environmental Protection Engineering, Department of Civil and Environmental Engineering, Graduate School of Engineering, Tohoku University, 6-6-06 Aza-Aoba, Aramaki, Aoba Ward, Sendai, Miyagi 980-8579, Japan.
| |
Collapse
|
8
|
Fan L, Gong X, Zheng C, Li J. Data pyramid structure for optimizing EUS-based GISTs diagnosis in multi-center analysis with missing label. Comput Biol Med 2024; 169:107897. [PMID: 38171262 DOI: 10.1016/j.compbiomed.2023.107897] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 12/04/2023] [Accepted: 12/23/2023] [Indexed: 01/05/2024]
Abstract
This study introduces the Data Pyramid Structure (DPS) to address data sparsity and missing labels in medical image analysis. The DPS optimizes multi-task learning and enables sustainable expansion of multi-center data analysis. Specifically, It facilitates attribute prediction and malignant tumor diagnosis tasks by implementing a segmentation and aggregation strategy on data with absent attribute labels. To leverage multi-center data, we propose the Unified Ensemble Learning Framework (UELF) and the Unified Federated Learning Framework (UFLF), which incorporate strategies for data transfer and incremental learning in scenarios with missing labels. The proposed method was evaluated on a challenging EUS patient dataset from five centers, achieving promising diagnostic performance. The average accuracy was 0.984 with an AUC of 0.927 for multi-center analysis, surpassing state-of-the-art approaches. The interpretability of the predictions further highlights the potential clinical relevance of our method.
Collapse
Affiliation(s)
- Lin Fan
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China; Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, China; Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China; National Engineering Laboratory of Integrated Transportation Big Data Application Technology, China
| | - Xun Gong
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China; Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, China; Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China; National Engineering Laboratory of Integrated Transportation Big Data Application Technology, China.
| | - Cenyang Zheng
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China; Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, China; Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, China; National Engineering Laboratory of Integrated Transportation Big Data Application Technology, China
| | - Jiao Li
- Department of Gastroenterology, The Third People's Hospital of Chendu, Affiliated Hospital of Southwest Jiaotong University, Chengdu 610031, China
| |
Collapse
|
9
|
Hou Z, Leng J, Yu J, Xia Z, Wu LY. PathExpSurv: pathway expansion for explainable survival analysis and disease gene discovery. BMC Bioinformatics 2023; 24:434. [PMID: 37968615 PMCID: PMC10648621 DOI: 10.1186/s12859-023-05535-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Accepted: 10/16/2023] [Indexed: 11/17/2023] Open
Abstract
BACKGROUND In the field of biology and medicine, the interpretability and accuracy are both important when designing predictive models. The interpretability of many machine learning models such as neural networks is still a challenge. Recently, many researchers utilized prior information such as biological pathways to develop neural networks-based methods, so as to provide some insights and interpretability for the models. However, the prior biological knowledge may be incomplete and there still exists some unknown information to be explored. RESULTS We proposed a novel method, named PathExpSurv, to gain an insight into the black-box model of neural network for cancer survival analysis. We demonstrated that PathExpSurv could not only incorporate the known prior information into the model, but also explore the unknown possible expansion to the existing pathways. We performed downstream analyses based on the expanded pathways and successfully identified some key genes associated with the diseases and original pathways. CONCLUSIONS Our proposed PathExpSurv is a novel, effective and interpretable method for survival analysis. It has great utility and value in medical diagnosis and offers a promising framework for biological research.
Collapse
Affiliation(s)
- Zhichao Hou
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Jiacheng Leng
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Jiating Yu
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Zheng Xia
- Computational Biology Program, Oregon Health & Science University, Portland, USA.
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, USA.
| | - Ling-Yun Wu
- IAM, MADIS, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
10
|
Miao Z, Zhao M, Zhang X, Ming D. LMDA-Net:A lightweight multi-dimensional attention network for general EEG-based brain-computer interfaces and interpretability. Neuroimage 2023; 276:120209. [PMID: 37269957 DOI: 10.1016/j.neuroimage.2023.120209] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 05/05/2023] [Accepted: 05/30/2023] [Indexed: 06/05/2023] Open
Abstract
Electroencephalography (EEG)-based brain-computer interfaces (BCIs) pose a challenge for decoding due to their low spatial resolution and signal-to-noise ratio. Typically, EEG-based recognition of activities and states involves the use of prior neuroscience knowledge to generate quantitative EEG features, which may limit BCI performance. Although neural network-based methods can effectively extract features, they often encounter issues such as poor generalization across datasets, high predicting volatility, and low model interpretability. To address these limitations, we propose a novel lightweight multi-dimensional attention network, called LMDA-Net. By incorporating two novel attention modules designed specifically for EEG signals, the channel attention module and the depth attention module, LMDA-Net is able to effectively integrate features from multiple dimensions, resulting in improved classification performance across various BCI tasks. LMDA-Net was evaluated on four high-impact public datasets, including motor imagery (MI) and P300-Speller, and was compared with other representative models. The experimental results demonstrate that LMDA-Net outperforms other representative methods in terms of classification accuracy and predicting volatility, achieving the highest accuracy in all datasets within 300 training epochs. Ablation experiments further confirm the effectiveness of the channel attention module and the depth attention module. To facilitate an in-depth understanding of the features extracted by LMDA-Net, we propose class-specific neural network feature interpretability algorithms that are suitable for evoked responses and endogenous activities. By mapping the output of the specific layer of LMDA-Net to the time or spatial domain through class activation maps, the resulting feature visualizations can provide interpretable analysis and establish connections with EEG time-spatial analysis in neuroscience. In summary, LMDA-Net shows great potential as a general decoding model for various EEG tasks.
Collapse
Affiliation(s)
- Zhengqing Miao
- State Key Laboratory of Precision Measuring Technology and Instruments, School of Precision Instrument and Opto-electronics Engineering, Tianjin University, Tianjin 300072, China.
| | - Meirong Zhao
- State Key Laboratory of Precision Measuring Technology and Instruments, School of Precision Instrument and Opto-electronics Engineering, Tianjin University, Tianjin 300072, China.
| | - Xin Zhang
- Laboratory of Neural Engineering and Rehabilitation, Department of Biomedical Engineering, School of Precision Instruments and Optoelectronics Engineering, Tianjin University, China; Tianjin International Joint Research Center for Neural Engineering, Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin 300072, China.
| | - Dong Ming
- Laboratory of Neural Engineering and Rehabilitation, Department of Biomedical Engineering, School of Precision Instruments and Optoelectronics Engineering, Tianjin University, China; Tianjin International Joint Research Center for Neural Engineering, Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin 300072, China.
| |
Collapse
|
11
|
Bao X, Sun J, Yi M, Qiu J, Chen X, Shuai SC, Zhao Q. MPFFPSDC: A multi-pooling feature fusion model for predicting synergistic drug combinations. Methods 2023:S1046-2023(23)00098-1. [PMID: 37321525 DOI: 10.1016/j.ymeth.2023.06.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 06/11/2023] [Accepted: 06/12/2023] [Indexed: 06/17/2023] Open
Abstract
Drug combination therapies are common practice in the treatment of cancer, but not all combinations result in synergy. As traditional screening approaches are restricted in their ability to uncover synergistic drug combinations, computer-aided medicine is becoming a increasingly prevalent in this field. In this work, a predictive model of potential interactions between drugs named MPFFPSDC is presented, which can maintain the symmetry of drug inputs and eliminate inconsistencies in predictive results caused by different drug inputting sequences or positions. The experimental results show that MPFFPSDC outperforms comparative models in major performance indicators and exhibits better generalization for independent data. Furthermore, the case study demonstrates that our model can capture molecular substructures that contribute to the synergistic effect of two drugs. These results indicate that MPFFPSDC not only offers strong predictive performance, but also has good model interpretability that may provide new insights for the study of drug interaction mechanisms and the development of new drugs.
Collapse
Affiliation(s)
- Xin Bao
- School of Automation and Electrical Engineering, Linyi University, Linyi 276000, China
| | - Jianqiang Sun
- School of Automation and Electrical Engineering, Linyi University, Linyi 276000, China.
| | - Ming Yi
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430000, China
| | - Jianlong Qiu
- School of Automation and Electrical Engineering, Linyi University, Linyi 276000, China
| | - Xiangyong Chen
- School of Automation and Electrical Engineering, Linyi University, Linyi 276000, China
| | - Stella C Shuai
- Biological Science, Northwestern University, Evanston, IL 60208, USA
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan 114051, China.
| |
Collapse
|
12
|
Zhang Z, Lin J, Chen Z. Predicting the effect of silver nanoparticles on soil enzyme activity using the machine learning method: type, size, dose and exposure time. J Hazard Mater 2023; 457:131789. [PMID: 37301072 DOI: 10.1016/j.jhazmat.2023.131789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/03/2023] [Accepted: 06/04/2023] [Indexed: 06/12/2023]
Abstract
In this study, machine learning models predicted the impact of silver nanoparticles (AgNPs) on soil enzymes. Artificial neural network (ANN) optimized with genetic algorithm (GA) (MAE = 0.1174) was more suitable for simulating overall trends, while the gradient boosting machine (GBM) and random forest (RF) were ideal for small-scale analysis. According to partial dependency profile (PDP) analysis, polyvinylpyrrolidone coated AgNPs (PVP-AgNPs) had the most inhibitory effect (average of 49.5%) on soil enzyme activity among the three types of AgNPs at the same dose (0.02-50 mg/kg). The ANN model predicted that enzyme activity first declined and then rose when AgNPs increased in size. Based on predictions from the ANN and RF models, when exposed to uncoated AgNPs, soil enzyme activities continued to decrease before 30 d, but gradually rose from 30 to 90 d, and fell slightly after 90 d. The ANN model indicated the importance order of four factors: dose > type > size > exposure time. The RF model suggested the enzyme was more sensitive when experiments were conducted at doses, sizes, and exposure times of 0.01-1 mg/kg, 50-100 nm, and 30-90 d, respectively. This study presents new insights on the regularity of soil enzyme responses to AgNPs.
Collapse
Affiliation(s)
- Zhenjun Zhang
- Fujian Key Laboratory of Pollution Control and Resource Reuse, College of Environmental and Resource Sciences, Fujian Normal University, Fuzhou 350117, Fujian Province, China
| | - Jiajiang Lin
- Fujian Key Laboratory of Pollution Control and Resource Reuse, College of Environmental and Resource Sciences, Fujian Normal University, Fuzhou 350117, Fujian Province, China.
| | - Zuliang Chen
- Fujian Key Laboratory of Pollution Control and Resource Reuse, College of Environmental and Resource Sciences, Fujian Normal University, Fuzhou 350117, Fujian Province, China.
| |
Collapse
|
13
|
Lu X, Du J, Zheng L, Wang G, Li X, Sun L, Huang X. Feature fusion improves performance and interpretability of machine learning models in identifying soil pollution of potentially contaminated sites. Ecotoxicol Environ Saf 2023; 259:115052. [PMID: 37224784 DOI: 10.1016/j.ecoenv.2023.115052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 05/17/2023] [Accepted: 05/19/2023] [Indexed: 05/26/2023]
Abstract
Owing to the rapid development of big data technology, use of machine learning methods to identify soil pollution of potentially contaminated sites (PCS) at regional scales and in different industries has become a research hot spot. However, due to the difficulty in obtaining key indexes of site pollution sources and pathways, current methods have problems such as low accuracy of model predictions and insufficient scientific basis. In this study, we collected the environmental data of 199 PCS in 6 typical industries involving heavy metal and organic pollution. Then, 21 indexes based on basic information, potential for pollution from product and raw material, pollution control level, and migration capacity of soil pollutants were used to established the soil pollution identification index system. We fused the original indexes into the new feature subset with 11 indexes through the method of consolidation calculation. The new feature subset was then used to train machine learning models of random forest (RF), support vector machine (SVM), and multilayer perceptron (MLP), and tested to determine whether it improved the accuracy and precision of soil pollination identification models. The results of correlation analysis showed that the four new indexes created by feature fusion have the correlation with soil pollution is similar to the original indexes. The accuracies and precisions of three machine learning models trained on the new feature subset were 67.4%- 72.9% and 72.0%- 74.7%, which were 2.1%- 2.5% and 0.3%- 5.7% higher than these of the models trained on original indexes, respectively. When the PCS were divided into typical heavy metal and organic pollution sites according to the enterprise industries, the accuracy of the model trained on the two datasets for identifying soil heavy metal and organic pollution were significantly improve to approximately 80%. Owing to the imbalance in positive and negative samples in the prediction of soil organic pollution, the precisions of soil organic pollution identification models were 58%- 72.5%, which were significantly lower than their accuracies. According to the factors analysis based on the model interpretability of SHAP, most of the indexes of basic information, potential for pollution from product and raw material, and pollution control level had different degrees of impact on soil pollution. However, the indexes of migration capacity of soil pollutants had the least effect in the classification task of soil pollution identification of PCS. Among the indexes, traces of soil pollution, industrial utilization years/start-up time, pollution control risk scores and enterprise scale having the greatest effects on soil pollution with the mean SHAP values of 0.17-0.36, which reflected their contribution rate on soil pollution and could help to optimize the current index scoring of the technical regulation for identifying site soil pollution. This study provides a new technical method to identify soil pollution based on big data and machine learning methods, in addition to providing a reference and scientific basis for environmental management and soil pollution control of PCS.
Collapse
Affiliation(s)
- Xiaosong Lu
- State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
| | - Junyang Du
- State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
| | - Liping Zheng
- State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
| | - Guoqing Wang
- State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China.
| | - Xuzhi Li
- State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
| | - Li Sun
- State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
| | - Xinghua Huang
- College of Environmental Science and Engineering, Yangzhou University, Yangzhou 225127, China
| |
Collapse
|
14
|
Majdandzic A, Rajesh C, Koo PK. Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol 2023; 24:109. [PMID: 37161475 PMCID: PMC10169356 DOI: 10.1186/s13059-023-02956-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 04/28/2023] [Indexed: 05/11/2023] Open
Abstract
Post hoc attribution methods can provide insights into the learned patterns from deep neural networks (DNNs) trained on high-throughput functional genomics data. However, in practice, their resultant attribution maps can be challenging to interpret due to spurious importance scores for seemingly arbitrary nucleotides. Here, we identify a previously overlooked attribution noise source that arises from how DNNs handle one-hot encoded DNA. We demonstrate this noise is pervasive across various genomic DNNs and introduce a statistical correction that effectively reduces it, leading to more reliable attribution maps. Our approach represents a promising step towards gaining meaningful insights from DNNs in regulatory genomics.
Collapse
Affiliation(s)
- Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Chandana Rajesh
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
15
|
Lee NK, Tang Z, Toneyan S, Koo PK. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol 2023; 24:105. [PMID: 37143118 PMCID: PMC10161416 DOI: 10.1186/s13059-023-02941-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 04/17/2023] [Indexed: 05/06/2023] Open
Abstract
Deep neural networks (DNNs) hold promise for functional genomics prediction, but their generalization capability may be limited by the amount of available data. To address this, we propose EvoAug, a suite of evolution-inspired augmentations that enhance the training of genomic DNNs by increasing genetic variation. Random transformation of DNA sequences can potentially alter their function in unknown ways, so we employ a fine-tuning procedure using the original non-transformed data to preserve functional integrity. Our results demonstrate that EvoAug substantially improves the generalization and interpretability of established DNNs across prominent regulatory genomics prediction tasks, offering a robust solution for genomic DNNs.
Collapse
Affiliation(s)
- Nicholas Keone Lee
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Shushan Toneyan
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
16
|
Liu M, Huang Y, Hu J, He J, Xiao X. Algal community structure prediction by machine learning. Environ Sci Ecotechnol 2023; 14:100233. [PMID: 36793396 PMCID: PMC9923192 DOI: 10.1016/j.ese.2022.100233] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/16/2022] [Revised: 12/21/2022] [Accepted: 12/21/2022] [Indexed: 06/18/2023]
Abstract
The algal community structure is vital for aquatic management. However, the complicated environmental and biological processes make modeling challenging. To cope with this difficulty, we investigated using random forests (RF) to predict phytoplankton community shifting based on multi-source environmental factors (including physicochemical, hydrological, and meteorological variables). The RF models robustly predicted the algal communities composed by 13 major classes (Bray-Curtis dissimilarity = 9.2 ± 7.0%, validation NRMSE mostly <10%), with accurate simulations to the total biomass (validation R2 > 0.74) in Norway's largest lake, Lake Mjosa. The importance analysis showed that the hydro-meteorological variables (Standardized MSE and Node Purity mostly >0.5) were the most influential factors in regulating the phytoplankton. Furthermore, an in-depth ecological interpretation uncovered the interactive stress-response effect on the algal community learned by the RF models. The interpretation results disclosed that the environmental drivers (i.e., temperature, lake inflow, and nutrients) can jointly pose strong influence on the algal community shifts. This study highlighted the power of machine learning in predicting complex algal community structures and provided insights into the model interpretability.
Collapse
Affiliation(s)
- Muyuan Liu
- Ocean College, Zhejiang University, #1 Zheda Road, Zhoushan, Zhejiang, 316021, China
| | - Yuzhou Huang
- Ocean College, Zhejiang University, #1 Zheda Road, Zhoushan, Zhejiang, 316021, China
| | - Jing Hu
- Ocean College, Zhejiang University, #1 Zheda Road, Zhoushan, Zhejiang, 316021, China
| | - Junyu He
- Ocean College, Zhejiang University, #1 Zheda Road, Zhoushan, Zhejiang, 316021, China
- Ocean Academy, Zhejiang University, #1 Zheda Road, Zhoushan, Zhejiang, 316021, China
| | - Xi Xiao
- Ocean College, Zhejiang University, #1 Zheda Road, Zhoushan, Zhejiang, 316021, China
- Key Laboratory of Marine Ecological Monitoring and Restoration Technologies, Ministry of Natural Resources, Shanghai, 201206, China
- Donghai Laboratory, Zhoushan, Zhejiang, 316021, China
- Key Laboratory of Watershed Non-point Source Pollution Control and Water Eco-security of Ministry of Water Resources, College of Environmental and Resources Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| |
Collapse
|
17
|
Koo PK, Ploenzke M, Anand P, Paul S, Majdandzic A. ResidualBind: Uncovering Sequence-Structure Preferences of RNA-Binding Proteins with Deep Neural Networks. Methods Mol Biol 2023; 2586:197-215. [PMID: 36705906 DOI: 10.1007/978-1-0716-2768-6_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| | | | - Steffan Paul
- Bioinformatics Program, Harvard Medical School, Boston, MA, USA
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
18
|
Chen Q, Li R, Lin C, Lai C, Chen D, Qu H, Huang Y, Lu W, Tang Y, Li L. Transferability and interpretability of the sepsis prediction models in the intensive care unit. BMC Med Inform Decis Mak 2022; 22:343. [PMID: 36581881 PMCID: PMC9798724 DOI: 10.1186/s12911-022-02090-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 12/16/2022] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND We aimed to develop an early warning system for real-time sepsis prediction in the ICU by machine learning methods, with tools for interpretative analysis of the predictions. In particular, we focus on the deployment of the system in a target medical center with small historical samples. METHODS Light Gradient Boosting Machine (LightGBM) and multilayer perceptron (MLP) were trained on Medical Information Mart for Intensive Care (MIMIC-III) dataset and then finetuned on the private Historical Database of local Ruijin Hospital (HDRJH) using transfer learning technique. The Shapley Additive Explanations (SHAP) analysis was employed to characterize the feature importance in the prediction inference. Ultimately, the performance of the sepsis prediction system was further evaluated in the real-world study in the ICU of the target Ruijin Hospital. RESULTS The datasets comprised 6891 patients from MIMIC-III, 453 from HDRJH, and 67 from Ruijin real-world data. The area under the receiver operating characteristic curves (AUCs) for LightGBM and MLP models derived from MIMIC-III were 0.98 - 0.98 and 0.95 - 0.96 respectively on MIMIC-III dataset, and, in comparison, 0.82 - 0.86 and 0.84 - 0.87 respectively on HDRJH, from 1 to 5 h preceding. After transfer learning and ensemble learning, the AUCs of the final ensemble model were enhanced to 0.94 - 0.94 on HDRJH and to 0.86 - 0.9 in the real-world study in the ICU of the target Ruijin Hospital. In addition, the SHAP analysis illustrated the importance of age, antibiotics, net balance, and ventilation for sepsis prediction, making the model interpretable. CONCLUSIONS Our machine learning model allows accurate real-time prediction of sepsis within 5-h preceding. Transfer learning can effectively improve the feasibility to deploy the prediction model in the target cohort, and ameliorate the model performance for external validation. SHAP analysis indicates that the role of antibiotic usage and fluid management needs further investigation. We argue that our system and methodology have the potential to improve ICU management by helping medical practitioners identify at-sepsis-risk patients and prepare for timely diagnosis and intervention. TRIAL REGISTRATION NCT05088850 (retrospectively registered).
Collapse
Affiliation(s)
- Qiyu Chen
- grid.8547.e0000 0001 0125 2443Department of Applied Mathematics, School of Mathematical Sciences, Fudan University, Shanghai, 200433 China
| | - Ranran Li
- grid.16821.3c0000 0004 0368 8293Department of Critical Care Medicine, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025 China
| | - ChihChe Lin
- grid.495525.a0000 0004 0552 4356Shanghai Electric Group Co., Ltd., Central Academe, Shanghai, China
| | - Chiming Lai
- grid.495525.a0000 0004 0552 4356Shanghai Electric Group Co., Ltd., Central Academe, Shanghai, China
| | - Dechang Chen
- grid.16821.3c0000 0004 0368 8293Department of Critical Care Medicine, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025 China
| | - Hongping Qu
- grid.16821.3c0000 0004 0368 8293Department of Critical Care Medicine, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025 China
| | - Yaling Huang
- grid.495525.a0000 0004 0552 4356Shanghai Electric Group Co., Ltd., Central Academe, Shanghai, China
| | - Wenlian Lu
- grid.8547.e0000 0001 0125 2443Department of Applied Mathematics, School of Mathematical Sciences, Fudan University, Shanghai, 200433 China
| | - Yaoqing Tang
- grid.16821.3c0000 0004 0368 8293Department of Critical Care Medicine, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025 China
| | - Lei Li
- grid.16821.3c0000 0004 0368 8293Department of Critical Care Medicine, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, 200025 China
| |
Collapse
|
19
|
Zhang X, Gavaldà R, Baixeries J. Interpretable prediction of mortality in liver transplant recipients based on machine learning. Comput Biol Med 2022; 151:106188. [PMID: 36306583 DOI: 10.1016/j.compbiomed.2022.106188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2022] [Revised: 09/24/2022] [Accepted: 10/08/2022] [Indexed: 12/27/2022]
Abstract
BACKGROUND Accurate prediction of the mortality of post-liver transplantation is an important but challenging task. It relates to optimizing organ allocation and estimating the risk of possible dysfunction. Existing risk scoring models, such as the Balance of Risk (BAR) score and the Survival Outcomes Following Liver Transplantation (SOFT) score, do not predict the mortality of post-liver transplantation with sufficient accuracy. In this study, we evaluate the performance of machine learning models and establish an explainable machine learning model for predicting mortality in liver transplant recipients. METHOD The optimal feature set for the prediction of the mortality was selected by a wrapper method based on binary particle swarm optimization (BPSO). With the selected optimal feature set, seven machine learning models were applied to predict mortality over different time windows. The best-performing model was used to predict mortality through a comprehensive comparison and evaluation. An interpretable approach based on machine learning and SHapley Additive exPlanations (SHAP) is used to explicitly explain the model's decision and make new discoveries. RESULTS With regard to predictive power, our results demonstrated that the feature set selected by BPSO outperformed both the feature set in the existing risk score model (BAR score, SOFT score) and the feature set processed by principal component analysis (PCA). The best-performing model, extreme gradient boosting (XGBoost), was found to improve the Area Under a Curve (AUC) values for mortality prediction by 6.7%, 11.6%, and 17.4% at 3 months, 3 years, and 10 years, respectively, compared to the SOFT score. The main predictors of mortality and their impact were discussed for different age groups and different follow-up periods. CONCLUSIONS Our analysis demonstrates that XGBoost can be an ideal method to assess the mortality risk in liver transplantation. In combination with the SHAP approach, the proposed framework provides a more intuitive and comprehensive interpretation of the predictive model, thereby allowing the clinician to better understand the decision-making process of the model and the impact of factors associated with mortality risk in liver transplantation.
Collapse
Affiliation(s)
- Xiao Zhang
- Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona, 08034, Spain.
| | | | - Jaume Baixeries
- Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona, 08034, Spain
| |
Collapse
|
20
|
Zhao Y, Shao J, Asmann YW. Assessment and Optimization of Explainable Machine Learning Models Applied to Transcriptomic Data. Genomics Proteomics Bioinformatics 2022; 20:899-911. [PMID: 35931322 PMCID: PMC10025763 DOI: 10.1016/j.gpb.2022.07.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 06/05/2022] [Accepted: 07/25/2022] [Indexed: 01/12/2023]
Abstract
Explainable artificial intelligence aims to interpret how machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
Collapse
Affiliation(s)
- Yongbing Zhao
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| | - Jinfeng Shao
- The Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Rockville, MD 20852, USA
| | - Yan W Asmann
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| |
Collapse
|
21
|
Zhang Y, Zhang X, Razbek J, Li D, Xia W, Bao L, Mao H, Daken M, Cao M. Opening the black box: interpretable machine learning for predictor finding of metabolic syndrome. BMC Endocr Disord 2022; 22:214. [PMID: 36028865 PMCID: PMC9419421 DOI: 10.1186/s12902-022-01121-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 07/31/2022] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE The internal workings ofmachine learning algorithms are complex and considered as low-interpretation "black box" models, making it difficult for domain experts to understand and trust these complex models. The study uses metabolic syndrome (MetS) as the entry point to analyze and evaluate the application value of model interpretability methods in dealing with difficult interpretation of predictive models. METHODS The study collects data from a chain of health examination institution in Urumqi from 2017 ~ 2019, and performs 39,134 remaining data after preprocessing such as deletion and filling. RFE is used for feature selection to reduce redundancy; MetS risk prediction models (logistic, random forest, XGBoost) are built based on a feature subset, and accuracy, sensitivity, specificity, Youden index, and AUROC value are used to evaluate the model classification performance; post-hoc model-agnostic interpretation methods (variable importance, LIME) are used to interpret the results of the predictive model. RESULTS Eighteen physical examination indicators are screened out by RFE, which can effectively solve the problem of physical examination data redundancy. Random forest and XGBoost models have higher accuracy, sensitivity, specificity, Youden index, and AUROC values compared with logistic regression. XGBoost models have higher sensitivity, Youden index, and AUROC values compared with random forest. The study uses variable importance, LIME and PDP for global and local interpretation of the optimal MetS risk prediction model (XGBoost), and different interpretation methods have different insights into the interpretation of model results, which are more flexible in model selection and can visualize the process and reasons for the model to make decisions. The interpretable risk prediction model in this study can help to identify risk factors associated with MetS, and the results showed that in addition to the traditional risk factors such as overweight and obesity, hyperglycemia, hypertension, and dyslipidemia, MetS was also associated with other factors, including age, creatinine, uric acid, and alkaline phosphatase. CONCLUSION The model interpretability methods are applied to the black box model, which can not only realize the flexibility of model application, but also make up for the uninterpretable defects of the model. Model interpretability methods can be used as a novel means of identifying variables that are more likely to be good predictors.
Collapse
Affiliation(s)
- Yan Zhang
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Xiaoxu Zhang
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Jaina Razbek
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Deyang Li
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Wenjun Xia
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Liangliang Bao
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Hongkai Mao
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Mayisha Daken
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China
| | - Mingqin Cao
- Department of Epidemiology and Health Statistics, College of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, China.
| |
Collapse
|
22
|
Coupet M, Urruty T, Leelanupab T, Naudin M, Bourdon P, Maloigne CF, Guillevin R. A multi-sequences MRI deep framework study applied to glioma classfication. Multimed Tools Appl 2022; 81:13563-13591. [PMID: 35250358 PMCID: PMC8882719 DOI: 10.1007/s11042-022-12316-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 09/02/2021] [Accepted: 01/17/2022] [Indexed: 06/14/2023]
Abstract
Glioma is one of the most important central nervous system tumors, ranked 15th in the most common cancer for men and women. Magnetic Resonance Imaging (MRI) represents a common tool for medical experts to the diagnosis of glioma. A set of multi-sequences from an MRI is selected according to the severity of the pathology. Our proposed approach aims moreto create a computer-aided system that is capable of helping morethe expert diagnose the brain gliomas. moreWe propose a supervised learning regime based on a convolutional neural network based framework and transfer learning techniques. Our research morefocuses on the performance of different pre-trained deep learning models with respect to different MRI sequences. We highlight the best combinations of such model-MRI sequence couple for our specific task of classifying healthy brain against brain with glioma. moreWe also propose to visually analyze the extracted deep features for studying the existing relation of the MRI sequences and models. This interpretability analysis gives some hints for medical expert to understand the diagnosis made by the models. Our study is based on the well-known BraTS datasets including multi-sequence images and expert diagnosis.
Collapse
Affiliation(s)
- Matthieu Coupet
- XLIM Laboratory, University of Poitiers, UMR CNRS 7252, Poitiers, France
- I3M, Common Laboratory CNRS-Siemens, University and Hospital of Poitiers, Poitiers, France
| | - Thierry Urruty
- XLIM Laboratory, University of Poitiers, UMR CNRS 7252, Poitiers, France
- I3M, Common Laboratory CNRS-Siemens, University and Hospital of Poitiers, Poitiers, France
| | - Teerapong Leelanupab
- Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang (KMITL), Bangkok, 10520 Thailand
| | - Mathieu Naudin
- I3M, Common Laboratory CNRS-Siemens, University and Hospital of Poitiers, Poitiers, France
- Poitiers University Hospital, CHU, Poitiers, France
| | - Pascal Bourdon
- I3M, Common Laboratory CNRS-Siemens, University and Hospital of Poitiers, Poitiers, France
- Poitiers University Hospital, CHU, Poitiers, France
| | - Christine Fernandez Maloigne
- I3M, Common Laboratory CNRS-Siemens, University and Hospital of Poitiers, Poitiers, France
- Poitiers University Hospital, CHU, Poitiers, France
| | - Rémy Guillevin
- I3M, Common Laboratory CNRS-Siemens, University and Hospital of Poitiers, Poitiers, France
- Poitiers University Hospital, CHU, Poitiers, France
- DACTIM-MIS/LMA Laboratory University of Poitiers, UMR CNRS 7348, Poitiers, France
| |
Collapse
|
23
|
Gao F, Shen Y, Brett Sallach J, Li H, Zhang W, Li Y, Liu C. Predicting crop root concentration factors of organic contaminants with machine learning models. J Hazard Mater 2022; 424:127437. [PMID: 34678561 DOI: 10.1016/j.jhazmat.2021.127437] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 09/15/2021] [Accepted: 10/03/2021] [Indexed: 06/13/2023]
Abstract
Accurate prediction of uptake and accumulation of organic contaminants by crops from soils is essential to assessing human exposure via the food chain. However, traditional empirical or mechanistic models frequently show variable performance due to complex interactions among contaminants, soils, and plants. Thus, in this study different machine learning algorithms were compared and applied to predict root concentration factors (RCFs) based on a dataset comprising 57 chemicals and 11 crops, followed by comparison with a traditional linear regression model as the benchmark. The RCF patterns and predictions were investigated by unsupervised t-distributed stochastic neighbor embedding and four supervised machine learning models including Random Forest, Gradient Boosting Regression Tree, Fully Connected Neural Network, and Supporting Vector Regression based on 15 property descriptors. The Fully Connected Neural Network demonstrated superior prediction performance for RCFs (R2 =0.79, mean absolute error [MAE] = 0.22) over other machine learning models (R2 =0.68-0.76, MAE = 0.23-0.26). All four machine learning models performed better than the traditional linear regression model (R2 =0.62, MAE = 0.29). Four key property descriptors were identified in predicting RCFs. Specifically, increasing root lipid content and decreasing soil organic matter content increased RCFs, while increasing excess molar refractivity and molecular volume of contaminants decreased RCFs. These results show that machine learning models can improve prediction accuracy by learning nonlinear relationships between RCFs and properties of contaminants, soils, and plants.
Collapse
Affiliation(s)
- Feng Gao
- Department of Genetics, School of Medicine, Yale University, New Haven, CT 06510, United States
| | - Yike Shen
- Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, NY 10032, United States
| | - J Brett Sallach
- Department of Environment and Geography, University of York, Heslington, York YO10 5NG, United Kingdom
| | - Hui Li
- Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, 48823, United States
| | - Wei Zhang
- Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, 48823, United States
| | - Yuanbo Li
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing 100193, PR China.
| | - Cun Liu
- Key Laboratory of Soil Environment and Pollution Remediation, Institute of Soil Science, Chinese Academy of Sciences, Nanjing 210008, PR China.
| |
Collapse
|
24
|
Susnjak T, Ramaswami GS, Mathrani A. Learning analytics dashboard: a tool for providing actionable insights to learners. Int J Educ Technol High Educ 2022; 19:12. [PMID: 35194560 PMCID: PMC8853217 DOI: 10.1186/s41239-021-00313-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2021] [Accepted: 12/14/2021] [Indexed: 06/14/2023]
Abstract
This study investigates current approaches to learning analytics (LA) dashboarding while highlighting challenges faced by education providers in their operationalization. We analyze recent dashboards for their ability to provide actionable insights which promote informed responses by learners in making adjustments to their learning habits. Our study finds that most LA dashboards merely employ surface-level descriptive analytics, while only few go beyond and use predictive analytics. In response to the identified gaps in recently published dashboards, we propose a state-of-the-art dashboard that not only leverages descriptive analytics components, but also integrates machine learning in a way that enables both predictive and prescriptive analytics. We demonstrate how emerging analytics tools can be used in order to enable learners to adequately interpret the predictive model behavior, and more specifically to understand how a predictive model arrives at a given prediction. We highlight how these capabilities build trust and satisfy emerging regulatory requirements surrounding predictive analytics. Additionally, we show how data-driven prescriptive analytics can be deployed within dashboards in order to provide concrete advice to the learners, and thereby increase the likelihood of triggering behavioral changes. Our proposed dashboard is the first of its kind in terms of breadth of analytics that it integrates, and is currently deployed for trials at a higher education institution.
Collapse
Affiliation(s)
- Teo Susnjak
- School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand
| | | | - Anuradha Mathrani
- School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand
| |
Collapse
|
25
|
Abbas A, O'Byrne C, Fu DJ, Moraes G, Balaskas K, Struyven R, Beqiri S, Wagner SK, Korot E, Keane PA. Evaluating an automated machine learning model that predicts visual acuity outcomes in patients with neovascular age-related macular degeneration. Graefes Arch Clin Exp Ophthalmol 2022. [PMID: 35122132 DOI: 10.1007/s00417-021-05544-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 11/10/2021] [Accepted: 12/27/2021] [Indexed: 01/01/2023] Open
Abstract
Purpose Neovascular age-related macular degeneration (nAMD) is a major global cause of blindness. Whilst anti-vascular endothelial growth factor (anti-VEGF) treatment is effective, response varies considerably between individuals. Thus, patients face substantial uncertainty regarding their future ability to perform daily tasks. In this study, we evaluate the performance of an automated machine learning (AutoML) model which predicts visual acuity (VA) outcomes in patients receiving treatment for nAMD, in comparison to a manually coded model built using the same dataset. Furthermore, we evaluate model performance across ethnic groups and analyse how the models reach their predictions. Methods Binary classification models were trained to predict whether patients’ VA would be ‘Above’ or ‘Below’ a score of 70 one year after initiating treatment, measured using the Early Treatment Diabetic Retinopathy Study (ETDRS) chart. The AutoML model was built using the Google Cloud Platform, whilst the bespoke model was trained using an XGBoost framework. Models were compared and analysed using the What-if Tool (WIT), a novel model-agnostic interpretability tool. Results Our study included 1631 eyes from patients attending Moorfields Eye Hospital. The AutoML model (area under the curve [AUC], 0.849) achieved a highly similar performance to the XGBoost model (AUC, 0.847). Using the WIT, we found that the models over-predicted negative outcomes in Asian patients and performed worse in those with an ethnic category of Other. Baseline VA, age and ethnicity were the most important determinants of model predictions. Partial dependence plot analysis revealed a sigmoidal relationship between baseline VA and the probability of an outcome of ‘Above’. Conclusion We have described and validated an AutoML-WIT pipeline which enables clinicians with minimal coding skills to match the performance of a state-of-the-art algorithm and obtain explainable predictions. Supplementary Information The online version contains supplementary material available at 10.1007/s00417-021-05544-y.
Collapse
|
26
|
Romeo L, Frontoni E. A Unified Hierarchical XGBoost model for classifying priorities for COVID-19 vaccination campaign. Pattern Recognit 2022; 121:108197. [PMID: 34312570 PMCID: PMC8295058 DOI: 10.1016/j.patcog.2021.108197] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 06/21/2021] [Accepted: 07/20/2021] [Indexed: 05/03/2023]
Abstract
The current ML approaches do not fully focus to answer a still unresolved and topical challenge, namely the prediction of priorities of COVID-19 vaccine administration. Thus, our task includes some additional methodological challenges mainly related to avoiding unwanted bias while handling categorical and ordinal data with a highly imbalanced nature. Hence, the main contribution of this study is to propose a machine learning algorithm, namely Hierarchical Priority Classification eXtreme Gradient Boosting for priority classification for COVID-19 vaccine administration using the Italian Federation of General Practitioners dataset that contains Electronic Health Record data of 17k patients. We measured the effectiveness of the proposed methodology for classifying all the priority classes while demonstrating a significant improvement with respect to the state of the art. The proposed ML approach, which is integrated into a clinical decision support system, is currently supporting General Pracitioners in assigning COVID-19 vaccine administration priorities to their assistants.
Collapse
Affiliation(s)
- Luca Romeo
- Department of Information Engineering (DII), Università Politecnica delle Marche, Ancona, Italy
- Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, Genova, Italy
| | - Emanuele Frontoni
- Department of Information Engineering (DII), Università Politecnica delle Marche, Ancona, Italy
| |
Collapse
|
27
|
Pucci F, Schwersensky M, Rooman M. Artificial intelligence challenges for predicting the impact of mutations on protein stability. Curr Opin Struct Biol 2021; 72:161-8. [PMID: 34922207 DOI: 10.1016/j.sbi.2021.11.001] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 09/15/2021] [Accepted: 11/08/2021] [Indexed: 01/17/2023]
Abstract
Stability is a key ingredient of protein fitness, and its modification through targeted mutations has applications in various fields, such as protein engineering, drug design, and deleterious variant interpretation. Many studies have been devoted over the past decades to build new, more effective methods for predicting the impact of mutations on protein stability based on the latest developments in artificial intelligence. We discuss their features, algorithms, computational efficiency, and accuracy estimated on an independent test set. We focus on a critical analysis of their limitations, the recurrent biases toward the training set, their generalizability, and interpretability. We found that the accuracy of the predictors has stagnated at around 1 kcal/mol for over 15 years. We conclude by discussing the challenges that need to be addressed to reach improved performance.
Collapse
|
28
|
Thomas M, Boardman A, Garcia-Ortegon M, Yang H, de Graaf C, Bender A. Applications of Artificial Intelligence in Drug Design: Opportunities and Challenges. Methods Mol Biol 2021; 2390:1-59. [PMID: 34731463 DOI: 10.1007/978-1-0716-1787-8_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Artificial intelligence (AI) has undergone rapid development in recent years and has been successfully applied to real-world problems such as drug design. In this chapter, we review recent applications of AI to problems in drug design including virtual screening, computer-aided synthesis planning, and de novo molecule generation, with a focus on the limitations of the application of AI therein and opportunities for improvement. Furthermore, we discuss the broader challenges imposed by AI in translating theoretical practice to real-world drug design; including quantifying prediction uncertainty and explaining model behavior.
Collapse
Affiliation(s)
- Morgan Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Andrew Boardman
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Miguel Garcia-Ortegon
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK.,Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
| | - Hongbin Yang
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | | | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
29
|
Ferré Q, Chèneby J, Puthier D, Capponi C, Ballester B. Anomaly detection in genomic catalogues using unsupervised multi-view autoencoders. BMC Bioinformatics 2021; 22:460. [PMID: 34563116 PMCID: PMC8467021 DOI: 10.1186/s12859-021-04359-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 06/04/2021] [Accepted: 08/09/2021] [Indexed: 11/13/2022] Open
Abstract
Background Accurate identification of Transcriptional Regulator binding locations is essential for analysis of genomic regions, including Cis Regulatory Elements. The customary NGS approaches, predominantly ChIP-Seq, can be obscured by data anomalies and biases which are difficult to detect without supervision. Results Here, we develop a method to leverage the usual combinations between many experimental series to mark such atypical peaks. We use deep learning to perform a lossy compression of the genomic regions’ representations with multiview convolutions. Using artificial data, we show that our method correctly identifies groups of correlating series and evaluates CRE according to group completeness. It is then applied to the ReMap database’s large volume of curated ChIP-seq data. We show that peaks lacking known biological correlators are singled out and less confirmed in real data. We propose normalization approaches useful in interpreting black-box models. Conclusion Our approach detects peaks that are less corroborated than average. It can be extended to other similar problems, and can be interpreted to identify correlation groups. It is implemented in an open-source tool called atyPeak. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04359-2.
Collapse
Affiliation(s)
- Quentin Ferré
- INSERM, TAGC, Aix Marseille University, Marseille, France.,Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France
| | - Jeanne Chèneby
- INSERM, TAGC, Aix Marseille University, Marseille, France
| | - Denis Puthier
- INSERM, TAGC, Aix Marseille University, Marseille, France
| | - Cécile Capponi
- Université de Toulon, CNRS, LIS, Aix Marseille University, Marseille, France.
| | | |
Collapse
|
30
|
Tideman LEM, Migas LG, Djambazova KV, Patterson NH, Caprioli RM, Spraggins JM, Van de Plas R. Automated biomarker candidate discovery in imaging mass spectrometry data through spatially localized Shapley additive explanations. Anal Chim Acta 2021; 1177:338522. [PMID: 34482894 PMCID: PMC10124144 DOI: 10.1016/j.aca.2021.338522] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 04/04/2021] [Accepted: 04/11/2021] [Indexed: 01/09/2023]
Abstract
The search for molecular species that are differentially expressed between biological states is an important step towards discovering promising biomarker candidates. In imaging mass spectrometry (IMS), performing this search manually is often impractical due to the large size and high-dimensionality of IMS datasets. Instead, we propose an interpretable machine learning workflow that automatically identifies biomarker candidates by their mass-to-charge ratios, and that quantitatively estimates their relevance to recognizing a given biological class using Shapley additive explanations (SHAP). The task of biomarker candidate discovery is translated into a feature ranking problem: given a classification model that assigns pixels to different biological classes on the basis of their mass spectra, the molecular species that the model uses as features are ranked in descending order of relative predictive importance such that the top-ranking features have a higher likelihood of being useful biomarkers. Besides providing the user with an experiment-wide measure of a molecular species' biomarker potential, our workflow delivers spatially localized explanations of the classification model's decision-making process in the form of a novel representation called SHAP maps. SHAP maps deliver insight into the spatial specificity of biomarker candidates by highlighting in which regions of the tissue sample each feature provides discriminative information and in which regions it does not. SHAP maps also enable one to determine whether the relationship between a biomarker candidate and a biological state of interest is correlative or anticorrelative. Our automated approach to estimating a molecular species' potential for characterizing a user-provided biological class, combined with the untargeted and multiplexed nature of IMS, allows for the rapid screening of thousands of molecular species and the obtention of a broader biomarker candidate shortlist than would be possible through targeted manual assessment. Our biomarker candidate discovery workflow is demonstrated on mouse-pup and rat kidney case studies.
Collapse
Affiliation(s)
- Leonoor E M Tideman
- Delft Center for Systems and Control, Delft University of Technology, Delft, Netherlands
| | - Lukasz G Migas
- Delft Center for Systems and Control, Delft University of Technology, Delft, Netherlands
| | - Katerina V Djambazova
- Mass Spectrometry Research Center, Vanderbilt University, Nashville, TN, USA; Department of Chemistry, Vanderbilt University, Nashville, TN, USA
| | - Nathan Heath Patterson
- Mass Spectrometry Research Center, Vanderbilt University, Nashville, TN, USA; Department of Biochemistry, Vanderbilt University, Nashville, TN, USA
| | - Richard M Caprioli
- Mass Spectrometry Research Center, Vanderbilt University, Nashville, TN, USA; Department of Biochemistry, Vanderbilt University, Nashville, TN, USA; Department of Chemistry, Vanderbilt University, Nashville, TN, USA; Department of Pharmacology, Vanderbilt University, Nashville, TN, USA; Department of Medicine, Vanderbilt University, Nashville, TN, USA
| | - Jeffrey M Spraggins
- Mass Spectrometry Research Center, Vanderbilt University, Nashville, TN, USA; Department of Biochemistry, Vanderbilt University, Nashville, TN, USA; Department of Chemistry, Vanderbilt University, Nashville, TN, USA
| | - Raf Van de Plas
- Delft Center for Systems and Control, Delft University of Technology, Delft, Netherlands; Mass Spectrometry Research Center, Vanderbilt University, Nashville, TN, USA; Department of Biochemistry, Vanderbilt University, Nashville, TN, USA.
| |
Collapse
|
31
|
Walakira A, Rozman D, Režen T, Mraz M, Moškon M. Guided extraction of genome-scale metabolic models for the integration and analysis of omics data. Comput Struct Biotechnol J 2021; 19:3521-3530. [PMID: 34194675 PMCID: PMC8225705 DOI: 10.1016/j.csbj.2021.06.009] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 06/04/2021] [Accepted: 06/04/2021] [Indexed: 02/05/2023] Open
Abstract
Omics data can be integrated into a reference model using various model extraction methods (MEMs) to yield context-specific genome-scale metabolic models (GEMs). How to chose the appropriate MEM, thresholding rule and threshold remains a challenge. We integrated mouse transcriptomic data from a Cyp51 knockout mice diet experiment (GSE58271) using five MEMs (GIMME, iMAT, FASTCORE, INIT an tINIT) in a combination with a recently published mouse GEM iMM1865. Except for INIT and tINIT, the size of extracted models varied with the MEM used (t-test: p-value < 0.001). The Jaccard index of iMAT models ranged from 0.27 to 1.0. Out of the three factors under study in the experiment (diet, gender and genotype), gender explained most of the variability ( > 90%) in PC1 for FASTCORE. In iMAT, each of the three factors explained less than 40% of the variability within PC1, PC2 and PC3. Among all the MEMs, FASTCORE captured the most of the true variability in the data by clustering samples by gender. Our results show that for the efficient use of MEMs in the context of omics data integration and analysis, one should apply various MEMs, thresholding rules, and thresholding values to select the MEM and its configuration that best captures the true variability in the data. This selection can be guided by the methodology as proposed and used in this paper. Moreover, we describe certain approaches that can be used to analyse the results obtained with the selected MEM and to put these results in a biological context.
Collapse
Affiliation(s)
- Andrew Walakira
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Damjana Rozman
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Tadeja Režen
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Miha Mraz
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Miha Moškon
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| |
Collapse
|
32
|
Abstract
Although an increasing number of ethical data science and AI courses is available, with many focusing specifically on technology and computer ethics, pedagogical approaches employed in these courses rely exclusively on texts rather than on algorithmic development or data analysis. In this paper we recount a recent experience in developing and teaching a technical course focused on responsible data science, which tackles the issues of ethics in AI, legal compliance, data quality, algorithmic fairness and diversity, transparency of data and algorithms, privacy, and data protection. Interpretability of machine-assisted decision-making is an important component of responsible data science that gives a good lens through which to see other responsible data science topics, including privacy and fairness. We provide emerging pedagogical best practices for teaching technical data science and AI courses that focus on interpretability, and tie responsible data science to current learning science and learning analytics research. We focus on a novel methodological notion of the object-to-interpret-with, a representation that helps students target metacognition involving interpretation and representation. In the context of interpreting machine learning models, we highlight the suitability of “nutritional labels”—a family of interpretability tools that are gaining popularity in responsible data science research and practice.
Collapse
|
33
|
Sushil M, Šuster S, Luyckx K, Daelemans W. Patient representation learning and interpretable evaluation using clinical notes. J Biomed Inform 2018; 84:103-13. [PMID: 29966746 DOI: 10.1016/j.jbi.2018.06.016] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Revised: 06/07/2018] [Accepted: 06/28/2018] [Indexed: 11/22/2022]
Abstract
We have three contributions in this work: 1. We explore the utility of a stacked denoising autoencoder and a paragraph vector model to learn task-independent dense patient representations directly from clinical notes. To analyze if these representations are transferable across tasks, we evaluate them in multiple supervised setups to predict patient mortality, primary diagnostic and procedural category, and gender. We compare their performance with sparse representations obtained from a bag-of-words model. We observe that the learned generalized representations significantly outperform the sparse representations when we have few positive instances to learn from, and there is an absence of strong lexical features. 2. We compare the model performance of the feature set constructed from a bag of words to that obtained from medical concepts. In the latter case, concepts represent problems, treatments, and tests. We find that concept identification does not improve the classification performance. 3. We propose novel techniques to facilitate model interpretability. To understand and interpret the representations, we explore the best encoded features within the patient representations obtained from the autoencoder model. Further, we calculate feature sensitivity across two networks to identify the most significant input features for different classification tasks when we use these pretrained representations as the supervised input. We successfully extract the most influential features for the pipeline using this technique.
Collapse
|
34
|
Rios A, Kavuluru R. Ordinal convolutional neural networks for predicting RDoC positive valence psychiatric symptom severity scores. J Biomed Inform 2017; 75S:S85-S93. [PMID: 28506904 PMCID: PMC5682241 DOI: 10.1016/j.jbi.2017.05.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2017] [Revised: 04/04/2017] [Accepted: 05/10/2017] [Indexed: 10/19/2022]
Abstract
BACKGROUND The CEGS N-GRID 2016 Shared Task in Clinical Natural Language Processing (NLP) provided a set of 1000 neuropsychiatric notes to participants as part of a competition to predict psychiatric symptom severity scores. This paper summarizes our methods, results, and experiences based on our participation in the second track of the shared task. OBJECTIVE Classical methods of text classification usually fall into one of three problem types: binary, multi-class, and multi-label classification. In this effort, we study ordinal regression problems with text data where misclassifications are penalized differently based on how far apart the ground truth and model predictions are on the ordinal scale. Specifically, we present our entries (methods and results) in the N-GRID shared task in predicting research domain criteria (RDoC) positive valence ordinal symptom severity scores (absent, mild, moderate, and severe) from psychiatric notes. METHODS We propose a novel convolutional neural network (CNN) model designed to handle ordinal regression tasks on psychiatric notes. Broadly speaking, our model combines an ordinal loss function, a CNN, and conventional feature engineering (wide features) into a single model which is learned end-to-end. Given interpretability is an important concern with nonlinear models, we apply a recent approach called locally interpretable model-agnostic explanation (LIME) to identify important words that lead to instance specific predictions. RESULTS Our best model entered into the shared task placed third among 24 teams and scored a macro mean absolute error (MMAE) based normalized score (100·(1-MMAE)) of 83.86. Since the competition, we improved our score (using basic ensembling) to 85.55, comparable with the winning shared task entry. Applying LIME to model predictions, we demonstrate the feasibility of instance specific prediction interpretation by identifying words that led to a particular decision. CONCLUSION In this paper, we present a method that successfully uses wide features and an ordinal loss function applied to convolutional neural networks for ordinal text classification specifically in predicting psychiatric symptom severity scores. Our approach leads to excellent performance on the N-GRID shared task and is also amenable to interpretability using existing model-agnostic approaches.
Collapse
Affiliation(s)
- Anthony Rios
- Department of Computer Science, University of Kentucky, 329 Rose Street, Lexington, KY 40506, USA.
| | - Ramakanth Kavuluru
- Department of Computer Science, University of Kentucky, 329 Rose Street, Lexington, KY 40506, USA; Division of Biomedical Informatics, Department of Internal Medicine, University Kentucky, 725 Rose Street, Lexington, KY 40536, USA.
| |
Collapse
|
35
|
Jovanovic M, Radovanovic S, Vukicevic M, Van Poucke S, Delibasic B. Building interpretable predictive models for pediatric hospital readmission using Tree-Lasso logistic regression. Artif Intell Med 2016; 72:12-21. [PMID: 27664505 DOI: 10.1016/j.artmed.2016.07.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Revised: 07/23/2016] [Accepted: 07/25/2016] [Indexed: 11/18/2022]
Abstract
OBJECTIVES Quantification and early identification of unplanned readmission risk have the potential to improve the quality of care during hospitalization and after discharge. However, high dimensionality, sparsity, and class imbalance of electronic health data and the complexity of risk quantification, challenge the development of accurate predictive models. Predictive models require a certain level of interpretability in order to be applicable in real settings and create actionable insights. This paper aims to develop accurate and interpretable predictive models for readmission in a general pediatric patient population, by integrating a data-driven model (sparse logistic regression) and domain knowledge based on the international classification of diseases 9th-revision clinical modification (ICD-9-CM) hierarchy of diseases. Additionally, we propose a way to quantify the interpretability of a model and inspect the stability of alternative solutions. MATERIALS AND METHODS The analysis was conducted on >66,000 pediatric hospital discharge records from California, State Inpatient Databases, Healthcare Cost and Utilization Project between 2009 and 2011. We incorporated domain knowledge based on the ICD-9-CM hierarchy in a data driven, Tree-Lasso regularized logistic regression model, providing the framework for model interpretation. This approach was compared with traditional Lasso logistic regression resulting in models that are easier to interpret by fewer high-level diagnoses, with comparable prediction accuracy. RESULTS The results revealed that the use of a Tree-Lasso model was as competitive in terms of accuracy (measured by area under the receiver operating characteristic curve-AUC) as the traditional Lasso logistic regression, but integration with the ICD-9-CM hierarchy of diseases provided more interpretable models in terms of high-level diagnoses. Additionally, interpretations of models are in accordance with existing medical understanding of pediatric readmission. Best performing models have similar performances reaching AUC values 0.783 and 0.779 for traditional Lasso and Tree-Lasso, respectfully. However, information loss of Lasso models is 0.35 bits higher compared to Tree-Lasso model. CONCLUSIONS We propose a method for building predictive models applicable for the detection of readmission risk based on Electronic Health records. Integration of domain knowledge (in the form of ICD-9-CM taxonomy) and a data-driven, sparse predictive algorithm (Tree-Lasso Logistic Regression) resulted in an increase of interpretability of the resulting model. The models are interpreted for the readmission prediction problem in general pediatric population in California, as well as several important subpopulations, and the interpretations of models comply with existing medical understanding of pediatric readmission. Finally, quantitative assessment of the interpretability of the models is given, that is beyond simple counts of selected low-level features.
Collapse
Affiliation(s)
- Milos Jovanovic
- University of Belgrade, Faculty of Organizational Sciences, Jove Ilica 154, 11010 Vozdovac, Belgrade, Serbia
| | - Sandro Radovanovic
- University of Belgrade, Faculty of Organizational Sciences, Jove Ilica 154, 11010 Vozdovac, Belgrade, Serbia
| | - Milan Vukicevic
- University of Belgrade, Faculty of Organizational Sciences, Jove Ilica 154, 11010 Vozdovac, Belgrade, Serbia.
| | - Sven Van Poucke
- Department of Anesthesiology, Critical Care, Emergency Medicine and Pain Therapy, Ziekenhuis Oost-Limburg, Schiepse Bos 6, B-3600 Genk, Belgium
| | - Boris Delibasic
- University of Belgrade, Faculty of Organizational Sciences, Jove Ilica 154, 11010 Vozdovac, Belgrade, Serbia
| |
Collapse
|