1
|
Qi D, Song C, Liu T. PreDBP-PLMs: Prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks. Anal Biochem 2024; 694:115603. [PMID: 38986796 DOI: 10.1016/j.ab.2024.115603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Revised: 06/15/2024] [Accepted: 07/06/2024] [Indexed: 07/12/2024]
Abstract
The recognition of DNA-binding proteins (DBPs) is the crucial step to understanding their roles in various biological processes such as genetic regulation, gene expression, cell cycle control, DNA repair, and replication within cells. However, conventional experimental methods for identifying DBPs are usually time-consuming and expensive. Therefore, there is an urgent need to develop rapid and efficient computational methods for the prediction of DBPs. In this study, we proposed a novel predictor named PreDBP-PLMs to further improve the identification accuracy of DBPs by fusing the pre-trained protein language model (PLM) ProtT5 embedding with evolutionary features as input to the classic convolutional neural network (CNN) model. Firstly, the ProtT5 embedding was combined with different evolutionary features derived from the position-specific scoring matrix (PSSM) to represent protein sequences. Then, the optimal feature combination was selected and input to the CNN classifier for the prediction of DBPs. Finally, the 5-fold cross-validation (CV), the leave-one-out CV (LOOCV), and the independent set test were adopted to examine the performance of PreDBP-PLMs on the benchmark datasets. Compared to the existing state-of-the-art predictors, PreDBP-PLMs exhibits an accuracy improvement of 0.5 % and 5.2 % on the PDB186 and PDB2272 datasets, respectively. It demonstrated that the proposed method could serve as a useful tool for the recognition of DBPs.
Collapse
Affiliation(s)
- Dawei Qi
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| | - Chen Song
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China.
| |
Collapse
|
2
|
Arshad F, Ahmed S, Amjad A, Kabir M. An explainable stacking-based approach for accelerating the prediction of antidiabetic peptides. Anal Biochem 2024; 691:115546. [PMID: 38670418 DOI: 10.1016/j.ab.2024.115546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2024] [Revised: 04/20/2024] [Accepted: 04/24/2024] [Indexed: 04/28/2024]
Abstract
Diabetes is a chronic disease that is characterized by high blood sugar levels and can have several harmful outcomes. Hyperglycemia, which is defined by persistently elevated blood sugar, is one of the primary concerns. People can improve their overall well-being and get optimal health outcomes by prioritizing diabetes control. Although the use of experimental approaches in diabetes treatment is cost-effective, it necessitates the development of many strategies for evaluating the efficacy of therapies. Researchers can quickly create new strategies for managing diabetes and get vital insights by enabling virtual screening with computational tools and procedures. In this study, we suggest a predictor named STADIP (STacking-based predictor for AntiDiabetic Peptides), a new method to predict antidiabetic peptides (ADPs) utilizing a stacked-based ensemble approach. It uses 12 different feature encodings and seven machine-learning techniques to construct 84 baseline models. The impacts of various baseline models on ADP prediction were then thoroughly examined. A two-step feature selection method, eXtreme Gradient Boosting with Sequential Forward Selection (XGB-SFS), was employed to determine the optimal number, out of 84 PFs to enhance predictive performance. Subsequently, utilizing the meta-predictor approach, 45 selected PFs were integrated into an XGB classifier to formulate the final hybrid model. The proposed method demonstrated superior predictive capabilities compared to constituent baseline models, as evidenced by evaluations on both cross-validation and independent tests. During extensive independent testing, STADIP achieved promising performance with accuracy and mathew's correlation coefficient of 0.954 and 0.877, respectively. It is anticipated that it will be useful tool in helping the scientific community to identify new antidiabetic proteins.
Collapse
Affiliation(s)
- Farwa Arshad
- School of Systems and Technology, University of Management and Technology, Lahore, 54770, Pakistan.
| | - Saeed Ahmed
- School of Systems and Technology, University of Management and Technology, Lahore, 54770, Pakistan.
| | - Aqsa Amjad
- School of Systems and Technology, University of Management and Technology, Lahore, 54770, Pakistan.
| | - Muhammad Kabir
- School of Systems and Technology, University of Management and Technology, Lahore, 54770, Pakistan.
| |
Collapse
|
3
|
Mahmud SMH, Goh KOM, Hosen MF, Nandi D, Shoombuatong W. Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features. Sci Rep 2024; 14:2961. [PMID: 38316843 PMCID: PMC10844231 DOI: 10.1038/s41598-024-52653-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Accepted: 01/22/2024] [Indexed: 02/07/2024] Open
Abstract
DNA-binding proteins (DBPs) play a significant role in all phases of genetic processes, including DNA recombination, repair, and modification. They are often utilized in drug discovery as fundamental elements of steroids, antibiotics, and anticancer drugs. Predicting them poses the most challenging task in proteomics research. Conventional experimental methods for DBP identification are costly and sometimes biased toward prediction. Therefore, developing powerful computational methods that can accurately and rapidly identify DBPs from sequence information is an urgent need. In this study, we propose a novel deep learning-based method called Deep-WET to accurately identify DBPs from primary sequence information. In Deep-WET, we employed three powerful feature encoding schemes containing Global Vectors, Word2Vec, and fastText to encode the protein sequence. Subsequently, these three features were sequentially combined and weighted using the weights obtained from the elements learned through the differential evolution (DE) algorithm. To enhance the predictive performance of Deep-WET, we applied the SHapley Additive exPlanations approach to remove irrelevant features. Finally, the optimal feature subset was input into convolutional neural networks to construct the Deep-WET predictor. Both cross-validation and independent tests indicated that Deep-WET achieved superior predictive performance compared to conventional machine learning classifiers. In addition, in extensive independent test, Deep-WET was effective and outperformed than several state-of-the-art methods for DBP prediction, with accuracy of 78.08%, MCC of 0.559, and AUC of 0.805. This superior performance shows that Deep-WET has a tremendous predictive capacity to predict DBPs. The web server of Deep-WET and curated datasets in this study are available at https://deepwet-dna.monarcatechnical.com/ . The proposed Deep-WET is anticipated to serve the community-wide effort for large-scale identification of potential DBPs.
Collapse
Affiliation(s)
- S M Hasan Mahmud
- Department of Computer Science, American International University-Bangladesh (AIUB), Kuratoli, Dhaka, 1229, Bangladesh.
- Centre for Advanced Machine Learning and Applications (CAMLAs), Dhaka, 1229, Bangladesh.
| | - Kah Ong Michael Goh
- Faculty of Information Science & Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, 75450, Melaka, Malaysia.
| | - Md Faruk Hosen
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Dip Nandi
- Department of Computer Science, American International University-Bangladesh (AIUB), Kuratoli, Dhaka, 1229, Bangladesh
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| |
Collapse
|
4
|
Manouchehri N, Bouguila N. Human Activity Recognition with an HMM-Based Generative Model. SENSORS (BASEL, SWITZERLAND) 2023; 23:1390. [PMID: 36772428 PMCID: PMC9920173 DOI: 10.3390/s23031390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 01/11/2023] [Accepted: 01/20/2023] [Indexed: 06/18/2023]
Abstract
Human activity recognition (HAR) has become an interesting topic in healthcare. This application is important in various domains, such as health monitoring, supporting elders, and disease diagnosis. Considering the increasing improvements in smart devices, large amounts of data are generated in our daily lives. In this work, we propose unsupervised, scaled, Dirichlet-based hidden Markov models to analyze human activities. Our motivation is that human activities have sequential patterns and hidden Markov models (HMMs) are some of the strongest statistical models used for modeling data with continuous flow. In this paper, we assume that emission probabilities in HMM follow a bounded-scaled Dirichlet distribution, which is a proper choice in modeling proportional data. To learn our model, we applied the variational inference approach. We used a publicly available dataset to evaluate the performance of our proposed model.
Collapse
Affiliation(s)
- Narges Manouchehri
- Algorithmic Dynamics Lab, Unit of Computational Medicine, Karolinska Institute, 171 77 Stockholm, Sweden
- Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC H3G1T7, Canada
| | - Nizar Bouguila
- Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC H3G1T7, Canada
| |
Collapse
|
5
|
Irshad MT, Nisar MA, Huang X, Hartz J, Flak O, Li F, Gouverneur P, Piet A, Oltmanns KM, Grzegorzek M. SenseHunger: Machine Learning Approach to Hunger Detection Using Wearable Sensors. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22207711. [PMID: 36298061 PMCID: PMC9609214 DOI: 10.3390/s22207711] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 09/26/2022] [Accepted: 10/06/2022] [Indexed: 05/23/2023]
Abstract
The perception of hunger and satiety is of great importance to maintaining a healthy body weight and avoiding chronic diseases such as obesity, underweight, or deficiency syndromes due to malnutrition. There are a number of disease patterns, characterized by a chronic loss of this perception. To our best knowledge, hunger and satiety cannot be classified using non-invasive measurements. Aiming to develop an objective classification system, this paper presents a multimodal sensory system using associated signal processing and pattern recognition methods for hunger and satiety detection based on non-invasive monitoring. We used an Empatica E4 smartwatch, a RespiBan wearable device, and JINS MEME smart glasses to capture physiological signals from five healthy normal weight subjects inactively sitting on a chair in a state of hunger and satiety. After pre-processing the signals, we compared different feature extraction approaches, either based on manual feature engineering or deep feature learning. Comparative experiments were carried out to determine the most appropriate sensor channel, device, and classifier to reliably discriminate between hunger and satiety states. Our experiments showed that the most discriminative features come from three specific sensor modalities: Electrodermal Activity (EDA), infrared Thermopile (Tmp), and Blood Volume Pulse (BVP).
Collapse
Affiliation(s)
- Muhammad Tausif Irshad
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
- Department of IT, University of the Punjab, Katchery Road, Lahore 54000, Pakistan
| | - Muhammad Adeel Nisar
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
- Department of IT, University of the Punjab, Katchery Road, Lahore 54000, Pakistan
| | - Xinyu Huang
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Jana Hartz
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Olaf Flak
- Department of Management, Faculty of Law and Social Sciences, Jan Kochanowski University of Kielce, ul. Żeromskiego 5, 25-369 Kielce, Poland
| | - Frédéric Li
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Philip Gouverneur
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Artur Piet
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Kerstin M. Oltmanns
- Section of Psychoneurobiology, Center of Brain, Behavior and Metabolism, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Marcin Grzegorzek
- Institute of Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
- Department of Knowledge Engineering, University of Economics in Katowice, Bogucicka 3, 40-287 Katowice, Poland
| |
Collapse
|
6
|
Identification of Human Cell Cycle Phase Markers Based on Single-Cell RNA-Seq Data by Using Machine Learning Methods. BIOMED RESEARCH INTERNATIONAL 2022; 2022:2516653. [PMID: 36004205 PMCID: PMC9393965 DOI: 10.1155/2022/2516653] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2022] [Revised: 07/25/2022] [Accepted: 07/29/2022] [Indexed: 12/17/2022]
Abstract
The cell cycle is composed of a series of ordered, highly regulated processes through which a cell grows and duplicates its genome and eventually divides into two daughter cells. According to the complex changes in cell structure and biosynthesis, the cell cycle is divided into four phases: gap 1 (G1), DNA synthesis (S), gap 2 (G2), and mitosis (M). Determining which cell cycle phases a cell is in is critical to the research of cancer development and pharmacy for targeting cell cycle. However, current detection methods have the following problems: (1) they are complicated and time consuming to perform, and (2) they cannot detect the cell cycle on a large scale. Rapid developments in single-cell technology have made dissecting cells on a large scale possible with unprecedented resolution. In the present research, we construct efficient classifiers and identify essential gene biomarkers based on single-cell RNA sequencing data through Boruta and three feature ranking algorithms (e.g., mRMR, MCFS, and SHAP by LightGBM) by utilizing four advanced classification algorithms. Meanwhile, we mine a series of classification rules that can distinguish different cell cycle phases. Collectively, we have provided a novel method for determining the cell cycle and identified new potential cell cycle-related genes, thereby contributing to the understanding of the processes that regulate the cell cycle.
Collapse
|
7
|
Huang F, Chen L, Guo W, Zhou X, Feng K, Huang T, Cai Y. Identifying COVID-19 Severity-Related SARS-CoV-2 Mutation Using a Machine Learning Method. Life (Basel) 2022; 12:806. [PMID: 35743837 PMCID: PMC9225528 DOI: 10.3390/life12060806] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 05/22/2022] [Accepted: 05/25/2022] [Indexed: 12/22/2022] Open
Abstract
SARS-CoV-2 shows great evolutionary capacity through a high frequency of genomic variation during transmission. Evolved SARS-CoV-2 often demonstrates resistance to previous vaccines and can cause poor clinical status in patients. Mutations in the SARS-CoV-2 genome involve mutations in structural and nonstructural proteins, and some of these proteins such as spike proteins have been shown to be directly associated with the clinical status of patients with severe COVID-19 pneumonia. In this study, we collected genome-wide mutation information of virulent strains and the severity of COVID-19 pneumonia in patients varying depending on their clinical status. Important protein mutations and untranslated region mutations were extracted using machine learning methods. First, through Boruta and four ranking algorithms (least absolute shrinkage and selection operator, light gradient boosting machine, max-relevance and min-redundancy, and Monte Carlo feature selection), mutations that were highly correlated with the clinical status of the patients were screened out and sorted in four feature lists. Some mutations such as D614G and V1176F were shown to be associated with viral infectivity. Moreover, previously unreported mutations such as A320V of nsp14 and I164ILV of nsp14 were also identified, which suggests their potential roles. We then applied the incremental feature selection method to each feature list to construct efficient classifiers, which can be directly used to distinguish the clinical status of COVID-19 patients. Meanwhile, four sets of quantitative rules were set up, which can help us to more intuitively understand the role of each mutation in differentiating the clinical status of COVID-19 patients. Identified key mutations linked to virologic properties will help better understand the mechanisms of infection and will aid in the development of antiviral treatments.
Collapse
Affiliation(s)
- Feiming Huang
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China;
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) and Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200025, China;
| | - Xianchao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine (SJTUSM), Shanghai 200025, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510060, China;
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| |
Collapse
|
8
|
Maruf FA, Pratama R, Song G. DNN-Boost: Somatic mutation identification of tumor-only whole-exome sequencing data using deep neural network and XGBoost. J Bioinform Comput Biol 2021; 19:2140017. [PMID: 34895111 DOI: 10.1142/s0219720021400175] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Detection of somatic mutation in whole-exome sequencing data can help elucidate the mechanism of tumor progression. Most computational approaches require exome sequencing for both tumor and normal samples. However, it is more common to sequence exomes for tumor samples only without the paired normal samples. To include these types of data for extensive studies on the process of tumorigenesis, it is necessary to develop an approach for identifying somatic mutations using tumor exome sequencing data only. In this study, we designed a machine learning approach using Deep Neural Network (DNN) and XGBoost to identify somatic mutations in tumor-only exome sequencing data and we integrated this into a pipeline called DNN-Boost. The XGBoost algorithm is used to extract the features from the results of variant callers and these features are then fed into the DNN model as input. The XGBoost algorithm resolves issues of missing values and overfitting. We evaluated our proposed model and compared its performance with other existing benchmark methods. We noted that the DNN-Boost classification model outperformed the benchmark method in classifying somatic mutations from paired tumor-normal exome data and tumor-only exome data.
Collapse
Affiliation(s)
- Firda Aminy Maruf
- School of Computer Science and Engineering, Pusan National University, 63 Busandaehak-Ro, Busan 46241, Republic of Korea
| | - Rian Pratama
- School of Computer Science and Engineering, Pusan National University, 63 Busandaehak-Ro, Busan 46241, Republic of Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, 63 Busandaehak-Ro, Busan 46241, Republic of Korea
| |
Collapse
|
9
|
Shen Z, Liu T, Xu T. Accurate Identification of Antioxidant Proteins Based on a Combination of Machine Learning Techniques and Hidden Markov Model Profiles. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5770981. [PMID: 34413898 PMCID: PMC8369162 DOI: 10.1155/2021/5770981] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 07/15/2021] [Accepted: 07/26/2021] [Indexed: 01/19/2023]
Abstract
Antioxidant proteins (AOPs) play important roles in the management and prevention of several human diseases due to their ability to neutralize excess free radicals. However, the identification of AOPs by using wet-lab experimental techniques is often time-consuming and expensive. In this study, we proposed an accurate computational model, called AOP-HMM, to predict AOPs by extracting discriminatory evolutionary features from hidden Markov model (HMM) profiles. First, auto cross-covariance (ACC) variables were applied to transform the HMM profiles into fixed-length feature vectors. Then, we performed the analysis of variance (ANOVA) method to reduce the dimensionality of the raw feature space. Finally, a support vector machine (SVM) classifier was adopted to conduct the prediction of AOPs. To comprehensively evaluate the performance of the proposed AOP-HMM model, the 10-fold cross-validation (CV), the jackknife CV, and the independent test were carried out on two widely used benchmark datasets. The experimental results demonstrated that AOP-HMM outperformed most of the existing methods and could be used to quickly annotate AOPs and guide the experimental process.
Collapse
Affiliation(s)
- Zhehan Shen
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Ting Xu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| |
Collapse
|
10
|
Shen Z, Wu Q, Wang Z, Chen G, Lin B. Diabetic Retinopathy Prediction by Ensemble Learning Based on Biochemical and Physical Data. SENSORS 2021; 21:s21113663. [PMID: 34070287 PMCID: PMC8197325 DOI: 10.3390/s21113663] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Revised: 05/15/2021] [Accepted: 05/20/2021] [Indexed: 11/16/2022]
Abstract
(1) Background: Diabetic retinopathy, one of the most serious complications of diabetes, is the primary cause of blindness in developed countries. Therefore, the prediction of diabetic retinopathy has a positive impact on its early detection and treatment. The prediction of diabetic retinopathy based on high-dimensional and small-sample-structured datasets (such as biochemical data and physical data) was the problem to be solved in this study. (2) Methods: This study proposed the XGB-Stacking model with the foundation of XGBoost and stacking. First, a wrapped feature selection algorithm, XGBIBS (Improved Backward Search Based on XGBoost), was used to reduce data feature redundancy and improve the effect of a single ensemble learning classifier. Second, in view of the slight limitation of a single classifier, a stacking model fusion method, Sel-Stacking (Select-Stacking), which keeps Label-Proba as the input matrix of meta-classifier and determines the optimal combination of learners by a global search, was used in the XGB-Stacking model. (3) Results: XGBIBS greatly improved the prediction accuracy and the feature reduction rate of a single classifier. Compared to a single classifier, the accuracy of the Sel-Stacking model was improved to varying degrees. Experiments proved that the prediction model of XGB-Stacking based on the XGBIBS algorithm and the Sel-Stacking method made effective predictions on diabetes retinopathy. (4) Conclusion: The XGB-Stacking prediction model of diabetic retinopathy based on biochemical and physical data had outstanding performance. This is highly significant to improve the screening efficiency of diabetes retinopathy and reduce the cost of diagnosis.
Collapse
Affiliation(s)
- Zun Shen
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
| | - Qingfeng Wu
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
- Correspondence:
| | - Zhi Wang
- Department of Microelectronics and Nanoelectronics, Tsinghua University, Beijing 100876, China;
| | - Guoyi Chen
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
| | - Bin Lin
- School of Informatics, Xiamen University, Xiamen 361005, China; (Z.S.); (G.C.); (B.L.)
| |
Collapse
|
11
|
Li G, Du X, Li X, Zou L, Zhang G, Wu Z. Prediction of DNA binding proteins using local features and long-term dependencies with primary sequences based on deep learning. PeerJ 2021; 9:e11262. [PMID: 33986992 PMCID: PMC8101451 DOI: 10.7717/peerj.11262] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 03/22/2021] [Indexed: 12/12/2022] Open
Abstract
DNA-binding proteins (DBPs) play pivotal roles in many biological functions such as alternative splicing, RNA editing, and methylation. Many traditional machine learning (ML) methods and deep learning (DL) methods have been proposed to predict DBPs. However, these methods either rely on manual feature extraction or fail to capture long-term dependencies in the DNA sequence. In this paper, we propose a method, called PDBP-Fusion, to identify DBPs based on the fusion of local features and long-term dependencies only from primary sequences. We utilize convolutional neural network (CNN) to learn local features and use bi-directional long-short term memory network (Bi-LSTM) to capture critical long-term dependencies in context. Besides, we perform feature extraction, model training, and model prediction simultaneously. The PDBP-Fusion approach can predict DBPs with 86.45% sensitivity, 79.13% specificity, 82.81% accuracy, and 0.661 MCC on the PDB14189 benchmark dataset. The MCC of our proposed methods has been increased by at least 9.1% compared to other advanced prediction models. Moreover, the PDBP-Fusion also gets superior performance and model robustness on the PDB2272 independent dataset. It demonstrates that the PDBP-Fusion can be used to predict DBPs from sequences accurately and effectively; the online server is at http://119.45.144.26:8080/PDBP-Fusion/.
Collapse
Affiliation(s)
- Guobin Li
- School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
| | - Xiuquan Du
- School of Computer Science and Technology, Anhui University, Hefei, China
| | - Xinlu Li
- School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
| | - Le Zou
- School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
| | - Guanhong Zhang
- School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
| | - Zhize Wu
- School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
| |
Collapse
|
12
|
Soft sensor based on eXtreme gradient boosting and bidirectional converted gates long short-term memory self-attention network. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.028] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
13
|
Zhang Q, Liu P, Wang X, Zhang Y, Han Y, Yu B. StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106921] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
14
|
Zhang YH, Li Z, Zeng T, Chen L, Li H, Huang T, Cai YD. Detecting the Multiomics Signatures of Factor-Specific Inflammatory Effects on Airway Smooth Muscles. Front Genet 2021; 11:599970. [PMID: 33519902 PMCID: PMC7838645 DOI: 10.3389/fgene.2020.599970] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 12/14/2020] [Indexed: 12/19/2022] Open
Abstract
Smooth muscles are a specific muscle subtype that is widely identified in the tissues of internal passageways. This muscle subtype has the capacity for controlled or regulated contraction and relaxation. Airway smooth muscles are a unique type of smooth muscles that constitute the effective, adjustable, and reactive wall that covers most areas of the entire airway from the trachea to lung tissues. Infection with SARS-CoV-2, which caused the world-wide COVID-19 pandemic, involves airway smooth muscles and their surrounding inflammatory environment. Therefore, airway smooth muscles and related inflammatory factors may play an irreplaceable role in the initiation and progression of several severe diseases. Many previous studies have attempted to reveal the potential relationships between interleukins and airway smooth muscle cells only on the omics level, and the continued existence of numerous false-positive optimal genes/transcripts cannot reflect the actual effective biological mechanisms underlying interleukin-based activation effects on airway smooth muscles. Here, on the basis of newly presented machine learning-based computational approaches, we identified specific regulatory factors and a series of rules that contribute to the activation and stimulation of airway smooth muscles by IL-13, IL-17, or the combination of both interleukins on the epigenetic and/or transcriptional levels. The detected discriminative factors (genes) and rules can contribute to the identification of potential regulatory mechanisms linking airway smooth muscle tissues and inflammatory factors and help reveal specific pathological factors for diseases associated with airway smooth muscle inflammation on multiomics levels.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, China
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
15
|
Zhang Y, Chen P, Gao Y, Ni J, Wang X. DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins. Comb Chem High Throughput Screen 2020; 25:3-12. [PMID: 33238837 DOI: 10.2174/1386207323999201124203531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 10/16/2020] [Accepted: 10/29/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND AND OBJECTIVE DNA-binding proteins play important roles in a variety of biological processes, such as gene transcription and regulation, DNA replication and repair, DNA recombination and packaging, and the formation of chromatin and ribosomes. Therefore, it is urgent to develop a computational method to improve the recognition efficiency of DNA-binding proteins. METHODS We proposed a novel method, DBP-PSSM, which constructed the features from amino acid composition and evolutionary information of protein sequences. The maximum relevance, minimum redundancy (mRMR) was employed to select the optimal features for establishing the XGBoost classifier, therefore, the novel model of prediction DNA-binding proteins, DBP-PSSM, was established with 5-fold cross-validation on the training dataset. RESULTS DBP-PSSM achieved an accuracy of 81.18% and MCC of 0.657 in a test dataset, which outperformed the many existing methods. These results demonstrated that our method can effectively predict DNA-binding proteins. CONCLUSION The data and source code are provided at https://github.com/784221489/DNA-binding.
Collapse
Affiliation(s)
- Yanping Zhang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Pengcheng Chen
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Ya Gao
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Jianwei Ni
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Xiaosheng Wang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| |
Collapse
|
16
|
Zhang YH, Li Z, Zeng T, Pan X, Chen L, Liu D, Li H, Huang T, Cai YD. Distinguishing Glioblastoma Subtypes by Methylation Signatures. Front Genet 2020; 11:604336. [PMID: 33329750 PMCID: PMC7732602 DOI: 10.3389/fgene.2020.604336] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Accepted: 11/02/2020] [Indexed: 11/13/2022] Open
Abstract
Glioblastoma, also called glioblastoma multiform (GBM), is the most aggressive cancer that initiates within the brain. GBM is produced in the central nervous system. Cancer cells in GBM are similar to stem cells. Several different schemes for GBM stratification exist. These schemes are based on intertumoral molecular heterogeneity, preoperative images, and integrated tumor characteristics. Although the formation of glioblastoma is remarkably related to gene methylation, GBM has been poorly classified by epigenetics. To classify glioblastoma subtypes on the basis of different degrees of genes' methylation, we adopted several powerful machine learning algorithms to identify numerous methylation features (sites) associated with the classification of GBM. The features were first analyzed by an excellent feature selection method, Monte Carlo feature selection (MCFS), resulting in a feature list. Then, such list was fed into the incremental feature selection (IFS), incorporating one classification algorithm, to extract essential sites. These sites can be annotated onto coding genes, such as CXCR4, TBX18, SP5, and TMEM22, and enriched in relevant biological functions related to GBM classification (e.g., subtype-specific functions). Representative functions, such as nervous system development, intrinsic plasma membrane component, calcium ion binding, systemic lupus erythematosus, and alcoholism, are potential pathogenic functions that participate in the initiation and progression of glioblastoma and its subtypes. With these sites, an efficient model can be built to classify the subtypes of glioblastoma.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai, China
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Dejing Liu
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
17
|
Chen L, Li Z, Zeng T, Zhang YH, Liu D, Li H, Huang T, Cai YD. Identifying Robust Microbiota Signatures and Interpretable Rules to Distinguish Cancer Subtypes. Front Mol Biosci 2020; 7:604794. [PMID: 33330634 PMCID: PMC7672214 DOI: 10.3389/fmolb.2020.604794] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Accepted: 10/15/2020] [Indexed: 12/11/2022] Open
Abstract
Cancer can be generally defined as a cluster of systematic diseases triggered by abnormal cell proliferation and growth. With the development of biological sciences and biotechnologies, the etiology of cancer is partially revealed, including some of the most substantial pathogenic factors [either endogenous (genetics) or exogenous (environmental)]. However, some remaining factors that contribute to the tumorigenesis but have not been analyzed and discussed in detail remain. For instance, some typical correlations between microorganisms and tumorigenesis have been reported already, but previous studies are just sporadic studies on single microorganism–cancer subtype pairs and do not explain and validate the specific contribution of microbiome on tumorigenesis. On the basis of the systematic microbiome analyses of blood and cancer-associated tissues in cancer patients/controls in public domain, we performed interpretable analyses. We identified several core regulatory microorganisms that contribute to the classification of multiple tumor subtypes and established quantitative predictive models for interpretable prediction by using multiple machine learning methods. We also compared the optimal features (microorganisms) and rules identified from microbiome profiles processed using the Kraken and the SHOGUN. Collectively, our study identified new microbiome signatures and their interpretable classification rules for cancer discrimination and carried out reliable methodological comparison for robust cancer microbiome analyses, thereby promoting the development of tumor etiology at the microbiome level.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, China.,College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Zeng
- Zhangjiang Laboratory, Institute of Brain-Intelligence Technology, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - Dejing Liu
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
18
|
Zhang X, Chen L. Prediction of membrane protein types by fusing protein-protein interaction and protein sequence information. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140524. [PMID: 32858174 DOI: 10.1016/j.bbapap.2020.140524] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 07/17/2020] [Accepted: 07/30/2020] [Indexed: 11/30/2022]
Abstract
Membrane proteins are gatekeepers to the cell and essential for determination of the function of cells. Identification of the types of membrane proteins is an essential problem in cell biology. It is time-consuming and expensive to identify the type of membrane proteins with traditional experimental methods. The alternative way is to design effective computational methods, which can provide quick and reliable predictions. To date, several computational methods have been proposed in this regard. Several of them used the features extracted from the sequence information of individual proteins. Recently, networks are more and more popular to tackle different protein-related problems, which can organize proteins in a system level and give an overview of all proteins. However, such form weakens the essential properties of proteins, such as their sequence information. In this study, a novel feature fusion scheme was proposed, which integrated the information of protein sequences and protein-protein interaction network. The fused features of a protein were defined as the linear combination of sequence features of all proteins in the network, where the combination coefficients were the probabilities yielded by the random walk with restart algorithm with the protein as the seed node. Several models with such fused features and different classification algorithms were built and evaluated. Their performance for predicting the type of membrane proteins was improved compared with the models only with the sequence features or network information.
Collapse
Affiliation(s)
- Xiaolin Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| |
Collapse
|