1
|
Zhelyazkova M, Yordanova R, Mihaylov I, Tsonev S, Vassilev D. In silico discovering relationship between bacteriophages and antimicrobial resistance. BIOTECHNOL BIOTEC EQ 2023. [DOI: 10.1080/13102818.2022.2151378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Affiliation(s)
- Maya Zhelyazkova
- Faculty of Mathematics and Informatics, Department of Probability, Operations Research and Statistics, Sofia University St. Kliment Ohridski, Sofia, Bulgaria
| | - Roumyana Yordanova
- Faculty of Science, Department of Mathematics, Hokkaido University, Sapporo, Japan
- Department of Informatics modeling, Bulgarian Academy of Sciences, Institute of Mathematics and Informatics, Sofia, Bulgaria
| | - Iliyan Mihaylov
- Faculty of Mathematics and Informatics, Department of Information Technologies, Sofia University St. Kliment Ohridski, Sofia, Bulgaria
| | - Stefan Tsonev
- Department of Functional Genetics, Abiotic and Biotic Stress, AgroBioInstitute, Agricultural Academy, Sofia, Bulgaria
| | - Dimitar Vassilev
- Faculty of Mathematics and Informatics, Department of Computational Informatics, Sofia University St. Kliment Ohridski, Sofia, Bulgaria
| |
Collapse
|
2
|
He S, Gao B, Sabnis R, Sun Q. Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions. ACS Synth Biol 2023; 12:3205-3214. [PMID: 37916871 PMCID: PMC10863451 DOI: 10.1021/acssynbio.3c00154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 10/04/2023] [Accepted: 10/06/2023] [Indexed: 11/03/2023]
Abstract
Much work has been done to apply machine learning and deep learning to genomics tasks, but these applications usually require extensive domain knowledge, and the resulting models provide very limited interpretability. Here, we present the Nucleic Transformer, a conceptually simple but effective and interpretable model architecture that excels in the classification of DNA sequences. The Nucleic Transformer employs self-attention and convolutions on nucleic acid sequences, leveraging two prominent deep learning strategies commonly used in computer vision and natural language analysis. We demonstrate that the Nucleic Transformer can be trained without much domain knowledge to achieve high performance in Escherichia coli promoter classification, viral genome identification, enhancer classification, and chromatin profile predictions.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Baizhen Gao
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Rushant Sabnis
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Qing Sun
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| |
Collapse
|
3
|
Bischoff E, Lang L, Zimmermann J, Luczak M, Kiefer AM, Niedner-Schatteburg G, Manolikakes G, Morgan B, Deponte M. Glutathione kinetically outcompetes reactions between dimedone and a cyclic sulfenamide or physiological sulfenic acids. Free Radic Biol Med 2023; 208:165-177. [PMID: 37541455 DOI: 10.1016/j.freeradbiomed.2023.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 07/31/2023] [Accepted: 08/01/2023] [Indexed: 08/06/2023]
Abstract
Dimedone and its derivates are used as selective probes for the nucleophilic detection of sulfenic acids in biological samples. Qualitative analyses suggested that dimedone also reacts with cyclic sulfenamides. Furthermore, under physiological conditions, dimedone must compete with the highly concentrated nucleophile glutathione. We therefore quantified the reaction kinetics for a cyclic sulfenamide model peptide and the sulfenic acids of glutathione and a model peroxiredoxin in the presence or absence of dimedone and glutathione. We show that the cyclic sulfenamide is stabilized at lower pH and that it reacts with dimedone. While reactions between dimedone and sulfenic acids or the cyclic sulfenamide have similar rate constants, glutathione kinetically outcompetes dimedone as a nucleophile by several orders of magnitude. Our comparative in vitro and intracellular analyses challenge the selectivity of dimedone. Consequently, the dimedone labeling of cysteinyl residues inside living cells points towards unidentified reaction pathways or unknown, kinetically competitive redox species.
Collapse
Affiliation(s)
- Eileen Bischoff
- Fachbereich Chemie & Landesforschungszentrum OPTIMAS, RPTU Kaiserslautern, Erwin-Schrödinger Straße 54, D-67663, Kaiserslautern, Germany
| | - Lukas Lang
- Fachbereich Chemie & Landesforschungszentrum OPTIMAS, RPTU Kaiserslautern, Erwin-Schrödinger Straße 54, D-67663, Kaiserslautern, Germany
| | - Jannik Zimmermann
- Zentrum für Human- und Molekularbiologie (ZHMB), Universität des Saarlandes, Biochemie Campus, Geb. B2.2, D-66123, Saarbrücken, Germany
| | - Maximilian Luczak
- Fachbereich Chemie & Landesforschungszentrum OPTIMAS, RPTU Kaiserslautern, Erwin-Schrödinger Straße 54, D-67663, Kaiserslautern, Germany
| | - Anna Maria Kiefer
- Fachbereich Biologie, RPTU Kaiserslautern, Paul-Ehrlich Straße 23, D-67663, Kaiserslautern, Germany
| | - Gereon Niedner-Schatteburg
- Fachbereich Chemie & Landesforschungszentrum OPTIMAS, RPTU Kaiserslautern, Erwin-Schrödinger Straße 54, D-67663, Kaiserslautern, Germany
| | - Georg Manolikakes
- Fachbereich Chemie & Landesforschungszentrum OPTIMAS, RPTU Kaiserslautern, Erwin-Schrödinger Straße 54, D-67663, Kaiserslautern, Germany
| | - Bruce Morgan
- Zentrum für Human- und Molekularbiologie (ZHMB), Universität des Saarlandes, Biochemie Campus, Geb. B2.2, D-66123, Saarbrücken, Germany
| | - Marcel Deponte
- Fachbereich Chemie & Landesforschungszentrum OPTIMAS, RPTU Kaiserslautern, Erwin-Schrödinger Straße 54, D-67663, Kaiserslautern, Germany.
| |
Collapse
|
4
|
Le NQK, Li W, Cao Y. Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection. Brief Bioinform 2023; 24:bbad319. [PMID: 37649385 DOI: 10.1093/bib/bbad319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Revised: 07/09/2023] [Accepted: 08/16/2023] [Indexed: 09/01/2023] Open
Abstract
Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on experimental costs and time, the tendency of proteins to crystallize can be initially determined and screened by modeling. As a result, this study created a new pipeline aimed at using protein sequence to predict protein crystallization propensity in the protein material production stage, purification stage and production of crystal stage. The newly created pipeline proposed a new feature selection method, which involves combining Chi-square (${\chi }^{2}$) and recursive feature elimination together with the 12 selected features, followed by a linear discriminant analysisfor dimensionality reduction and finally, a support vector machine algorithm with hyperparameter tuning and 10-fold cross-validation is used to train the model and test the results. This new pipeline has been tested on three different datasets, and the accuracy rates are higher than the existing pipelines. In conclusion, our model provides a new solution to predict multistage protein crystallization propensity which is a big challenge in computational biology.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, 252 Wuxing Street, 110, Taipei, Taiwan
| | - Wanru Li
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Yanshuang Cao
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| |
Collapse
|
5
|
Zhang T, Jia J, Chen C, Zhang Y, Yu B. BiGRUD-SA: Protein S-sulfenylation sites prediction based on BiGRU and self-attention. Comput Biol Med 2023; 163:107145. [PMID: 37336062 DOI: 10.1016/j.compbiomed.2023.107145] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 05/18/2023] [Accepted: 06/06/2023] [Indexed: 06/21/2023]
Abstract
S-sulfenylation is a vital post-translational modification (PTM) of proteins, which is an intermediate in other redox reactions and has implications for signal transduction and protein function regulation. However, there are many restrictions on the experimental identification of S-sulfenylation sites. Therefore, predicting S-sulfoylation sites by computational methods is fundamental to studying protein function and related biological mechanisms. In this paper, we propose a method named BiGRUD-SA based on bi-directional gated recurrent unit (BiGRU) and self-attention mechanism to predict protein S-sulfenylation sites. We first use AAC, BLOSUM62, AAindex, EAAC and GAAC to extract features, and do feature fusion to obtain original feature space. Next, we use SMOTE-Tomek method to handle data imbalance. Then, we input the processed data to the BiGRU and use self-attention mechanism to do further feature extraction. Finally, we input the data obtained to the deep neural networks (DNN) to identify S-sulfenylation sites. The accuracies of training set and independent test set are 96.66% and 95.91% respectively, which indicates that our method is conducive to identifying S-sulfenylation sites. Furthermore, we use a data set of S-sulfenylation sites in Arabidopsis thaliana to effectively verify the generalization ability of BiGRUD-SA method, and obtain better prediction results.
Collapse
Affiliation(s)
- Tingting Zhang
- College of Computer Science and Technology, Shandong University, Qingdao, 266237, China; College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jihua Jia
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Cheng Chen
- College of Computer Science and Technology, Shandong University, Qingdao, 266237, China
| | - Yaqun Zhang
- College of Mathematics and Big Data, Dezhou University, Dezhou, 253023, China.
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
6
|
Palangi V. Identification of Ruminal Fermentation Curves of Some Legume Forages Using Particle Swarm Optimization. Animals (Basel) 2023; 13:ani13081339. [PMID: 37106901 PMCID: PMC10135319 DOI: 10.3390/ani13081339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 04/11/2023] [Accepted: 04/12/2023] [Indexed: 04/29/2023] Open
Abstract
The modeling process has a wide range of applications in animal nutrition. The purpose of this work is to determine whether particle swarm optimization (PSO) could be used to explain the fermentation curves of some legume forages. The model suited the fermentation data with minor statistical differences (R2 > 0.98). In addition, reducing the number of iterations enhanced this method's benefits. Only Models I and II could successfully fit the fermentability data (R2 > 0.98) in the vetch and white clover fermentation curve because the negative parameters (calculated in Models III and IV) were not biologically acceptable. Model IV could only fit the alfalfa fermentation curve, which had higher R values and demonstrated the model's dependability. In conclusion, it is advised to use PSO to match the fermentation curves. By examining the fermentation curves of feed materials, animal nutritionists can obtain a broader view of what ruminants require in terms of nutrition.
Collapse
Affiliation(s)
- Valiollah Palangi
- Department of Animal Science, Faculty of Agriculture, Ege University, Bornova, Izmir 35100, Türkiye
| |
Collapse
|
7
|
Luo H, Shan W, Chen C, Ding P, Luo L. Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training. Interdiscip Sci 2023; 15:32-43. [PMID: 36136096 DOI: 10.1007/s12539-022-00537-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 08/30/2022] [Accepted: 09/07/2022] [Indexed: 11/27/2022]
Abstract
The DNA-protein binding plays a pivotal role in regulating gene expression and evolution, and computational identification of DNA-protein has drawn more and more attention in bioinformatics. Recently, variants of BERT are also used to capture the semantic information of DNA sequences for predicting DNA-protein bindings. In this study, we leverage a task-specific pre-training strategy on BERT using large-scale multi-source DNA-protein binding data and present TFBert. TFBert treats DNA sequences as natural sentences and k-mer nucleotides as words. It can effectively extract upstream and downstream nucleotide context information by pre-training the 690 unlabeled ChIP-seq datasets. Experiments show that the pre-trained model can achieve promising performance on every single dataset in the 690 ChIP-seq datasets after simple fine tuning, especially on small datasets. The average AUC is 94.7%, outperforming existing popular methods. In conclusion, this study provides a variant of BERT based on pre-training and achieved state-of-the-art results in predicting DNA-protein bindings. We believe that TFBert can provide insights into other biological sequence classification problems.
Collapse
Affiliation(s)
- Hanyu Luo
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Wenyu Shan
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Cheng Chen
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China. .,Hunan Medical Big Data International Science and Technology Innovation Cooperation Base, Hengyang, Hunan, 421001, People's Republic of China.
| |
Collapse
|
8
|
Watanabe N, Yamamoto M, Murata M, Vavricka CJ, Ogino C, Kondo A, Araki M. Comprehensive Machine Learning Prediction of Extensive Enzymatic Reactions. J Phys Chem B 2022; 126:6762-6770. [PMID: 36053051 DOI: 10.1021/acs.jpcb.2c03287] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
New enzyme functions exist within the increasing number of unannotated protein sequences. Novel enzyme discovery is necessary to expand the pathways that can be accessed by metabolic engineering for the biosynthesis of functional compounds. Accordingly, various machine learning models have been developed to predict enzymatic reactions. However, the ability to predict unknown reactions that are not included in the training data has not been clarified. In order to cover uncertain and unknown reactions, a wider range of reaction types must be demonstrated by the models. Here, we establish 16 expanded enzymatic reaction prediction models developed using various machine learning algorithms, including deep neural network. Improvements in prediction performances over that of our previous study indicate that the updated methods are more effective for the prediction of enzymatic reactions. Overall, the deep neural network model trained with combined substrate-enzyme-product information exhibits the highest prediction accuracy with Macro F1 scores up to 0.966 and with robust prediction of unknown enzymatic reactions that are not included in the training data. This model can predict more extensive enzymatic reactions in comparison to previously reported models. This study will facilitate the discovery of new enzymes for the production of useful substances.
Collapse
Affiliation(s)
- Naoki Watanabe
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan
| | - Masaki Yamamoto
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan
| | - Masahiro Murata
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan
| | - Christopher J Vavricka
- Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan
| | - Chiaki Ogino
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan
| | - Akihiko Kondo
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan.,Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan
| | - Michihiro Araki
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan.,Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan.,National Institutes of Biomedical Innovation, Health and Nutrition, National Institute of Health and Nutrition, 1-23-1 Toyama, Shinjuku-ku, Tokyo 162-8638, Japan.,National Cerebral and Cardiovascular Center, 6-1 Kishibe-Shinmachi, Suita, Osaka 564-8565, Japan
| |
Collapse
|
9
|
Jiang Z, Lu Y, Liu Z, Wu W, Xu X, Dinnyés A, Yu Z, Chen L, Sun Q. Drug resistance prediction and resistance genes identification in Mycobacterium tuberculosis based on a hierarchical attentive neural network utilizing genome-wide variants. Brief Bioinform 2022; 23:6553603. [PMID: 35325021 DOI: 10.1093/bib/bbac041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Revised: 01/18/2022] [Accepted: 01/27/2022] [Indexed: 01/25/2023] Open
Abstract
Prediction of antimicrobial resistance based on whole-genome sequencing data has attracted greater attention due to its rapidity and convenience. Numerous machine learning-based studies have used genetic variants to predict drug resistance in Mycobacterium tuberculosis (MTB), assuming that variants are homogeneous, and most of these studies, however, have ignored the essential correlation between variants and corresponding genes when encoding variants, and used a limited number of variants as prediction input. In this study, taking advantage of genome-wide variants for drug-resistance prediction and inspired by natural language processing, we summarize drug resistance prediction into document classification, in which variants are considered as words, mutated genes in an isolate as sentences, and an isolate as a document. We propose a novel hierarchical attentive neural network model (HANN) that helps discover drug resistance-related genes and variants and acquire more interpretable biological results. It captures the interaction among variants in a mutated gene as well as among mutated genes in an isolate. Our results show that for the four first-line drugs of isoniazid (INH), rifampicin (RIF), ethambutol (EMB) and pyrazinamide (PZA), the HANN achieves the optimal area under the ROC curve of 97.90, 99.05, 96.44 and 95.14% and the optimal sensitivity of 94.63, 96.31, 92.56 and 87.05%, respectively. In addition, without any domain knowledge, the model identifies drug resistance-related genes and variants consistent with those confirmed by previous studies, and more importantly, it discovers one more potential drug-resistance-related gene.
Collapse
Affiliation(s)
- Zhonghua Jiang
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - Yongmei Lu
- College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China
| | - Zhuochong Liu
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - Wei Wu
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - Xinyi Xu
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| | - András Dinnyés
- BioTalentum Ltd. Aulich Lajos str. 26. 2100 Gödöllõ, Hungary
| | - Zhonghua Yu
- College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China
| | - Li Chen
- College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China
| | - Qun Sun
- Key Laboratory of Bio-resources and Eco-environment of the Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan 610064, China
| |
Collapse
|
10
|
Guo H, Song Y, Tang H, Zhao J. An ensemble deep neural network approach for predicting TOC concentration in lakes along the middle-lower reaches of Yangtze River. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-210708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In recent years, lakes pollution has become increasingly serious, so water quality monitoring is becoming increasingly important. The concentration of total organic carbon (TOC) in lakes is an important indicator for monitoring the emission of organic pollutants. Therefore, it is of great significance to determine the TOC concentration in lakes. In this paper, the water quality dataset of the middle and lower reaches of the Yangtze River is obtained, and then the temperature, transparency, pH value, dissolved oxygen, conductivity, chlorophyll and ammonia nitrogen content are taken as the impact factors, and the stacking of different epochs’ deep neural networks (SDE-DNN) model is constructed to predict the TOC concentration in water. Five deep neural networks and linear regression are integrated into a strong prediction model by the stacking ensemble method. The experimental results show the prediction performance, the Nash-Sutcliffe efficiency coefficient (NSE) is 0.5312, the mean absolute error (MAE) is 0.2108 mg/L, the symmetric mean absolute percentage error (SMAPE) is 43.92%, and the root mean squared error (RMSE) is 0.3064 mg/L. The model has good prediction performance for the TOC concentration in water. Compared with the common machine learning models, traditional ensemble learning models and existing TOC prediction methods, the prediction error of this model is lower, and it is more suitable for predicting the TOC concentration. The model can use a wireless sensor network to obtain water quality data, thus predicting the TOC concentration of lakes in real time, reducing the cost of manual testing, and improving the detection efficiency.
Collapse
Affiliation(s)
- Hai Guo
- College of Computer Science and Technology, Dalian Minzu University, Dalian, China
| | - Yifan Song
- College of Computer Science and Technology, Dalian Minzu University, Dalian, China
| | - Haoran Tang
- College of Computer Science and Technology, Dalian Minzu University, Dalian, China
| | - Jingying Zhao
- College of Computer Science and Technology, Dalian Minzu University, Dalian, China
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, China
| |
Collapse
|
11
|
Yao M, Fu L, Liu X, Zheng D. In-Silico Multi-Omics Analysis of the Functional Significance of Calmodulin 1 in Multiple Cancers. Front Genet 2022; 12:793508. [PMID: 35096010 PMCID: PMC8790318 DOI: 10.3389/fgene.2021.793508] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 12/23/2021] [Indexed: 01/14/2023] Open
Abstract
Aberrant activation of calmodulin 1 (CALM1) has been reported in human cancers. However, comprehensive understanding of the role of CALM1 in most cancer types has remained unclear. We systematically analyzed the expression landscape, DNA methylation, gene alteration, immune infiltration, clinical relevance, and molecular pathway of CALM1 in multiple cancers using various online tools, including The Cancer Genome Atlas, cBioPortal and the Human Protein Atlas databases. Kaplan–Meier and receiver operating characteristic (ROC) curves were plotted to explore the prognostic and diagnostic potential of CALM1 expression. Multivariate analyses were used to evaluate whether the CALM1 expression could be an independent risk factor. A nomogram predicting the overall survival (OS) of patients was developed, evaluated, and compared with the traditional Tumor-Node-Metastasis (TNM) model using decision curve analysis. R language was employed as the main tool for analysis and visualization. Results revealed CALM1 to be highly expressed in most cancers, its expression being regulated by DNA methylation in multiple cancers. CALM1 had a low mutation frequency (within 3%) and was associated with immune infiltration. We observed a substantial positive correlation between CALM1 expression and macrophage and neutrophil infiltration levels in multiple cancers. Different mutational forms of CALM1 hampered immune cell infiltration. Additionally, CALM1 expression had high diagnostic and prognostic potential. Multivariate analyses revealed CALM1 expression to be an independent risk factor for OS. Therefore, our newly developed nomogram had a higher clinical value than the TNM model. The concordance index, calibration curve, and time-dependent ROC curves of the nomogram exhibited excellent performance in terms of predicting the survival rate of patients. Moreover, elevated CALM1 expression contributes to the activation of cancer-related pathways, such as the WNT and MAPK pathways. Overall, our findings improved our understanding of the function of CALM1 in human cancers.
Collapse
Affiliation(s)
- Maolin Yao
- Laboratory of Genetics and Molecular Biology, College of Wildlife and Protected Area, Northeast Forestry University, Harbin, China
| | - Lanyi Fu
- Laboratory of Genetics and Molecular Biology, College of Wildlife and Protected Area, Northeast Forestry University, Harbin, China
| | - Xuedong Liu
- Laboratory of Genetics and Molecular Biology, College of Wildlife and Protected Area, Northeast Forestry University, Harbin, China
| | - Dong Zheng
- Laboratory of Genetics and Molecular Biology, College of Wildlife and Protected Area, Northeast Forestry University, Harbin, China
| |
Collapse
|
12
|
Recognition of mRNA N4 Acetylcytidine (ac4C) by Using Non-Deep vs. Deep Learning. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12031344] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Deep learning models have been successfully applied in a wide range of fields. The creation of a deep learning framework for analyzing high-performance sequence data have piqued the research community’s interest. N4 acetylcytidine (ac4C) is a post-transcriptional modification in mRNA, is an mRNA component that plays an important role in mRNA stability control and translation. The ac4C method of mRNA changes is still not simple, time consuming, or cost effective for conventional laboratory experiments. As a result, we developed DL-ac4C, a CNN-based deep learning model for ac4C recognition. In the alternative scenario, the model families are well-suited to working in large datasets with a large number of available samples, especially in biological domains. In this study, the DL-ac4C method (deep learning) is compared to non-deep learning (machine learning) methods, regression, and support vector machine. The results show that DL-ac4C is more advanced than previously used approaches. The proposed model improves the accuracy recall area by 9.6 percent and 9.8 percent, respectively, for cross-validation and independent tests. More nuanced methods of incorporating prior bio-logical knowledge into the estimation procedure of deep learning models are required to achieve better results in terms of predictive efficiency and cost-effectiveness. Based on an experiment’s acetylated dataset, the DL-ac4C sequence-based predictor for acetylation sites in mRNA can predict whether query sequences have potential acetylation motifs.
Collapse
|
13
|
Yang Y, Lin L, Qiao L. Deep learning approaches for data-independent acquisition proteomics. Expert Rev Proteomics 2021; 18:1031-1043. [PMID: 34918987 DOI: 10.1080/14789450.2021.2020654] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
INTRODUCTION Data-independent acquisition (DIA) is an emerging technology for large-scale proteomic studies. DIA data analysis methods are evolving rapidly, and deep learning has cut a conspicuous figure in this field. AREAS COVERED This review discusses and provides an overview of the deep learning methods that are used for DIA data analysis, including spectral library prediction, feature scoring, and statistical control in peptide-centric analysis, as well as de novo peptide sequencing. Literature searches were performed for articles, including preprints, up to December 2021 from PubMed, Scopus, and Web of Science databases. EXPERT OPINION While spectral library prediction has broken through the limitation on proteome coverage of experimental libraries, the statistical burden due to the large query space is the remaining challenge of utilizing proteome-wide predicted libraries. Analysis of post-translational modifications is another promising direction of deep learning-based DIA methods.
Collapse
Affiliation(s)
- Yi Yang
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Ling Lin
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Liang Qiao
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| |
Collapse
|
14
|
Le NQK, Ho QT. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes. Methods 2021; 204:199-206. [PMID: 34915158 DOI: 10.1016/j.ymeth.2021.12.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2021] [Revised: 11/30/2021] [Accepted: 12/09/2021] [Indexed: 12/19/2022] Open
Abstract
As one of the most common post-transcriptional epigenetic modifications, N6-methyladenine (6 mA), plays an essential role in various cellular processes and disease pathogenesis. Therefore, accurately identifying 6 mA modifications is necessary for a deep understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models were developed with small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we present a novel model based on transformer architecture and deep learning to identify DNA 6 mA sites from the cross-species genome. The model is constructed on a benchmark dataset and explored a feature derived from pre-trained transformer word embedding approaches. Subsequently, a convolutional neural network was employed to learn the generated features and generate the prediction outcomes. As a result, our predictor achieved excellent performance during independent test with the accuracy and Matthews correlation coefficient (MCC) of 79.3% and 0.58, respectively. Overall, its performance achieved better accuracy than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, our model is expected to assist biologists in accurately identifying 6mAs and formulate the novel testable biological hypothesis. We also release source codes and datasets freely at https://github.com/khanhlee/bert-dna for front-end users.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
| | - Quang-Thai Ho
- College of Information & Communication Technology, Can Tho University, Viet Nam; Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| |
Collapse
|
15
|
Pakhrin SC, Aoki-Kinoshita KF, Caragea D, KC DB. DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules 2021; 26:molecules26237314. [PMID: 34885895 PMCID: PMC8658957 DOI: 10.3390/molecules26237314] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 11/22/2021] [Accepted: 11/26/2021] [Indexed: 12/21/2022] Open
Abstract
Protein N-linked glycosylation is a post-translational modification that plays an important role in a myriad of biological processes. Computational prediction approaches serve as complementary methods for the characterization of glycosylation sites. Most of the existing predictors for N-linked glycosylation utilize the information that the glycosylation site occurs at the N-X-[S/T] sequon, where X is any amino acid except proline. Not all N-X-[S/T] sequons are glycosylated, thus the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In that regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem. Here, we report DeepNGlyPred a deep learning-based approach that encodes the positive and negative sequences in the human proteome dataset (extracted from N-GlycositeAtlas) using sequence-based features (gapped-dipeptide), predicted structural features, and evolutionary information. DeepNGlyPred produces SN, SP, MCC, and ACC of 88.62%, 73.92%, 0.60, and 79.41%, respectively on N-GlyDE independent test set, which is better than the compared approaches. These results demonstrate that DeepNGlyPred is a robust computational technique to predict N-Linked glycosylation sites confined to N-X-[S/T] sequon. DeepNGlyPred will be a useful resource for the glycobiology community.
Collapse
Affiliation(s)
- Subash C. Pakhrin
- School of Computing, Wichita State University, 1845 Fairmount St., Wichita, KS 67260, USA;
| | | | - Doina Caragea
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA;
| | - Dukka B. KC
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA
- Correspondence: ; Tel.: +1-906-487-1657
| |
Collapse
|
16
|
Predicting Three-Dimensional Dose Distribution of Prostate Volumetric Modulated Arc Therapy Using Deep Learning. Life (Basel) 2021; 11:life11121305. [PMID: 34947836 PMCID: PMC8706736 DOI: 10.3390/life11121305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 11/19/2021] [Accepted: 11/23/2021] [Indexed: 11/21/2022] Open
Abstract
Background: Volumetric modulated arc therapy (VMAT) planning is a time-consuming process of radiation therapy. With a deep learning approach, 3D dose distribution can be predicted without the need for an actual dose calculation. This approach can accelerate the process by guiding and confirming the achievable dose distribution in order to reduce the replanning iterations while maintaining the plan quality. Methods: In this study, three dose distribution predictive models of VMAT for prostate cancer were developed, evaluated, and compared. Each model was designed with a different input data structure to train and test the model: (1) patient CT alone (PCT alone), (2) patient CT and generalized organ structure (PCTGOS), and (3) patient CT and specific organ structure (PCTSOS). The generative adversarial network (GAN) model was used as a core learning algorithm. The models were trained slice-by-slice using 46 VMAT plans for prostate cancer, and then used to predict and evaluate the dose distribution from 8 independent plans. Results: VMAT dose distribution was generated with a mean prediction time of approximately 3.5 s per patient, whereas the PCTSOS model was excluded due to a mean prediction time of approximately 17.5 s per patient. The highest average 3D gamma passing rate was 80.51 ± 5.94, while the lowest overall percentage difference of dose-volume histogram (DVH) parameters was 6.01 ± 5.44% for the prescription dose from the PCTGOS model. However, the PCTSOS model was the most reliable for the evaluation of multiple parameters. Conclusions: This dose prediction model could accelerate the iterative optimization process for the planning of VMAT treatment by guiding the planner with the desired dose distribution.
Collapse
|
17
|
Huang S, Liu Y, Sun X, Li J. Application of Artificial Neural Network Based on Traditional Detection and GC-MS in Prediction of Free Radicals in Thermal Oxidation of Vegetable Oil. Molecules 2021; 26:6717. [PMID: 34771126 PMCID: PMC8586939 DOI: 10.3390/molecules26216717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Revised: 11/01/2021] [Accepted: 11/02/2021] [Indexed: 11/30/2022] Open
Abstract
In this study, electron paramagnetic resonance (EPR) and gas chromatography-mass spectrometry (GC-MS) techniques were applied to reveal the variation of lipid free radicals and oxidized volatile products of four oils in the thermal process. The EPR results showed the signal intensities of linseed oil (LO) were the highest, followed by sunflower oil (SO), rapeseed oil (RO), and palm oil (PO). Moreover, the signal intensities of the four oils increased with heating time. GC-MS results showed that (E)-2-decenal, (E,E)-2,4-decadienal, and 2-undecenal were the main volatile compounds of oxidized oil. Besides, the oxidized PO and LO contained the highest and lowest contents of volatiles, respectively. According to the oil characteristics, an artificial neural network (ANN) intelligent evaluation model of free radicals was established. The coefficients of determination (R2) of ANN models were more than 0.97, and the difference between the true and predicted values was small, which indicated that oil profiles combined with chemometrics can accurately predict the free radical of thermal oxidized oil.
Collapse
Affiliation(s)
- Shengquan Huang
- Nuspower Greatsun (Guangdong) Biotechnology Co., Ltd., Guangzhou 510931, China;
| | - Ying Liu
- School of Food Science and Technology, Jiangnan University, Wuxi 214122, China; (Y.L.); (X.S.)
| | - Xuyuan Sun
- School of Food Science and Technology, Jiangnan University, Wuxi 214122, China; (Y.L.); (X.S.)
| | - Jinwei Li
- School of Food Science and Technology, Jiangnan University, Wuxi 214122, China; (Y.L.); (X.S.)
| |
Collapse
|
18
|
Xu Z, Luo M, Lin W, Xue G, Wang P, Jin X, Xu C, Zhou W, Cai Y, Yang W, Nie H, Jiang Q. DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor. Brief Bioinform 2021; 22:6355415. [PMID: 34415016 DOI: 10.1093/bib/bbab335] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 07/25/2021] [Accepted: 07/28/2021] [Indexed: 12/30/2022] Open
Abstract
Accurate prediction of immunogenic peptide recognized by T cell receptor (TCR) can greatly benefit vaccine development and cancer immunotherapy. However, identifying immunogenic peptides accurately is still a huge challenge. Most of the antigen peptides predicted in silico fail to elicit immune responses in vivo without considering TCR as a key factor. This inevitably causes costly and time-consuming experimental validation test for predicted antigens. Therefore, it is necessary to develop novel computational methods for precisely and effectively predicting immunogenic peptide recognized by TCR. Here, we described DLpTCR, a multimodal ensemble deep learning framework for predicting the likelihood of interaction between single/paired chain(s) of TCR and peptide presented by major histocompatibility complex molecules. To investigate the generality and robustness of the proposed model, COVID-19 data and IEDB data were constructed for independent evaluation. The DLpTCR model exhibited high predictive power with area under the curve up to 0.91 on COVID-19 data while predicting the interaction between peptide and single TCR chain. Additionally, the DLpTCR model achieved the overall accuracy of 81.03% on IEDB data while predicting the interaction between peptide and paired TCR chains. The results demonstrate that DLpTCR has the ability to learn general interaction rules and generalize to antigen peptide recognition by TCR. A user-friendly webserver is available at http://jianglab.org.cn/DLpTCR/. Additionally, a stand-alone software package that can be downloaded from https://github.com/jiangBiolab/DLpTCR.
Collapse
Affiliation(s)
- Zhaochun Xu
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Meng Luo
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Weizhong Lin
- Center for Bioinformatics, Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333403, China
| | - Guangfu Xue
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Pingping Wang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Xiyun Jin
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Chang Xu
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Wenyang Zhou
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Yideng Cai
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Wenyi Yang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Huan Nie
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Qinghua Jiang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China.,Key Laboratory of Biological Data (Harbin Institute of Technology), Ministry of Education, China
| |
Collapse
|
19
|
Ali H, Iqbal K, Mujtaba G, Fayyaz A, Bulbul MF, Karam FW, Zahir A. Urdu text in natural scene images: a new dataset and preliminary text detection. PeerJ Comput Sci 2021; 7:e717. [PMID: 34616893 PMCID: PMC8459794 DOI: 10.7717/peerj-cs.717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 08/25/2021] [Indexed: 06/13/2023]
Abstract
Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.
Collapse
Affiliation(s)
- Hazrat Ali
- Department of Electrical and Computer Engineering, COMSATS University Islamabad,Abbottabad Campus, Abbottabad, Pakistan
| | - Khalid Iqbal
- Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, Pakistan
| | - Ghulam Mujtaba
- Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Ahmad Fayyaz
- Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Mohammad Farhad Bulbul
- Department of Mathematics, Jashore University of Science and Technology, Jashore, Bangladesh
| | - Fazal Wahab Karam
- Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Ali Zahir
- Department of Electrical and Computer Engineering, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| |
Collapse
|
20
|
Jia Y, Liu Y, Han Z, Tian R. Identification of potential gene signatures associated with osteosarcoma by integrated bioinformatics analysis. PeerJ 2021; 9:e11496. [PMID: 34123594 PMCID: PMC8164836 DOI: 10.7717/peerj.11496] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 04/30/2021] [Indexed: 12/21/2022] Open
Abstract
Background Osteosarcoma (OS) is the most primary malignant bone cancer in children and adolescents with a high mortality rate. This work aims to screen novel potential gene signatures associated with OS by integrated microarray analysis of the Gene Expression Omnibus (GEO) database. Material and Methods The OS microarray datasets were searched and downloaded from GEO database to identify differentially expressed genes (DEGs) between OS and normal samples. Afterwards, the functional enrichment analysis, protein–protein interaction (PPI) network analysis and transcription factor (TF)-target gene regulatory network were applied to uncover the biological function of DEGs. Finally, two published OS datasets (GSE39262 and GSE126209) were obtained from GEO database for evaluating the expression level and diagnostic values of key genes. Results In total 1,059 DEGs (569 up-regulated DEGs and 490 down-regulated DEGs) between OS and normal samples were screened. Functional analysis showed that these DEGs were markedly enriched in 214 GO terms and 54 KEGG pathways such as pathways in cancer. Five genes (CAMP, METTL7A, TCN1, LTF and CXCL12) acted as hub genes in PPI network. Besides, METTL7A, CYP4F3, TCN1, LTF and NETO2 were key genes in TF-gene network. Moreover, Pax-6 regulated four key genes (TCN1, CYP4F3, NETO2 and CXCL12). The expression levels of four genes (METTL7A, TCN1, CXCL12 and NETO2) in GSE39262 set were consistent with our integration analysis. The expression levels of two genes (CXCL12 and NETO2) in GSE126209 set were consistent with our integration analysis. ROC analysis of GSE39262 set revealed that CYP4F3, CXCL12, METTL7A, TCN1 and NETO2 had good diagnostic values for OS patients. ROC analysis of GSE126209 set revealed that CXCL12, METTL7A, TCN1 and NETO2 had good diagnostic values for OS patients.
Collapse
Affiliation(s)
- Yutao Jia
- Department of Spine Surgery, Tianjin Union Medical Center, Tianjin, China
| | - Yang Liu
- Department of Spine Surgery, Tianjin Union Medical Center, Tianjin, China
| | - Zhihua Han
- Department of Anesthesiology, Tianjin Union Medical Center, Tianjin, China
| | - Rong Tian
- Department of Spine Surgery, Tianjin Union Medical Center, Tianjin, China
| |
Collapse
|
21
|
Mosquera Navarro R, Castrillón OD, Parra Osorio L, Oliveira T, Novais P, Valencia JF. Improving classification based on physical surface tension-neural net for the prediction of psychosocial-risk level in public school teachers. PeerJ Comput Sci 2021; 7:e511. [PMID: 34141875 PMCID: PMC8176537 DOI: 10.7717/peerj-cs.511] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 04/06/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND Psychosocial risks, also present in educational processes, are stress factors particularly critical in state-schools, affecting the efficacy, stress, and job satisfaction of the teachers. This study proposes an intelligent algorithm to improve the prediction of psychosocial risk, as a tool for the generation of health and risk prevention assistance programs. METHODS The proposed approach, Physical Surface Tension-Neural Net (PST-NN), applied the theory of superficial tension in liquids to an artificial neural network (ANN), in order to model four risk levels (low, medium, high and very high psychosocial risk). The model was trained and tested using the results of tests for measurement of the psychosocial risk levels of 5,443 teachers. Psychosocial, and also physiological and musculoskeletal symptoms, factors were included as inputs of the model. The classification efficiency of the PST-NN approach was evaluated by using the sensitivity, specificity, accuracy and ROC curve metrics, and compared against other techniques as the Decision Tree model, Naïve Bayes, ANN, Support Vector Machines, Robust Linear Regression and the Logistic Regression Model. RESULTS The modification of the ANN model, by the adaptation of a layer that includes concepts related to the theory of physical surface tension, improved the separation of the subjects according to the risk level group, as a function of the mass and perimeter outputs. Indeed, the PST-NN model showed better performance to classify psychosocial risk level on state-school teachers than the linear, probabilistic and logistic models included in this study, obtaining an average accuracy value of 97.31%. CONCLUSIONS The introduction of physical models, such as the physical surface tension, can improve the classification performance of ANN. Particularly, the PST-NN model can be used to predict and classify psychosocial risk levels among state-school teachers at work. This model could help to early identification of psychosocial risk and to the development of programs to prevent it.
Collapse
Affiliation(s)
- Rodolfo Mosquera Navarro
- Departamento de Ingeniería Industrial, Universidad Nacional de Colombia, Manizales, Caldas, Colombia
- Grupo Nuevas tecnologías trabajo y gestión, Universidad de San Buenaventura - Cali, Cali, Valle del Cauca, Colombia
| | - Omar Danilo Castrillón
- Departamento de Ingeniería Industrial, Universidad Nacional de Colombia, Manizales, Caldas, Colombia
| | - Liliana Parra Osorio
- Centro de Investigaciones Socio jurídicas, Facultad de Derecho, Universidad Libre, Bogotá, Cundinamarca, Colombia
| | - Tiago Oliveira
- Algoritmi Center, Universidade do Minho, Minho, Braga, Portugal
| | - Paulo Novais
- Department of Informatics/Algoritmi Center, Universidade do Minho, Minho, Braga, Portugal
| | - José Fernando Valencia
- Department of Ciencias y Tecnologías de la Información, Universidad de San Buenaventura - Cali, Cali, Valle del Cauca, Colombia
| |
Collapse
|
22
|
Makarov I, Makarov M, Kiselev D. Fusion of text and graph information for machine learning problems on networks. PeerJ Comput Sci 2021; 7:e526. [PMID: 34084929 PMCID: PMC8157042 DOI: 10.7717/peerj-cs.526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 04/14/2021] [Indexed: 06/12/2023]
Abstract
Today, increased attention is drawn towards network representation learning, a technique that maps nodes of a network into vectors of a low-dimensional embedding space. A network embedding constructed this way aims to preserve nodes similarity and other specific network properties. Embedding vectors can later be used for downstream machine learning problems, such as node classification, link prediction and network visualization. Naturally, some networks have text information associated with them. For instance, in a citation network, each node is a scientific paper associated with its abstract or title; in a social network, all users may be viewed as nodes of a network and posts of each user as textual attributes. In this work, we explore how combining existing methods of text and network embeddings can increase accuracy for downstream tasks and propose modifications to popular architectures to better capture textual information in network embedding and fusion frameworks.
Collapse
Affiliation(s)
- Ilya Makarov
- HSE University, Moscow, Russia
- University of Ljubljana, Ljubljana, Slovenia
| | | | | |
Collapse
|
23
|
Shafiq S, Azim T. Introspective analysis of convolutional neural networks for improving discrimination performance and feature visualisation. PeerJ Comput Sci 2021; 7:e497. [PMID: 34013030 PMCID: PMC8114803 DOI: 10.7717/peerj-cs.497] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 03/30/2021] [Indexed: 06/12/2023]
Abstract
Deep neural networks have been widely explored and utilised as a useful tool for feature extraction in computer vision and machine learning. It is often observed that the last fully connected (FC) layers of convolutional neural network possess higher discrimination power as compared to the convolutional and maxpooling layers whose goal is to preserve local and low-level information of the input image and down sample it to avoid overfitting. Inspired from the functionality of local binary pattern (LBP) operator, this paper proposes to induce discrimination into the mid layers of convolutional neural network by introducing a discriminatively boosted alternative to pooling (DBAP) layer that has shown to serve as a favourable replacement of early maxpooling layer in a convolutional neural network (CNN). A thorough research of the related works show that the proposed change in the neural architecture is novel and has not been proposed before to bring enhanced discrimination and feature visualisation power achieved from the mid layer features. The empirical results reveal that the introduction of DBAP layer in popular neural architectures such as AlexNet and LeNet produces competitive classification results in comparison to their baseline models as well as other ultra-deep models on several benchmark data sets. In addition, better visualisation of intermediate features can allow one to seek understanding and interpretation of black box behaviour of convolutional neural networks, used widely by the research community.
Collapse
Affiliation(s)
- Shakeel Shafiq
- Center of Excellence in IT, Institute of Management Sciences (IMSciences), Peshawar, KPK, Pakistan
| | - Tayyaba Azim
- Center of Excellence in IT, Institute of Management Sciences (IMSciences), Peshawar, KPK, Pakistan
| |
Collapse
|
24
|
Zhelyazkova M, Yordanova R, Mihaylov I, Kirov S, Tsonev S, Danko D, Mason C, Vassilev D. Origin Sample Prediction and Spatial Modeling of Antimicrobial Resistance in Metagenomic Sequencing Data. Front Genet 2021; 12:642991. [PMID: 33763122 PMCID: PMC7983949 DOI: 10.3389/fgene.2021.642991] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 02/02/2021] [Indexed: 12/18/2022] Open
Abstract
The steady elaboration of the Metagenomic and Metadesign of Subways and Urban Biomes (MetaSUB) international consortium project raises important new questions about the origin, variation, and antimicrobial resistance of the collected samples. CAMDA (Critical Assessment of Massive Data Analysis, http://camda.info/) forum organizes annual challenges where different bioinformatics and statistical approaches are tested on samples collected around the world for bacterial classification and prediction of geographical origin. This work proposes a method which not only predicts the locations of unknown samples, but also estimates the relative risk of antimicrobial resistance through spatial modeling. We introduce a new component in the standard analysis as we apply a Bayesian spatial convolution model which accounts for spatial structure of the data as defined by the longitude and latitude of the samples and assess the relative risk of antimicrobial resistance taxa across regions which is relevant to public health. We can then use the estimated relative risk as a new measure for antimicrobial resistance. We also compare the performance of several machine learning methods, such as Gradient Boosting Machine, Random Forest, and Neural Network to predict the geographical origin of the mystery samples. All three methods show consistent results with some superiority of Random Forest classifier. In our future work we can consider a broader class of spatial models and incorporate covariates related to the environment and climate profiles of the samples to achieve more reliable estimation of the relative risk related to antimicrobial resistance.
Collapse
Affiliation(s)
- Maya Zhelyazkova
- Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Sofia, Bulgaria
| | - Roumyana Yordanova
- Department of Mathematics, Hokkaido University, Sapporo, Japan.,Bulgarian Academy of Sciences, Institute of Mathematics and Informatics, Sofia, Bulgaria
| | - Iliyan Mihaylov
- Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Sofia, Bulgaria
| | - Stefan Kirov
- Bristol-Myers Squibb, Pennington, NJ, United States
| | - Stefan Tsonev
- Department of Molecular Genetics, AgroBioInstitute, Sofia, Bulgaria
| | - David Danko
- Department of Computational Informatics, Weill Cornell Medical College, New York, NY, United States
| | | | - Dimitar Vassilev
- Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Sofia, Bulgaria
| |
Collapse
|
25
|
Xiao J, Wang R, Cai X, Ye Z. Coupling of Co-expression Network Analysis and Machine Learning Validation Unearthed Potential Key Genes Involved in Rheumatoid Arthritis. Front Genet 2021; 12:604714. [PMID: 33643380 PMCID: PMC7905311 DOI: 10.3389/fgene.2021.604714] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Accepted: 01/04/2021] [Indexed: 12/21/2022] Open
Abstract
Rheumatoid arthritis (RA) is an incurable disease that afflicts 0.5–1.0% of the global population though it is less threatening at its early stage. Therefore, improved diagnostic efficiency and prognostic outcome are critical for confronting RA. Although machine learning is considered a promising technique in clinical research, its potential in verifying the biological significance of gene was not fully exploited. The performance of a machine learning model depends greatly on the features used for model training; therefore, the effectiveness of prediction might reflect the quality of input features. In the present study, we used weighted gene co-expression network analysis (WGCNA) in conjunction with differentially expressed gene (DEG) analysis to select the key genes that were highly associated with RA phenotypes based on multiple microarray datasets of RA blood samples, after which they were used as features in machine learning model validation. A total of six machine learning models were used to validate the biological significance of the key genes based on gene expression, among which five models achieved good performances [area under curve (AUC) >0.85], suggesting that our currently identified key genes are biologically significant and highly representative of genes involved in RA. Combined with other biological interpretations including Gene Ontology (GO) analysis, protein–protein interaction (PPI) network analysis, as well as inference of immune cell composition, our current study might shed a light on the in-depth study of RA diagnosis and prognosis.
Collapse
Affiliation(s)
- Jianwei Xiao
- Department of Rheumatology and Immunology, Shenzhen Futian Hospital for Rheumatic Diseases, Shenzhen, China
| | - Rongsheng Wang
- Department of Rheumatology, Shanghai Guanghua Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai, China
| | - Xu Cai
- Department of Rheumatology and Immunology, Shenzhen Futian Hospital for Rheumatic Diseases, Shenzhen, China
| | - Zhizhong Ye
- Department of Rheumatology and Immunology, Shenzhen Futian Hospital for Rheumatic Diseases, Shenzhen, China
| |
Collapse
|
26
|
Wang X, Li BB. Deep Learning in Head and Neck Tumor Multiomics Diagnosis and Analysis: Review of the Literature. Front Genet 2021; 12:624820. [PMID: 33643386 PMCID: PMC7902873 DOI: 10.3389/fgene.2021.624820] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Accepted: 01/07/2021] [Indexed: 12/24/2022] Open
Abstract
Head and neck tumors are the sixth most common neoplasms. Multiomics integrates multiple dimensions of clinical, pathologic, radiological, and biological data and has the potential for tumor diagnosis and analysis. Deep learning (DL), a type of artificial intelligence (AI), is applied in medical image analysis. Among the DL techniques, the convolution neural network (CNN) is used for image segmentation, detection, and classification and in computer-aided diagnosis. Here, we reviewed multiomics image analysis of head and neck tumors using CNN and other DL neural networks. We also evaluated its application in early tumor detection, classification, prognosis/metastasis prediction, and the signing out of the reports. Finally, we highlighted the challenges and potential of these techniques.
Collapse
Affiliation(s)
- Xi Wang
- Department of Oral Pathology, Peking University School and Hospital of Stomatology & National Clinical Research Center for Oral Diseases & National Engineering Laboratory for Digital and Material Technology of Stomatology & Beijing Key Laboratory of Digital Stomatology, Beijing, China.,Research Unit of Precision Pathologic Diagnosis in Tumors of the Oral and Maxillofacial Regions, Chinese Academy of Medical Sciences, Beijing, China
| | - Bin-Bin Li
- Department of Oral Pathology, Peking University School and Hospital of Stomatology & National Clinical Research Center for Oral Diseases & National Engineering Laboratory for Digital and Material Technology of Stomatology & Beijing Key Laboratory of Digital Stomatology, Beijing, China.,Research Unit of Precision Pathologic Diagnosis in Tumors of the Oral and Maxillofacial Regions, Chinese Academy of Medical Sciences, Beijing, China
| |
Collapse
|
27
|
Bai R, Jiang S, Sun H, Yang Y, Li G. Deep Neural Network-Based Semantic Segmentation of Microvascular Decompression Images. SENSORS 2021; 21:s21041167. [PMID: 33562275 PMCID: PMC7915571 DOI: 10.3390/s21041167] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 01/26/2021] [Accepted: 02/02/2021] [Indexed: 11/30/2022]
Abstract
Image semantic segmentation has been applied more and more widely in the fields of satellite remote sensing, medical treatment, intelligent transportation, and virtual reality. However, in the medical field, the study of cerebral vessel and cranial nerve segmentation based on true-color medical images is in urgent need and has good research and development prospects. We have extended the current state-of-the-art semantic-segmentation network DeepLabv3+ and used it as the basic framework. First, the feature distillation block (FDB) was introduced into the encoder structure to refine the extracted features. In addition, the atrous spatial pyramid pooling (ASPP) module was added to the decoder structure to enhance the retention of feature and boundary information. The proposed model was trained by fine tuning and optimizing the relevant parameters. Experimental results show that the encoder structure has better performance in feature refinement processing, improving target boundary segmentation precision, and retaining more feature information. Our method has a segmentation accuracy of 75.73%, which is 3% better than DeepLabv3+.
Collapse
Affiliation(s)
- Ruifeng Bai
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; (R.B.); (H.S.); (Y.Y.); (G.L.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shan Jiang
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; (R.B.); (H.S.); (Y.Y.); (G.L.)
- Correspondence: ; Tel.: +86-187-4401-2663
| | - Haijiang Sun
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; (R.B.); (H.S.); (Y.Y.); (G.L.)
| | - Yifan Yang
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; (R.B.); (H.S.); (Y.Y.); (G.L.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Guiju Li
- Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; (R.B.); (H.S.); (Y.Y.); (G.L.)
| |
Collapse
|
28
|
Le NQK, Ho QT, Nguyen TTD, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform 2021; 22:6128847. [PMID: 33539511 DOI: 10.1093/bib/bbab005] [Citation(s) in RCA: 75] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 01/01/2021] [Accepted: 01/03/2021] [Indexed: 01/11/2023] Open
Abstract
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Quang-Thai Ho
- College of Information and Communication Technology, Can Tho University, Vietnam
| | | | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taiwan
| |
Collapse
|
29
|
Explainable AI Framework for Multivariate Hydrochemical Time Series. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2021. [DOI: 10.3390/make3010009] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
The understanding of water quality and its underlying processes is important for the protection of aquatic environments. With the rare opportunity of access to a domain expert, an explainable AI (XAI) framework is proposed that is applicable to multivariate time series. The XAI provides explanations that are interpretable by domain experts. In three steps, it combines a data-driven choice of a distance measure with supervised decision trees guided by projection-based clustering. The multivariate time series consists of water quality measurements, including nitrate, electrical conductivity, and twelve other environmental parameters. The relationships between water quality and the environmental parameters are investigated by identifying similar days within a cluster and dissimilar days between clusters. The framework, called DDS-XAI, does not depend on prior knowledge about data structure, and its explanations are tendentially contrastive. The relationships in the data can be visualized by a topographic map representing high-dimensional structures. Two state of the art XAIs called eUD3.5 and iterative mistake minimization (IMM) were unable to provide meaningful and relevant explanations from the three multivariate time series data. The DDS-XAI framework can be swiftly applied to new data. Open-source code in R for all steps of the XAI framework is provided and the steps are structured application-oriented.
Collapse
|
30
|
Makarov I, Kiselev D, Nikitinsky N, Subelj L. Survey on graph embeddings and their applications to machine learning problems on graphs. PeerJ Comput Sci 2021; 7:e357. [PMID: 33817007 PMCID: PMC7959646 DOI: 10.7717/peerj-cs.357] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 12/18/2020] [Indexed: 05/13/2023]
Abstract
Dealing with relational data always required significant computational resources, domain expertise and task-dependent feature engineering to incorporate structural information into a predictive model. Nowadays, a family of automated graph feature engineering techniques has been proposed in different streams of literature. So-called graph embeddings provide a powerful tool to construct vectorized feature spaces for graphs and their components, such as nodes, edges and subgraphs under preserving inner graph properties. Using the constructed feature spaces, many machine learning problems on graphs can be solved via standard frameworks suitable for vectorized feature representation. Our survey aims to describe the core concepts of graph embeddings and provide several taxonomies for their description. First, we start with the methodological approach and extract three types of graph embedding models based on matrix factorization, random-walks and deep learning approaches. Next, we describe how different types of networks impact the ability of models to incorporate structural and attributed data into a unified embedding. Going further, we perform a thorough evaluation of graph embedding applications to machine learning problems on graphs, among which are node classification, link prediction, clustering, visualization, compression, and a family of the whole graph embedding algorithms suitable for graph classification, similarity and alignment problems. Finally, we overview the existing applications of graph embeddings to computer science domains, formulate open problems and provide experiment results, explaining how different networks properties result in graph embeddings quality in the four classic machine learning problems on graphs, such as node classification, link prediction, clustering and graph visualization. As a result, our survey covers a new rapidly growing field of network feature engineering, presents an in-depth analysis of models based on network types, and overviews a wide range of applications to machine learning problems on graphs.
Collapse
Affiliation(s)
- Ilya Makarov
- HSE University, Moscow, Russia
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | | | - Nikita Nikitinsky
- Big Data Research Center, National University of Science and Technology MISIS, Moscow, Russia
| | - Lovro Subelj
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| |
Collapse
|
31
|
SSnet: A Deep Learning Approach for Protein-Ligand Interaction Prediction. Int J Mol Sci 2021; 22:ijms22031392. [PMID: 33573266 PMCID: PMC7869013 DOI: 10.3390/ijms22031392] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 01/24/2021] [Accepted: 01/27/2021] [Indexed: 12/15/2022] Open
Abstract
Computational prediction of Protein-Ligand Interaction (PLI) is an important step in the modern drug discovery pipeline as it mitigates the cost, time, and resources required to screen novel therapeutics. Deep Neural Networks (DNN) have recently shown excellent performance in PLI prediction. However, the performance is highly dependent on protein and ligand features utilized for the DNN model. Moreover, in current models, the deciphering of how protein features determine the underlying principles that govern PLI is not trivial. In this work, we developed a DNN framework named SSnet that utilizes secondary structure information of proteins extracted as the curvature and torsion of the protein backbone to predict PLI. We demonstrate the performance of SSnet by comparing against a variety of currently popular machine and non-Machine Learning (ML) models using various metrics. We visualize the intermediate layers of SSnet to show a potential latent space for proteins, in particular to extract structural elements in a protein that the model finds influential for ligand binding, which is one of the key features of SSnet. We observed in our study that SSnet learns information about locations in a protein where a ligand can bind, including binding sites, allosteric sites and cryptic sites, regardless of the conformation used. We further observed that SSnet is not biased to any specific molecular interaction and extracts the protein fold information critical for PLI prediction. Our work forms an important gateway to the general exploration of secondary structure-based Deep Learning (DL), which is not just confined to protein-ligand interactions, and as such will have a large impact on protein research, while being readily accessible for de novo drug designers as a standalone package.
Collapse
|
32
|
Wu F, Yang R, Zhang C, Zhang L. A deep learning framework combined with word embedding to identify DNA replication origins. Sci Rep 2021; 11:844. [PMID: 33436981 PMCID: PMC7804333 DOI: 10.1038/s41598-020-80670-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 12/24/2020] [Indexed: 01/29/2023] Open
Abstract
The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote's ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, 'Word2vec', to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew's correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.
Collapse
Affiliation(s)
- Feng Wu
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Runtao Yang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China.
| | - Chengjin Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Lina Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| |
Collapse
|
33
|
Wang P, Zhang Q, Li S, Cheng B, Xue H, Wei Z, Shao T, Liu ZX, Cheng H, Wang Z. iCysMod: an integrative database for protein cysteine modifications in eukaryotes. Brief Bioinform 2021; 22:6066620. [PMID: 33406221 DOI: 10.1093/bib/bbaa400] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 11/23/2020] [Accepted: 12/07/2020] [Indexed: 01/06/2023] Open
Abstract
As important post-translational modifications, protein cysteine modifications (PCMs) occurring at cysteine thiol group play critical roles in the regulation of various biological processes in eukaryotes. Due to the rapid advancement of high-throughput proteomics technologies, a large number of PCM events have been identified but remain to be curated. Thus, an integrated resource of eukaryotic PCMs will be useful for the research community. In this work, we developed an integrative database for protein cysteine modifications in eukaryotes (iCysMod), which curated and hosted 108 030 PCM events for 85 747 experimentally identified sites on 31 483 proteins from 48 eukaryotes for 8 types of PCMs, including oxidation, S-nitrosylation (-SNO), S-glutathionylation (-SSG), disulfide formation (-SSR), S-sulfhydration (-SSH), S-sulfenylation (-SOH), S-sulfinylation (-SO2H) and S-palmitoylation (-S-palm). Then, browse and search options were provided for accessing the dataset, while various detailed information about the PCM events was well organized for visualization. With human dataset in iCysMod, the sequence features around the cysteine modification sites for each PCM type were analyzed, and the results indicated that various types of PCMs presented distinct sequence recognition preferences. Moreover, different PCMs can crosstalk with each other to synergistically orchestrate specific biological processes, and 37 841 PCM events involved in 119 types of PCM co-occurrences at the same cysteine residues were finally obtained. Taken together, we anticipate that the database of iCysMod would provide a useful resource for eukaryotic PCMs to facilitate related researches, while the online service is freely available at http://icysmod.omicsbio.info.
Collapse
Affiliation(s)
- Panqin Wang
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| | - Qingfeng Zhang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Shihua Li
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| | - Ben Cheng
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| | - Han Xue
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| | - Zhen Wei
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| | - Tian Shao
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| | - Ze-Xian Liu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Han Cheng
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| | - Zhenlong Wang
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan, China
| |
Collapse
|
34
|
Zhao X, He M. Comprehensive pathway-related genes signature for prognosis and recurrence of ovarian cancer. PeerJ 2020; 8:e10437. [PMID: 33344083 PMCID: PMC7718801 DOI: 10.7717/peerj.10437] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 11/06/2020] [Indexed: 12/14/2022] Open
Abstract
Background Ovarian cancer (OC) is a highly malignant disease with a poor prognosis and high recurrence rate. At present, there is no accurate strategy to predict the prognosis and recurrence of OC. The aim of this study was to identify gene-based signatures to predict OC prognosis and recurrence. Methods mRNA expression profiles and corresponding clinical information regarding OC were collected from The Cancer Genome Atlas (TCGA) database. Gene set enrichment analysis (GSEA) and LASSO analysis were performed, and Kaplan–Meier curves, time-dependent ROC curves, and nomograms were constructed using R software and GraphPad Prism7. Results We first identified several key signalling pathways that affected ovarian tumorigenesis by GSEA. We then established a nine-gene-based signature for overall survival (OS) and a five-gene-based-signature for relapse-free survival (RFS) using LASSO Cox regression analysis of the TCGA dataset and validated the prognostic value of these signatures in independent GEO datasets. We also confirmed that these signatures were independent risk factors for OS and RFS by multivariate Cox analysis. Time-dependent ROC analysis showed that the AUC values for OS and RFS were 0.640, 0.663, 0.758, and 0.891, and 0.638, 0.722, 0.813, and 0.972 at 1, 3, 5, and 10 years, respectively. The results of the nomogram analysis demonstrated that combining two signatures with the TNM staging system and tumour status yielded better predictive ability. Conclusion In conclusion, the two-gene-based signatures established in this study may serve as novel and independent prognostic indicators for OS and RFS.
Collapse
Affiliation(s)
- Xinnan Zhao
- Department of Rheumatology and Immunology, The First Affiliated Hospital of China Medical University, Shenyang, China
| | - Miao He
- Department of Pharmacology, China Medical University, Shenyang, China
| |
Collapse
|
35
|
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification. Int J Mol Sci 2020; 21:E9070. [PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 01/13/2023] Open
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Duyen Thi Do
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan;
| | - Truong Nguyen Khanh Hung
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Department of Orthopedic and Trauma, Cho Ray Hospital, Ho Chi Minh 70000, Vietnam
| | - Luu Ho Thanh Lam
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Intensive Care Unit, Children’s Hospital 2, Ho Chi Minh 70000, Vietnam
| | - Tuan-Tu Huynh
- Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan;
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Dong Nai 76120, Vietnam
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, Taipei Medical University, Taipei 110, Taiwan;
| |
Collapse
|
36
|
Liu W, Juhas M, Zhang Y. Fine-Grained Breast Cancer Classification With Bilinear Convolutional Neural Networks (BCNNs). Front Genet 2020; 11:547327. [PMID: 33101377 PMCID: PMC7500315 DOI: 10.3389/fgene.2020.547327] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 08/17/2020] [Indexed: 12/24/2022] Open
Abstract
Classification of histopathological images of cancer is challenging even for well-trained professionals, due to the fine-grained variability of the disease. Deep Convolutional Neural Networks (CNNs) showed great potential for classification of a number of the highly variable fine-grained objects. In this study, we introduce a Bilinear Convolutional Neural Networks (BCNNs) based deep learning method for fine-grained classification of breast cancer histopathological images. We evaluated our model by comparison with several deep learning algorithms for fine-grained classification. We used bilinear pooling to aggregate a large number of orderless features without taking into consideration the disease location. The experimental results on BreaKHis, a publicly available breast cancer dataset, showed that our method is highly accurate with 99.24% and 95.95% accuracy in binary and in fine-grained classification, respectively.
Collapse
Affiliation(s)
- Weihuang Liu
- College of Science, Harbin Institute of Technology, Shenzhen, China
| | - Mario Juhas
- Faculty of Science and Medicine, University of Fribourg, Fribourg, Switzerland
| | - Yang Zhang
- College of Science, Harbin Institute of Technology, Shenzhen, China
| |
Collapse
|
37
|
Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences. BIOLOGY 2020; 9:biology9100325. [PMID: 33036150 PMCID: PMC7599600 DOI: 10.3390/biology9100325] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 10/03/2020] [Accepted: 10/04/2020] [Indexed: 12/15/2022]
Abstract
Antioxidant proteins are involved importantly in many aspects of cellular life activities. They protect the cell and DNA from oxidative substances (such as peroxide, nitric oxide, oxygen-free radicals, etc.) which are known as reactive oxygen species (ROS). Free radical generation and antioxidant defenses are opposing factors in the human body and the balance between them is necessary to maintain a healthy body. An unhealthy routine or the degeneration of age can break the balance, leading to more ROS than antioxidants, causing damage to health. In general, the antioxidant mechanism is the combination of antioxidant molecules and ROS in a one-electron reaction. Creating computational models to promptly identify antioxidant candidates is essential in supporting antioxidant detection experiments in the laboratory. In this study, we proposed a machine learning-based model for this prediction purpose from a benchmark set of sequencing data. The experiments were conducted by using 10-fold cross-validation on the training process and validated by three different independent datasets. Different machine learning and deep learning algorithms have been evaluated on an optimal set of sequence features. Among them, Random Forest has been identified as the best model to identify antioxidant proteins with the highest performance. Our optimal model achieved high accuracy of 84.6%, as well as a balance in sensitivity (81.5%) and specificity (85.1%) for antioxidant protein identification on the training dataset. The performance results from different independent datasets also showed the significance in our model compared to previously published works on antioxidant protein identification.
Collapse
|
38
|
Le NQK, Do DT, Chiu FY, Yapp EKY, Yeh HY, Chen CY. XGBoost Improves Classification of MGMT Promoter Methylation Status in IDH1 Wildtype Glioblastoma. J Pers Med 2020; 10:jpm10030128. [PMID: 32942564 PMCID: PMC7563334 DOI: 10.3390/jpm10030128] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Revised: 09/03/2020] [Accepted: 09/09/2020] [Indexed: 02/07/2023] Open
Abstract
Approximately 96% of patients with glioblastomas (GBM) have IDH1 wildtype GBMs, characterized by extremely poor prognosis, partly due to resistance to standard temozolomide treatment. O6-Methylguanine-DNA methyltransferase (MGMT) promoter methylation status is a crucial prognostic biomarker for alkylating chemotherapy resistance in patients with GBM. However, MGMT methylation status identification methods, where the tumor tissue is often undersampled, are time consuming and expensive. Currently, presurgical noninvasive imaging methods are used to identify biomarkers to predict MGMT methylation status. We evaluated a novel radiomics-based eXtreme Gradient Boosting (XGBoost) model to identify MGMT promoter methylation status in patients with IDH1 wildtype GBM. This retrospective study enrolled 53 patients with pathologically proven GBM and tested MGMT methylation and IDH1 status. Radiomics features were extracted from multimodality MRI and tested by F-score analysis to identify important features to improve our model. We identified nine radiomics features that reached an area under the curve of 0.896, which outperformed other classifiers reported previously. These features could be important biomarkers for identifying MGMT methylation status in IDH1 wildtype GBM. The combination of radiomics feature extraction and F-core feature selection significantly improved the performance of the XGBoost model, which may have implications for patient stratification and therapeutic strategy in GBM.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City 106, Taiwan;
- Correspondence: (N.Q.K.L.); (C.-Y.C.); Tel.: +886-266-382-736 (ext. 1992) (N.Q.K.L.); Fax: +886-2-2732-1956 (N.Q.K.L.)
| | - Duyen Thi Do
- Faculty of Applied Sciences, Ton Duc Thang University, Ho Chi Minh City 70000, Vietnam;
| | - Fang-Ying Chiu
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City 106, Taiwan;
| | - Edward Kien Yee Yapp
- Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, Singapore 138634, Singapore;
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, Singapore 639798, Singapore;
| | - Cheng-Yu Chen
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City 106, Taiwan;
- Department of Radiology, School of Medicine, College of Medicine, Taipei Medical University, Taipei 11031, Taiwan
- Department of Medical Imaging, Taipei Medical University Hospital, Taipei 11031, Taiwan
- Correspondence: (N.Q.K.L.); (C.-Y.C.); Tel.: +886-266-382-736 (ext. 1992) (N.Q.K.L.); Fax: +886-2-2732-1956 (N.Q.K.L.)
| |
Collapse
|