1
|
Miyake H, Kawaguchi RK, Kiryu H. RNAelem: an algorithm for discovering sequence-structure motifs in RNA bound by RNA-binding proteins. BIOINFORMATICS ADVANCES 2024; 4:vbae144. [PMID: 39399375 PMCID: PMC11471262 DOI: 10.1093/bioadv/vbae144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Revised: 09/08/2024] [Accepted: 09/26/2024] [Indexed: 10/15/2024]
Abstract
Motivation RNA-binding proteins (RBPs) play a crucial role in the post-transcriptional regulation of RNA. Given their importance, analyzing the specific RNA patterns recognized by RBPs has become a significant research focus in bioinformatics. Deep Neural Networks have enhanced the accuracy of prediction for RBP-binding sites, yet understanding the structural basis of RBP-binding specificity from these models is challenging due to their limited interpretability. To address this, we developed RNAelem, which combines profile context-free grammar and the Turner energy model for RNA secondary structure to predict sequence-structure motifs in RBP-binding regions. Results RNAelem exhibited superior detection accuracy compared to existing tools for RNA sequences with structural motifs. Upon applying RNAelem to the eCLIP database, we were not only able to reproduce many known primary sequence motifs in the absence of secondary structures, but also discovered many secondary structural motifs that contained sequence-nonspecific insertion regions. Furthermore, the high interpretability of RNAelem yielded insightful findings such as long-range base-pairing interactions in the binding region of the U2AF protein. Availability and implementation The code is available at https://github.com/iyak/RNAelem.
Collapse
Affiliation(s)
- Hiroshi Miyake
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8561, Japan
| | - Risa Karakida Kawaguchi
- Department of Life Science Frontiers, Center for iPS Cell Research and Application (CiRA), Kyoto University, Sakyo-ku 606-8507, Japan
| | - Hisanori Kiryu
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8561, Japan
| |
Collapse
|
2
|
Todhunter ME, Jubair S, Verma R, Saqe R, Shen K, Duffy B. Artificial intelligence and machine learning applications for cultured meat. Front Artif Intell 2024; 7:1424012. [PMID: 39381621 PMCID: PMC11460582 DOI: 10.3389/frai.2024.1424012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 08/21/2024] [Indexed: 10/10/2024] Open
Abstract
Cultured meat has the potential to provide a complementary meat industry with reduced environmental, ethical, and health impacts. However, major technological challenges remain which require time-and resource-intensive research and development efforts. Machine learning has the potential to accelerate cultured meat technology by streamlining experiments, predicting optimal results, and reducing experimentation time and resources. However, the use of machine learning in cultured meat is in its infancy. This review covers the work available to date on the use of machine learning in cultured meat and explores future possibilities. We address four major areas of cultured meat research and development: establishing cell lines, cell culture media design, microscopy and image analysis, and bioprocessing and food processing optimization. In addition, we have included a survey of datasets relevant to CM research. This review aims to provide the foundation necessary for both cultured meat and machine learning scientists to identify research opportunities at the intersection between cultured meat and machine learning.
Collapse
Affiliation(s)
| | - Sheikh Jubair
- Alberta Machine Intelligence Institute, Edmonton, AB, Canada
| | - Ruchika Verma
- Alberta Machine Intelligence Institute, Edmonton, AB, Canada
| | - Rikard Saqe
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
| | - Kevin Shen
- Department of Mathematics, University of Waterloo, Waterloo, ON, Canada
| | | |
Collapse
|
3
|
Fu C, Yang T, Liao H, Huang Y, Wang H, Long W, Jiang N, Yang Y. Genome-wide identification and molecular evolution of elongation family of very long chain fatty acids proteins in Cyrtotrachelus buqueti. BMC Genomics 2024; 25:758. [PMID: 39095734 PMCID: PMC11297609 DOI: 10.1186/s12864-024-10658-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Accepted: 07/24/2024] [Indexed: 08/04/2024] Open
Abstract
To reveal the molecular function of elongation family of very long chain fatty acids(ELO) protein in Cyrtotrachelus buqueti, we have identified 15 ELO proteins from C.buqueti genome. 15 CbuELO proteins were located on four chromosomes. Their isoelectric points ranged from 9.22 to 9.68, and they were alkaline. These CbuELO proteins were stable and hydrophobic. CbuELO proteins had transmembrane movement, and had multiple phosphorylation sites. The secondary structure of CbuELO proteins was mainly α-helix. A total of 10 conserved motifs were identified in CbuELO protein family. Phylogenetic analysis showed that molecular evolutionary relationships of ELO protein family between C. buqueti and Tribolium castaneum was the closest. Developmental transcriptome analysis indicated that CbuELO10, CbuELO13 and CbuELO02 genes were key enzyme genes that determine the synthesis of very long chain fatty acids in pupae and eggs, CbuELO6 and CbuELO7 were that in the male, and CbuELO8 and CbuELO11 were that in the larva. Transcriptome analysis under different temperature conditions indicated that CbuELO1, CbuELO5, CbuELO12 and CbuELO14 participated in regulating temperature stress responses. Transcriptome analysis at different feeding times showed CbuELO12 gene expression level in all feeding time periods was significant downregulation. The qRT-PCR experiment verified expression level changes of CbuELO gene family under different temperature and feeding time conditions. Protein-protein interaction analysis showed that 9 CbuELO proteins were related to each other, CbuELO1, CbuELO4 and CbuELO12 had more than one interaction relationship. These results lay a theoretical foundation for further studying its molecular function during growth and development of C. buqueti.
Collapse
Affiliation(s)
- Chun Fu
- Key Laboratory of Sichuan Province for Bamboo Pests Control and Resource Development, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China.
- College of Life Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China.
| | - Ting Yang
- Key Laboratory of Sichuan Province for Bamboo Pests Control and Resource Development, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
- College of Life Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
| | - Hong Liao
- Key Laboratory of Sichuan Province for Bamboo Pests Control and Resource Development, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
- College of Life Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
| | - YuLing Huang
- Key Laboratory of Sichuan Province for Bamboo Pests Control and Resource Development, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
- College of Life Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
| | - HanYu Wang
- Key Laboratory of Sichuan Province for Bamboo Pests Control and Resource Development, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
- College of Life Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
| | - WenCong Long
- Key Laboratory of Sichuan Province for Bamboo Pests Control and Resource Development, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
- College of Life Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
| | - Na Jiang
- College of Tourism and Geographical Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China
| | - YaoJun Yang
- Key Laboratory of Sichuan Province for Bamboo Pests Control and Resource Development, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China.
- College of Life Science, Leshan Normal University, No. 778 Binhe Road, Shizhong District, Leshan, Sichuan, 614000, China.
| |
Collapse
|
4
|
Lefin N, Herrera-Belén L, Farias JG, Beltrán JF. Review and perspective on bioinformatics tools using machine learning and deep learning for predicting antiviral peptides. Mol Divers 2024; 28:2365-2374. [PMID: 37626205 DOI: 10.1007/s11030-023-10718-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 08/15/2023] [Indexed: 08/27/2023]
Abstract
Viruses constitute a constant threat to global health and have caused millions of human and animal deaths throughout human history. Despite advances in the discovery of antiviral compounds that help fight these pathogens, finding a solution to this problem continues to be a task that consumes time and financial resources. Currently, artificial intelligence (AI) has revolutionized many areas of the biological sciences, making it possible to decipher patterns in amino acid sequences that encode different functions and activities. Within the field of AI, machine learning, and deep learning algorithms have been used to discover antimicrobial peptides. Due to their effectiveness and specificity, antimicrobial peptides (AMPs) hold excellent promise for treating various infections caused by pathogens. Antiviral peptides (AVPs) are a specific type of AMPs that have activity against certain viruses. Unlike the research focused on the development of tools and methods for the prediction of antimicrobial peptides, those related to the prediction of AVPs are still scarce. Given the significance of AVPs as potential pharmaceutical options for human and animal health and the ongoing AI revolution, we have reviewed and summarized the current machine learning and deep learning-based tools and methods available for predicting these types of peptides.
Collapse
Affiliation(s)
- Nicolás Lefin
- Department of Chemical Engineering, Faculty of Engineering and Science, University of La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomás, Temuco, Chile
| | - Jorge G Farias
- Department of Chemical Engineering, Faculty of Engineering and Science, University of La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, University of La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile.
| |
Collapse
|
5
|
Selvam PK, Elavarasu SM, Dhanushkumar T, Vasudevan K, George Priya Doss C. Exploring the role of estrogen and progestins in breast cancer: A genomic approach to diagnosis. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2024; 142:25-43. [PMID: 39059987 DOI: 10.1016/bs.apcsb.2023.12.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/28/2024]
Abstract
Breast cancer (BC) is the most common cancer among women and a major cause of death from cancer. The role of estrogen and progestins, including synthetic hormones like R5020, in the development of BC has been highlighted in numerous studies. In our study, we employed machine learning and advanced bioinformatics to identify genes that could serve as diagnostic markers for BC. We thoroughly analyzed the transcriptomic data of two BC cell lines, T47D and UDC4, and performed differential gene expression analysis. We also conducted functional enrichment analysis to understand the biological functions influenced by these genes. Our study identified several diagnostic genes strongly associated with BC, including MIR6728, ENO1-IT1, ENO1-AS1, RNU6-304P, HMGN2P17, RP3-477M7.5, RP3-477M7.6, and CA6. The genes MIR6728, ENO1-IT1, ENO1-AS1, and HMGN2P17 are involved in cancer control, glycolysis, and DNA-related processes, while CA6 is associated with apoptosis and cancer development. These genes could potentially serve as predictors for BC, paving the way for more precise diagnostic methods and personalized treatment plans. This research enhances our understanding of BC and offers promising avenues for improving patient care in the future.
Collapse
Affiliation(s)
- Prasanna Kumar Selvam
- Department of Biotechnology, School of Applied Sciences, REVA University, Bengaluru, India; Institute of Bioinformatics, International Technology Park, Bangalore, India
| | | | - T Dhanushkumar
- Department of Biotechnology, School of Applied Sciences, REVA University, Bengaluru, India
| | - Karthick Vasudevan
- Institute of Bioinformatics, International Technology Park, Bangalore, India; Manipal Academy of Higher Education (MAHE), Manipal, India
| | - C George Priya Doss
- Laboratory of Integrative Genomics, Department of Integrative Biology, School of BioSciences and Technology, Vellore Institute of Technology (VIT), Vellore, Tamil Nadu, India.
| |
Collapse
|
6
|
Darmofal M, Suman S, Atwal G, Toomey M, Chen JF, Chang JC, Vakiani E, Varghese AM, Balakrishnan Rema A, Syed A, Schultz N, Berger MF, Morris Q. Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data. Cancer Discov 2024; 14:1064-1081. [PMID: 38416134 PMCID: PMC11145170 DOI: 10.1158/2159-8290.cd-23-0996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 12/07/2023] [Accepted: 02/23/2024] [Indexed: 02/29/2024]
Abstract
Tumor type guides clinical treatment decisions in cancer, but histology-based diagnosis remains challenging. Genomic alterations are highly diagnostic of tumor type, and tumor-type classifiers trained on genomic features have been explored, but the most accurate methods are not clinically feasible, relying on features derived from whole-genome sequencing (WGS), or predicting across limited cancer types. We use genomic features from a data set of 39,787 solid tumors sequenced using a clinically targeted cancer gene panel to develop Genome-Derived-Diagnosis Ensemble (GDD-ENS): a hyperparameter ensemble for classifying tumor type using deep neural networks. GDD-ENS achieves 93% accuracy for high-confidence predictions across 38 cancer types, rivaling the performance of WGS-based methods. GDD-ENS can also guide diagnoses of rare type and cancers of unknown primary and incorporate patient-specific clinical information for improved predictions. Overall, integrating GDD-ENS into prospective clinical sequencing workflows could provide clinically relevant tumor-type predictions to guide treatment decisions in real time. SIGNIFICANCE We describe a highly accurate tumor-type prediction model, designed specifically for clinical implementation. Our model relies only on widely used cancer gene panel sequencing data, predicts across 38 distinct cancer types, and supports integration of patient-specific nongenomic information for enhanced decision support in challenging diagnostic situations. See related commentary by Garg, p. 906. This article is featured in Selected Articles from This Issue, p. 897.
Collapse
Affiliation(s)
- Madison Darmofal
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York
| | - Shalabh Suman
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Gurnit Atwal
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Michael Toomey
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York
| | - Jie-Fu Chen
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Jason C. Chang
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Efsevia Vakiani
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Anna M. Varghese
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York
| | | | - Aijazuddin Syed
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Nikolaus Schultz
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, New York
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, New York
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Michael F. Berger
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, New York
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Quaid Morris
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York
| |
Collapse
|
7
|
Li Z, Jin B, Fang J. MetaAc4C: A multi-module deep learning framework for accurate prediction of N4-acetylcytidine sites based on pre-trained bidirectional encoder representation and generative adversarial networks. Genomics 2024; 116:110749. [PMID: 38008265 DOI: 10.1016/j.ygeno.2023.110749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 11/05/2023] [Accepted: 11/21/2023] [Indexed: 11/28/2023]
Abstract
MOTIVATION N4-acetylcytidine (ac4C) is a highly conserved RNA modification that plays a crucial role in various biological processes. Accurately identifying ac4C sites is of paramount importance for gaining a deeper understanding of their regulatory mechanisms. Nevertheless, the existing experimental techniques for ac4C site identification are characterized by limitations in terms of cost-effectiveness, while the performance of current computational methods in accurately identifying ac4C sites requires further enhancement. RESULTS In this paper, we present MetaAc4C, an advanced deep learning model that leverages pre-trained bidirectional encoder representations from transformers (BERT). The model is based on a bi-directional long short-term memory network (BLSTM) architecture, incorporating attention mechanism and residual connection. To address the issue of data imbalance, we adapt generative adversarial networks to generate synthetic feature samples. On the independent test set, MetaAc4C surpasses the current state-of-the-art ac4C prediction model, exhibiting improvements in terms of ACC, MCC, and AUROC by 2.36%, 4.76%, and 3.11%, respectively, on the unbalanced dataset. When evaluated on the balanced dataset, MetaAc4C achieves improvements in ACC, MCC, and AUROC by 2.6%, 5.11%, and 1.01%, respectively. Notably, our approach of utilizing WGAN-GP augmented training RNA samples demonstrates even superior performance compared to the SMOTE oversampling method.
Collapse
Affiliation(s)
- Zutan Li
- College of Engineering, Westlake University, Hangzhou, China; College of Sciences, Nanjing Agricultural University, Nanjing, China
| | - Bingbing Jin
- College of Sciences, Nanjing Agricultural University, Nanjing, China
| | - Jingya Fang
- College of Science, China Pharmaceutical University, Nanjing, China.
| |
Collapse
|
8
|
Mao J, Cao Y, Zhang Y, Huang B, Zhao Y. A novel method for identifying key genes in macroevolution based on deep learning with attention mechanism. Sci Rep 2023; 13:19727. [PMID: 37957311 PMCID: PMC10643560 DOI: 10.1038/s41598-023-47113-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 11/09/2023] [Indexed: 11/15/2023] Open
Abstract
Macroevolution can be regarded as the result of evolutionary changes of synergistically acting genes. Unfortunately, the importance of these genes in macroevolution is difficult to assess and hence the identification of macroevolutionary key genes is a major challenge in evolutionary biology. In this study, we designed various word embedding libraries of natural language processing (NLP) considering the multiple mechanisms of evolutionary genomics. A novel method (IKGM) based on three types of attention mechanisms (domain attention, kmer attention and fused attention) were proposed to calculate the weights of different genes in macroevolution. Taking 34 species of diurnal butterflies and nocturnal moths in Lepidoptera as an example, we identified a few of key genes with high weights, which annotated to the functions of circadian rhythms, sensory organs, as well as behavioral habits etc. This study not only provides a novel method to identify the key genes of macroevolution at the genomic level, but also helps us to understand the microevolution mechanisms of diurnal butterflies and nocturnal moths in Lepidoptera.
Collapse
Affiliation(s)
- Jiawei Mao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yong Cao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yan Zhang
- College of Mathematics and Physics, Southwest Forestry University, Kunming, 650224, China
| | - Biaosheng Huang
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Youjie Zhao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China.
| |
Collapse
|
9
|
Bhonde SB, Wagh SK, Prasad JR. Identification of cancer types from gene expressions using learning techniques. Comput Methods Biomech Biomed Engin 2023; 26:1951-1965. [PMID: 36562388 DOI: 10.1080/10255842.2022.2160243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 10/15/2022] [Accepted: 11/15/2022] [Indexed: 12/24/2022]
Abstract
Tumor is the major cause of death all around the world in recent days. Early detection and prediction of a cancer type are important for a patient's well-being. Functional genomic data has recently been used in the effective and early detection of cancer. According to previous research, the use of microarray data in cancer prediction has evidenced two main problems as high dimensionality and limited sample size. Several researchers have used numerous statistical and machine learning-based methods to classify cancer types but still, limitations are there which makes cancer classification a difficult job. Deep Learning (DL) and Convolutional Neural Networks (CNN) have been proven with effective analyses of unstructured data including gene expression data. In the proposed method gene expression data for five types of cancer is collected from The Cancer Genome Atlas (TCGA). Prominent features are selected using a hybrid Particle Swarm Optimization (PSO) and Random Forest (RF) algorithm followed by the use of Principal Component Analysis (PCA) for dimensionality reduction. Finally, for classification blend of Convolutional Neural Network (CNN) and Bi-directional Long Short Term Memory (Bi-LSTM) is used to predict the target type of cancer. Experimental results demonstrate that accuracy of the proposed method is 96.89%. As compared to existing work, our method outperformed with better results.
Collapse
Affiliation(s)
- Swati B Bhonde
- Smt. Kashibai Navale College of Engineering, Pune, India
| | | | | |
Collapse
|
10
|
Darmofal M, Suman S, Atwal G, Chen JF, Chang JC, Toomey M, Vakiani E, Varghese AM, Rema AB, Syed A, Schultz N, Berger M, Morris Q. Deep Learning Model for Tumor Type Prediction using Targeted Clinical Genomic Sequencing Data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.09.08.23295131. [PMID: 37732244 PMCID: PMC10508812 DOI: 10.1101/2023.09.08.23295131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
Tumor type guides clinical treatment decisions in cancer, but histology-based diagnosis remains challenging. Genomic alterations are highly diagnostic of tumor type, and tumor type classifiers trained on genomic features have been explored, but the most accurate methods are not clinically feasible, relying on features derived from whole genome sequencing (WGS), or predicting across limited cancer types. We use genomic features from a dataset of 39,787 solid tumors sequenced using a clinical targeted cancer gene panel to develop Genome-Derived-Diagnosis Ensemble (GDD-ENS): a hyperparameter ensemble for classifying tumor type using deep neural networks. GDD-ENS achieves 93% accuracy for high-confidence predictions across 38 cancer types, rivalling performance of WGS-based methods. GDD-ENS can also guide diagnoses on rare type and cancers of unknown primary, and incorporate patient-specific clinical information for improved predictions. Overall, integrating GDD-ENS into prospective clinical sequencing workflows has enabled clinically-relevant tumor type predictions to guide treatment decisions in real time.
Collapse
Affiliation(s)
- Madison Darmofal
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine; New York, NY 10065, USA
| | - Shalabh Suman
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Gurnit Atwal
- Computational Biology Program, Ontario Institute for Cancer Research; Toronto, ON M5G 0A3, Canada
- Department of Molecular Genetics, University of Toronto; Toronto, ON M5S 1A8, Canada
- Vector Institute; Toronto, ON M5G 1M1, Canada
| | - Jie-Fu Chen
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Jason C. Chang
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Michael Toomey
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine; New York, NY 10065, USA
| | - Efsevia Vakiani
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Anna M Varghese
- Department of Medicine, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | | | - Aijazuddin Syed
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Nikolaus Schultz
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Michael Berger
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Quaid Morris
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| |
Collapse
|
11
|
Szulc NA, Mackiewicz Z, Bujnicki JM, Stefaniak F. Structural interaction fingerprints and machine learning for predicting and explaining binding of small molecule ligands to RNA. Brief Bioinform 2023; 24:bbad187. [PMID: 37204195 DOI: 10.1093/bib/bbad187] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 04/07/2023] [Accepted: 04/25/2023] [Indexed: 05/20/2023] Open
Abstract
Ribonucleic acids (RNAs) play crucial roles in living organisms and some of them, such as bacterial ribosomes and precursor messenger RNA, are targets of small molecule drugs, whereas others, e.g. bacterial riboswitches or viral RNA motifs are considered as potential therapeutic targets. Thus, the continuous discovery of new functional RNA increases the demand for developing compounds targeting them and for methods for analyzing RNA-small molecule interactions. We recently developed fingeRNAt-a software for detecting non-covalent bonds formed within complexes of nucleic acids with different types of ligands. The program detects several non-covalent interactions and encodes them as structural interaction fingerprint (SIFt). Here, we present the application of SIFts accompanied by machine learning methods for binding prediction of small molecules to RNA. We show that SIFt-based models outperform the classic, general-purpose scoring functions in virtual screening. We also employed Explainable Artificial Intelligence (XAI)-the SHapley Additive exPlanations, Local Interpretable Model-agnostic Explanations and other methods to help understand the decision-making process behind the predictive models. We conducted a case study in which we applied XAI on a predictive model of ligand binding to human immunodeficiency virus type 1 trans-activation response element RNA to distinguish between residues and interaction types important for binding. We also used XAI to indicate whether an interaction has a positive or negative effect on binding prediction and to quantify its impact. Our results obtained using all XAI methods were consistent with the literature data, demonstrating the utility and importance of XAI in medicinal chemistry and bioinformatics.
Collapse
Affiliation(s)
- Natalia A Szulc
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
- Laboratory of Protein Metabolism, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| | - Zuzanna Mackiewicz
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
- Laboratory of RNA Biology - ERA Chairs Group, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| | - Janusz M Bujnicki
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| | - Filip Stefaniak
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| |
Collapse
|
12
|
Wang LS, Sun ZL. iDHS-FFLG: Identifying DNase I Hypersensitive Sites by Feature Fusion and Local-Global Feature Extraction Network. Interdiscip Sci 2023; 15:155-170. [PMID: 36166165 DOI: 10.1007/s12539-022-00538-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 09/12/2022] [Accepted: 09/12/2022] [Indexed: 05/01/2023]
Abstract
The DNase I hypersensitive sites (DHSs) are active regions on chromatin that have been found to be highly sensitive to DNase I. These regions contain various cis-regulatory elements, including promoters, enhancers and silencers. Accurate identification of DHSs helps researchers better understand the transcriptional machinery of DNA and deepen the knowledge of functional DNA elements in non-coding sequences. Researchers have developed many methods based on traditional experiments and machine learning to identify DHSs. However, low prediction accuracy and robustness limit their application in genetics research. In this paper, a novel computational approach based on deep learning is proposed by feature fusion and local-global feature extraction network to identify DHSs in mouse, named iDHS-FFLG. First of all, multiple binary features of nucleotides are fused to better express sequence information. Then, a network consisting of the convolutional neural network (CNN), bidirectional long short-term memory (BiLSTM) and self-attention mechanism is designed to extract local features and global contextual associations. In the end, the prediction module is applied to distinguish between DHSs and non-DHSs. The results of several experiments demonstrate the superior performances of iDHS-FFLG compared to the latest methods.
Collapse
Affiliation(s)
- Lei-Shan Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China
| | - Zhan-Li Sun
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China.
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China.
| |
Collapse
|
13
|
Shi Z, Deng R, Yuan Q, Mao Z, Wang R, Li H, Liao X, Ma H. Enzyme Commission Number Prediction and Benchmarking with Hierarchical Dual-core Multitask Learning Framework. RESEARCH (WASHINGTON, D.C.) 2023; 6:0153. [PMID: 37275124 PMCID: PMC10232324 DOI: 10.34133/research.0153] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 04/28/2023] [Indexed: 06/07/2023]
Abstract
Enzyme commission (EC) numbers, which associate a protein sequence with the biochemical reactions it catalyzes, are essential for the accurate understanding of enzyme functions and cellular metabolism. Many ab initio computational approaches were proposed to predict EC numbers for given input protein sequences. However, the prediction performance (accuracy, recall, and precision), usability, and efficiency of existing methods decreased seriously when dealing with recently discovered proteins, thus still having much room to be improved. Here, we report HDMLF, a hierarchical dual-core multitask learning framework for accurately predicting EC numbers based on novel deep learning techniques. HDMLF is composed of an embedding core and a learning core; the embedding core adopts the latest protein language model for protein sequence embedding, and the learning core conducts the EC number prediction. Specifically, HDMLF is designed on the basis of a gated recurrent unit framework to perform EC number prediction in the multi-objective hierarchy, multitasking manner. Additionally, we introduced an attention layer to optimize the EC prediction and employed a greedy strategy to integrate and fine-tune the final model. Comparative analyses against 4 representative methods demonstrate that HDMLF stably delivers the highest performance, which improves accuracy and F1 score by 60% and 40% over the state of the art, respectively. An additional case study of tyrB predicted to compensate for the loss of aspartate aminotransferase aspC, as reported in a previous experimental study, shows that our model can also be used to uncover the enzyme promiscuity. Finally, we established a web platform, namely, ECRECer (https://ecrecer.biodesign.ac.cn), using an entirely could-based serverless architecture and provided an offline bundle to improve usability.
Collapse
Affiliation(s)
- Zhenkun Shi
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
| | - Rui Deng
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
- College of Biotechnology,
Tianjin University of Science & Technology, Tianjin, China
| | - Qianqian Yuan
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
| | - Zhitao Mao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
| | - Ruoyu Wang
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
| | - Haoran Li
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
| | - Xiaoping Liao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
- Haihe Laboratory of Synthetic Biology, 300308, Tianjin, China
| | - Hongwu Ma
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology,
Chinese Academy of Sciences, 300308, Tianjin, China
- National Center of Technology Innovation for Synthetic Biology, 300308, Tianjin, China
| |
Collapse
|
14
|
Yu Y, Ding P, Gao H, Liu G, Zhang F, Yu B. Cooperation of local features and global representations by a dual-branch network for transcription factor binding sites prediction. Brief Bioinform 2023; 24:7030619. [PMID: 36748992 DOI: 10.1093/bib/bbad036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 01/03/2023] [Accepted: 01/18/2023] [Indexed: 02/08/2023] Open
Abstract
Interactions between DNA and transcription factors (TFs) play an essential role in understanding transcriptional regulation mechanisms and gene expression. Due to the large accumulation of training data and low expense, deep learning methods have shown huge potential in determining the specificity of TFs-DNA interactions. Convolutional network-based and self-attention network-based methods have been proposed for transcription factor binding sites (TFBSs) prediction. Convolutional operations are efficient to extract local features but easy to ignore global information, while self-attention mechanisms are expert in capturing long-distance dependencies but difficult to pay attention to local feature details. To discover comprehensive features for a given sequence as far as possible, we propose a Dual-branch model combining Self-Attention and Convolution, dubbed as DSAC, which fuses local features and global representations in an interactive way. In terms of features, convolution and self-attention contribute to feature extraction collaboratively, enhancing the representation learning. In terms of structure, a lightweight but efficient architecture of network is designed for the prediction, in particular, the dual-branch structure makes the convolution and the self-attention mechanism can be fully utilized to improve the predictive ability of our model. The experiment results on 165 ChIP-seq datasets show that DSAC obviously outperforms other five deep learning based methods and demonstrate that our model can effectively predict TFBSs based on sequence feature alone. The source code of DSAC is available at https://github.com/YuBinLab-QUST/DSAC/.
Collapse
Affiliation(s)
- Yutong Yu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Hongli Gao
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Guozhu Liu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Fa Zhang
- School of Medical Technology, Beijing Institute of Technology, China
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, China
| |
Collapse
|
15
|
Zhu Y, Zhang F, Zhang S, Yi M. Predicting latent lncRNA and cancer metastatic event associations via variational graph auto-encoder. Methods 2023; 211:1-9. [PMID: 36709790 DOI: 10.1016/j.ymeth.2023.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 12/05/2022] [Accepted: 01/20/2023] [Indexed: 01/27/2023] Open
Abstract
Long non-coding RNA (lncRNA) are shown to be closely associated with cancer metastatic events (CME, e.g., cancer cell invasion, intravasation, extravasation, proliferation) that collaboratively accelerate malignant cancer spread and cause high mortality rate in patients. Clinical trials may accurately uncover the relationships between lncRNAs and CMEs; however, it is time-consuming and expensive. With the accumulation of data, there is an urgent need to find efficient ways to identify these relationships. Herein, a graph embedding representation-based predictor (VGEA-LCME) for exploring latent lncRNA-CME associations is introduced. In VGEA-LCME, a heterogeneous combined network is constructed by integrating similarity and linkage matrix that can maintain internal and external characteristics of networks, and a variational graph auto-encoder serves as a feature generator to represent arbitrary lncRNA and CME pair. The final robustness predicted result is obtained by ensemble classifier strategy via cross-validation. Experimental comparisons and literature verification show better remarkable performance of VGEA-LCME, although the similarities between CMEs are challenging to calculate. In addition, VGEA-LCME can further identify organ-specific CMEs. To the best of our knowledge, this is the first computational attempt to discover the potential relationships between lncRNAs and CMEs. It may provide support and new insight for guiding experimental research of metastatic cancers. The source code and data are available at https://github.com/zhuyuan-cug/VGAE-LCME.
Collapse
Affiliation(s)
- Yuan Zhu
- School of Automation, China University of Geosciences, 388 Lumo Road, Hongshan District, 430074, Wuhan, Hubei, China; Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, 388 Lumo Road, Hongshan District, 430074, Wuhan, Hubei, China; Engineering Research Center of Intelligent Technology for Geo-Exploration, 388 Lumo Road, Hongshan District, 430074, Wuhan, Hubei, China
| | - Feng Zhang
- School of Mathematics and Physics, China University of Geosciences, 388 Lumo Road, Hongshan District, 430074, Wuhan, Hubei, China
| | - Shihua Zhang
- College of Life Science and Health, Wuhan University of Science and Technology, 974 Heping Avenue, Qingshan District, 430081, Wuhan, Hubei, China.
| | - Ming Yi
- School of Mathematics and Physics, China University of Geosciences, 388 Lumo Road, Hongshan District, 430074, Wuhan, Hubei, China.
| |
Collapse
|
16
|
Jubair S, Domaratzki M. Crop genomic selection with deep learning and environmental data: A survey. Front Artif Intell 2023; 5:1040295. [PMID: 36703955 PMCID: PMC9871498 DOI: 10.3389/frai.2022.1040295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 12/22/2022] [Indexed: 01/12/2023] Open
Abstract
Machine learning techniques for crop genomic selections, especially for single-environment plants, are well-developed. These machine learning models, which use dense genome-wide markers to predict phenotype, routinely perform well on single-environment datasets, especially for complex traits affected by multiple markers. On the other hand, machine learning models for predicting crop phenotype, especially deep learning models, using datasets that span different environmental conditions, have only recently emerged. Models that can accept heterogeneous data sources, such as temperature, soil conditions and precipitation, are natural choices for modeling GxE in multi-environment prediction. Here, we review emerging deep learning techniques that incorporate environmental data directly into genomic selection models.
Collapse
Affiliation(s)
- Sheikh Jubair
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Mike Domaratzki
- Department of Computer Science, University of Western Ontario, London, ON, Canada
| |
Collapse
|
17
|
Yu Q, Zhang X, Hu Y, Chen S, Yang L. A Method for Predicting DNA Motif Length Based On Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:61-73. [PMID: 35275822 DOI: 10.1109/tcbb.2022.3158471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
A DNA motif is a sequence pattern shared by the DNA sequence segments that bind to a specific protein. Discovering motifs in a given DNA sequence dataset plays a vital role in studying gene expression regulation. As an important attribute of the DNA motif, the motif length directly affects the quality of the discovered motifs. How to determine the motif length more accurately remains a difficult challenge to be solved. We propose a new motif length prediction scheme named MotifLen by using supervised machine learning. First, a method of constructing sample data for predicting the motif length is proposed. Secondly, a deep learning model for motif length prediction is constructed based on the convolutional neural network. Then, the methods of applying the proposed prediction model based on a motif found by an existing motif discovery algorithm are given. The experimental results show that i) the prediction accuracy of MotifLen is more than 90% on the validation set and is significantly higher than that of the compared methods on real datasets, ii) MotifLen can successfully optimize the motifs found by the existing motif discovery algorithms, and iii) it can effectively improve the time performance of some existing motif discovery algorithms.
Collapse
|
18
|
Thakur N, Alam MR, Abdul-Ghafar J, Chong Y. Recent Application of Artificial Intelligence in Non-Gynecological Cancer Cytopathology: A Systematic Review. Cancers (Basel) 2022; 14:cancers14143529. [PMID: 35884593 PMCID: PMC9316753 DOI: 10.3390/cancers14143529] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 07/12/2022] [Accepted: 07/15/2022] [Indexed: 11/27/2022] Open
Abstract
Simple Summary Artificial intelligence (AI) has attracted significant interest in the healthcare sector due to its promising results. Cytological examination is a critical step in the initial diagnosis of cancer. Here, we conducted a systematic review with quantitative analysis to understand the current status of AI applications in non-gynecological (non-GYN) cancer cytology. In our analysis, we found that most of the studies focused on classification and segmentation tasks. Overall, AI showed promising results for non-GYN cancer cytopathology analysis. However, the lack of well-annotated, large-scale datasets with Z-stacking and external cross-validation was the major limitation across all studies. Abstract State-of-the-art artificial intelligence (AI) has recently gained considerable interest in the healthcare sector and has provided solutions to problems through automated diagnosis. Cytological examination is a crucial step in the initial diagnosis of cancer, although it shows limited diagnostic efficacy. Recently, AI applications in the processing of cytopathological images have shown promising results despite the elementary level of the technology. Here, we performed a systematic review with a quantitative analysis of recent AI applications in non-gynecological (non-GYN) cancer cytology to understand the current technical status. We searched the major online databases, including MEDLINE, Cochrane Library, and EMBASE, for relevant English articles published from January 2010 to January 2021. The searched query terms were: “artificial intelligence”, “image processing”, “deep learning”, “cytopathology”, and “fine-needle aspiration cytology.” Out of 17,000 studies, only 26 studies (26 models) were included in the full-text review, whereas 13 studies were included for quantitative analysis. There were eight classes of AI models treated of according to target organs: thyroid (n = 11, 39%), urinary bladder (n = 6, 21%), lung (n = 4, 14%), breast (n = 2, 7%), pleural effusion (n = 2, 7%), ovary (n = 1, 4%), pancreas (n = 1, 4%), and prostate (n = 1, 4). Most of the studies focused on classification and segmentation tasks. Although most of the studies showed impressive results, the sizes of the training and validation datasets were limited. Overall, AI is also promising for non-GYN cancer cytopathology analysis, such as pathology or gynecological cytology. However, the lack of well-annotated, large-scale datasets with Z-stacking and external cross-validation was the major limitation found across all studies. Future studies with larger datasets with high-quality annotations and external validation are required.
Collapse
|
19
|
In Silico Investigation of Some Compounds from the N-Butanol Extract of Centaurea tougourensis Boiss. & Reut. CRYSTALS 2022. [DOI: 10.3390/cryst12030355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Bioinformatics as a newly emerging discipline is considered nowadays a reference to characterize the physicochemical and pharmacological properties of the actual biocompounds contained in plants, which has helped the pharmaceutical industry a lot in the drug development process. In this study, a bioinformatics approach known as in silico was performed to predict, for the first time, the physicochemical properties, ADMET profile, pharmacological capacities, cytotoxicity, and nervous system macromolecular targets, as well as the gene expression profiles, of four compounds recently identified from Centaurea tougourensis via the gas chromatography–mass spectrometry (GC–MS) approach. Thus, four compounds were tested from the n-butanol (n-BuOH) extract of this plant, named, respectively, Acridin-9-amine, 1,2,3,4-tetrahydro-5,7-dimethyl- (compound 1), 3-[2,3-Dihydro-2,2-dimethylbenzofuran-7-yl]-5-methoxy-1,3,4-oxadiazol-2(3H)-one (compound 2), 9,9-Dimethoxybicyclo[3.3.1]nona-2,4-dione (compound 3), and 3-[3-Bromophenyl]-7-chloro-3,4-dihydro-10-hydroxy-1,9(2H,10H)-acridinedione (compound 4). The insilico investigation revealed that the four tested compounds could be a good candidate to regulate the expression of key genes and may also exert significant cytotoxic effects against several tumor celllines. In addition, these compounds could also be effective in the treatment of some diseases related to diabetes, skin pathologies, cardiovascular, and central nervous system disorders. The bioactive compounds of plant remain the best alternative in the context of the drug discovery and development process.
Collapse
|
20
|
ACPNet: A Deep Learning Network to Identify Anticancer Peptides by Hybrid Sequence Information. Molecules 2022; 27:molecules27051544. [PMID: 35268644 PMCID: PMC8912097 DOI: 10.3390/molecules27051544] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 02/20/2022] [Accepted: 02/23/2022] [Indexed: 12/18/2022] Open
Abstract
Cancer is one of the most dangerous threats to human health. One of the issues is drug resistance action, which leads to side effects after drug treatment. Numerous therapies have endeavored to relieve the drug resistance action. Recently, anticancer peptides could be a novel and promising anticancer candidate, which can inhibit tumor cell proliferation, migration, and suppress the formation of tumor blood vessels, with fewer side effects. However, it is costly, laborious and time consuming to identify anticancer peptides by biological experiments with a high throughput. Therefore, accurately identifying anti-cancer peptides becomes a key and indispensable step for anticancer peptides therapy. Although some existing computer methods have been developed to predict anticancer peptides, the accuracy still needs to be improved. Thus, in this study, we propose a deep learning-based model, called ACPNet, to distinguish anticancer peptides from non-anticancer peptides (non-ACPs). ACPNet employs three different types of peptide sequence information, peptide physicochemical properties and auto-encoding features linking the training process. ACPNet is a hybrid deep learning network, which fuses fully connected networks and recurrent neural networks. The comparison with other existing methods on ACPs82 datasets shows that ACPNet not only achieves the improvement of 1.2% Accuracy, 2.0% F1-score, and 7.2% Recall, but also gets balanced performance on the Matthews correlation coefficient. Meanwhile, ACPNet is verified on an independent dataset, with 20 proven anticancer peptides, and only one anticancer peptide is predicted as non-ACPs. The comparison and independent validation experiment indicate that ACPNet can accurately distinguish anticancer peptides from non-ACPs.
Collapse
|
21
|
Li Z, Fang J, Wang S, Zhang L, Chen Y, Pian C. Adapt-Kcr: a novel deep learning framework for accurate prediction of lysine crotonylation sites based on learning embedding features and attention architecture. Brief Bioinform 2022; 23:6533505. [PMID: 35189635 DOI: 10.1093/bib/bbac037] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2022] [Revised: 01/18/2022] [Accepted: 01/25/2022] [Indexed: 01/20/2023] Open
Abstract
Protein lysine crotonylation (Kcr) is an important type of posttranslational modification that is associated with a wide range of biological processes. The identification of Kcr sites is critical to better understanding their functional mechanisms. However, the existing experimental techniques for detecting Kcr sites are cost-ineffective, to a great need for new computational methods to address this problem. We here describe Adapt-Kcr, an advanced deep learning model that utilizes adaptive embedding and is based on a convolutional neural network together with a bidirectional long short-term memory network and attention architecture. On the independent testing set, Adapt-Kcr outperformed the current state-of-the-art Kcr prediction model, with an improvement of 3.2% in accuracy and 1.9% in the area under the receiver operating characteristic curve. Compared to other Kcr models, Adapt-Kcr additionally had a more robust ability to distinguish between crotonylation and other lysine modifications. Another model (Adapt-ST) was trained to predict phosphorylation sites in SARS-CoV-2, and outperformed the equivalent state-of-the-art phosphorylation site prediction model. These results indicate that self-adaptive embedding features perform better than handcrafted features in capturing discriminative information; when used in attention architecture, this could be an effective way of identifying protein Kcr sites. Together, our Adapt framework (including learning embedding features and attention architecture) has a strong potential for prediction of other protein posttranslational modification sites.
Collapse
Affiliation(s)
- Zutan Li
- College of Agriculture, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Jingya Fang
- College of Agriculture, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Shining Wang
- Department of Mathematics, College of Science, Nanjing Agricultural University, China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, China
| | - Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, China.,The State Key Laboratory of Translational Medicine and Innovative Drug Development, Jiangsu Simcere Diagnostics Co., Ltd., Nanjing, China
| |
Collapse
|
22
|
Chen X, Du Z, Guo T, Wu J, Wang B, Wei Z, Jia L, Kang K. Effects of heavy metals stress on chicken manures composting via the perspective of microbial community feedback. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2022; 294:118624. [PMID: 34864104 DOI: 10.1016/j.envpol.2021.118624] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 11/10/2021] [Accepted: 12/01/2021] [Indexed: 06/13/2023]
Abstract
Heavy metal pollution was the main risk during livestock manures composting, in which microorganisms played a vital role. However, response strategies of microbial community to heavy metals stress (HMS) remained largely unclear. Therefore, the objective of this study was to reveal the ecological adaptation and counter-effect of bacterial community under HMS during chicken manures composting, and evaluating environmental implications of HMS on composting. The degradation of organic matters (more than 6.4%) and carbohydrate (more than 19.8%) were enhanced under intense HMS, suggesting that microorganisms could quickly adapt to the HMS to ensure smooth composting. Meanwhile, HMS increased keystone nodes and strengthened significant positive correlation relationships between genera (p < 0.05), indicating that bacteria resisted HMS through cooperating during composting. In addition, different bacterial groups performed various functions to cope with HMS. Specific bacterial groups responded to HMS, and certain groups regulated bacterial networks. Therefore, bacterial community had the extraordinary potential to deal with HMS and guarantee chicken manures composting even in the presence of high concentrations of heavy metals.
Collapse
Affiliation(s)
- Xiaomeng Chen
- College of Life Science, Northeast Agricultural University, Harbin, 150030, China
| | - Zhuang Du
- College of Life Science, Northeast Agricultural University, Harbin, 150030, China
| | - Tong Guo
- College of Life Science, Northeast Agricultural University, Harbin, 150030, China
| | - Junqiu Wu
- College of Life Science, Northeast Agricultural University, Harbin, 150030, China
| | - Bo Wang
- College of Life Science, Northeast Agricultural University, Harbin, 150030, China
| | - Zimin Wei
- College of Life Science, Northeast Agricultural University, Harbin, 150030, China.
| | - Liming Jia
- Heilongjiang Province Environmental Monitoring Centre, Harbin, 150056, China
| | - Kejia Kang
- Heilongjiang Province Environmental Science Research Institute, Harbin, 150056, China
| |
Collapse
|
23
|
Pan G, Sun C, Liao Z, Tang J. Machine and Deep Learning for Prediction of Subcellular Localization. Methods Mol Biol 2022; 2361:249-261. [PMID: 34236666 DOI: 10.1007/978-1-0716-1641-3_15] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Protein subcellular localization prediction (PSLP), which plays an important role in the field of computational biology, identifies the position and function of proteins in cells without expensive cost and laborious effort. In the past few decades, various methods with different algorithms have been proposed in solving the problem of subcellular localization prediction; machine learning and deep learning constitute a large portion among those proposed methods. In order to provide an overview about those methods, the first part of this article will be a brief review of several state-of-the-art machine learning methods on subcellular localization prediction; then the materials used by subcellular localization prediction is described and a simple prediction method, that takes protein sequences as input and utilizes a convolutional neural network as the classifier, is introduced. At last, a list of notes is provided to indicate the major problems that may occur with this method.
Collapse
Affiliation(s)
- Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Chao Sun
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Zijun Liao
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fujian, China
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA. .,School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
24
|
Liu Z, Ren Z, Yan L, Li F. DeepLRR: An Online Webserver for Leucine-Rich-Repeat Containing Protein Characterization Based on Deep Learning. PLANTS (BASEL, SWITZERLAND) 2022; 11:plants11010136. [PMID: 35009139 PMCID: PMC8796025 DOI: 10.3390/plants11010136] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 12/31/2021] [Accepted: 01/01/2022] [Indexed: 05/26/2023]
Abstract
Members of the leucine-rich repeat (LRR) superfamily play critical roles in multiple biological processes. As the LRR unit sequence is highly variable, accurately predicting the number and location of LRR units in proteins is a highly challenging task in the field of bioinformatics. Existing methods still need to be improved, especially when it comes to similarity-based methods. We introduce our DeepLRR method based on a convolutional neural network (CNN) model and LRR features to predict the number and location of LRR units in proteins. We compared DeepLRR with six existing methods using a dataset containing 572 LRR proteins and it outperformed all of them when it comes to overall F1 score. In addition, DeepLRR has integrated identifying plant disease-resistance proteins (NLR, LRR-RLK, LRR-RLP) and non-canonical domains. With DeepLRR, 223, 191 and 183 LRR-RLK genes in Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa ssp. Japonica) and tomato (Solanum lycopersicum) genomes were re-annotated, respectively. Chromosome mapping and gene cluster analysis revealed that 24.2% (54/223), 29.8% (57/191) and 16.9% (31/183) of LRR-RLK genes formed gene cluster structures in Arabidopsis, rice and tomato, respectively. Finally, we explored the evolutionary relationship and domain composition of LRR-RLK genes in each plant and distributions of known receptor and co-receptor pairs. This provides a new perspective for the identification of potential receptors and co-receptors.
Collapse
Affiliation(s)
- Zhenya Liu
- Key Lab of Horticultural Plant Biology (MOE), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Zirui Ren
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (Z.R.); (L.Y.)
| | - Lunyi Yan
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (Z.R.); (L.Y.)
| | - Feng Li
- Key Lab of Horticultural Plant Biology (MOE), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
25
|
Cadet XF, Gelly JC, van Noord A, Cadet F, Acevedo-Rocha CG. Learning Strategies in Protein Directed Evolution. Methods Mol Biol 2022; 2461:225-275. [PMID: 35727454 DOI: 10.1007/978-1-0716-2152-3_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Synthetic biology is a fast-evolving research field that combines biology and engineering principles to develop new biological systems for medical, pharmacological, and industrial applications. Synthetic biologists use iterative "design, build, test, and learn" cycles to efficiently engineer genetic systems that are reliable, reproducible, and predictable. Protein engineering by directed evolution can benefit from such a systematic engineering approach for various reasons. Learning can be carried out before starting, throughout or after finalizing a directed evolution project. Computational tools, bioinformatics, and scanning mutagenesis methods can be excellent starting points, while molecular dynamics simulations and other strategies can guide engineering efforts. Similarly, studying protein intermediates along evolutionary pathways offers fascinating insights into the molecular mechanisms shaped by evolution. The learning step of the cycle is not only crucial for proteins or enzymes that are not suitable for high-throughput screening or selection systems, but it is also valuable for any platform that can generate a large amount of data that can be aided by machine learning algorithms. The main challenge in protein engineering is to predict the effect of a single mutation on one functional parameter-to say nothing of several mutations on multiple parameters. This is largely due to nonadditive mutational interactions, known as epistatic effects-beneficial mutations present in a genetic background may not be beneficial in another genetic background. In this work, we provide an overview of experimental and computational strategies that can guide the user to learn protein function at different stages in a directed evolution project. We also discuss how epistatic effects can influence the success of directed evolution projects. Since machine learning is gaining momentum in protein engineering and the field is becoming more interdisciplinary thanks to collaboration between mathematicians, computational scientists, engineers, molecular biologists, and chemists, we provide a general workflow that familiarizes nonexperts with the basic concepts, dataset requirements, learning approaches, model capabilities and performance metrics of this intriguing area. Finally, we also provide some practical recommendations on how machine learning can harness epistatic effects for engineering proteins in an "outside-the-box" way.
Collapse
Affiliation(s)
- Xavier F Cadet
- PEACCEL, Artificial Intelligence Department, Paris, France
| | - Jean Christophe Gelly
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | | - Frédéric Cadet
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | |
Collapse
|
26
|
Mavaie P, Holder L, Beck D, Skinner MK. Predicting environmentally responsive transgenerational differential DNA methylated regions (epimutations) in the genome using a hybrid deep-machine learning approach. BMC Bioinformatics 2021; 22:575. [PMID: 34847877 PMCID: PMC8630850 DOI: 10.1186/s12859-021-04491-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 11/18/2021] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Deep learning is an active bioinformatics artificial intelligence field that is useful in solving many biological problems, including predicting altered epigenetics such as DNA methylation regions. Deep learning (DL) can learn an informative representation that addresses the need for defining relevant features. However, deep learning models are computationally expensive, and they require large training datasets to achieve good classification performance. RESULTS One approach to addressing these challenges is to use a less complex deep learning network for feature selection and Machine Learning (ML) for classification. In the current study, we introduce a hybrid DL-ML approach that uses a deep neural network for extracting molecular features and a non-DL classifier to predict environmentally responsive transgenerational differential DNA methylated regions (DMRs), termed epimutations, based on the extracted DL-based features. Various environmental toxicant induced epigenetic transgenerational inheritance sperm epimutations were used to train the model on the rat genome DNA sequence and use the model to predict transgenerational DMRs (epimutations) across the entire genome. CONCLUSION The approach was also used to predict potential DMRs in the human genome. Experimental results show that the hybrid DL-ML approach outperforms deep learning and traditional machine learning methods.
Collapse
Affiliation(s)
- Pegah Mavaie
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164-2752, USA
| | - Lawrence Holder
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164-2752, USA.
| | - Daniel Beck
- Center for Reproductive Biology, School of Biological Sciences, Washington State University, Pullman, WA, 99164-4236, USA
| | - Michael K Skinner
- Center for Reproductive Biology, School of Biological Sciences, Washington State University, Pullman, WA, 99164-4236, USA.
| |
Collapse
|
27
|
Scalzitti N, Kress A, Orhand R, Weber T, Moulinier L, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics 2021; 22:561. [PMID: 34814826 PMCID: PMC8609763 DOI: 10.1186/s12859-021-04471-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 11/09/2021] [Indexed: 12/14/2022] Open
Abstract
Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04471-3.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Arnaud Kress
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Romain Orhand
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Thomas Weber
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Luc Moulinier
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Anne Jeannin-Girardon
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Pierre Collet
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Julie D Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.
| |
Collapse
|
28
|
Zhang Y, Liu Y, Xu J, Wang X, Peng X, Song J, Yu DJ. Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites. Brief Bioinform 2021; 22:bbab351. [PMID: 34459479 PMCID: PMC8575024 DOI: 10.1093/bib/bbab351] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 08/02/2021] [Accepted: 08/09/2021] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine is an important type of DNA modification that plays important roles in multiple biological processes. Despite the recent progress in developing DNA 6mA site prediction methods, several challenges remain to be addressed. For example, although the hand-crafted features are interpretable, they contain redundant information that may bias the model training and have a negative impact on the trained model. Furthermore, although deep learning (DL)-based models can perform feature extraction and classification automatically, they lack the interpretability of the crucial features learned by those models. As such, considerable research efforts have been focused on achieving the trade-off between the interpretability and straightforwardness of DL neural networks. In this study, we develop two new DL-based models for improving the prediction of N6-methyladenine sites, termed LA6mA and AL6mA, which use bidirectional long short-term memory to respectively capture the long-range information and self-attention mechanism to extract the key position information from DNA sequences. The performance of the two proposed methods is benchmarked and evaluated on the two model organisms Arabidopsis thaliana and Drosophila melanogaster. On the two benchmark datasets, LA6mA achieves an area under the receiver operating characteristic curve (AUROC) value of 0.962 and 0.966, whereas AL6mA achieves an AUROC value of 0.945 and 0.941, respectively. Moreover, an in-depth analysis of the attention matrix is conducted to interpret the important information, which is hidden in the sequence and relevant for 6mA site prediction. The two novel pipelines developed for DNA 6mA site prediction in this work will facilitate a better understanding of the underlying principle of DL-based DNA methylation site prediction and its future applications.
Collapse
Affiliation(s)
- Ying Zhang
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Yan Liu
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Xinxin Peng
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
29
|
Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021; 23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Collapse
Affiliation(s)
- Fuyi Li
- Monash University, Australia
| | | | - André Leier
- Department of Genetics, UAB School of Medicine, USA
| | - Meiya Han
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | | | - Jing Xu
- Computer Science and Technology from Nankai University, China
| | - Xiaoyu Wang
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Shirui Pan
- University of Technology Sydney (UTS), Ultimo, NSW, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Australia
| | - Yang Zhang
- Northwestern Polytechnical University, China
| | - Geoffrey I Webb
- Faculty of Information Technology at Monash University, Australia
| | - Lachlan J M Coin
- Department of Clinical Pathology, University of Melbourne, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| |
Collapse
|
30
|
Liao Z, Pan G, Sun C, Tang J. Predicting subcellular location of protein with evolution information and sequence-based deep learning. BMC Bioinformatics 2021; 22:515. [PMID: 34686152 PMCID: PMC8539821 DOI: 10.1186/s12859-021-04404-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 09/24/2021] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations. RESULTS Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848. CONCLUSION The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.
Collapse
Affiliation(s)
- Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, 1 Xuefu North Road, University Town, Fuzhou, 350122 FJ China
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Chao Sun
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
- College of Electrical and Power Engineering, Taiyuan University of Technology, No. 79 Yinze West Street, Taiyuan, 030024 SX China
| |
Collapse
|
31
|
AoP-LSE: Antioxidant Proteins Classification Using Deep Latent Space Encoding of Sequence Features. Curr Issues Mol Biol 2021; 43:1489-1501. [PMID: 34698113 PMCID: PMC8928959 DOI: 10.3390/cimb43030105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 09/28/2021] [Accepted: 09/29/2021] [Indexed: 11/16/2022] Open
Abstract
It is of utmost importance to develop a computational method for accurate prediction of antioxidants, as they play a vital role in the prevention of several diseases caused by oxidative stress. In this correspondence, we present an effective computational methodology based on the notion of deep latent space encoding. A deep neural network classifier fused with an auto-encoder learns class labels in a pruned latent space. This strategy has eliminated the need to separately develop classifier and the feature selection model, allowing the standalone model to effectively harness discriminating feature space and perform improved predictions. A thorough analytical study has been presented alongwith the PCA/tSNE visualization and PCA-GCNR scores to show the discriminating power of the proposed method. The proposed method showed a high MCC value of 0.43 and a balanced accuracy of 76.2%, which is superior to the existing models. The model has been evaluated on an independent dataset during which it outperformed the contemporary methods by correctly identifying the novel proteins with an accuracy of 95%.
Collapse
|
32
|
Thafar MA, Olayan RS, Albaradei S, Bajic VB, Gojobori T, Essack M, Gao X. DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning. J Cheminform 2021; 13:71. [PMID: 34551818 PMCID: PMC8459562 DOI: 10.1186/s13321-021-00552-w] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Accepted: 09/05/2021] [Indexed: 11/21/2022] Open
Abstract
Drug-target interaction (DTI) prediction is a crucial step in drug discovery and repositioning as it reduces experimental validation costs if done right. Thus, developing in-silico methods to predict potential DTI has become a competitive research niche, with one of its main focuses being improving the prediction accuracy. Using machine learning (ML) models for this task, specifically network-based approaches, is effective and has shown great advantages over the other computational methods. However, ML model development involves upstream hand-crafted feature extraction and other processes that impact prediction accuracy. Thus, network-based representation learning techniques that provide automated feature extraction combined with traditional ML classifiers dealing with downstream link prediction tasks may be better-suited paradigms. Here, we present such a method, DTi2Vec, which identifies DTIs using network representation learning and ensemble learning techniques. DTi2Vec constructs the heterogeneous network, and then it automatically generates features for each drug and target using the nodes embedding technique. DTi2Vec demonstrated its ability in drug-target link prediction compared to several state-of-the-art network-based methods, using four benchmark datasets and large-scale data compiled from DrugBank. DTi2Vec showed a statistically significant increase in the prediction performances in terms of AUPR. We verified the "novel" predicted DTIs using several databases and scientific literature. DTi2Vec is a simple yet effective method that provides high DTI prediction performance while being scalable and efficient in computation, translating into a powerful drug repositioning tool.
Collapse
Affiliation(s)
- Maha A Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
- College of Computers and Information Technology, Computer Science Department, Taif University, Taif, Kingdom of Saudi Arabia
| | - Rawan S Olayan
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
| | - Vladimir B Bajic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia.
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia.
| |
Collapse
|
33
|
Li H, Zhou J, Zhou Y, Chen Q, She Y, Gao F, Xu Y, Chen J, Gao X. An Interpretable Computer-Aided Diagnosis Method for Periodontitis From Panoramic Radiographs. Front Physiol 2021; 12:655556. [PMID: 34239448 PMCID: PMC8258157 DOI: 10.3389/fphys.2021.655556] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 05/31/2021] [Indexed: 12/02/2022] Open
Abstract
Periodontitis is a prevalent and irreversible chronic inflammatory disease both in developed and developing countries, and affects about 20–50% of the global population. The tool for automatically diagnosing periodontitis is highly demanded to screen at-risk people for periodontitis and its early detection could prevent the onset of tooth loss, especially in local communities and health care settings with limited dental professionals. In the medical field, doctors need to understand and trust the decisions made by computational models and developing interpretable models is crucial for disease diagnosis. Based on these considerations, we propose an interpretable method called Deetal-Perio to predict the severity degree of periodontitis in dental panoramic radiographs. In our method, alveolar bone loss (ABL), the clinical hallmark for periodontitis diagnosis, could be interpreted as the key feature. To calculate ABL, we also propose a method for teeth numbering and segmentation. First, Deetal-Perio segments and indexes the individual tooth via Mask R-CNN combined with a novel calibration method. Next, Deetal-Perio segments the contour of the alveolar bone and calculates a ratio for individual tooth to represent ABL. Finally, Deetal-Perio predicts the severity degree of periodontitis given the ratios of all the teeth. The Macro F1-score and accuracy of the periodontitis prediction task in our method reach 0.894 and 0.896, respectively, on Suzhou data set, and 0.820 and 0.824, respectively on Zhongshan data set. The entire architecture could not only outperform state-of-the-art methods and show robustness on two data sets in both periodontitis prediction, and teeth numbering and segmentation tasks, but also be interpretable for doctors to understand the reason why Deetal-Perio works so well.
Collapse
Affiliation(s)
- Haoyang Li
- Cancer Systems Biology Center, The China-Japan Union Hospital, Jilin University, Changchun, China.,Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.,College of Computer Science and Technology, Jilin University, Changchun, China
| | - Juexiao Zhou
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.,Department of Biology, Southern University of Science and Technology, Shenzhen, China
| | - Yi Zhou
- Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, GA, United States
| | - Qiang Chen
- The Affiliated Stomatological Hospital of Soochow University, Soochow, China
| | - Yangyang She
- Department of Stomatology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Feng Gao
- Department of Colorectal Surgery, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Ying Xu
- Cancer Systems Biology Center, The China-Japan Union Hospital, Jilin University, Changchun, China.,Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, GA, United States
| | - Jieyu Chen
- Department of Stomatology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| |
Collapse
|
34
|
Orozco-Arias S, Candamil-Cortés MS, Jaimes PA, Piña JS, Tabares-Soto R, Guyot R, Isaza G. K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. PeerJ 2021; 9:e11456. [PMID: 34055489 PMCID: PMC8140598 DOI: 10.7717/peerj.11456] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 04/24/2021] [Indexed: 12/15/2022] Open
Abstract
Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.,Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | | | - Paula A Jaimes
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Johan S Piña
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Romain Guyot
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.,Institut de Recherche pour le Développement, CIRAD, Univ. Montpellier, Montpellier, France
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| |
Collapse
|
35
|
Zhong Q, Zhu Y, Cai D, Xiao L, Zhang H. Electroencephalogram Access for Emotion Recognition Based on a Deep Hybrid Network. Front Hum Neurosci 2021; 14:589001. [PMID: 33390918 PMCID: PMC7772146 DOI: 10.3389/fnhum.2020.589001] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Accepted: 11/26/2020] [Indexed: 11/13/2022] Open
Abstract
In the human-computer interaction (HCI), electroencephalogram (EEG) access for automatic emotion recognition is an effective way for robot brains to perceive human behavior. In order to improve the accuracy of the emotion recognition, a method of EEG access for emotion recognition based on a deep hybrid network was proposed in this paper. Firstly, the collected EEG was decomposed into four frequency band signals, and the multiscale sample entropy (MSE) features of each frequency band were extracted. Secondly, the constructed 3D MSE feature matrices were fed into a deep hybrid network for autonomous learning. The deep hybrid network was composed of a continuous convolutional neural network (CNN) and hidden Markov models (HMMs). Lastly, HMMs trained with multiple observation sequences were used to replace the artificial neural network classifier in the CNN, and the emotion recognition task was completed by HMM classifiers. The proposed method was applied to the DEAP dataset for emotion recognition experiments, and the average accuracy could achieve 79.77% on arousal, 83.09% on valence, and 81.83% on dominance. Compared with the latest related methods, the accuracy was improved by 0.99% on valence and 14.58% on dominance, which verified the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Qinghua Zhong
- School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou, China.,South China Academy of Advanced Optoelectronics, South China Normal University, Guangzhou, China
| | - Yongsheng Zhu
- School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou, China
| | - Dongli Cai
- School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou, China
| | - Luwei Xiao
- School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou, China
| | - Han Zhang
- School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou, China
| |
Collapse
|
36
|
Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J. DeepCryoPicker: fully automated deep neural network for single protein particle picking in cryo-EM. BMC Bioinformatics 2020; 21:509. [PMID: 33167860 PMCID: PMC7653784 DOI: 10.1186/s12859-020-03809-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 10/13/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cryo-electron microscopy (Cryo-EM) is widely used in the determination of the three-dimensional (3D) structures of macromolecules. Particle picking from 2D micrographs remains a challenging early step in the Cryo-EM pipeline due to the diversity of particle shapes and the extremely low signal-to-noise ratio of micrographs. Because of these issues, significant human intervention is often required to generate a high-quality set of particles for input to the downstream structure determination steps. RESULTS Here we propose a fully automated approach (DeepCryoPicker) for single particle picking based on deep learning. It first uses automated unsupervised learning to generate particle training datasets. Then it trains a deep neural network to classify particles automatically. Results indicate that the DeepCryoPicker compares favorably with semi-automated methods such as DeepEM, DeepPicker, and RELION, with the significant advantage of not requiring human intervention. CONCLUSIONS Our framework combing supervised deep learning classification with automated un-supervised clustering for generating training data provides an effective approach to pick particles in cryo-EM images automatically and accurately.
Collapse
Affiliation(s)
- Adil Al-Azzawi
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO 65211 USA
| | - Anes Ouadou
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO 65211 USA
| | - Highsmith Max
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO 65211 USA
| | - Ye Duan
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO 65211 USA
| | - John J. Tanner
- Departments of Biochemistry and Chemistry, University of Missouri, Columbia, MO 65211-2060 USA
| | - Jianlin Cheng
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO 65211 USA
- Informatics Institute, University of Missouri, Columbia, MO 65211 USA
| |
Collapse
|