1
|
Capitanchik C, Wilkins OG, Wagner N, Gagneur J, Ule J. From computational models of the splicing code to regulatory mechanisms and therapeutic implications. Nat Rev Genet 2024:10.1038/s41576-024-00774-2. [PMID: 39358547 DOI: 10.1038/s41576-024-00774-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/27/2024] [Indexed: 10/04/2024]
Abstract
Since the discovery of RNA splicing and its role in gene expression, researchers have sought a set of rules, an algorithm or a computational model that could predict the splice isoforms, and their frequencies, produced from any transcribed gene in a specific cellular context. Over the past 30 years, these models have evolved from simple position weight matrices to deep-learning models capable of integrating sequence data across vast genomic distances. Most recently, new model architectures are moving the field closer to context-specific alternative splicing predictions, and advances in sequencing technologies are expanding the type of data that can be used to inform and interpret such models. Together, these developments are driving improved understanding of splicing regulatory mechanisms and emerging applications of the splicing code to the rational design of RNA- and splicing-based therapeutics.
Collapse
Affiliation(s)
- Charlotte Capitanchik
- The Francis Crick Institute, London, UK
- UK Dementia Research Institute at King's College London, London, UK
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK
| | - Oscar G Wilkins
- The Francis Crick Institute, London, UK
- UCL Queen Square Motor Neuron Disease Centre, Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Nils Wagner
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
| | - Jernej Ule
- The Francis Crick Institute, London, UK.
- UK Dementia Research Institute at King's College London, London, UK.
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK.
- National Institute of Chemistry, Ljubljana, Slovenia.
| |
Collapse
|
2
|
Xu J, Wang Q, Tang X, Feng X, Zhang X, Liu T, Wu F, Wang Q, Feng X, Tang Q, Lisch D, Lu Y. Drought-induced circular RNAs in maize roots: Separating signal from noise. PLANT PHYSIOLOGY 2024; 196:352-367. [PMID: 38669308 DOI: 10.1093/plphys/kiae229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 03/08/2024] [Accepted: 03/08/2024] [Indexed: 04/28/2024]
Abstract
Circular RNAs (circRNAs) play an important role in diverse biological processes; however, their origin and functions, especially in plants, remain largely unclear. Here, we used 2 maize (Zea mays) inbred lines, as well as 14 of their derivative recombination inbred lines with different drought sensitivity, to systematically characterize 8,790 circRNAs in maize roots under well-watered (WW) and water-stress (WS) conditions. We found that a diverse set of circRNAs expressed at significantly higher levels under WS. Enhanced expression of circRNAs was associated with longer flanking introns and an enrichment of long interspersed nuclear element retrotransposable elements. The epigenetic marks found at the back-splicing junctions of circRNA-producing genes were markedly different from canonical splicing, characterized by increased levels of H3K36me3/H3K4me1, as well as decreased levels of H3K9Ac/H3K27Ac. We found that genes expressing circRNAs are subject to relaxed selection. The significant enrichment of trait-associated sites along their genic regions suggested that genes giving rise to circRNAs were associated with plant survival rate under drought stress, implying that circRNAs play roles in plant drought responses. Furthermore, we found that overexpression of circMED16, one of the drought-responsive circRNAs, enhances drought tolerance in Arabidopsis (Arabidopsis thaliana). Our results provide a framework for understanding the intricate interplay of epigenetic modifications and how they contribute to the fine-tuning of circRNA expression under drought stress.
Collapse
Affiliation(s)
- Jie Xu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
- Key Laboratory of Agricultural Bioinformatics, Ministry of Education, Sichuan Agricultural University, Sichuan 611130, China
| | - Qi Wang
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Xin Tang
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Xiaoju Feng
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Xiaoyue Zhang
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Tianhong Liu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Fengkai Wu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Qingjun Wang
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Xuanjun Feng
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Qi Tang
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| | - Damon Lisch
- Department of Botany and Plant Pathology, Purdue University, West Lafayette, IN 47907, USA
| | - Yanli Lu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Sichuan Agricultural University, Sichuan 611130, China
- Maize Research Institute, Sichuan Agricultural University, Sichuan 611130, China
- Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan 611130, China
| |
Collapse
|
3
|
Qian Y, Zou Q, Zhao M, Liu Y, Guo F, Ding Y. scRNMF: An imputation method for single-cell RNA-seq data by robust and non-negative matrix factorization. PLoS Comput Biol 2024; 20:e1012339. [PMID: 39116191 PMCID: PMC11338450 DOI: 10.1371/journal.pcbi.1012339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 08/21/2024] [Accepted: 07/19/2024] [Indexed: 08/10/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool in genomics research, enabling the analysis of gene expression at the individual cell level. However, scRNA-seq data often suffer from a high rate of dropouts, where certain genes fail to be detected in specific cells due to technical limitations. This missing data can introduce biases and hinder downstream analysis. To overcome this challenge, the development of effective imputation methods has become crucial in the field of scRNA-seq data analysis. Here, we propose an imputation method based on robust and non-negative matrix factorization (scRNMF). Instead of other matrix factorization algorithms, scRNMF integrates two loss functions: L2 loss and C-loss. The L2 loss function is highly sensitive to outliers, which can introduce substantial errors. We utilize the C-loss function when dealing with zero values in the raw data. The primary advantage of the C-loss function is that it imposes a smaller punishment for larger errors, which results in more robust factorization when handling outliers. Various datasets of different sizes and zero rates are used to evaluate the performance of scRNMF against other state-of-the-art methods. Our method demonstrates its power and stability as a tool for imputation of scRNA-seq data.
Collapse
Affiliation(s)
- Yuqing Qian
- Institute Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Institute Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Mengyuan Zhao
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Yi Liu
- Institute Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
4
|
Shen F, Hu C, Huang X, He H, Yang D, Zhao J, Yang X. Advances in alternative splicing identification: deep learning and pantranscriptome. FRONTIERS IN PLANT SCIENCE 2023; 14:1232466. [PMID: 37790793 PMCID: PMC10544900 DOI: 10.3389/fpls.2023.1232466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 08/28/2023] [Indexed: 10/05/2023]
Abstract
In plants, alternative splicing is a crucial mechanism for regulating gene expression at the post-transcriptional level, which leads to diverse proteins by generating multiple mature mRNA isoforms and diversify the gene regulation. Due to the complexity and variability of this process, accurate identification of splicing events is a vital step in studying alternative splicing. This article presents the application of alternative splicing algorithms with or without reference genomes in plants, as well as the integration of advanced deep learning techniques for improved detection accuracy. In addition, we also discuss alternative splicing studies in the pan-genomic background and the usefulness of integrated strategies for fully profiling alternative splicing.
Collapse
Affiliation(s)
- Fei Shen
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Chenyang Hu
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
- Shanxi Key Lab of Chinese Jujube, College of Life Science, Yan’an University, Yan’an, Shanxi, China
| | - Xin Huang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Hao He
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Deng Yang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Jirong Zhao
- Shanxi Key Lab of Chinese Jujube, College of Life Science, Yan’an University, Yan’an, Shanxi, China
| | - Xiaozeng Yang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| |
Collapse
|
5
|
Hawsawi YM, Shams A, Theyab A, Abdali WA, Hussien NA, Alatwi HE, Alzahrani OR, Oyouni AAA, Babalghith AO, Alreshidi M. BARD1 mystery: tumor suppressors are cancer susceptibility genes. BMC Cancer 2022; 22:599. [PMID: 35650591 PMCID: PMC9161512 DOI: 10.1186/s12885-022-09567-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 04/14/2022] [Indexed: 12/24/2022] Open
Abstract
The full-length BRCA1-associated RING domain 1 (BARD1) gene encodes a 777-aa protein. BARD1 displays a dual role in cancer development and progression as it acts as a tumor suppressor and an oncogene. Structurally, BARD1 has homologous domains to BRCA1 that aid their heterodimer interaction to inhibit the progression of different cancers such as breast and ovarian cancers following the BRCA1-dependant pathway. In addition, BARD1 was shown to be involved in other pathways that are involved in tumor suppression (BRCA1-independent pathway) such as the TP53-dependent apoptotic signaling pathway. However, there are abundant BARD1 isoforms exist that are different from the full-length BARD1 due to nonsense and frameshift mutations, or deletions were found to be associated with susceptibility to various cancers including neuroblastoma, lung, breast, and cervical cancers. This article reviews the spectrum of BARD1 full-length genes and its different isoforms and their anticipated associated risk. Additionally, the study also highlights the role of BARD1 as an oncogene in breast cancer patients and its potential uses as a prognostic/diagnostic biomarker and as a therapeutic target for cancer susceptibility testing and treatment.
Collapse
Affiliation(s)
- Yousef M Hawsawi
- King Faisal Specialist Hospital and Research Center- Research Center, KFSH&RC, MBC-J04, P.O. Box 40047, Jeddah, 21499, Saudi Arabia. .,College of Medicine, Al-Faisal University, P.O. Box 50927, Riyadh, 11533, Saudi Arabia.
| | - Anwar Shams
- Department of Pharmacology, College of Medicine, Taif University, P.O. Box 11099, Taif, 21944, Saudi Arabia
| | - Abdulrahman Theyab
- College of Medicine, Al-Faisal University, P.O. Box 50927, Riyadh, 11533, Saudi Arabia.,Department of Pharmacology, College of Medicine, Taif University, P.O. Box 11099, Taif, 21944, Saudi Arabia.,Department of Laboratory Medicine, Security Forces Hospital, Mecca, Kingdom of Saudi Arabia
| | - Wed A Abdali
- King Faisal Specialist Hospital and Research Center- Research Center, KFSH&RC, MBC-J04, P.O. Box 40047, Jeddah, 21499, Saudi Arabia
| | - Nahed A Hussien
- Department of Zoology, Faculty of Science, Cairo University, Giza, 12613, Egypt.,Department of Biology, College of Science, Taif University, P.O Box 11099, Taif, 21944, Saudi Arabia
| | - Hanan E Alatwi
- Department of Biology, Faculty of Sciences, University of Tabuk, Tabuk, Kingdom of Saudi Arabia.,Genome and Biotechnology Unit, Faculty of Science, University of Tabuk, Tabuk, Saudi Arabia
| | - Othman R Alzahrani
- Department of Biology, Faculty of Sciences, University of Tabuk, Tabuk, Kingdom of Saudi Arabia.,Genome and Biotechnology Unit, Faculty of Science, University of Tabuk, Tabuk, Saudi Arabia
| | - Atif Abdulwahab A Oyouni
- Department of Biology, Faculty of Sciences, University of Tabuk, Tabuk, Kingdom of Saudi Arabia.,Genome and Biotechnology Unit, Faculty of Science, University of Tabuk, Tabuk, Saudi Arabia
| | - Ahmad O Babalghith
- Medical genetics Department, College of Medicine, Umm Alqura University, Makkah, Saudi Arabia
| | - Mousa Alreshidi
- Departement of biology, College of Science, University of Hail, Hail, Saudi Arabia.,Molecular Diagnostic and Personalized Therapeutic Unit, University of Hail, Hail, Saudi Arabia
| |
Collapse
|
6
|
Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and perspectives. Brief Bioinform 2021; 23:6425809. [PMID: 34791021 DOI: 10.1093/bib/bbab460] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 09/29/2021] [Accepted: 10/07/2021] [Indexed: 02/07/2023] Open
Abstract
The innovation of biotechnologies has allowed the accumulation of omics data at an alarming rate, thus introducing the era of 'big data'. Extracting inherent valuable knowledge from various omics data remains a daunting problem in bioinformatics. Better solutions often need some kind of more innovative methods for efficient handlings and effective results. Recent advancements in integrated analysis and computational modeling of multi-omics data helped address such needs in an increasingly harmonious manner. The development and application of machine learning have largely advanced our insights into biology and biomedicine and greatly promoted the development of therapeutic strategies, especially for precision medicine. Here, we propose a comprehensive survey and discussion on what happened, is happening and will happen when machine learning meets omics. Specifically, we describe how artificial intelligence can be applied to omics studies and review recent advancements at the interface between machine learning and the ever-widest range of omics including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as those at the single-cell resolution. We also discuss and provide a synthesis of ideas, new insights, current challenges and perspectives of machine learning in omics.
Collapse
Affiliation(s)
- Rufeng Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Lixin Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Yungang Xu
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Juan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China.,Key Laboratory of Environment and Genes Related to Diseases (Xi'an Jiaotong University), Ministry of Education of China, Xi'an 710061, P. R. China
| |
Collapse
|
7
|
Wang Y, Zhang P, Guo W, Liu H, Li X, Zhang Q, Du Z, Hu G, Han X, Pu L, Tian J, Gu X. A deep learning approach to automate whole-genome prediction of diverse epigenomic modifications in plants. THE NEW PHYTOLOGIST 2021; 232:880-897. [PMID: 34287908 DOI: 10.1111/nph.17630] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 07/09/2021] [Indexed: 06/13/2023]
Abstract
Epigenetic modifications function in gene transcription, RNA metabolism, and other biological processes. However, multiple factors currently limit the scientific utility of epigenomic datasets generated for plants. Here, using deep-learning approaches, we developed a Smart Model for Epigenetics in Plants (SMEP) to predict six types of epigenomic modifications: DNA 5-methylcytosine (5mC) and N6-methyladenosine (6mA) methylation, RNA N6-methyladenosine (m6 A) methylation, and three types of histone modification. Using the datasets from the japonica rice Nipponbare, SMEP achieved 95% prediction accuracy for 6mA, and also achieved around 80% for 5mC, m6 A, and the three types of histone modification based on the 10-fold cross-validation. Additionally, > 95% of the 6mA peaks detected after a heat-shock treatment were predicted. We also successfully applied the SMEP for examining epigenomic modifications in indica rice 93-11 and even the B73 maize line. Taken together, we show that the deep-learning-enabled SMEP can reliably mine epigenomic datasets from diverse plants to yield actionable insights about epigenomic sites. Thus, our work opens new avenues for the application of predictive tools to facilitate functional research, and will almost certainly increase the efficiency of genome engineering efforts.
Collapse
Affiliation(s)
- Yifan Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Pingxian Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Weijun Guo
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Hanqing Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Xiulan Li
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Qian Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Zhuoying Du
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Guihua Hu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Xiao Han
- College of Biological Science and Engineering, Fuzhou University, Fuzhou, 350108, China
| | - Li Pu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Jian Tian
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Xiaofeng Gu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| |
Collapse
|
8
|
Yu K, Zhang Q, Liu Z, Du Y, Gao X, Zhao Q, Cheng H, Li X, Liu ZX. Deep learning based prediction of reversible HAT/HDAC-specific lysine acetylation. Brief Bioinform 2021; 21:1798-1805. [PMID: 32978618 DOI: 10.1093/bib/bbz107] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 07/18/2019] [Accepted: 07/30/2019] [Indexed: 11/14/2022] Open
Abstract
Protein lysine acetylation regulation is an important molecular mechanism for regulating cellular processes and plays critical physiological and pathological roles in cancers and diseases. Although massive acetylation sites have been identified through experimental identification and high-throughput proteomics techniques, their enzyme-specific regulation remains largely unknown. Here, we developed the deep learning-based protein lysine acetylation modification prediction (Deep-PLA) software for histone acetyltransferase (HAT)/histone deacetylase (HDAC)-specific acetylation prediction based on deep learning. Experimentally identified substrates and sites of several HATs and HDACs were curated from the literature to generate enzyme-specific data sets. We integrated various protein sequence features with deep neural network and optimized the hyperparameters with particle swarm optimization, which achieved satisfactory performance. Through comparisons based on cross-validations and testing data sets, the model outperformed previous studies. Meanwhile, we found that protein-protein interactions could enrich enzyme-specific acetylation regulatory relations and visualized this information in the Deep-PLA web server. Furthermore, a cross-cancer analysis of acetylation-associated mutations revealed that acetylation regulation was intensively disrupted by mutations in cancers and heavily implicated in the regulation of cancer signaling. These prediction and analysis results might provide helpful information to reveal the regulatory mechanism of protein acetylation in various biological processes to promote the research on prognosis and treatment of cancers. Therefore, the Deep-PLA predictor and protein acetylation interaction networks could provide helpful information for studying the regulation of protein acetylation. The web server of Deep-PLA could be accessed at http://deeppla.cancerbio.info.
Collapse
Affiliation(s)
- Kai Yu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| | - Qingfeng Zhang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| | - Zekun Liu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| | - Yimeng Du
- School of Life Sciences, Zhengzhou University, Zhengzhou 450001, China
| | - Xinjiao Gao
- Division of Molecular and Cell Biophysics, Hefei National Science Center for Physical Sciences at the Microscale, Anhui Key Laboratory of Cellular Dynamics and Chemical Biology, School of Life Sciences, University of Science and Technology of the China, Hefei 230027, China
| | - Qi Zhao
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| | - Han Cheng
- School of Life Sciences, Zhengzhou University, Zhengzhou 450001, China
| | - Xiaoxing Li
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| | - Ze-Xian Liu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| |
Collapse
|
9
|
Xu SJ, Lombroso SI, Fischer DK, Carpenter MD, Marchione DM, Hamilton PJ, Lim CJ, Neve RL, Garcia BA, Wimmer ME, Pierce RC, Heller EA. Chromatin-mediated alternative splicing regulates cocaine-reward behavior. Neuron 2021; 109:2943-2966.e8. [PMID: 34480866 PMCID: PMC8454057 DOI: 10.1016/j.neuron.2021.08.008] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Revised: 06/14/2021] [Accepted: 08/10/2021] [Indexed: 10/20/2022]
Abstract
Neuronal alternative splicing is a key gene regulatory mechanism in the brain. However, the spliceosome machinery is insufficient to fully specify splicing complexity. In considering the role of the epigenome in activity-dependent alternative splicing, we and others find the histone modification H3K36me3 to be a putative splicing regulator. In this study, we found that mouse cocaine self-administration caused widespread differential alternative splicing, concomitant with the enrichment of H3K36me3 at differentially spliced junctions. Importantly, only targeted epigenetic editing can distinguish between a direct role of H3K36me3 in splicing and an indirect role via regulation of splice factor expression elsewhere on the genome. We targeted Srsf11, which was both alternatively spliced and H3K36me3 enriched in the brain following cocaine self-administration. Epigenetic editing of H3K36me3 at Srsf11 was sufficient to drive its alternative splicing and enhanced cocaine self-administration, establishing the direct causal relevance of H3K36me3 to alternative splicing of Srsf11 and to reward behavior.
Collapse
Affiliation(s)
- Song-Jun Xu
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Sonia I Lombroso
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Delaney K Fischer
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Marco D Carpenter
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Dylan M Marchione
- Department of Biochemistry and Biophysics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Peter J Hamilton
- Department of Brain and Cognitive Sciences, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Carissa J Lim
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Rachel L Neve
- Gene Delivery Technology Core, Massachusetts General Hospital, Cambridge, MA 02139, USA
| | - Benjamin A Garcia
- Department of Biochemistry and Biophysics, University of Pennsylvania, Philadelphia, PA 19104, USA; Penn Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Mathieu E Wimmer
- Department of Psychology, Temple University, Philadelphia, PA 19121, USA
| | - R Christopher Pierce
- Department of Psychiatry, Robert Wood Johnson Medical School, Rutgers University, Piscataway, NJ 08854, USA
| | - Elizabeth A Heller
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA,19104, USA; Penn Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
10
|
Devailly G, Joshi A. Comprehensive analysis of epigenetic signatures of human transcription control. Mol Omics 2021; 17:692-705. [PMID: 34291238 DOI: 10.1039/d0mo00130a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Advances in sequencing technologies have enabled exploration of epigenetic and transcriptional profiles at a genome-wide level. The epigenetic and transcriptional landscapes are now available in hundreds of mammalian cell and tissue contexts. Many studies have performed multi-omics analyses using these datasets to enhance our understanding of relationships between epigenetic modifications and transcription regulation. Nevertheless, most studies so far have focused on the promoters/enhancers and transcription start sites, and other features of transcription control including exons, introns and transcription termination remain underexplored. We investigated the interplay between epigenetic modifications and diverse transcription features using the data generated by the Roadmap Epigenomics project. A comprehensive analysis of histone modifications, DNA methylation, and RNA-seq data of thirty-three human cell lines and tissue types allowed us to confirm the generality of previously described relationships, as well as to generate new hypotheses about the interplay between epigenetic modifications and transcription features. Importantly, our analysis included previously under-explored features of transcription control, namely, transcription termination sites, exon-intron boundaries, and the exon inclusion ratio. We have made the analyses freely available to the scientific community at joshiapps.cbu.uib.no/perepigenomics_app/ for easy exploration, validation and hypothesis generation.
Collapse
Affiliation(s)
- Guillaume Devailly
- GenPhySE, Université de Toulouse, INRAE, ENVT, 31326, Castanet Tolosan, France.
| | - Anagha Joshi
- Computational Biology Unit, Department of Clinical Science, University of Bergen, 5021, Bergen, Norway.
| |
Collapse
|
11
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
12
|
Wang Y, Yang Y, Chen S, Wang J. DeepDRK: a deep learning framework for drug repurposing through kernel-based multi-omics integration. Brief Bioinform 2021; 22:6210072. [PMID: 33822890 DOI: 10.1093/bib/bbab048] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 01/16/2021] [Accepted: 01/30/2021] [Indexed: 12/11/2022] Open
Abstract
Recent pharmacogenomic studies that generate sequencing data coupled with pharmacological characteristics for patient-derived cancer cell lines led to large amounts of multi-omics data for precision cancer medicine. Among various obstacles hindering clinical translation, lacking effective methods for multimodal and multisource data integration is becoming a bottleneck. Here we proposed DeepDRK, a machine learning framework for deciphering drug response through kernel-based data integration. To transfer information among different drugs and cancer types, we trained deep neural networks on more than 20 000 pan-cancer cell line-anticancer drug pairs. These pairs were characterized by kernel-based similarity matrices integrating multisource and multi-omics data including genomics, transcriptomics, epigenomics, chemical properties of compounds and known drug-target interactions. Applied to benchmark cancer cell line datasets, our model surpassed previous approaches with higher accuracy and better robustness. Then we applied our model on newly established patient-derived cancer cell lines and achieved satisfactory performance with AUC of 0.84 and AUPRC of 0.77. Moreover, DeepDRK was used to predict clinical response of cancer patients. Notably, the prediction of DeepDRK correlated well with clinical outcome of patients and revealed multiple drug repurposing candidates. In sum, DeepDRK provided a computational method to predict drug response of cancer cells from integrating pharmacogenomic datasets, offering an alternative way to prioritize repurposing drugs in precision cancer treatment. The DeepDRK is freely available via https://github.com/wangyc82/DeepDRK.
Collapse
Affiliation(s)
- Yongcui Wang
- Key Laboratory of Adaptation and Evolution of Plateau Biota at Northwest Institute of Plateau Biology, Chinese Academy of Sciences, China
| | - Yingxi Yang
- Department of Chemical and Biological Engineering at The Hong Kong University of Science and Technology, China
| | - Shilong Chen
- Key Laboratory of Adaptation and Evolution of Plateau Biota at Institute of Sanjiangyuan National Park, Chinese Academy of Sciences, China
| | - Jiguang Wang
- Division of Life Science, Department of Chemical and Biological Engineering, and State Key Laboratory of Molecular Neuroscience at The Hong Kong University of Science and Technology, China
| |
Collapse
|
13
|
Chao Y, Jiang Y, Zhong M, Wei K, Hu C, Qin Y, Zuo Y, Yang L, Shen Z, Zou C. Regulatory roles and mechanisms of alternative RNA splicing in adipogenesis and human metabolic health. Cell Biosci 2021; 11:66. [PMID: 33795017 PMCID: PMC8017860 DOI: 10.1186/s13578-021-00581-w] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Accepted: 03/24/2021] [Indexed: 12/15/2022] Open
Abstract
Alternative splicing (AS) regulates gene expression patterns at the post-transcriptional level and generates a striking expansion of coding capacities of genomes and cellular protein diversity. RNA splicing could undergo modulation and close interaction with genetic and epigenetic machinery. Notably, during the adipogenesis processes of white, brown and beige adipocytes, AS tightly interplays with the differentiation gene program networks. Here, we integrate the available findings on specific splicing events and distinct functions of different splicing regulators as examples to highlight the directive biological contribution of AS mechanism in adipogenesis and adipocyte biology. Furthermore, accumulating evidence has suggested that mutations and/or altered expression in splicing regulators and aberrant splicing alterations in the obesity-associated genes are often linked to humans’ diet-induced obesity and metabolic dysregulation phenotypes. Therefore, significant attempts have been finally made to overview novel detailed discussion on the prospects of splicing machinery with obesity and metabolic disorders to supply featured potential management mechanisms in clinical applicability for obesity treatment strategies.
Collapse
Affiliation(s)
- Yunqi Chao
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Yonghui Jiang
- Department of Genetics, Yale University School of Medicine, New Haven, CT, 06520, USA
| | - Mianling Zhong
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Kaiyan Wei
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Chenxi Hu
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Yifang Qin
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Yiming Zuo
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Lili Yang
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Zheng Shen
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China
| | - Chaochun Zou
- Department of Endocrinology, The Children's Hospital, School of Medicine, Zhejiang University, Hangzhou, 310052, Zhejiang, China.
| |
Collapse
|
14
|
Li S, Yu K, Wu G, Zhang Q, Wang P, Zheng J, Liu ZX, Wang J, Gao X, Cheng H. pCysMod: Prediction of Multiple Cysteine Modifications Based on Deep Learning Framework. Front Cell Dev Biol 2021; 9:617366. [PMID: 33732693 PMCID: PMC7959776 DOI: 10.3389/fcell.2021.617366] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 01/12/2021] [Indexed: 12/18/2022] Open
Abstract
Thiol groups on cysteines can undergo multiple post-translational modifications (PTMs), acting as a molecular switch to maintain redox homeostasis and regulating a series of cell signaling transductions. Identification of sophistical protein cysteine modifications is crucial for dissecting its underlying regulatory mechanism. Instead of a time-consuming and labor-intensive experimental method, various computational methods have attracted intense research interest due to their convenience and low cost. Here, we developed the first comprehensive deep learning based tool pCysMod for multiple protein cysteine modification prediction, including S-nitrosylation, S-palmitoylation, S-sulfenylation, S-sulfhydration, and S-sulfinylation. Experimentally verified cysteine sites curated from literature and sites collected by other databases and predicting tools were integrated as benchmark dataset. Several protein sequence features were extracted and united into a deep learning model, and the hyperparameters were optimized by particle swarm optimization algorithms. Cross-validations indicated our model showed excellent robustness and outperformed existing tools, which was able to achieve an average AUC of 0.793, 0.807, 0.796, 0.793, and 0.876 for S-nitrosylation, S-palmitoylation, S-sulfenylation, S-sulfhydration, and S-sulfinylation, demonstrating pCysMod was stable and suitable for protein cysteine modification prediction. Besides, we constructed a comprehensive protein cysteine modification prediction web server based on this model to benefit the researches finding the potential modification sites of their interested proteins, which could be accessed at http://pcysmod.omicsbio.info. This work will undoubtedly greatly promote the study of protein cysteine modification and contribute to clarifying the biological regulation mechanisms of cysteine modification within and among the cells.
Collapse
Affiliation(s)
- Shihua Li
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China.,School of Life Sciences, Zhengzhou University, Zhengzhou, China
| | - Kai Yu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Guandi Wu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Qingfeng Zhang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Panqin Wang
- School of Life Sciences, Zhengzhou University, Zhengzhou, China
| | - Jian Zheng
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Ze-Xian Liu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Jichao Wang
- CAS Key Lab of Biobased Materials, Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, China
| | - Xinjiao Gao
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, Hefei National Laboratory for Physical Sciences at the Microscale, University of Science and Technology of China, Hefei, China
| | - Han Cheng
- School of Life Sciences, Zhengzhou University, Zhengzhou, China
| |
Collapse
|
15
|
Lv Z, Cui F, Zou Q, Zhang L, Xu L. Anticancer peptides prediction with deep representation learning features. Brief Bioinform 2021; 22:6126754. [PMID: 33529337 DOI: 10.1093/bib/bbab008] [Citation(s) in RCA: 78] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 12/20/2020] [Accepted: 01/05/2021] [Indexed: 12/13/2022] Open
Abstract
Anticancer peptides constitute one of the most promising therapeutic agents for combating common human cancers. Using wet experiments to verify whether a peptide displays anticancer characteristics is time-consuming and costly. Hence, in this study, we proposed a computational method named identify anticancer peptides via deep representation learning features (iACP-DRLF) using light gradient boosting machine algorithm and deep representation learning features. Two kinds of sequence embedding technologies were used, namely soft symmetric alignment embedding and unified representation (UniRep) embedding, both of which involved deep neural network models based on long short-term memory networks and their derived networks. The results showed that the use of deep representation learning features greatly improved the capability of the models to discriminate anticancer peptides from other peptides. Also, UMAP (uniform manifold approximation and projection for dimension reduction) and SHAP (shapley additive explanations) analysis proved that UniRep have an advantage over other features for anticancer peptide identification. The python script and pretrained models could be downloaded from https://github.com/zhibinlv/iACP-DRLF or from http://public.aibiochem.net/iACP-DRLF/.
Collapse
Affiliation(s)
- Zhibin Lv
- University of Electronic Science and Technology of China
| | - Feifei Cui
- University of Electronic Science and Technology of China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences at University of Electronic Science and Technology of China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic
| |
Collapse
|
16
|
Prihoda D, Maritz JM, Klempir O, Dzamba D, Woelk CH, Hazuda DJ, Bitton DA, Hannigan GD. The application potential of machine learning and genomics for understanding natural product diversity, chemistry, and therapeutic translatability. Nat Prod Rep 2021; 38:1100-1108. [PMID: 33245088 DOI: 10.1039/d0np00055h] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Covering: up to the end of 2020. The machine learning field can be defined as the study and application of algorithms that perform classification and prediction tasks through pattern recognition instead of explicitly defined rules. Among other areas, machine learning has excelled in natural language processing. As such methods have excelled at understanding written languages (e.g. English), they are also being applied to biological problems to better understand the "genomic language". In this review we focus on recent advances in applying machine learning to natural products and genomics, and how those advances are improving our understanding of natural product biology, chemistry, and drug discovery. We discuss machine learning applications in genome mining (identifying biosynthetic signatures in genomic data), predictions of what structures will be created from those genomic signatures, and the types of activity we might expect from those molecules. We further explore the application of these approaches to data derived from complex microbiomes, with a focus on the human microbiome. We also review challenges in leveraging machine learning approaches in the field, and how the availability of other "omics" data layers provides value. Finally, we provide insights into the challenges associated with interpreting machine learning models and the underlying biology and promises of applying machine learning to natural product drug discovery. We believe that the application of machine learning methods to natural product research is poised to accelerate the identification of new molecular entities that may be used to treat a variety of disease indications.
Collapse
Affiliation(s)
- David Prihoda
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic and Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology, Prague, Czech Republic
| | - Julia M Maritz
- Exploratory Science Center, Merck & Co., Inc., Cambridge, MA, USA.
| | - Ondrej Klempir
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | - David Dzamba
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | | | - Daria J Hazuda
- Exploratory Science Center, Merck & Co., Inc., Cambridge, MA, USA.
| | - Danny A Bitton
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | | |
Collapse
|
17
|
Li Y, Zhang Z, Teng Z, Liu X. PredAmyl-MLP: Prediction of Amyloid Proteins Using Multilayer Perceptron. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8845133. [PMID: 33294004 PMCID: PMC7700051 DOI: 10.1155/2020/8845133] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/06/2020] [Accepted: 10/31/2020] [Indexed: 01/20/2023]
Abstract
Amyloid is generally an aggregate of insoluble fibrin; its abnormal deposition is the pathogenic mechanism of various diseases, such as Alzheimer's disease and type II diabetes. Therefore, accurately identifying amyloid is necessary to understand its role in pathology. We proposed a machine learning-based prediction model called PredAmyl-MLP, which consists of the following three steps: feature extraction, feature selection, and classification. In the step of feature extraction, seven feature extraction algorithms and different combinations of them are investigated, and the combination of SVMProt-188D and tripeptide composition (TPC) is selected according to the experimental results. In the step of feature selection, maximum relevant maximum distance (MRMD) and binomial distribution (BD) are, respectively, used to remove the redundant or noise features, and the appropriate features are selected according to the experimental results. In the step of classification, we employed multilayer perceptron (MLP) to train the prediction model. The 10-fold cross-validation results show that the overall accuracy of PredAmyl-MLP reached 91.59%, and the performance was better than the existing methods.
Collapse
Affiliation(s)
- Yanjuan Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zitong Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Harbin Institute of Technology, Harbin 150040, China
| |
Collapse
|
18
|
Zhai Y, Chen Y, Teng Z, Zhao Y. Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions. Front Cell Dev Biol 2020; 8:591487. [PMID: 33195258 PMCID: PMC7658297 DOI: 10.3389/fcell.2020.591487] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 09/18/2020] [Indexed: 12/13/2022] Open
Abstract
Excessive oxidative stress responses can threaten our health, and thus it is essential to produce antioxidant proteins to regulate the body’s oxidative responses. The low number of antioxidant proteins makes it difficult to extract their representative features. Our experimental method did not use structural information but instead studied antioxidant proteins from a sequenced perspective while focusing on the impact of data imbalance on sensitivity, thus greatly improving the model’s sensitivity for antioxidant protein recognition. We developed a method based on the Composition of k-spaced Amino Acid Pairs (CKSAAP) and the Conjoint Triad (CT) features derived from the amino acid composition and protein-protein interactions. SMOTE and the Max-Relevance-Max-Distance algorithm (MRMD) were utilized to unbalance the training data and select the optimal feature subset, respectively. The test set used 10-fold crossing validation and a random forest algorithm for classification according to the selected feature subset. The sensitivity was 0.792, the specificity was 0.808, and the average accuracy was 0.8.
Collapse
Affiliation(s)
- Yixiao Zhai
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Yu Chen
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Zhixia Teng
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
19
|
Hou R, Wu J, Xu L, Zou Q, Wu YJ. Computational Prediction of Protein Arginine Methylation Based on Composition-Transition-Distribution Features. ACS OMEGA 2020; 5:27470-27479. [PMID: 33134710 PMCID: PMC7594152 DOI: 10.1021/acsomega.0c03972] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 10/06/2020] [Indexed: 06/11/2023]
Abstract
Arginine methylation is one of the most essential protein post-translational modifications. Identifying the site of arginine methylation is a critical problem in biology research. Unfortunately, biological experiments such as mass spectrometry are expensive and time-consuming. Hence, predicting arginine methylation by machine learning is an alternative fast and efficient way. In this paper, we focus on the systematic characterization of arginine methylation with composition-transition-distribution (CTD) features. The presented framework consists of three stages. In the first stage, we extract CTD features from 1750 samples and exploit decision tree to generate accurate prediction. The accuracy of prediction can reach 96%. In the second stage, the support vector machine can predict the number of arginine methylation sites with 0.36 R-squared. In the third stage, experiments carried out with the updated arginine methylation site data set show that utilizing CTD features and adopting random forest as the classifier outperform previous methods. The accuracy of identification can reach 82.1 and 82.5% in single methylarginine and double methylarginine data sets, respectively. The discovery presented in this paper can be helpful for future research on arginine methylation.
Collapse
Affiliation(s)
- Ruiyan Hou
- Laboratory
of Molecular Toxicology, State Key Laboratory of Integrated Management
of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- College
of Life Science, University of Chinese Academy
of Sciences, Beijing 100049, China
| | - Jin Wu
- School
of Management, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School
of Electronic and Engineering, Shenzhen
Polytechnic, Shenzhen 518055, China
| | - Quan Zou
- Institute
of Fundamental and Frontier Sciences, University
of Electronic Science and Technology of China, Chengdu 610054, China
| | - Yi-Jun Wu
- Laboratory
of Molecular Toxicology, State Key Laboratory of Integrated Management
of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
20
|
Guo Z, Wang P, Liu Z, Zhao Y. Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction. Front Bioeng Biotechnol 2020; 8:584807. [PMID: 33195148 PMCID: PMC7642589 DOI: 10.3389/fbioe.2020.584807] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Accepted: 09/11/2020] [Indexed: 01/19/2023] Open
Abstract
Thermophilicity is a very important property of proteins, as it sometimes determines denaturation and cell death. Thus, methods for predicting thermophilic proteins and non-thermophilic proteins are of interest and can contribute to the design and engineering of proteins. In this article, we describe the use of feature dimension reduction technology and LIBSVM to identify thermophilic proteins. The highest accuracy obtained by cross-validation was 96.02% with 119 parameters. When using only 16 features, we obtained an accuracy of 93.33%. We discuss the importance of the different characteristics in identification and report a comparison of the performance of support vector machine to that of other methods.
Collapse
Affiliation(s)
- Zifan Guo
- School of Aeronautics and Astronautic, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pingping Wang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Zhendong Liu
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China
| | - Yuming Zhao
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
21
|
Dou L, Li X, Zhang L, Xiang H, Xu L. iGlu_AdaBoost: Identification of Lysine Glutarylation Using the AdaBoost Classifier. J Proteome Res 2020; 20:191-201. [PMID: 33090794 DOI: 10.1021/acs.jproteome.0c00314] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Lysine glutarylation is a newly reported post-translational modification (PTM) that plays significant roles in regulating metabolic and mitochondrial processes. Accurate identification of protein glutarylation is the primary task to better investigate molecular functions and various applications. Due to the common disadvantages of the time-consuming and expensive nature of traditional biological sequencing techniques as well as the explosive growth of protein data, building precise computational models to rapidly diagnose glutarylation is a popular and feasible solution. In this work, we proposed a novel AdaBoost-based predictor called iGlu_AdaBoost to distinguish glutarylation and non-glutarylation sequences. Here, the top 37 features were chosen from a total of 1768 combined features using Chi2 following incremental feature selection (IFS) to build the model, including 188D, the composition of k-spaced amino acid pairs (CKSAAP), and enhanced amino acid composition (EAAC). With the help of the hybrid-sampling method SMOTE-Tomek, the AdaBoost algorithm was performed with satisfactory recall, specificity, and AUC values of 87.48%, 72.49%, and 0.89 over 10-fold cross validation as well as 72.73%, 71.92%, and 0.63 over independent test, respectively. Further feature analysis inferred that positively charged amino acids RK play critical roles in glutarylation recognition. Our model presented the well generalization ability and consistency of the prediction results of positive and negative samples, which is comparable to four published tools. The proposed predictor is an efficient tool to find potential glutarylation sites and provides helpful suggestions for further research on glutarylation mechanisms and concerned disease treatments.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiaoling Li
- Department of Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin 150000, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen 518172, China
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| |
Collapse
|
22
|
A Method for Identifying Vesicle Transport Proteins Based on LibSVM and MRMD. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8926750. [PMID: 33133228 PMCID: PMC7591939 DOI: 10.1155/2020/8926750] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 08/14/2020] [Accepted: 09/16/2020] [Indexed: 12/14/2022]
Abstract
With the development of computer technology, many machine learning algorithms have been applied to the field of biology, forming the discipline of bioinformatics. Protein function prediction is a classic research topic in this subject area. Though many scholars have made achievements in identifying protein by different algorithms, they often extract a large number of feature types and use very complex classification methods to obtain little improvement in the classification effect, and this process is very time-consuming. In this research, we attempt to utilize as few features as possible to classify vesicular transportation proteins and to simultaneously obtain a comparative satisfactory classification result. We adopt CTDC which is a submethod of the method of composition, transition, and distribution (CTD) to extract only 39 features from each sequence, and LibSVM is used as the classification method. We use the SMOTE method to deal with the problem of dataset imbalance. There are 11619 protein sequences in our dataset. We selected 4428 sequences to train our classification model and selected other 1832 sequences from our dataset to test the classification effect and finally achieved an accuracy of 71.77%. After dimension reduction by MRMD, the accuracy is 72.16%.
Collapse
|
23
|
Li Q, Xu L, Li Q, Zhang L. Identification and Classification of Enhancers Using Dimension Reduction Technique and Recurrent Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8852258. [PMID: 33133227 PMCID: PMC7591959 DOI: 10.1155/2020/8852258] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 09/16/2020] [Accepted: 09/30/2020] [Indexed: 12/21/2022]
Abstract
Enhancers are noncoding fragments in DNA sequences, which play an important role in gene transcription and translation. However, due to their high free scattering and positional variability, the identification and classification of enhancers have a higher level of complexity than those of coding genes. In order to solve this problem, many computer studies have been carried out in this field, but there are still some deficiencies in these prediction models. In this paper, we use various feature extraction strategies, dimension reduction technology, and a comprehensive application of machine model and recurrent neural network model to achieve an accurate prediction of enhancer identification and classification with the accuracy of was 76.7% and 84.9%, respectively. The model proposed in this paper is superior to the previous methods in performance index or feature dimension, which provides inspiration for the prediction of enhancers by computer technology in the future.
Collapse
Affiliation(s)
- Qingwen Li
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Qingyuan Li
- Forestry and Fruit Tree Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| |
Collapse
|
24
|
Hu Q, Greene CS, Heller EA. Specific histone modifications associate with alternative exon selection during mammalian development. Nucleic Acids Res 2020; 48:4709-4724. [PMID: 32319526 PMCID: PMC7229819 DOI: 10.1093/nar/gkaa248] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 03/23/2020] [Accepted: 04/02/2020] [Indexed: 12/29/2022] Open
Abstract
Alternative splicing (AS) is frequent during early mouse embryonic development. Specific histone post-translational modifications (hPTMs) have been shown to regulate exon splicing by either directly recruiting splice machinery or indirectly modulating transcriptional elongation. In this study, we hypothesized that hPTMs regulate expression of alternatively spliced genes for specific processes during differentiation. To address this notion, we applied an innovative machine learning approach to relate global hPTM enrichment to AS regulation during mammalian tissue development. We found that specific hPTMs, H3K36me3 and H3K4me1, play a role in skipped exon selection among all the tissues and developmental time points examined. In addition, we used iterative random forest model and found that interactions of multiple hPTMs most strongly predicted splicing when they included H3K36me3 and H3K4me1. Collectively, our data demonstrated a link between hPTMs and alternative splicing which will drive further experimental studies on the functional relevance of these modifications to alternative splicing.
Collapse
Affiliation(s)
- Qiwen Hu
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Elizabeth A Heller
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
25
|
Li Q, Zhou W, Wang D, Wang S, Li Q. Prediction of Anticancer Peptides Using a Low-Dimensional Feature Model. Front Bioeng Biotechnol 2020; 8:892. [PMID: 32903381 PMCID: PMC7434836 DOI: 10.3389/fbioe.2020.00892] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 07/10/2020] [Indexed: 01/09/2023] Open
Abstract
Cancer is still a severe health problem globally. The therapy of cancer traditionally involves the use of radiotherapy or anticancer drugs to kill cancer cells, but these methods are quite expensive and have side effects, which will cause great harm to patients. With the find of anticancer peptides (ACPs), significant progress has been achieved in the therapy of tumors. Therefore, it is invaluable to accurately identify anticancer peptides. Although biochemical experiments can solve this work, this method is expensive and time-consuming. To promote the application of anticancer peptides in cancer therapy, machine learning can be used to recognize anticancer peptides by extracting the feature vectors of anticancer peptides. Nevertheless, poor performance usually be found in training the machine learning model to utilizing high-dimensional features in practice. In order to solve the above job, this paper put forward a 19-dimensional feature model based on anticancer peptide sequences, which has lower dimensionality and better performance than some existing methods. In addition, this paper also separated a model with a low number of dimensions and acceptable performance. The few features identified in this study may represent the important features of anticancer peptides.
Collapse
Affiliation(s)
- Qingwen Li
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China
| | - Wenyang Zhou
- Center for Bioinformatics, School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Sui Wang
- Key Laboratory of Soybean Biology in Chinese Ministry of Education, Northeast Agricultural University, Harbin, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin, China
| | - Qingyuan Li
- Forestry and Fruit Tree Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan, China
| |
Collapse
|
26
|
Li S, Yu K, Wang D, Zhang Q, Liu ZX, Zhao L, Cheng H. Deep learning based prediction of species-specific protein S-glutathionylation sites. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140422. [DOI: 10.1016/j.bbapap.2020.140422] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Revised: 03/12/2020] [Accepted: 03/26/2020] [Indexed: 02/08/2023]
|
27
|
Zhang S, Li X, Lin Q, Lin J, Wong KC. Uncovering the key dimensions of high-throughput biomolecular data using deep learning. Nucleic Acids Res 2020; 48:e56. [PMID: 32232416 PMCID: PMC7261195 DOI: 10.1093/nar/gkaa191] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 03/06/2020] [Accepted: 03/16/2020] [Indexed: 01/09/2023] Open
Abstract
Recent advances in high-throughput single-cell RNA-seq have enabled us to measure thousands of gene expression levels at single-cell resolution. However, the transcriptomic profiles are high-dimensional and sparse in nature. To address it, a deep learning framework based on auto-encoder, termed DeepAE, is proposed to elucidate high-dimensional transcriptomic profiling data in an encode-decode manner. Comparative experiments were conducted on nine transcriptomic profiling datasets to compare DeepAE with four benchmark methods. The results demonstrate that the proposed DeepAE outperforms the benchmark methods with robust performance on uncovering the key dimensions of single-cell RNA-seq data. In addition, we also investigate the performance of DeepAE in other contexts and platforms such as mass cytometry and metabolic profiling in a comprehensive manner. Gene ontology enrichment and pathology analysis are conducted to reveal the mechanisms behind the robust performance of DeepAE by uncovering its key dimensions.
Collapse
Affiliation(s)
- Shixiong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 132000, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
28
|
Wei L, Luan S, Nagai LAE, Su R, Zou Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2020; 35:1326-1333. [PMID: 30239627 DOI: 10.1093/bioinformatics/bty824] [Citation(s) in RCA: 126] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 09/12/2018] [Accepted: 09/18/2018] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) is recently shown to play crucial roles in restriction-modification systems. For better understanding of their functional mechanisms, it is fundamentally important to identify 4mC modification. Machine learning methods have recently emerged as an effective and efficient approach for the high-throughput identification of 4mC sites, although high predictive error rates are still challenging for existing methods. Therefore, it is highly desirable to develop a computational method to more accurately identify m4C sites. RESULTS In this study, we propose a machine learning based predictor, namely 4mcPred-SVM, for the genome-wide detection of DNA 4mC sites. In this predictor, we present a new feature representation algorithm that sufficiently exploits sequence-based information. To improve the feature representation ability, we use a two-step feature optimization strategy, thereby obtaining the most representative features. Using the resulting features and Support Vector Machine (SVM), we adaptively train the optimal models for different species. Comparative results on benchmark datasets from six species indicate that our predictor is able to achieve generally better performance in predicting 4mC sites as compared to the state-of-the-art predictors. Importantly, the sequence-based features can reliably and robust predict 4mC sites, facilitating the discovery of potentially important sequence characteristics for the prediction of 4mC sites. AVAILABILITY AND IMPLEMENTATION The user-friendly webserver that implements the proposed 4mcPred-SVM is well established, and is freely accessible at http://server.malab.cn/4mcPred-SVM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shasha Luan
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Luis Augusto Eijy Nagai
- Lab of Functional Analysis In Silico, Institute of Medical Science, University of Tokyo, Tokyo, Japan
| | - Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
29
|
Huang Q, Zhang J, Wei L, Guo F, Zou Q. 6mA-RicePred: A Method for Identifying DNA N 6-Methyladenine Sites in the Rice Genome Based on Feature Fusion. FRONTIERS IN PLANT SCIENCE 2020; 11:4. [PMID: 32076430 PMCID: PMC7006724 DOI: 10.3389/fpls.2020.00004] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 01/06/2020] [Indexed: 06/01/2023]
Abstract
MOTIVATION The biological function of N 6-methyladenine DNA (6mA) in plants is largely unknown. Rice is one of the most important crops worldwide and is a model species for molecular and genetic studies. There are few methods for 6mA site recognition in the rice genome, and an effective computational method is needed. RESULTS In this paper, we propose a new computational method called 6mA-Pred to identify 6mA sites in the rice genome. 6mA-Pred employs a feature fusion method to combine advantageous features from other methods and thus obtain a new feature to identify 6mA sites. This method achieved an accuracy of 87.27% in the identification of 6mA sites with 10-fold cross-validation and achieved an accuracy of 85.6% in independent test sets.
Collapse
Affiliation(s)
- Qianfei Huang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
30
|
Wang C, Zhao N, Yuan L, Liu X. Computational Detection of Breast Cancer Invasiveness with DNA Methylation Biomarkers. Cells 2020; 9:E326. [PMID: 32019269 PMCID: PMC7072524 DOI: 10.3390/cells9020326] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Revised: 01/28/2020] [Accepted: 01/28/2020] [Indexed: 12/14/2022] Open
Abstract
Breast cancer is the most common female malignancy. It has high mortality, primarily due to metastasis and recurrence. Patients with invasive and noninvasive breast cancer require different treatments, so there is an urgent need for predictive tools to guide clinical decision making and avoid overtreatment of noninvasive breast cancer and undertreatment of invasive cases. Here, we divided the sample set based on the genome-wide methylation distance to make full use of metastatic cancer data. Specifically, we implemented two differential methylation analysis methods to identify specific CpG sites. After effective dimensionality reduction, we constructed a methylation-based classifier using the Random Forest algorithm to categorize the primary breast cancer. We took advantage of breast cancer (BRCA) HM450 DNA methylation data and accompanying clinical data from The Cancer Genome Atlas (TCGA) database to validate the performance of the classifier. Overall, this study demonstrates DNA methylation as a potential biomarker to predict breast tumor invasiveness and as a possible parameter that could be included in the studies aiming to predict breast cancer aggressiveness. However, more comparative studies are needed to assess its usability in the clinic. Towards this, we developed a website based on these algorithms to facilitate its use in studies and predictions of breast cancer invasiveness.
Collapse
Affiliation(s)
- Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150080, China
| | - Ning Zhao
- School of Life Science and Technology, Harbin Institute of Technology, Harbin 150080, China;
| | - Linlin Yuan
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China;
| | - Xiaoyan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150080, China
| |
Collapse
|
31
|
Ao C, Zhang Y, Li D, Zhao Y, Zou Q. Progress in the development of antimicrobial peptide prediction tools. Curr Protein Pept Sci 2020; 22:CPPS-EPUB-103746. [PMID: 31957609 DOI: 10.2174/1389203721666200117163802] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Revised: 06/12/2019] [Accepted: 07/15/2019] [Indexed: 11/22/2022]
Abstract
Antimicrobial peptides (AMPs) are natural polypeptides with antimicrobial activities and are found in most organisms. AMPs are evolutionarily conservative components that belong to the innate immune system and show potent activity against bacteria, fungi, viruses and in some cases display antitumor activity. Thus, AMPs are major candidates in the development of new antibacterial reagents. In the last few decades, AMPs have attracted significant attention from the research community. During the early stages of the development of this research field, AMPs were experimentally identified, which is an expensive and time-consuming procedure. Therefore, research and development (R&D) of fast, highly efficient computational tools for predicting AMPs has enabled the rapid identification and analysis of new AMPs from a wide range of organisms. Moreover, these computational tools have allowed researchers to better understand the activities of AMPs, which has promoted R&D of antibacterial drugs. In this review, we systematically summarize AMP prediction tools and their corresponding algorithms used.
Collapse
Affiliation(s)
- Chunyan Ao
- Institute of Fundamental and Frontier Sciences - University of Electronic Science and Technology of China Chengdu. China
| | - Yu Zhang
- Department of neurosurgery - Heilongjiang Province Land Reclamation Headquarters General Hospital Harbin. China
| | - Dapeng Li
- Department of Internal Medicine-Oncology - The Fourth Hospital in Qinhuangdao Hebei. China
| | - Yuming Zhao
- Information and Computer Engineering College - Northeast Forestry University Harbin. China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences - University of Electronic Science and Technology of China Chengdu. China
| |
Collapse
|
32
|
Prospect of using deep learning for predicting differentiation of myeloid progenitor cells after sepsis. Chin Med J (Engl) 2020; 132:1862-1864. [PMID: 31306223 PMCID: PMC6759120 DOI: 10.1097/cm9.0000000000000349] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
|
33
|
Zhang Z, Zhao Y, Liao X, Shi W, Li K, Zou Q, Peng S. Deep learning in omics: a survey and guideline. Brief Funct Genomics 2020; 18:41-57. [PMID: 30265280 DOI: 10.1093/bfgp/ely030] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Revised: 07/31/2018] [Accepted: 08/30/2018] [Indexed: 01/17/2023] Open
Abstract
Omics, such as genomics, transcriptome and proteomics, has been affected by the era of big data. A huge amount of high dimensional and complex structured data has made it no longer applicable for conventional machine learning algorithms. Fortunately, deep learning technology can contribute toward resolving these challenges. There is evidence that deep learning can handle omics data well and resolve omics problems. This survey aims to provide an entry-level guideline for researchers, to understand and use deep learning in order to solve omics problems. We first introduce several deep learning models and then discuss several research areas which have combined omics and deep learning in recent years. In addition, we summarize the general steps involved in using deep learning which have not yet been systematically discussed in the existent literature on this topic. Finally, we compare the features and performance of current mainstream open source deep learning frameworks and present the opportunities and challenges involved in deep learning. This survey will be a good starting point and guideline for omics researchers to understand deep learning.
Collapse
Affiliation(s)
- Zhiqiang Zhang
- School of Computer Science, National University of Defense Technology, Changsha, China
| | - Yi Zhao
- Institute of Computing Technology,Chinese Academy of Sciences, Beijing, China
| | - Xiangke Liao
- School of Computer Science, National University of Defense Technology, Changsha, China
| | - Wenqiang Shi
- School of Computer Science, National University of Defense Technology, Changsha, China
| | - Kenli Li
- College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, Changsha, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Shaoliang Peng
- School of Computer Science, National University of Defense Technology, Changsha, China.,College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, Changsha, China
| |
Collapse
|
34
|
Sun S, Wang C, Ding H, Zou Q. Machine learning and its applications in plant molecular studies. Brief Funct Genomics 2019; 19:40-48. [DOI: 10.1093/bfgp/elz036] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Revised: 09/06/2019] [Accepted: 09/15/2019] [Indexed: 01/16/2023] Open
Abstract
Abstract
The advent of high-throughput genomic technologies has resulted in the accumulation of massive amounts of genomic information. However, biologists are challenged with how to effectively analyze these data. Machine learning can provide tools for better and more efficient data analysis. Unfortunately, because many plant biologists are unfamiliar with machine learning, its application in plant molecular studies has been restricted to a few species and a limited set of algorithms. Thus, in this study, we provide the basic steps for developing machine learning frameworks and present a comprehensive overview of machine learning algorithms and various evaluation metrics. Furthermore, we introduce sources of important curated plant genomic data and R packages to enable plant biologists to easily and quickly apply appropriate machine learning algorithms in their research. Finally, we discuss current applications of machine learning algorithms for identifying various genes related to resistance to biotic and abiotic stress. Broad application of machine learning and the accumulation of plant sequencing data will advance plant molecular studies.
Collapse
Affiliation(s)
- Shanwen Sun
- University of Bayreuth in Germany. He is now a postdoctoral fellow at the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| | - Chunyu Wang
- Harbin Institute of Technology in China. He is an associate professor in the School of Computer Science and Technology, Harbin Institute of Technology
| | - Hui Ding
- Inner Mongolia University in China. She is an associate professor in the Center for Informational Biology, University of Electronic Science and Technology of China
| | - Quan Zou
- Harbin Institute of Technology in China. He is a professor in the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| |
Collapse
|
35
|
Zhong W, Zhong B, Zhang H, Chen Z, Chen Y. Identification of Anti-cancer Peptides Based on Multi-classifier System. Comb Chem High Throughput Screen 2019; 22:694-704. [PMID: 31793417 DOI: 10.2174/1386207322666191203141102] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 07/18/2019] [Accepted: 07/30/2019] [Indexed: 01/01/2023]
Abstract
AIMS AND OBJECTIVE Cancer is one of the deadliest diseases, taking the lives of millions every year. Traditional methods of treating cancer are expensive and toxic to normal cells. Fortunately, anti-cancer peptides (ACPs) can eliminate this side effect. However, the identification and development of new anti-cancer peptides through experiments take a lot of time and money, therefore, it is necessary to develop a fast and accurate calculation model to identify the anti-cancer peptide. Machine learning algorithms are a good choice. MATERIALS AND METHODS In our study, a multi-classifier system was used, combined with multiple machine learning models, to predict anti-cancer peptides. These individual learners are composed of different feature information and algorithms, and form a multi-classifier system by voting. RESULTS AND CONCLUSION The experiments show that the overall prediction rate of each individual learner is above 80% and the overall accuracy of multi-classifier system for anti-cancer peptides prediction can reach 95.93%, which is better than the existing prediction model.
Collapse
Affiliation(s)
- Wanben Zhong
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| | - Bineng Zhong
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China.,Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Hongbo Zhang
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| | - Ziyi Chen
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| | - Yan Chen
- School of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, 361021, China
| |
Collapse
|
36
|
Wang J, Cui B, Zhao Y, Guo M. A New Algorithm for Identifying Genome Rearrangements in the Mammalian Evolution. Front Genet 2019; 10:1020. [PMID: 31737036 PMCID: PMC6828935 DOI: 10.3389/fgene.2019.01020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Accepted: 09/24/2019] [Indexed: 11/13/2022] Open
Abstract
Genome rearrangements are the evolutionary events on level of genomes. It is a global view on evolution research of species to analyze the genome rearrangements. We introduce a new method called RGRPT (recovering the genome rearrangements based on phylogenetic tree) used to identify the genome rearrangements. We test the RGRPT using simulated data. The results of experiments show that RGRPT have high sensitivity and specificity compared with other tools when to predict rearrangement events. We use RGRPT to predict the rearrangement events of six mammalian genomes (human, chimpanzee, rhesus macaque, mouse, rat, and dog). RGRPT has recognized a total of 1,157 rearrangement events for them at 10 kb resolution, including 858 reversals, 16 translocations, 249 transpositions, and 34 fusions/fissions. And RGRPT has recognized 475 rearrangement events for them at 50 kb resolution, including 332 reversals, 13 translocations, 94 transpositions, and 36 fusions/fissions. The code source of RGRPT is available from https://github.com/wangjuanimu/data-of-genome-rearrangement.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Bo Cui
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Yulan Zhao
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing University of Civil Engineering and Architecture, Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| |
Collapse
|
37
|
Xuan P, Ye Y, Zhang T, Zhao L, Sun C. Convolutional Neural Network and Bidirectional Long Short-Term Memory-Based Method for Predicting Drug-Disease Associations. Cells 2019; 8:E705. [PMID: 31336774 PMCID: PMC6679344 DOI: 10.3390/cells8070705] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2019] [Revised: 07/08/2019] [Accepted: 07/09/2019] [Indexed: 12/16/2022] Open
Abstract
Identifying novel indications for approved drugs can accelerate drug development and reduce research costs. Most previous studies used shallow models for prioritizing the potential drug-related diseases and failed to deeply integrate the paths between drugs and diseases which may contain additional association information. A deep-learning-based method for predicting drug-disease associations by integrating useful information is needed. We proposed a novel method based on a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM)-CBPred-for predicting drug-related diseases. Our method deeply integrates similarities and associations between drugs and diseases, and paths among drug-disease pairs. The CNN-based framework focuses on learning the original representation of a drug-disease pair from their similarities and associations. As the drug-disease association possibility also depends on the multiple paths between them, the BiLSTM-based framework mainly learns the path representation of the drug-disease pair. In addition, considering that different paths have discriminate contributions to the association prediction, an attention mechanism at path level is constructed. Our method, CBPred, showed better performance and retrieved more real associations in the front of the results, which is more important for biologists. Case studies further confirmed that CBPred can discover potential drug-disease associations.
Collapse
Affiliation(s)
- Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
| | - Yilin Ye
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China.
| | - Tiangang Zhang
- School of Mathematical Science, Heilongjiang University, Harbin 150080, China.
| | - Lianfeng Zhao
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
| | - Chang Sun
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
| |
Collapse
|
38
|
Xuan P, Cao Y, Zhang T, Kong R, Zhang Z. Dual Convolutional Neural Networks With Attention Mechanisms Based Method for Predicting Disease-Related lncRNA Genes. Front Genet 2019; 10:416. [PMID: 31130990 PMCID: PMC6509943 DOI: 10.3389/fgene.2019.00416] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 04/16/2019] [Indexed: 12/30/2022] Open
Abstract
A lot of studies indicated that aberrant expression of long non-coding RNA genes (lncRNAs) is closely related to human diseases. Identifying disease-related lncRNAs (disease lncRNAs) is critical for understanding the pathogenesis and etiology of diseases. Most of the previous methods focus on prioritizing the potential disease lncRNAs based on shallow learning methods. The methods fail to extract the deep and complex feature representations of lncRNA-disease associations. Furthermore, nearly all the methods ignore the discriminative contributions of the similarity, association, and interaction relationships among lncRNAs, disease, and miRNAs for the association prediction. A dual convolutional neural networks with attention mechanisms based method is presented for predicting the candidate disease lncRNAs, and it is referred to as CNNLDA. CNNLDA deeply integrates the multiple source data like the lncRNA similarities, the disease similarities, the lncRNA-disease associations, the lncRNA-miRNA interactions, and the miRNA-disease associations. The diverse biological premises about lncRNAs, miRNAs, and diseases are combined to construct the feature matrix from the biological perspectives. A novel framework based on the dual convolutional neural networks is developed to learn the global and attention representations of the lncRNA-disease associations. The left part of the framework exploits the various information contained by the feature matrix to learn the global representation of lncRNA-disease associations. The different connection relationships among the lncRNA, miRNA, and disease nodes and the different features of these nodes have the discriminative contributions for the association prediction. Hence we present the attention mechanisms from the relationship level and the feature level respectively, and the right part of the framework learns the attention representation of associations. The experimental results based on the cross validation indicate that CNNLDA yields superior performance than several state-of-the-art methods. Case studies on stomach cancer, lung cancer, and colon cancer further demonstrate CNNLDA's ability to discover the potential disease lncRNAs.
Collapse
Affiliation(s)
- Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Yangkun Cao
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Tiangang Zhang
- School of Mathematical Science, Heilongjiang University, Harbin, China
| | - Rui Kong
- Department of Pancreatic and Biliary Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Zhaogong Zhang
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| |
Collapse
|
39
|
Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Zhao Q, Zhang Y, Zeng N, Wang C. Predicting Ion Channels Genes and Their Types With Machine Learning Techniques. Front Genet 2019; 10:399. [PMID: 31130983 PMCID: PMC6510169 DOI: 10.3389/fgene.2019.00399] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Accepted: 04/12/2019] [Indexed: 02/01/2023] Open
Abstract
Motivation: The number of ion channels is increasing rapidly. As many of them are associated with diseases, they are the targets of more than 700 drugs. The discovery of new ion channels is facilitated by computational methods that predict ion channels and their types from protein sequences. Methods: We used the SVMProt and the k-skip-n-gram methods to extract the feature vectors of ion channels, and obtained 188- and 400-dimensional features, respectively. The 188- and 400-dimensional features were combined to obtain 588-dimensional features. We then employed the maximum-relevance-maximum-distance method to reduce the dimensions of the 588-dimensional features. Finally, the support vector machine and random forest methods were used to build the prediction models to evaluate the classification effect. Results: Different methods were employed to extract various feature vectors, and after effective dimensionality reduction, different classifiers were used to classify the ion channels. We extracted the ion channel data from the Universal Protein Resource (UniProt, http://www.uniprot.org/) and Ligand-Gated Ion Channel databases (http://www.ebi.ac.uk/compneur-srv/LGICdb/LGICdb.php), and then verified the performance of the classifiers after screening. The findings of this study could inform the research and development of drugs.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Miao Wang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Lei Zhang
- Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China
| | - Ying Wang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
| | - Mian Guo
- Department of Neurosurgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Ming Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Qian Zhao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Yu Zhang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China
- Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China
| | - Nianyin Zeng
- Department of Instrumental and Electrical Engineering, Xiamen University, Xiamen, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
40
|
Manavalan B, Basith S, Shin TH, Wei L, Lee G. Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 16:733-744. [PMID: 31146255 PMCID: PMC6540332 DOI: 10.1016/j.omtn.2019.04.019] [Citation(s) in RCA: 164] [Impact Index Per Article: 32.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 04/16/2019] [Accepted: 04/22/2019] [Indexed: 11/19/2022]
Abstract
DNA N4-methylcytosine (4mC) is an important genetic modification and plays crucial roles in differentiation between self and non-self DNA and in controlling DNA replication, cell cycle, and gene-expression levels. Accurate 4mC site identification is fundamental to improve the understanding of 4mC biological functions and mechanisms. Hence, it is necessary to develop in silico approaches for efficient and high-throughput 4mC site identification. Although some bioinformatic tools have been developed in this regard, their prediction accuracy and generalizability require improvement to optimize their usability in practical applications. For this purpose, we here proposed Meta-4mCpred, a meta-predictor for 4mC site prediction. In Meta-4mCpred, we employed a feature representation learning scheme and generated 56 probabilistic features based on four different machine-learning algorithms and seven feature encodings covering diverse sequence information, including compositional, physicochemical, and position-specific information. Subsequently, the probabilistic features were used as an input to support vector machine and developed a final meta-predictor. To the best of our knowledge, this is the first meta-predictor for 4mC site prediction. Cross-validation results show that Meta-4mCpred achieved an overall average accuracy of 84.2% from six different species, which is ∼2%–4% higher than those attainable using the state-of-the-art predictors. Furthermore, Meta-4mCpred achieved an overall average accuracy of 86% on independent datasets evaluation, which is over 4% higher than those yielded by the state-of-the-art predictors. The user-friendly webserver employed to implement the proposed Meta-4mCpred is freely accessible at http://thegleelab.org/Meta-4mCpred.
Collapse
Affiliation(s)
| | - Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Tae Hwan Shin
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea; Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea
| | - Leyi Wei
- School of Computer Science and Technology, Tianjin University, China.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea; Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea.
| |
Collapse
|
41
|
Xu SJ, Heller EA. Recent advances in neuroepigenetic editing. Curr Opin Neurobiol 2019; 59:26-33. [PMID: 31015104 DOI: 10.1016/j.conb.2019.03.010] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 02/28/2019] [Accepted: 03/18/2019] [Indexed: 02/09/2023]
Abstract
A wealth of studies in the mammalian nervous system indicate the role of epigenetic gene regulation in both basic neurobiological function and disease. However, the relationship between epigenetic regulation and neuropathology is largely correlational due to the presence of mixed cell populations within brain regions and the genome-wide effects of classical approaches to manipulate the epigenome. Locus-specific epigenetic editing allows direct epigenetic regulation of specific genes to elucidate the direct causal relationship between epigenetic modifications and transcription. This review discusses some of the latest innovations in the efficacy and flexibility in this approach that hold promise for neurobiological application.
Collapse
Affiliation(s)
- Song-Jun Xu
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| | - Elizabeth A Heller
- Department of Systems Pharmacology and Translational Therapeutics and Penn Epigenetics Institute, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
42
|
Su R, Liu X, Wei L. MinE-RFE: determine the optimal subset from RFE by minimizing the subset-accuracy–defined energy. Brief Bioinform 2019; 21:687-698. [DOI: 10.1093/bib/bbz021] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2018] [Revised: 01/24/2019] [Accepted: 02/02/2019] [Indexed: 01/18/2023] Open
Abstract
Abstract
Recursive feature elimination (RFE), as one of the most popular feature selection algorithms, has been extensively applied to bioinformatics. During the training, a group of candidate subsets are generated by iteratively eliminating the least important features from the original features. However, how to determine the optimal subset from them still remains ambiguous. Among most current studies, either overall accuracy or subset size (SS) is used to select the most predictive features. Using which one or both and how they affect the prediction performance are still open questions. In this study, we proposed MinE-RFE, a novel RFE-based feature selection approach by sufficiently considering the effect of both factors. Subset decision problem was reflected into subset-accuracy space and became an energy-minimization problem. We also provided a mathematical description of the relationship between the overall accuracy and SS using Gaussian Mixture Models together with spline fitting. Besides, we comprehensively reviewed a variety of state-of-the-art applications in bioinformatics using RFE. We compared their approaches of deciding the final subset from all the candidate subsets with MinE-RFE on diverse bioinformatics data sets. Additionally, we also compared MinE-RFE with some well-used feature selection algorithms. The comparative results demonstrate that the proposed approach exhibits the best performance among all the approaches. To facilitate the use of MinE-RFE, we further established a user-friendly web server with the implementation of the proposed approach, which is accessible at http://qgking.wicp.net/MinE/. We expect this web server will be a useful tool for research community.
Collapse
Affiliation(s)
- Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xinyi Liu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
43
|
Zou Q, Xing P, Wei L, Liu B. Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA. RNA (NEW YORK, N.Y.) 2019; 25:205-218. [PMID: 30425123 PMCID: PMC6348985 DOI: 10.1261/rna.069112.118] [Citation(s) in RCA: 311] [Impact Index Per Article: 62.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 11/01/2018] [Indexed: 05/20/2023]
Abstract
N6-Methyladenosine (m6A) refers to methylation modification of the adenosine nucleotide acid at the nitrogen-6 position. Many conventional computational methods for identifying N6-methyladenosine sites are limited by the small amount of data available. Taking advantage of the thousands of m6A sites detected by high-throughput sequencing, it is now possible to discover the characteristics of m6A sequences using deep learning techniques. To the best of our knowledge, our work is the first attempt to use word embedding and deep neural networks for m6A prediction from mRNA sequences. Using four deep neural networks, we developed a model inferred from a larger sequence shifting window that can predict m6A accurately and robustly. Four prediction schemes were built with various RNA sequence representations and optimized convolutional neural networks. The soft voting results from the four deep networks were shown to outperform all of the state-of-the-art methods. We evaluated these predictors mentioned above on a rigorous independent test data set and proved that our proposed method outperforms the state-of-the-art predictors. The training, independent, and cross-species testing data sets are much larger than in previous studies, which could help to avoid the problem of overfitting. Furthermore, an online prediction web server implementing the four proposed predictors has been built and is available at http://server.malab.cn/Gene2vec/.
Collapse
Affiliation(s)
- Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610051 Chengdu, China
- School of Computer Science and Technology, Tianjin University, 300350 Tianjin, China
| | - Pengwei Xing
- School of Computer Science and Technology, Tianjin University, 300350 Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, Tianjin University, 300350 Tianjin, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, 150001 Shenzhen, China
| |
Collapse
|
44
|
Li Y, Niu M, Zou Q. ELM-MHC: An Improved MHC Identification Method with Extreme Learning Machine Algorithm. J Proteome Res 2019; 18:1392-1401. [DOI: 10.1021/acs.jproteome.9b00012] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
45
|
Pang L, Wang J, Zhao L, Wang C, Zhan H. A Novel Protein Subcellular Localization Method With CNN-XGBoost Model for Alzheimer's Disease. Front Genet 2019; 9:751. [PMID: 30713552 PMCID: PMC6345701 DOI: 10.3389/fgene.2018.00751] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 12/31/2018] [Indexed: 12/26/2022] Open
Abstract
The disorder distribution of protein in the compartment or organelle leads to many human diseases, including neurodegenerative diseases such as Alzheimer's disease. The prediction of protein subcellular localization play important roles in the understanding of the mechanism of protein function, pathogenes and disease therapy. This paper proposes a novel subcellular localization method by integrating the Convolutional Neural Network (CNN) and eXtreme Gradient Boosting (XGBoost), where CNN acts as a feature extractor to automatically obtain features from the original sequence information and a XGBoost classifier as a recognizer to identify the protein subcellular localization based on the output of the CNN. Experiments are implemented on three protein datasets. The results prove that the CNN-XGBoost method performs better than the general protein subcellular localization methods.
Collapse
Affiliation(s)
- Long Pang
- Harbin Nebula Bioinformatics Technology Development Co., Ltd., Harbin, China
| | - Junjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Lingling Zhao
- School of Electronic Engineering, Heilongjiang University, Harbin, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Hui Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
46
|
Pai AA, Luca F. Environmental influences on RNA processing: Biochemical, molecular and genetic regulators of cellular response. WILEY INTERDISCIPLINARY REVIEWS. RNA 2019; 10:e1503. [PMID: 30216698 PMCID: PMC6294667 DOI: 10.1002/wrna.1503] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 07/19/2018] [Accepted: 08/01/2018] [Indexed: 12/16/2022]
Abstract
RNA processing has emerged as a key mechanistic step in the regulation of the cellular response to environmental perturbation. Recent work has uncovered extensive remodeling of transcriptome composition upon environmental perturbation and linked the impacts of this molecular plasticity to health and disease outcomes. These isoform changes and their underlying mechanisms are varied-involving alternative sites of transcription initiation, alternative splicing, and alternative cleavage at the 3' end of the mRNA. The mechanisms and consequences of differential RNA processing have been characterized across a range of common environmental insults, including chemical stimuli, immune stimuli, heat stress, and cancer pathogenesis. In each case, there are perturbation-specific contributions of local (cis) regulatory elements or global (trans) factors and downstream consequences. Overall, it is clear that choices in isoform usage involve a balance between the usage of specific genetic elements (i.e., splice sites, polyadenylation sites) and the timing at which certain decisions are made (i.e., transcription elongation rate). Fine-tuned cellular responses to environmental perturbation are often dependent on the genetic makeup of the cell. Genetic analyses of interindividual variation in splicing have identified genetic effects on splicing that contribute to variation in complex traits. Finally, the increase in the number of tissue types and environmental conditions analyzed for RNA processing is paralleled by the need to develop appropriate analytical tools. The combination of large datasets, novel methods and conditions explored promises to provide a much greater understanding of the role of RNA processing response in human phenotypic variation. This article is categorized under: RNA Processing > RNA Editing and Modification RNA Evolution and Genomics > Computational Analyses of RNA RNA Processing > Splicing Mechanisms RNA Processing > Splicing Regulation/Alternative Splicing.
Collapse
Affiliation(s)
- Athma A Pai
- RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, Massachusetts
| | - Francesca Luca
- Center for Molecular Medicine and Genetics, and Department of Obstetrics and Gynecology, Wayne State University, Detroit, Michigan
| |
Collapse
|
47
|
Gene-Based Nonparametric Testing of Interactions Using Distance Correlation Coefficient in Case-Control Association Studies. Genes (Basel) 2018; 9:genes9120608. [PMID: 30563156 PMCID: PMC6316506 DOI: 10.3390/genes9120608] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Revised: 11/24/2018] [Accepted: 11/27/2018] [Indexed: 12/12/2022] Open
Abstract
Among the various statistical methods for identifying gene⁻gene interactions in qualitative genome-wide association studies (GWAS), gene-based methods have recently grown in popularity because they confer advantages in both statistical power and biological interpretability. However, most of these methods make strong assumptions about the form of the relationship between traits and single-nucleotide polymorphisms, which result in limited statistical power. In this paper, we propose a gene-based method based on the distance correlation coefficient called gene-based gene-gene interaction via distance correlation coefficient (GBDcor). The distance correlation (dCor) is a measurement of the dependency between two random vectors with arbitrary, and not necessarily equal, dimensions. We used the difference in dCor in case and control datasets as an indicator of gene⁻gene interaction, which was based on the assumption that the joint distribution of two genes in case subjects and in control subjects should not be significantly different if the two genes do not interact. We designed a permutation-based statistical test to evaluate the difference between dCor in cases and controls for a pair of genes, and we provided the p-value for the statistic to represent the significance of the interaction between the two genes. In experiments with both simulated and real-world data, our method outperformed previous approaches in detecting interactions accurately.
Collapse
|
48
|
Xuan P, Dong Y, Guo Y, Zhang T, Liu Y. Dual Convolutional Neural Network Based Method for Predicting Disease-Related miRNAs. Int J Mol Sci 2018; 19:ijms19123732. [PMID: 30477152 PMCID: PMC6321160 DOI: 10.3390/ijms19123732] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 11/15/2018] [Accepted: 11/19/2018] [Indexed: 02/07/2023] Open
Abstract
Identification of disease-related microRNAs (disease miRNAs) is helpful for understanding and exploring the etiology and pathogenesis of diseases. Most of recent methods predict disease miRNAs by integrating the similarities and associations of miRNAs and diseases. However, these methods fail to learn the deep features of the miRNA similarities, the disease similarities, and the miRNA–disease associations. We propose a dual convolutional neural network-based method for predicting candidate disease miRNAs and refer to it as CNNDMP. CNNDMP not only exploits the similarities and associations of miRNAs and diseases, but also captures the topology structures of the miRNA and disease networks. An embedding layer is constructed by combining the biological premises about the miRNA–disease associations. A new framework based on the dual convolutional neural network is presented for extracting the deep feature representation of associations. The left part of the framework focuses on integrating the original similarities and associations of miRNAs and diseases. The novel miRNA and disease similarities which contain the topology structures are obtained by random walks on the miRNA and disease networks, and their deep features are learned by the right part of the framework. CNNDMP achieves the superior prediction performance than several state-of-the-art methods during the cross-validation process. Case studies on breast cancer, colorectal cancer and lung cancer further demonstrate CNNDMP’s powerful ability of discovering potential disease miRNAs.
Collapse
Affiliation(s)
- Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China.
| | - Yihua Dong
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China.
| | - Yahong Guo
- School of Information Science and Technology, Heilongjiang University, Harbin 150080, China.
| | - Tiangang Zhang
- School of Mathematical Science, Heilongjiang University, Harbin 150080, China.
| | - Yong Liu
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China.
| |
Collapse
|
49
|
Wei L, Hu J, Li F, Song J, Su R, Zou Q. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms. Brief Bioinform 2018; 21:106-119. [PMID: 30383239 DOI: 10.1093/bib/bby107] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Revised: 09/18/2018] [Accepted: 10/05/2018] [Indexed: 12/11/2022] Open
Abstract
Quorum-sensing peptides (QSPs) are the signal molecules that are closely associated with diverse cellular processes, such as cell-cell communication, and gene expression regulation in Gram-positive bacteria. It is therefore of great importance to identify QSPs for better understanding and in-depth revealing of their functional mechanisms in physiological processes. Machine learning algorithms have been developed for this purpose, showing the great potential for the reliable prediction of QSPs. In this study, several sequence-based feature descriptors for peptide representation and machine learning algorithms are comprehensively reviewed, evaluated and compared. To effectively use existing feature descriptors, we used a feature representation learning strategy that automatically learns the most discriminative features from existing feature descriptors in a supervised way. Our results demonstrate that this strategy is capable of effectively capturing the sequence determinants to represent the characteristics of QSPs, thereby contributing to the improved predictive performance. Furthermore, wrapping this feature representation learning strategy, we developed a powerful predictor named QSPred-FL for the detection of QSPs in large-scale proteomic data. Benchmarking results with 10-fold cross validation showed that QSPred-FL is able to achieve better performance as compared to the state-of-the-art predictors. In addition, we have established a user-friendly webserver that implements QSPred-FL, which is currently available at http://server.malab.cn/QSPred-FL. We expect that this tool will be useful for the high-throughput prediction of QSPs and the discovery of important functional mechanisms of QSPs.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Jie Hu
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Fuyi Li
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Ran Su
- School of Computer Software, Tianjin University, Tianjin, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
50
|
Chen W, Feng P, Ding H, Lin H. Classifying Included and Excluded Exons in Exon Skipping Event Using Histone Modifications. Front Genet 2018; 9:433. [PMID: 30327665 PMCID: PMC6174203 DOI: 10.3389/fgene.2018.00433] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Accepted: 09/12/2018] [Indexed: 12/15/2022] Open
Abstract
Alternative splicing (AS) not only ensures the diversity of gene expression products, but also closely correlated with genetic diseases. Therefore, knowledge about regulatory mechanisms of AS will provide useful clues for understanding its biological functions. In the current study, a random forest based method was developed to classify included and excluded exons in exon skipping event. In this method, the samples in the dataset were encoded by using optimal histone modification features which were optimized by using the Maximum Relevance Maximum Distance (MRMD) feature selection technique. The proposed method obtained an accuracy of 72.91% in 10-fold cross validation test and outperformed existing methods. Meanwhile, we also systematically analyzed the distribution of histone modifications between included and excluded exons and discovered their preference in both kinds of exons, which might provide insights into researches on the regulatory mechanisms of alternative splicing.
Collapse
Affiliation(s)
- Wei Chen
- Center for Genomics and Computational Biology, School of Life Science, North China University of Science and Technology, Tangshan, China.,Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Pengmian Feng
- School of Public Health, North China University of Science and Technology, Tangshan, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics and Center for Information in Biomedicine, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics and Center for Information in Biomedicine, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|