1
|
Tenekeci S, Tekir S. Identifying promoter and enhancer sequences by graph convolutional networks. Comput Biol Chem 2024; 110:108040. [PMID: 38430611 DOI: 10.1016/j.compbiolchem.2024.108040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 01/09/2024] [Accepted: 02/27/2024] [Indexed: 03/05/2024]
Abstract
Identification of promoters, enhancers, and their interactions helps understand genetic regulation. This study proposes a graph-based semi-supervised learning model (GCN4EPI) for the enhancer-promoter classification problem. We adopt a graph convolutional network (GCN) architecture to integrate interaction information with sequence features. Nodes of the constructed graph hold word embeddings of DNA sequences while edges hold the Enhancer-Promoter Interaction (EPI) information. By means of semi-supervised learning, much less data (16%) and time are needed in model training. Comparisons on a benchmark dataset of six human cell lines show that the proposed approach outperforms the state-of-the-art methods by a large margin (10% higher F1 score) and has the fastest training time (up to 3 times). Moreover, GCN4EPI's performance on cross-cell line data is also better than the baselines (3% higher F1 score). Our qualitative analyses with graph explainability models prove that GCN4EPI learns from both text and graph structure. The results suggest that integrating interaction information with sequence features improves predictive performance and compensates for the number of training instances.
Collapse
Affiliation(s)
- Samet Tenekeci
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye
| | - Selma Tekir
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye.
| |
Collapse
|
2
|
Ramakrishnan A, Wangensteen G, Kim S, Nestler EJ, Shen L. DeepRegFinder: deep learning-based regulatory elements finder. BIOINFORMATICS ADVANCES 2024; 4:vbae007. [PMID: 38343388 PMCID: PMC10858349 DOI: 10.1093/bioadv/vbae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 12/06/2023] [Accepted: 01/12/2024] [Indexed: 06/15/2024]
Abstract
Summary Enhancers and promoters are important classes of DNA regulatory elements (DREs) that govern gene expression. Identifying them at a genomic scale is a critical task in bioinformatics. The DREs often exhibit unique histone mark binding patterns, which can be captured by high-throughput ChIP-seq experiments. To account for the variations and noises among the binding sites, machine learning models are trained on known enhancer/promoter sites using histone mark ChIP-seq data and predict enhancers/promoters at other genomic regions. To this end, we have developed a highly customizable program named DeepRegFinder, which automates the entire process of data processing, model training, and prediction. We have employed convolutional and recurrent neural networks for model training and prediction. DeepRegFinder further categorizes enhancers and promoters into active and poised states, making it a unique and valuable feature for researchers. Our method demonstrates improved precision and recall in comparison to existing algorithms for enhancer prediction across multiple cell types. Moreover, our pipeline is modular and eliminates the tedious steps involved in preprocessing, making it easier for users to apply on their data quickly. Availability and implementation https://github.com/shenlab-sinai/DeepRegFinder.
Collapse
Affiliation(s)
- Aarthi Ramakrishnan
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - George Wangensteen
- Department of Computer Science, Brown University, Providence, RI 02912, United States
| | - Sarah Kim
- Cancer Program, Broad Institute, Cambridge, MA 02142, United States
| | - Eric J Nestler
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Li Shen
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
3
|
Wang Q, Zhang J, Liu Z, Duan Y, Li C. Integrative approaches based on genomic techniques in the functional studies on enhancers. Brief Bioinform 2023; 25:bbad442. [PMID: 38048082 PMCID: PMC10694556 DOI: 10.1093/bib/bbad442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 10/22/2023] [Accepted: 11/08/2023] [Indexed: 12/05/2023] Open
Abstract
With the development of sequencing technology and the dramatic drop in sequencing cost, the functions of noncoding genes are being characterized in a wide variety of fields (e.g. biomedicine). Enhancers are noncoding DNA elements with vital transcription regulation functions. Tens of thousands of enhancers have been identified in the human genome; however, the location, function, target genes and regulatory mechanisms of most enhancers have not been elucidated thus far. As high-throughput sequencing techniques have leapt forwards, omics approaches have been extensively employed in enhancer research. Multidimensional genomic data integration enables the full exploration of the data and provides novel perspectives for screening, identification and characterization of the function and regulatory mechanisms of unknown enhancers. However, multidimensional genomic data are still difficult to integrate genome wide due to complex varieties, massive amounts, high rarity, etc. To facilitate the appropriate methods for studying enhancers with high efficacy, we delineate the principles, data processing modes and progress of various omics approaches to study enhancers and summarize the applications of traditional machine learning and deep learning in multi-omics integration in the enhancer field. In addition, the challenges encountered during the integration of multiple omics data are addressed. Overall, this review provides a comprehensive foundation for enhancer analysis.
Collapse
Affiliation(s)
- Qilin Wang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Junyou Zhang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Zhaoshuo Liu
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Yingying Duan
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Chunyan Li
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
- Key Laboratory of Big Data-Based Precision Medicine (Ministry of Industry and Information Technology), Beihang University, Beijing 100191, China
- Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beihang University, Beijing 100191, China
| |
Collapse
|
4
|
Wang J, Zhang H, Chen N, Zeng T, Ai X, Wu K. PorcineAI-Enhancer: Prediction of Pig Enhancer Sequences Using Convolutional Neural Networks. Animals (Basel) 2023; 13:2935. [PMID: 37760334 PMCID: PMC10526013 DOI: 10.3390/ani13182935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 08/21/2023] [Accepted: 09/05/2023] [Indexed: 09/29/2023] Open
Abstract
Understanding the mechanisms of gene expression regulation is crucial in animal breeding. Cis-regulatory DNA sequences, such as enhancers, play a key role in regulating gene expression. Identifying enhancers is challenging, despite the use of experimental techniques and computational methods. Enhancer prediction in the pig genome is particularly significant due to the costliness of high-throughput experimental techniques. The study constructed a high-quality database of pig enhancers by integrating information from multiple sources. A deep learning prediction framework called PorcineAI-enhancer was developed for the prediction of pig enhancers. This framework employs convolutional neural networks for feature extraction and classification. PorcineAI-enhancer showed excellent performance in predicting pig enhancers, validated on an independent test dataset. The model demonstrated reliable prediction capability for unknown enhancer sequences and performed remarkably well on tissue-specific enhancer sequences.The study developed a deep learning prediction framework, PorcineAI-enhancer, for predicting pig enhancers. The model demonstrated significant predictive performance and potential for tissue-specific enhancers. This research provides valuable resources for future studies on gene expression regulation in pigs.
Collapse
Affiliation(s)
- Ji Wang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Han Zhang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Nanzhu Chen
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China;
| | - Tong Zeng
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Xiaohua Ai
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Keliang Wu
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| |
Collapse
|
5
|
Zhang Z, Feng F, Qiu Y, Liu J. A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome. Nucleic Acids Res 2023; 51:5931-5947. [PMID: 37224527 PMCID: PMC10325920 DOI: 10.1093/nar/gkad436] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 03/31/2023] [Accepted: 05/09/2023] [Indexed: 05/26/2023] Open
Abstract
Many deep learning approaches have been proposed to predict epigenetic profiles, chromatin organization, and transcription activity. While these approaches achieve satisfactory performance in predicting one modality from another, the learned representations are not generalizable across predictive tasks or across cell types. In this paper, we propose a deep learning approach named EPCOT which employs a pre-training and fine-tuning framework, and is able to accurately and comprehensively predict multiple modalities including epigenome, chromatin organization, transcriptome, and enhancer activity for new cell types, by only requiring cell-type specific chromatin accessibility profiles. Many of these predicted modalities, such as Micro-C and ChIA-PET, are quite expensive to get in practice, and the in silico prediction from EPCOT should be quite helpful. Furthermore, this pre-training and fine-tuning framework allows EPCOT to identify generic representations generalizable across different predictive tasks. Interpreting EPCOT models also provides biological insights including mapping between different genomic modalities, identifying TF sequence binding patterns, and analyzing cell-type specific TF impacts on enhancer activity.
Collapse
Affiliation(s)
- Zhenhao Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Fan Feng
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Yiyang Qiu
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Jie Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| |
Collapse
|
6
|
Phan LT, Oh C, He T, Manavalan B. A comprehensive revisit of the machine-learning tools developed for the identification of enhancers in the human genome. Proteomics 2023; 23:e2200409. [PMID: 37021401 DOI: 10.1002/pmic.202200409] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 03/18/2023] [Accepted: 03/27/2023] [Indexed: 04/07/2023]
Abstract
Enhancers are non-coding DNA elements that play a crucial role in enhancing the transcription rate of a specific gene in the genome. Experiments for identifying enhancers can be restricted by their conditions and involve complicated, time-consuming, laborious, and costly steps. To overcome these challenges, computational platforms have been developed to complement experimental methods that enable high-throughput identification of enhancers. Over the last few years, the development of various enhancer computational tools has resulted in significant progress in predicting putative enhancers. Thus, researchers are now able to use a variety of strategies to enhance and advance enhancer study. In this review, an overview of machine learning (ML)-based prediction methods for enhancer identification and related databases has been provided. The existing enhancer-prediction methods have also been reviewed regarding their algorithms, feature selection processes, validation techniques, and software utility. In addition, the advantages and drawbacks of these ML approaches and guidelines for developing bioinformatic tools have been highlighted for a more efficient enhancer prediction. This review will serve as a useful resource for experimentalists in selecting the appropriate ML tool for their study, and for bioinformaticians in developing more accurate and advanced ML-based predictors.
Collapse
Affiliation(s)
- Le Thi Phan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Changmin Oh
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| |
Collapse
|
7
|
Zhang X, Misra SK, Moitra P, Zhang X, Jeong SJ, Stitham J, Rodriguez-Velez A, Park A, Yeh YS, Gillanders WE, Fan D, Diwan A, Cho J, Epelman S, Lodhi IJ, Pan D, Razani B. Use of acidic nanoparticles to rescue macrophage lysosomal dysfunction in atherosclerosis. Autophagy 2023; 19:886-903. [PMID: 35982578 PMCID: PMC9980706 DOI: 10.1080/15548627.2022.2108252] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 07/23/2022] [Accepted: 07/25/2022] [Indexed: 12/19/2022] Open
Abstract
Dysfunction in the macrophage lysosomal system including reduced acidity and diminished degradative capacity is a hallmark of atherosclerosis, leading to blunted clearance of excess cellular debris and lipids in plaques and contributing to lesion progression. Devising strategies to rescue this macrophage lysosomal dysfunction is a novel therapeutic measure. Nanoparticles have emerged as an effective platform to both target specific tissues and serve as drug delivery vehicles. In most cases, administered nanoparticles are taken up non-selectively by the mononuclear phagocyte system including monocytes/macrophages leading to the undesirable degradation of cargo in lysosomes. We took advantage of this default route to target macrophage lysosomes to rectify their acidity in disease states such as atherosclerosis. Herein, we develop and test two commonly used acidic nanoparticles, poly-lactide-co-glycolic acid (PLGA) and polylactic acid (PLA), both in vitro and in vivo. Our results in cultured macrophages indicate that the PLGA-based nanoparticles are the most effective at trafficking to and enhancing acidification of lysosomes. PLGA nanoparticles also provide functional benefits including enhanced lysosomal degradation, promotion of macroautophagy/autophagy and protein aggregate removal, and reduced apoptosis and inflammasome activation. We demonstrate the utility of this system in vivo, showing nanoparticle accumulation in, and lysosomal acidification of, macrophages in atherosclerotic plaques. Long-term administration of PLGA nanoparticles results in significant reductions in surrogates of plaque complexity with reduced apoptosis, necrotic core formation, and cytotoxic protein aggregates and increased fibrous cap formation. Taken together, our data support the use of acidic nanoparticles to rescue macrophage lysosomal dysfunction in the treatment of atherosclerosis.Abbreviations: BCA: brachiocephalic arteries; FACS: fluorescence activated cell sorting; FITC: fluorescein-5-isothiocyanatel; IL1B: interleukin 1 beta; LAMP: lysosomal associated membrane protein; LIPA/LAL: lipase A, lysosomal acid type; LSDs: lysosomal storage disorders; MAP1LC3/LC3: microtubule associated protein 1 light chain 3; MFI: mean fluorescence intensity; MPS: mononuclear phagocyte system; PEGHDE: polyethylene glycol hexadecyl ether; PLA: polylactic acid; PLGA: poly-lactide-co-glycolic acid; SQSTM1/p62: sequestosome 1.
Collapse
Affiliation(s)
- Xiangyu Zhang
- Cardiovascular Division, Washington University, St. Louis, MO, USA
| | - Santosh Kumar Misra
- Department of Bioengineering, University of Illinois at Urbana Champaign, IL, USA
| | - Parikshit Moitra
- Departments of Diagnostic Radiology and Nuclear Medicine and Pediatrics, Baltimore, Maryland, USA
- Department of Nuclear Engineering, The Pennsylvania State University, University Park, Pennsylvania16802, USA
| | - Xiuli Zhang
- Department of Surgery, Washington University, St. Louis, MO, USA
| | - Se-Jin Jeong
- Cardiovascular Division, Washington University, St. Louis, MO, USA
| | - Jeremiah Stitham
- Cardiovascular Division, Washington University, St. Louis, MO, USA
- Division of Endocrinology, Metabolism, and Lipid Research, St. Louis, MO, USA
| | | | - Arick Park
- Cardiovascular Division, Washington University, St. Louis, MO, USA
| | - Yu-Sheng Yeh
- Cardiovascular Division, Washington University, St. Louis, MO, USA
| | | | - Daping Fan
- Department of Cell Biology and Anatomy, University of South Carolina School of Medicine, Columbia, SC, USA
| | - Abhinav Diwan
- Cardiovascular Division, Washington University, St. Louis, MO, USA
- John Cochran Division, VA Medical Center, St. Louis, MO, USA
| | - Jaehyung Cho
- Division of Hematology, Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, USA
| | - Slava Epelman
- Peter Munk Cardiac Center, Toronto General Hospital Research Institute, University Health Network, Ted Rogers Centre for Heart Research, University of Toronto, Toronto, Ontario, Canada
| | - Irfan J. Lodhi
- Division of Endocrinology, Metabolism, and Lipid Research, St. Louis, MO, USA
| | - Dipanjan Pan
- Department of Bioengineering, University of Illinois at Urbana Champaign, IL, USA
- Departments of Diagnostic Radiology and Nuclear Medicine and Pediatrics, Baltimore, Maryland, USA
- Department of Nuclear Engineering, The Pennsylvania State University, University Park, Pennsylvania16802, USA
| | - Babak Razani
- Cardiovascular Division, Washington University, St. Louis, MO, USA
- John Cochran Division, VA Medical Center, St. Louis, MO, USA
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, USA
| |
Collapse
|
8
|
Li Y, Kong F, Cui H, Wang F, Li C, Ma J. SENIES: DNA Shape Enhanced Two-Layer Deep Learning Predictor for the Identification of Enhancers and Their Strength. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:637-645. [PMID: 35015646 DOI: 10.1109/tcbb.2022.3142019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identifying enhancers is a critical task in bioinformatics due to their primary role in regulating gene expression. For this reason, various computational algorithms devoted to enhancer identification have been put forward over the years. More features are extracted from the single DNA sequences to boost the performance. Nevertheless, DNA structural information is neglected, which is an essential factor affecting the binding preferences of transcription factors to regulatory elements like enhancers. Here, we propose SENIES, a DNA shape enhanced deep learning predictor, to identify enhancers and their strength. The predictor consists of two layers where the first layer is for enhancer and non-enhancer identification, and the second layer is for predicting the strength of enhancers. Apart from two common sequence-derived features (i.e., one-hot and k-mer), DNA shape is introduced to describe the 3D structures of DNA sequences. Performance comparison with state-of-the-art methods conducted on public datasets demonstrates the effectiveness and robustness of our predictor. The code implementation of SENIES is publicly available at https://github.com/hlju-liye/SENIES.
Collapse
|
9
|
Investigation and Prediction of ECMM characteristics of Hardened Die Steel with Nanoparticle Added Electrolytes Using Hybrid Deep Neural Network. POLISH JOURNAL OF CHEMICAL TECHNOLOGY 2022. [DOI: 10.2478/pjct-2022-0024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Abstract
In our work, the process efficiency of the ECMM should be improved by using different combinations of nano-particles and added electrolytes. The superior aim of this work is to improve and predict the ECMM machining characteristics of die hardened steel, namely material removal rate (MRR), Tool wear rate (TWR) and Surface Roughness (Ra). The machining conditions are optimized using Response Surface Methodology (RSM) based on Box Behnken Design. The better Nano electrolyte is optimized using Deer Hunting Optimization (DHO) based on the machined outcomes, and the performances are predicted using a hybrid Deep Neural Network (DNN) based DHO. The hybrid DNN-DHO based predicted outcome of MRR is 0.361 mg/min, TWR is 0.272 mg/min and Ra is 2.511 μm. The validation results show that our proposed DNN-DHO model performed well and obtained above 0.99 regression for both training and validation of DNN-DHO, where the root mean square error ranges between 0.018 and 0.024.
Collapse
|
10
|
Alotaibi BS, Buabeid M, Ibrahim NA, Kharaba ZJ, Ijaz M, Murtaza G. Recent strategies driving oral biologic administration. Expert Rev Vaccines 2021; 20:1587-1601. [PMID: 34612121 DOI: 10.1080/14760584.2021.1990044] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
INTRODUCTION High patient compliance, noninvasiveness, and self-administration are the leading features of vaccine delivery through the oral route. The implementation of swift mass vaccination campaigns in pandemic outbreaks fascinates the use of oral vaccination. This approach can elicit both mucosal and systemic immune responses to protect against infection at the surface of the mucosa. AREA COVERED As pathogen entry and spread mainly occurs through the gastrointestinal tract (GIT) mucosal surfaces, oral vaccination may protect and limit disease spread. Oral vaccines target various potential mucosal inductive sites in the GIT, such as the oral cavity, gastric area, and small intestine. Orally delivered vaccines having subunit and nucleic acid pass through various GIT-associated risks, such as the biodegradation of biologics and their reduced absorption. This article presents a summarized review of the existing technologies and prospects for oral vaccination. EXPERT OPINION The intestinal mucosa focuses on current approaches, while future strategies target new mucosal sites, i.e. oral cavity and stomach. Recent developments in biologic delivery through the oral route and their potential use in future oral vaccination are mainly considered.
Collapse
Affiliation(s)
- Badriyah Shadid Alotaibi
- Department of Pharmaceutical Sciences, College of Pharmacy, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Manal Buabeid
- Department of Clinical Sciences, Ajman University, Ajman, 346, UAE.,Medical and Bio-allied Health Sciences Research Centre, Ajman University, Ajman, United Arab Emirates
| | - Nihal Abdalla Ibrahim
- Department of Clinical Sciences, Ajman University, Ajman, 346, UAE.,Medical and Bio-allied Health Sciences Research Centre, Ajman University, Ajman, United Arab Emirates
| | - Zelal Jaber Kharaba
- Department of Clinical Sciences, College of Pharmacy, Al-Ain University of Science and Technology, Abu Dhabi, United Arab Emirates
| | - Munazza Ijaz
- Institute of Molecular Biology and Biotechnology, The University of Lahore, Lahore, Pakistan
| | - Ghulam Murtaza
- Department of Pharmacy, COMSATS University Islamabad, Lahore, 54000, Pakistan
| |
Collapse
|
11
|
Umarov R, Li Y, Arakawa T, Takizawa S, Gao X, Arner E. ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation. PLoS Comput Biol 2021; 17:e1009376. [PMID: 34491989 PMCID: PMC8448322 DOI: 10.1371/journal.pcbi.1009376] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 09/17/2021] [Accepted: 08/23/2021] [Indexed: 11/19/2022] Open
Abstract
Regulatory elements control gene expression through transcription initiation (promoters) and by enhancing transcription at distant regions (enhancers). Accurate identification of regulatory elements is fundamental for annotating genomes and understanding gene expression patterns. While there are many attempts to develop computational promoter and enhancer identification methods, reliable tools to analyze long genomic sequences are still lacking. Prediction methods often perform poorly on the genome-wide scale because the number of negatives is much higher than that in the training sets. To address this issue, we propose a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions. The developed method achieves good genome-level performance and maintains robust performance when applied to other vertebrate species, without re-training. Moreover, the unannotated predicted regulatory regions made on the human genome are enriched for disease-associated variants, suggesting them to be potentially true regulatory elements rather than false positives. We validated high scoring "false positive" predictions using reporter assay and all tested candidates were successfully validated, demonstrating the ability of our method to discover novel human regulatory regions.
Collapse
Affiliation(s)
- Ramzan Umarov
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- * E-mail: (RU); (XG); (EA)
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People’s Republic of China
| | - Takahiro Arakawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Satoshi Takizawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Xin Gao
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, Thuwal, Saudi Arabia
- * E-mail: (RU); (XG); (EA)
| | - Erik Arner
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail: (RU); (XG); (EA)
| |
Collapse
|
12
|
Roth M, Jain P, Koo J, Chaterji S. Simultaneous learning of individual microRNA-gene interactions and regulatory comodules. BMC Bioinformatics 2021; 22:237. [PMID: 33971820 PMCID: PMC8111732 DOI: 10.1186/s12859-021-04151-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Accepted: 04/23/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND MicroRNAs (miRNAs) function in post-transcriptional regulation of gene expression by binding to target messenger RNAs (mRNAs). Because of the key part that miRNAs play, understanding the correct regulatory role of miRNAs in diverse patho-physiological conditions is of great interest. Although it is known that miRNAs act combinatorially to regulate genes, precise identification of miRNA-gene interactions and their specific functional roles in regulatory comodules remains a challenge. We developed THEIA, an effective method for simultaneously predicting miRNA-gene interactions and regulatory comodules, which group functionally related miRNAs and genes via non-negative matrix factorization (NMF). RESULTS We apply THEIA to RNA sequencing data from breast invasive carcinoma samples and demonstrate its effectiveness in discovering biologically significant regulatory comodules that are significantly enriched in spatial miRNA clusters, biological pathways, and various cancers. CONCLUSIONS THEIA is a theoretically rigorous optimization algorithm that simultaneously predicts the strength and direction (i.e., up-regulation or down-regulation) of the effect of modules of miRNAs on a gene. We posit that if THEIA is capable of recovering known clusters of genes and miRNA, then the clusters found by our method not previously identified by literature are also likely to have biological significance. We believe that these novel regulatory comodules found by our method will be a springboard for further research into the specific functional roles of these new functional ensembles of miRNAs and genes,especially those related to diseases like breast cancer.
Collapse
Affiliation(s)
| | - Pranjal Jain
- Electrical Engineering, Indian Institute of Technology Bombay, Mumbai, India
| | | | - Somali Chaterji
- Agricultural and Biological Engineering, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
13
|
Mantsoki A, Parussel K, Joshi A. Identification and Characterisation of Putative Enhancer Elements in Mouse Embryonic Stem Cells. Bioinform Biol Insights 2021; 15:1177932220974623. [PMID: 33623376 PMCID: PMC7876754 DOI: 10.1177/1177932220974623] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2019] [Accepted: 10/26/2020] [Indexed: 11/16/2022] Open
Abstract
Enhancer elements control mammalian transcription largely in a cell-type-specific
manner. The genome-wide identification of enhancer elements and their activity
status in a cellular context is therefore fundamental to understanding cell
identity and function. We determined enhancer activity in mouse embryonic stem
(ES) cells using chromatin modifications and characterised their global
properties. Specifically, we first grouped enhancers into 5 groups using
multiple H3K4me1, H3K27ac, and H3K27me3 modification data sets. Active enhancers
(simultaneous presence of H3K4me1 and H3K27ac) were enriched for binding of
pluripotency factors and were found near pluripotency-related genes. Although
both H3K4me1-only and active enhancers were enriched for super-enhancers and a
TATA box like motif, active enhancers were preferentially bound by RNA polII
(s2) and were enriched for bidirectional transcription, while H3K4me1-only
enhancers were enriched for RNA polII (8WG16) suggesting they were likely
poised. Bivalent enhancers (simultaneous presence of H3K4me1 and H3K27me3) were
preferentially in the vicinity of bivalent genes. They were enriched for binding
of components of polycomb complex as well as Tcf3 and Oct4. Moreover, a
‘CTTTCTC’ de-novo motif was enriched at bivalent enhancers, previously
identified at bivalent promoters in ES cells. Taken together, 3 histone
modifications successfully demarcated active, bivalent, and poised enhancers
with distinct sequence and binding features.
Collapse
Affiliation(s)
- Anna Mantsoki
- Division of Developmental Biology, The Roslin Institute, The University of Edinburgh, Midlothian, UK
| | - Karla Parussel
- Division of Developmental Biology, The Roslin Institute, The University of Edinburgh, Midlothian, UK
| | - Anagha Joshi
- Computational Biology Unit, Department of Clinical Science, University of Bergen, Bergen, Norway
| |
Collapse
|
14
|
Schreiber J, Singh R, Bilmes J, Noble WS. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol 2020; 21:282. [PMID: 33213499 PMCID: PMC7678316 DOI: 10.1186/s13059-020-02177-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Accepted: 10/07/2020] [Indexed: 01/19/2023] Open
Abstract
Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.
Collapse
Affiliation(s)
- Jacob Schreiber
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA
| | - Ritambhara Singh
- Department of Genome Science, University of Washington, Seattle, USA.,Current Affiliation: Department of Computer Science, and Center for Computational Molecular Biology, Brown University, Providence, 02906, RI, United States
| | - Jeffrey Bilmes
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA.,Department of Electrical & Computer Engineering, University of Washington, Seattle, USA
| | - William Stafford Noble
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA. .,Department of Genome Science, University of Washington, Seattle, USA.
| |
Collapse
|
15
|
Sun C, Zhang N, Yu P, Wu X, Li Q, Li T, Li H, Xiao X, Shalmani A, Li L, Che D, Wang X, Zhang P, Chen Z, Liu T, Zhao J, Hua J, Liao M. Enhancer recognition and prediction during spermatogenesis based on deep convolutional neural networks. Mol Omics 2020; 16:455-464. [PMID: 32568326 DOI: 10.1039/d0mo00031k] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
MOTIVATION enhancers play an important role in the regulation of gene expression during spermatogenesis. The development of ChIP-Chip and ChIP-Seq sequencing technology has enabled researchers to focus on the relationship between enhancers and DNA sequences and histone protein modifications. However, the prediction of enhancers based on the locally conserved DNA sequence and similar histone modification features is still unknown. Here, the present study proposed a convolutional neural network (CNN) model to predict enhancers that can regulate gene expression during spermatogenesis. RESULTS we have obtained a positive set of enhancers using the P300 locus, verified by experiments, while a negative set was constructed using the promoter as a non-enhancer locus. The model was trained on all types of specific cells during spermatogenesis independently, and the transfer learning strategy was used to fine-tune the model based on which the model can be trained and adapted to other cells quickly. We visualized the convolution layer of the trained model and aligned the predicted enhancer with the JASPAR database. The results showed that the model was highly matched with some important transcription factors during spermatogenesis, signifying the reliability of the model. Finally, we compared the CNN algorithm with the gkmSVM algorithm (Support Vector Machine). It is well known that CNN has better performance than the gkmSVM algorithm, especially in the generalization ability. Our work demonstrated their strong learning ability and the low CPU requirements for the experiment, with a small number of convolution layers and simple network structure, while avoiding overfitting the training data. At the end of the experiment, we used the trained model to build an enhancer recognition website for further research and communication.
Collapse
Affiliation(s)
- Chengzhang Sun
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi 712100, China.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Chen T, Tyagi S. Integrative computational epigenomics to build data-driven gene regulation hypotheses. Gigascience 2020; 9:giaa064. [PMID: 32543653 PMCID: PMC7297091 DOI: 10.1093/gigascience/giaa064] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 05/25/2020] [Accepted: 05/26/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Diseases are complex phenotypes often arising as an emergent property of a non-linear network of genetic and epigenetic interactions. To translate this resulting state into a causal relationship with a subset of regulatory features, many experiments deploy an array of laboratory assays from multiple modalities. Often, each of these resulting datasets is large, heterogeneous, and noisy. Thus, it is non-trivial to unify these complex datasets into an interpretable phenotype. Although recent methods address this problem with varying degrees of success, they are constrained by their scopes or limitations. Therefore, an important gap in the field is the lack of a universal data harmonizer with the capability to arbitrarily integrate multi-modal datasets. RESULTS In this review, we perform a critical analysis of methods with the explicit aim of harmonizing data, as opposed to case-specific integration. This revealed that matrix factorization, latent variable analysis, and deep learning are potent strategies. Finally, we describe the properties of an ideal universal data harmonization framework. CONCLUSIONS A sufficiently advanced universal harmonizer has major medical implications, such as (i) identifying dysregulated biological pathways responsible for a disease is a powerful diagnostic tool; (2) investigating these pathways further allows the biological community to better understand a disease's mechanisms; and (3) precision medicine also benefits from developments in this area, particularly in the context of the growing field of selective epigenome editing, which can suppress or induce a desired phenotype.
Collapse
Affiliation(s)
- Tyrone Chen
- 25 Rainforest Walk, School of Biological Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Sonika Tyagi
- 25 Rainforest Walk, School of Biological Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
17
|
The Road Not Taken with Pyrrole-Imidazole Polyamides: Off-Target Effects and Genomic Binding. Biomolecules 2020; 10:biom10040544. [PMID: 32260120 PMCID: PMC7226143 DOI: 10.3390/biom10040544] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Revised: 03/16/2020] [Accepted: 03/19/2020] [Indexed: 12/20/2022] Open
Abstract
The high sequence specificity of minor groove-binding N-methylpyrrole-N-methylimidazole polyamides have made significant advances in cancer and disease biology, yet there have been few comprehensive reports on their off-target effects, most likely as a consequence of the lack of available tools in evaluating genomic binding, an essential aspect that has gone seriously underexplored. Compared to other N-heterocycles, the off-target effects of these polyamides and their specificity for the DNA minor groove and primary base pair recognition require the development of new analytical methods, which are missing in the field today. This review aims to highlight the current progress in deciphering the off-target effects of these N-heterocyclic molecules and suggests new ways that next-generating sequencing can be used in addressing off-target effects.
Collapse
|
18
|
Orozco-Arias S, Isaza G, Guyot R, Tabares-Soto R. A systematic review of the application of machine learning in the detection and classification of transposable elements. PeerJ 2019; 7:e8311. [PMID: 31976169 PMCID: PMC6967008 DOI: 10.7717/peerj.8311] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Accepted: 11/28/2019] [Indexed: 12/16/2022] Open
Abstract
Background Transposable elements (TEs) constitute the most common repeated sequences in eukaryotic genomes. Recent studies demonstrated their deep impact on species diversity, adaptation to the environment and diseases. Although there are many conventional bioinformatics algorithms for detecting and classifying TEs, none have achieved reliable results on different types of TEs. Machine learning (ML) techniques can automatically extract hidden patterns and novel information from labeled or non-labeled data and have been applied to solving several scientific problems. Methodology We followed the Systematic Literature Review (SLR) process, applying the six stages of the review protocol from it, but added a previous stage, which aims to detect the need for a review. Then search equations were formulated and executed in several literature databases. Relevant publications were scanned and used to extract evidence to answer research questions. Results Several ML approaches have already been tested on other bioinformatics problems with promising results, yet there are few algorithms and architectures available in literature focused specifically on TEs, despite representing the majority of the nuclear DNA of many organisms. Only 35 articles were found and categorized as relevant in TE or related fields. Conclusions ML is a powerful tool that can be used to address many problems. Although ML techniques have been used widely in other biological tasks, their utilization in TE analyses is still limited. Following the SLR, it was possible to notice that the use of ML for TE analyses (detection and classification) is an open problem, and this new field of research is growing in interest.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.,Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Romain Guyot
- Institut de Recherche pour le Développement, CIRAD, University of Montpellier, Montpellier, France.,Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| |
Collapse
|
19
|
Fang CH, Theera-Ampornpunt N, Roth MA, Grama A, Chaterji S. AIKYATAN: mapping distal regulatory elements using convolutional learning on GPU. BMC Bioinformatics 2019; 20:488. [PMID: 31590652 PMCID: PMC6781298 DOI: 10.1186/s12859-019-3049-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2019] [Accepted: 08/22/2019] [Indexed: 12/02/2022] Open
Abstract
Background The data deluge can leverage sophisticated ML techniques for functionally annotating the regulatory non-coding genome. The challenge lies in selecting the appropriate classifier for the specific functional annotation problem, within the bounds of the hardware constraints and the model’s complexity. In our system Aikyatan, we annotate distal epigenomic regulatory sites, e.g., enhancers. Specifically, we develop a binary classifier that classifies genome sequences as distal regulatory regions or not, given their histone modifications’ combinatorial signatures. This problem is challenging because the regulatory regions are distal to the genes, with diverse signatures across classes (e.g., enhancers and insulators) and even within each class (e.g., different enhancer sub-classes). Results We develop a suite of ML models, under the banner Aikyatan, including SVM models, random forest variants, and deep learning architectures, for distal regulatory element (DRE) detection. We demonstrate, with strong empirical evidence, deep learning approaches have a computational advantage. Plus, convolutional neural networks (CNN) provide the best-in-class accuracy, superior to the vanilla variant. With the human embryonic cell line H1, CNN achieves an accuracy of 97.9% and an order of magnitude lower runtime than the kernel SVM. Running on a GPU, the training time is sped up 21x and 30x (over CPU) for DNN and CNN, respectively. Finally, our CNN model enjoys superior prediction performance vis-‘a-vis the competition. Specifically, Aikyatan-CNN achieved 40% higher validation rate versus CSIANN and the same accuracy as RFECS. Conclusions Our exhaustive experiments using an array of ML tools validate the need for a model that is not only expressive but can scale with increasing data volumes and diversity. In addition, a subset of these datasets have image-like properties and benefit from spatial pooling of features. Our Aikyatan suite leverages diverse epigenomic datasets that can then be modeled using CNNs with optimized activation and pooling functions. The goal is to capture the salient features of the integrated epigenomic datasets for deciphering the distal (non-coding) regulatory elements, which have been found to be associated with functional variants. Our source code will be made publicly available at: https://bitbucket.org/cellsandmachines/aikyatan. Electronic supplementary material The online version of this article (10.1186/s12859-019-3049-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Chih-Hao Fang
- Department of Ag. and Biological Engineering, Purdue University, West Lafayette, IN, USA
| | | | | | - Ananth Grama
- Department of Ag. and Biological Engineering, Purdue University, West Lafayette, IN, USA
| | - Somali Chaterji
- Department of Ag. and Biological Engineering, Purdue University, Purdue University, IN, USA.
| |
Collapse
|
20
|
Shi C, Chen J, Kang X, Zhao G, Lao X, Zheng H. Deep Learning in the Study of Protein-Related Interactions. Protein Pept Lett 2019; 27:359-369. [PMID: 31538879 DOI: 10.2174/0929866526666190723114142] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Revised: 03/13/2019] [Accepted: 04/05/2019] [Indexed: 11/22/2022]
Abstract
Protein-related interaction prediction is critical to understanding life processes, biological functions, and mechanisms of drug action. Experimental methods used to determine proteinrelated interactions have always been costly and inefficient. In recent years, advances in biological and medical technology have provided us with explosive biological and physiological data, and deep learning-based algorithms have shown great promise in extracting features and learning patterns from complex data. At present, deep learning in protein research has emerged. In this review, we provide an introductory overview of the deep neural network theory and its unique properties. Mainly focused on the application of this technology in protein-related interactions prediction over the past five years, including protein-protein interactions prediction, protein-RNA\DNA, Protein- drug interactions prediction, and others. Finally, we discuss some of the challenges that deep learning currently faces.
Collapse
Affiliation(s)
- Cheng Shi
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Jiaxing Chen
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Xinyue Kang
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Guiling Zhao
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Xingzhen Lao
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Heng Zheng
- School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| |
Collapse
|
21
|
Albalawi F, Chahid A, Guo X, Albaradei S, Magana-Mora A, Jankovic BR, Uludag M, Van Neste C, Essack M, Laleg-Kirati TM, Bajic VB. Hybrid model for efficient prediction of poly(A) signals in human genomic DNA. Methods 2019; 166:31-39. [PMID: 30991099 DOI: 10.1016/j.ymeth.2019.04.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 03/12/2019] [Accepted: 04/01/2019] [Indexed: 12/15/2022] Open
Abstract
Polyadenylation signals (PAS) are found in most protein-coding and some non-coding genes in eukaryotes. Their accurate recognition improves understanding gene regulation mechanisms and recognition of the 3'-end of transcribed gene regions where premature or alternate transcription ends may lead to various diseases. Although different methods and tools for in-silico prediction of genomic signals have been proposed, the correct identification of PAS in genomic DNA remains challenging due to a vast number of non-relevant hexamers identical to PAS hexamers. In this study, we developed a novel method for PAS recognition. The method is implemented in a hybrid PAS recognition model (HybPAS), which is based on deep neural networks (DNNs) and logistic regression models (LRMs). One of such models is developed for each of the 12 most frequent human PAS hexamers. DNN models appeared the best for eight PAS types (including the two most frequent PAS hexamers), while LRM appeared best for the remaining four PAS types. The new models use different combinations of signal processing-based, statistical, and sequence-based features as input. The results obtained on human genomic data show that HybPAS outperforms the well-tuned state-of-the-art Omni-PolyA models, reducing the classification error for different PAS hexamers by up to 57.35% for 10 out of 12 PAS types, with Omni-PolyA models being better for two PAS types. For the most frequent PAS types, 'AATAAA' and 'ATTAAA', HybPAS reduced the error rate by 35.14% and 34.48%, respectively. On average, HybPAS reduces the error by 30.29%. HybPAS is implemented partly in Python and in MATLAB available at https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN.
Collapse
Affiliation(s)
- Fahad Albalawi
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Taif University, Electrical Engineering, Taif 21944, Saudi Arabia
| | - Abderrazak Chahid
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Xingang Guo
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Somayah Albaradei
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Arturo Magana-Mora
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Boris R Jankovic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000 Ghent, Belgium
| | - Magbubah Essack
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Taous-Meriem Laleg-Kirati
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| |
Collapse
|
22
|
Gudenas BL, Wang L. Prediction of LncRNA Subcellular Localization with Deep Learning from Sequence Features. Sci Rep 2018; 8:16385. [PMID: 30401954 PMCID: PMC6219567 DOI: 10.1038/s41598-018-34708-w] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Accepted: 10/19/2018] [Indexed: 12/20/2022] Open
Abstract
Long non-coding RNAs are involved in biological processes throughout the cell including the nucleus, chromatin and cytosol. However, most lncRNAs remain unannotated and functional annotation of lncRNAs is difficult due to their low conservation and their tissue and developmentally specific expression. LncRNA subcellular localization is highly informative regarding its biological function, although it is difficult to discover because few prediction methods currently exist. While protein subcellular localization prediction is a well-established research field, lncRNA localization prediction is a novel research problem. We developed DeepLncRNA, a deep learning algorithm which predicts lncRNA subcellular localization directly from lncRNA transcript sequences. We analyzed 93 strand-specific RNA-seq samples of nuclear and cytosolic fractions from multiple cell types to identify differentially localized lncRNAs. We then extracted sequence-based features from the lncRNAs to construct our DeepLncRNA model, which achieved an accuracy of 72.4%, sensitivity of 83%, specificity of 62.4% and area under the receiver operating characteristic curve of 0.787. Our results suggest that primary sequence motifs are a major driving force in the subcellular localization of lncRNAs.
Collapse
Affiliation(s)
- Brian L Gudenas
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
| | - Liangjiang Wang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.
| |
Collapse
|
23
|
Ma W, Qiu Z, Song J, Li J, Cheng Q, Zhai J, Ma C. A deep convolutional neural network approach for predicting phenotypes from genotypes. PLANTA 2018; 248:1307-1318. [PMID: 30101399 DOI: 10.1007/s00425-018-2976-9] [Citation(s) in RCA: 92] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 07/11/2018] [Indexed: 05/21/2023]
Abstract
Deep learning is a promising technology to accurately select individuals with high phenotypic values based on genotypic data. Genomic selection (GS) is a promising breeding strategy by which the phenotypes of plant individuals are usually predicted based on genome-wide markers of genotypes. In this study, we present a deep learning method, named DeepGS, to predict phenotypes from genotypes. Using a deep convolutional neural network, DeepGS uses hidden variables that jointly represent features in genotypes when making predictions; it also employs convolution, sampling and dropout strategies to reduce the complexity of high-dimensional genotypic data. We used a large GS dataset to train DeepGS and compared its performance with other methods. The experimental results indicate that DeepGS can be used as a complement to the commonly used RR-BLUP in the prediction of phenotypes from genotypes. The complementarity between DeepGS and RR-BLUP can be utilized using an ensemble learning approach for more accurately selecting individuals with high phenotypic values, even for the absence of outlier individuals and subsets of genotypic markers. The source codes of DeepGS and the ensemble learning approach have been packaged into Docker images for facilitating their applications in different GS programs.
Collapse
Affiliation(s)
- Wenlong Ma
- State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Zhixu Qiu
- State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, 712100, Shaanxi, China
- Biomass Energy Center for Arid and Semi-arid Lands, Northwest A&F University, Shaanxi, 712100, Yangling, China
| | - Jie Song
- State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Jiajia Li
- State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, 712100, Shaanxi, China
- Biomass Energy Center for Arid and Semi-arid Lands, Northwest A&F University, Shaanxi, 712100, Yangling, China
| | - Qian Cheng
- State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, 712100, Shaanxi, China
- Biomass Energy Center for Arid and Semi-arid Lands, Northwest A&F University, Shaanxi, 712100, Yangling, China
| | - Jingjing Zhai
- State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, 712100, Shaanxi, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Chuang Ma
- State Key Laboratory of Crop Stress Biology for Arid Areas, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, 712100, Shaanxi, China.
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, 712100, Shaanxi, China.
| |
Collapse
|
24
|
Thibodeau A, Uyar A, Khetan S, Stitzel ML, Ucar D. A neural network based model effectively predicts enhancers from clinical ATAC-seq samples. Sci Rep 2018; 8:16048. [PMID: 30375457 PMCID: PMC6207744 DOI: 10.1038/s41598-018-34420-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 10/16/2018] [Indexed: 01/06/2023] Open
Abstract
Enhancers are cis-acting sequences that regulate transcription rates of their target genes in a cell-specific manner and harbor disease-associated sequence variants in cognate cell types. Many complex diseases are associated with enhancer malfunction, necessitating the discovery and study of enhancers from clinical samples. Assay for Transposase Accessible Chromatin (ATAC-seq) technology can interrogate chromatin accessibility from small cell numbers and facilitate studying enhancers in pathologies. However, on average, ~35% of open chromatin regions (OCRs) from ATAC-seq samples map to enhancers. We developed a neural network-based model, Predicting Enhancers from ATAC-Seq data (PEAS), to effectively infer enhancers from clinical ATAC-seq samples by extracting ATAC-seq data features and integrating these with sequence-related features (e.g., GC ratio). PEAS recapitulated ChromHMM-defined enhancers in CD14+ monocytes, CD4+ T cells, GM12878, peripheral blood mononuclear cells, and pancreatic islets. PEAS models trained on these 5 cell types effectively predicted enhancers in four cell types that are not used in model training (EndoC-βH1, naïve CD8+ T, MCF7, and K562 cells). Finally, PEAS inferred individual-specific enhancers from 19 islet ATAC-seq samples and revealed variability in enhancer activity across individuals, including those driven by genetic differences. PEAS is an easy-to-use tool developed to study enhancers in pathologies by taking advantage of the increasing number of clinical epigenomes.
Collapse
Affiliation(s)
- Asa Thibodeau
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Asli Uyar
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Shubham Khetan
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA.,Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, 06030, USA
| | - Michael L Stitzel
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, 06030, USA
| | - Duygu Ucar
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA. .,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, 06030, USA.
| |
Collapse
|
25
|
Sequence based prediction of enhancer regions from DNA random walk. Sci Rep 2018; 8:15912. [PMID: 30374023 PMCID: PMC6206163 DOI: 10.1038/s41598-018-33413-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Accepted: 09/28/2018] [Indexed: 12/17/2022] Open
Abstract
Regulatory elements play a critical role in development process of eukaryotic organisms by controlling the spatio-temporal pattern of gene expression. Enhancer is one of these elements which contributes to the regulation of gene expression through chromatin loop or eRNA expression. Experimental identification of a novel enhancer is a costly exercise, due to which there is an interest in computational approaches to predict enhancer regions in a genome. Existing computational approaches to achieve this goal have primarily been based on training of high-throughput data such as transcription factor binding sites (TFBS), DNA methylation, and histone modification marks etc. On the other hand, purely sequence based approaches to predict enhancer regions are promising as they are not biased by the complexity or context specificity of such datasets. In sequence based approaches, machine learning models are either directly trained on sequences or sequence features, to classify sequences as enhancers or non-enhancers. In this paper, we derived statistical and nonlinear dynamic features along with k-mer features from experimentally validated sequences taken from Vista Enhancer Browser through random walk model and applied different machine learning based methods to predict whether an input test sequence is enhancer or not. Experimental results demonstrate the success of proposed model based on Ensemble method with area under curve (AUC) 0.86, 0.89, and 0.87 in B cells, T cells, and Natural killer cells for histone marks dataset.
Collapse
|
26
|
Lim LWK, Chung HH, Chong YL, Lee NK. A survey of recently emerged genome-wide computational enhancer predictor tools. Comput Biol Chem 2018; 74:132-141. [DOI: 10.1016/j.compbiolchem.2018.03.019] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 03/13/2018] [Accepted: 03/13/2018] [Indexed: 12/19/2022]
|
27
|
Kalinin AA, Higgins GA, Reamaroon N, Soroushmehr S, Allyn-Feuer A, Dinov ID, Najarian K, Athey BD. Deep learning in pharmacogenomics: from gene regulation to patient stratification. Pharmacogenomics 2018; 19:629-650. [PMID: 29697304 PMCID: PMC6022084 DOI: 10.2217/pgs-2018-0008] [Citation(s) in RCA: 74] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 03/09/2018] [Indexed: 01/02/2023] Open
Abstract
This Perspective provides examples of current and future applications of deep learning in pharmacogenomics, including: identification of novel regulatory variants located in noncoding domains of the genome and their function as applied to pharmacoepigenomics; patient stratification from medical records; and the mechanistic prediction of drug response, targets and their interactions. Deep learning encapsulates a family of machine learning algorithms that has transformed many important subfields of artificial intelligence over the last decade, and has demonstrated breakthrough performance improvements on a wide range of tasks in biomedicine. We anticipate that in the future, deep learning will be widely used to predict personalized drug response and optimize medication selection and dosing, using knowledge extracted from large and complex molecular, epidemiological, clinical and demographic datasets.
Collapse
Affiliation(s)
- Alexandr A Kalinin
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Statistics Online Computational Resource (SOCR), University of Michigan School of Nursing, Ann Arbor, MI 48109, USA
| | - Gerald A Higgins
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Narathip Reamaroon
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Sayedmohammadreza Soroushmehr
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Ari Allyn-Feuer
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Ivo D Dinov
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Statistics Online Computational Resource (SOCR), University of Michigan School of Nursing, Ann Arbor, MI 48109, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA
| | - Kayvan Najarian
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Brian D Athey
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA
- Department of Internal Medicine, University of Michigan Health System, Ann Arbor, MI 48109, USA
- Department of Psychiatry, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| |
Collapse
|
28
|
Mourad R, Ginalski K, Legube G, Cuvier O. Predicting double-strand DNA breaks using epigenome marks or DNA at kilobase resolution. Genome Biol 2018; 19:34. [PMID: 29544533 PMCID: PMC5856001 DOI: 10.1186/s13059-018-1411-7] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Accepted: 02/22/2018] [Indexed: 12/18/2022] Open
Abstract
Double-strand breaks (DSBs) result from the attack of both DNA strands by multiple sources, including radiation and chemicals. DSBs can cause the abnormal chromosomal rearrangements associated with cancer. Recent techniques allow the genome-wide mapping of DSBs at high resolution, enabling the comprehensive study of their origins. However, these techniques are costly and challenging. Hence, we devise a computational approach to predict DSBs using the epigenomic and chromatin context, for which public data are readily available from the ENCODE project. We achieve excellent prediction accuracy at high resolution. We identify chromatin accessibility, activity, and long-range contacts as the best predictors.
Collapse
Affiliation(s)
- Raphaël Mourad
- LBME, Centre de Biologie Intégrative (CBI), Université de Toulouse, CNRS, UPS, 118, route de Narbonne, Toulouse, 31062, France.
| | - Krzysztof Ginalski
- Laboratory of Bioinformatics and Systems Biology, Centre of New Technologies, University of Warsaw, Zwirki i Wigury 93, Warsaw, 02-089, Poland
| | - Gaëlle Legube
- LBCMCP, Centre de Biologie Intégrative (CBI), Université de Toulouse, CNRS, UPS, 118, route de Narbonne, Toulouse, 31062, France
| | - Olivier Cuvier
- LBME, Centre de Biologie Intégrative (CBI), Université de Toulouse, CNRS, UPS, 118, route de Narbonne, Toulouse, 31062, France
| |
Collapse
|
29
|
Chaterji S, Ahn EH, Kim DH. CRISPR Genome Engineering for Human Pluripotent Stem Cell Research. Theranostics 2017; 7:4445-4469. [PMID: 29158838 PMCID: PMC5695142 DOI: 10.7150/thno.18456] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Accepted: 08/24/2017] [Indexed: 12/13/2022] Open
Abstract
The emergence of targeted and efficient genome editing technologies, such as repurposed bacterial programmable nucleases (e.g., CRISPR-Cas systems), has abetted the development of cell engineering approaches. Lessons learned from the development of RNA-interference (RNA-i) therapies can spur the translation of genome editing, such as those enabling the translation of human pluripotent stem cell engineering. In this review, we discuss the opportunities and the challenges of repurposing bacterial nucleases for genome editing, while appreciating their roles, primarily at the epigenomic granularity. First, we discuss the evolution of high-precision, genome editing technologies, highlighting CRISPR-Cas9. They exist in the form of programmable nucleases, engineered with sequence-specific localizing domains, and with the ability to revolutionize human stem cell technologies through precision targeting with greater on-target activities. Next, we highlight the major challenges that need to be met prior to bench-to-bedside translation, often learning from the path-to-clinic of complementary technologies, such as RNA-i. Finally, we suggest potential bioinformatics developments and CRISPR delivery vehicles that can be deployed to circumvent some of the challenges confronting genome editing technologies en route to the clinic.
Collapse
|