1
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and Deep Learning Methods for Predicting 3D Genome Organization. Methods Mol Biol 2025; 2856:357-400. [PMID: 39283464 DOI: 10.1007/978-1-0716-4136-1_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Three-dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, topologically associating domains (TADs), and A/B compartments, play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers and transcription factor binding site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, and TAD boundaries) and analyze their pros and cons. We also point out obstacles to the computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P G Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA
| | - J Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Mikhail G Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA.
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
2
|
Lai CHL, Kwok APK, Wong KC. Cheminformatic Identification of Tyrosyl-DNA Phosphodiesterase 1 (Tdp1) Inhibitors: A Comparative Study of SMILES-Based Supervised Machine Learning Models. J Pers Med 2024; 14:981. [PMID: 39338235 PMCID: PMC11433629 DOI: 10.3390/jpm14090981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 09/13/2024] [Accepted: 09/14/2024] [Indexed: 09/30/2024] Open
Abstract
BACKGROUND Tyrosyl-DNA phosphodiesterase 1 (Tdp1) repairs damages in DNA induced by abortive topoisomerase 1 activity; however, maintenance of genetic integrity may sustain cellular division of neoplastic cells. It follows that Tdp1-targeting chemical inhibitors could synergize well with existing chemotherapy drugs to deny cancer growth; therefore, identification of Tdp1 inhibitors may advance precision medicine in oncology. OBJECTIVE Current computational research efforts focus primarily on molecular docking simulations, though datasets involving three-dimensional molecular structures are often hard to curate and computationally expensive to store and process. We propose the use of simplified molecular input line entry system (SMILES) chemical representations to train supervised machine learning (ML) models, aiming to predict potential Tdp1 inhibitors. METHODS An open-sourced consensus dataset containing the inhibitory activity of numerous chemicals against Tdp1 was obtained from Kaggle. Various ML algorithms were trained, ranging from simple algorithms to ensemble methods and deep neural networks. For algorithms requiring numerical data, SMILES were converted to chemical descriptors using RDKit, an open-sourced Python cheminformatics library. RESULTS Out of 13 optimized ML models with rigorously tuned hyperparameters, the random forest model gave the best results, yielding a receiver operating characteristics-area under curve of 0.7421, testing accuracy of 0.6815, sensitivity of 0.6444, specificity of 0.7156, precision of 0.6753, and F1 score of 0.6595. CONCLUSIONS Ensemble methods, especially the bootstrap aggregation mechanism adopted by random forest, outperformed other ML algorithms in classifying Tdp1 inhibitors from non-inhibitors using SMILES. The discovery of Tdp1 inhibitors could unlock more treatment regimens for cancer patients, allowing for therapies tailored to the patient's condition.
Collapse
Affiliation(s)
- Conan Hong-Lun Lai
- Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China
- Data Science and Policy Studies Programme, School of Governance and Policy Science, Faculty of Social Science, The Chinese University of Hong Kong, Hong Kong 999077, China
| | - Alex Pak Ki Kwok
- Data Science and Policy Studies Programme, School of Governance and Policy Science, Faculty of Social Science, The Chinese University of Hong Kong, Hong Kong 999077, China
| | - Kwong-Cheong Wong
- Data Science and Policy Studies Programme, School of Governance and Policy Science, Faculty of Social Science, The Chinese University of Hong Kong, Hong Kong 999077, China
| |
Collapse
|
3
|
Tenekeci S, Tekir S. Identifying promoter and enhancer sequences by graph convolutional networks. Comput Biol Chem 2024; 110:108040. [PMID: 38430611 DOI: 10.1016/j.compbiolchem.2024.108040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 01/09/2024] [Accepted: 02/27/2024] [Indexed: 03/05/2024]
Abstract
Identification of promoters, enhancers, and their interactions helps understand genetic regulation. This study proposes a graph-based semi-supervised learning model (GCN4EPI) for the enhancer-promoter classification problem. We adopt a graph convolutional network (GCN) architecture to integrate interaction information with sequence features. Nodes of the constructed graph hold word embeddings of DNA sequences while edges hold the Enhancer-Promoter Interaction (EPI) information. By means of semi-supervised learning, much less data (16%) and time are needed in model training. Comparisons on a benchmark dataset of six human cell lines show that the proposed approach outperforms the state-of-the-art methods by a large margin (10% higher F1 score) and has the fastest training time (up to 3 times). Moreover, GCN4EPI's performance on cross-cell line data is also better than the baselines (3% higher F1 score). Our qualitative analyses with graph explainability models prove that GCN4EPI learns from both text and graph structure. The results suggest that integrating interaction information with sequence features improves predictive performance and compensates for the number of training instances.
Collapse
Affiliation(s)
- Samet Tenekeci
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye
| | - Selma Tekir
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye.
| |
Collapse
|
4
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and deep learning methods for predicting 3D genome organization. ARXIV 2024:arXiv:2403.03231v1. [PMID: 38495565 PMCID: PMC10942493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P. G. Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
| | - J. Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA 23298, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
5
|
Hassan J, Saeed SM, Deka L, Uddin MJ, Das DB. Applications of Machine Learning (ML) and Mathematical Modeling (MM) in Healthcare with Special Focus on Cancer Prognosis and Anticancer Therapy: Current Status and Challenges. Pharmaceutics 2024; 16:260. [PMID: 38399314 PMCID: PMC10892549 DOI: 10.3390/pharmaceutics16020260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/29/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024] Open
Abstract
The use of data-driven high-throughput analytical techniques, which has given rise to computational oncology, is undisputed. The widespread use of machine learning (ML) and mathematical modeling (MM)-based techniques is widely acknowledged. These two approaches have fueled the advancement in cancer research and eventually led to the uptake of telemedicine in cancer care. For diagnostic, prognostic, and treatment purposes concerning different types of cancer research, vast databases of varied information with manifold dimensions are required, and indeed, all this information can only be managed by an automated system developed utilizing ML and MM. In addition, MM is being used to probe the relationship between the pharmacokinetics and pharmacodynamics (PK/PD interactions) of anti-cancer substances to improve cancer treatment, and also to refine the quality of existing treatment models by being incorporated at all steps of research and development related to cancer and in routine patient care. This review will serve as a consolidation of the advancement and benefits of ML and MM techniques with a special focus on the area of cancer prognosis and anticancer therapy, leading to the identification of challenges (data quantity, ethical consideration, and data privacy) which are yet to be fully addressed in current studies.
Collapse
Affiliation(s)
- Jasmin Hassan
- Drug Delivery & Therapeutics Lab, Dhaka 1212, Bangladesh; (J.H.); (S.M.S.)
| | | | - Lipika Deka
- Faculty of Computing, Engineering and Media, De Montfort University, Leicester LE1 9BH, UK;
| | - Md Jasim Uddin
- Department of Pharmaceutical Technology, Faculty of Pharmacy, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| | - Diganta B. Das
- Department of Chemical Engineering, Loughborough University, Loughborough LE11 3TU, UK
| |
Collapse
|
6
|
Gao Z, Liu Q, Zeng W, Jiang R, Wong WH. EpiGePT: a Pretrained Transformer model for epigenomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.15.549134. [PMID: 37502861 PMCID: PMC10370089 DOI: 10.1101/2023.07.15.549134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
The inherent similarities between natural language and biological sequences have given rise to great interest in adapting the transformer-based large language models (LLMs) underlying recent breakthroughs in natural language processing (references), for applications in genomics. However, current LLMs for genomics suffer from several limitations such as the inability to include chromatin interactions in the training data, and the inability to make prediction in new cellular contexts not represented in the training data. To mitigate these problems, we propose EpiGePT, a transformer-based pretrained language model for predicting context-specific epigenomic signals and chromatin contacts. By taking the context-specific activities of transcription factors (TFs) and 3D genome interactions into consideration, EpiGePT offers wider applicability and deeper biological insights than models trained on DNA sequence only. In a series of experiments, EpiGePT demonstrates superior performance in a diverse set of epigenomic signals prediction tasks when compared to existing methods. In particular, our model enables cross-cell-type prediction of long-range interactions and offers insight on the functional impact of genetic variants under different cellular contexts. These new capabilities will enhance the usefulness of LLM in the study of gene regulatory mechanisms. We provide free online prediction service of EpiGePT through http://health.tsinghua.edu.cn/epigept/.
Collapse
Affiliation(s)
- Zijing Gao
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Qiao Liu
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Wanwen Zeng
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wing Hung Wong
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Bio-X Program, Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
7
|
Zhang Y, Boninsegna L, Yang M, Misteli T, Alber F, Ma J. Computational methods for analysing multiscale 3D genome organization. Nat Rev Genet 2024; 25:123-141. [PMID: 37673975 PMCID: PMC11127719 DOI: 10.1038/s41576-023-00638-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/12/2023] [Indexed: 09/08/2023]
Abstract
Recent progress in whole-genome mapping and imaging technologies has enabled the characterization of the spatial organization and folding of the genome in the nucleus. In parallel, advanced computational methods have been developed to leverage these mapping data to reveal multiscale three-dimensional (3D) genome features and to provide a more complete view of genome structure and its connections to genome functions such as transcription. Here, we discuss how recently developed computational tools, including machine-learning-based methods and integrative structure-modelling frameworks, have led to a systematic, multiscale delineation of the connections among different scales of 3D genome organization, genomic and epigenomic features, functional nuclear components and genome function. However, approaches that more comprehensively integrate a wide variety of genomic and imaging datasets are still needed to uncover the functional role of 3D genome structure in defining cellular phenotypes in health and disease.
Collapse
Affiliation(s)
- Yang Zhang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Lorenzo Boninsegna
- Department of Microbiology, Immunology and Molecular Genetics and Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, CA, USA
| | - Muyu Yang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Tom Misteli
- Center for Cancer Research, National Cancer Institute, Bethesda, MD, USA.
| | - Frank Alber
- Department of Microbiology, Immunology and Molecular Genetics and Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, CA, USA.
| | - Jian Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
8
|
Zhang P, Wu H. IChrom-Deep: An Attention-Based Deep Learning Model for Identifying Chromatin Interactions. IEEE J Biomed Health Inform 2023; 27:4559-4568. [PMID: 37402191 DOI: 10.1109/jbhi.2023.3292299] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2023]
Abstract
Identification of chromatin interactions is crucial for advancing our knowledge of gene regulation. However, due to the limitations of high-throughput experimental techniques, there is an urgent need to develop computational methods for predicting chromatin interactions. In this study, we propose a novel attention-based deep learning model, termed IChrom-Deep, to identify chromatin interactions using sequence features and genomic features. The experimental results based on the datasets of three cell lines demonstrate that the IChrom-Deep achieves satisfactory performance and is superior to the previous methods. We also investigate the effect of DNA sequence and associated features and genomic features on chromatin interactions, and highlight the applicable scenarios of some features, such as sequence conservation and distance. Moreover, we identify a few genomic features that are extremely important across different cell lines, and IChrom-Deep achieves comparable performance with only these significant genomic features versus using all genomic features. It is believed that IChrom-Deep can serve as a useful tool for future studies that seek to identify chromatin interactions.
Collapse
|
9
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
10
|
Huang J, Zhou Y, Zhang H, Wu Y. A neural network model to screen feature genes for pancreatic cancer. BMC Bioinformatics 2023; 24:193. [PMID: 37170188 PMCID: PMC10176951 DOI: 10.1186/s12859-023-05322-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 05/05/2023] [Indexed: 05/13/2023] Open
Abstract
All the time, pancreatic cancer is a problem worldwide because of its high degree of malignancy and increased mortality. Neural network model analysis is an efficient and accurate machine learning method that can quickly and accurately predict disease feature genes. The aim of our research was to build a neural network model that would help screen out feature genes for pancreatic cancer diagnosis and prediction of prognosis. Our study confirmed that the neural network model is a reliable way to predict feature genes of pancreatic cancer, and immune cells infiltrating play an essential role in the development of pancreatic cancer, especially neutrophils. ANO1, AHNAK2, and ADAM9 were eventually identified as feature genes of pancreatic cancer, helping to diagnose and predict prognosis. Neural network model analysis provides us with a new idea for finding new intervention targets for pancreatic cancer.
Collapse
Affiliation(s)
- Jing Huang
- Department of Gastroenterology, First Hospital of Jiaxing, Jiaxing, 314001, Zhejiang, China
| | - Yuting Zhou
- Department of Respiratory, The 904Th Hospital of Joint Logistic Support Force of PLA, Affiliated Hospital of Jiangnan University, Wuxi, 214000, Jiangsu, China
| | - Haoran Zhang
- Department of Gastroenterology, First Hospital of Jiaxing, Jiaxing, 314001, Zhejiang, China
| | - Yiming Wu
- Department of Gastroenterology, First Hospital of Jiaxing, Jiaxing, 314001, Zhejiang, China.
| |
Collapse
|
11
|
Xu J, Zhang A, Liu F, Zhang X. STGRNS: an interpretable transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data. Bioinformatics 2023; 39:btad165. [PMID: 37004161 PMCID: PMC10085635 DOI: 10.1093/bioinformatics/btad165] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2022] [Revised: 02/28/2023] [Accepted: 03/25/2023] [Indexed: 04/03/2023] Open
Abstract
MOTIVATION Single-cell RNA-sequencing (scRNA-seq) technologies provide an opportunity to infer cell-specific gene regulatory networks (GRNs), which is an important challenge in systems biology. Although numerous methods have been developed for inferring GRNs from scRNA-seq data, it is still a challenge to deal with cellular heterogeneity. RESULTS To address this challenge, we developed an interpretable transformer-based method namely STGRNS for inferring GRNs from scRNA-seq data. In this algorithm, gene expression motif technique was proposed to convert gene pairs into contiguous sub-vectors, which can be used as input for the transformer encoder. By avoiding missing phase-specific regulations in a network, gene expression motif can improve the accuracy of GRN inference for different types of scRNA-seq data. To assess the performance of STGRNS, we implemented the comparative experiments with some popular methods on extensive benchmark datasets including 21 static and 27 time-series scRNA-seq dataset. All the results show that STGRNS is superior to other comparative methods. In addition, STGRNS was also proved to be more interpretable than "black box" deep learning methods, which are well-known for the difficulty to explain the predictions clearly. AVAILABILITY AND IMPLEMENTATION The source code and data are available at https://github.com/zhanglab-wbgcas/STGRNS.
Collapse
Affiliation(s)
- Jing Xu
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Aidi Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Fang Liu
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Xiujun Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
- Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, 430074 China
| |
Collapse
|
12
|
Linder J, Koplik SE, Kundaje A, Seelig G. Deciphering the impact of genetic variation on human polyadenylation using APARENT2. Genome Biol 2022; 23:232. [PMID: 36335397 PMCID: PMC9636789 DOI: 10.1186/s13059-022-02799-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 10/19/2022] [Indexed: 11/08/2022] Open
Abstract
BACKGROUND 3'-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging. RESULTS We introduce a residual neural network model, APARENT2, that can infer 3'-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals. We demonstrate APARENT2's performance on several variant datasets, including functional reporter data and human 3' aQTLs from GTEx. We apply neural network interpretation methods to gain insights into disrupted or protective higher-order features of polyadenylation. We fine-tune APARENT2 on human tissue-resolved transcriptomic data to elucidate tissue-specific variant effects. By combining APARENT2 with models of mRNA stability, we extend aQTL effect size predictions to the entire 3' untranslated region. Finally, we perform in silico saturation mutagenesis of all human polyadenylation signals and compare the predicted effects of [Formula: see text] million variants against gnomAD. While loss-of-function variants were generally selected against, we also find specific clinical conditions linked to gain-of-function mutations. For example, we detect an association between gain-of-function mutations in the 3'-end and autism spectrum disorder. To experimentally validate APARENT2's predictions, we assayed clinically relevant variants in multiple cell lines, including microglia-derived cells. CONCLUSIONS A sequence-to-function model based on deep residual learning enables accurate functional interpretation of genetic variants in polyadenylation signals and, when coupled with large human variation databases, elucidates the link between functional 3'-end mutations and human health.
Collapse
Affiliation(s)
| | | | - Anshul Kundaje
- Department of Genetics, Stanford University, Stanford, USA
- Department of Computer Science, Stanford University, Stanford, USA
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
- Department of Electrical and Computer Engineering, University of Washington, Seattle, USA
| |
Collapse
|
13
|
Liu S, Xu X, Yang Z, Zhao X, Liu S, Zhang W. EPIHC: Improving Enhancer-Promoter Interaction Prediction by Using Hybrid Features and Communicative Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3435-3443. [PMID: 34473626 DOI: 10.1109/tcbb.2021.3109488] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Enhancer-promoter interactions (EPIs) regulate the expression of specific genes in cells, which help facilitate understanding of gene regulation, cell differentiation and disease mechanisms. EPI identification approaches through wet experiments are often costly and time-consuming, leading to the design of high-efficiency computational methods is in demand. In this paper, we propose a deep neural network-based method named EPIHC to predict Enhancer-Promoter Interactions with Hybrid features and Communicative learning. EPIHC extracts enhancer and promoter sequence-derived features using convolutional neural networks (CNN), and then we design a communicative learning module to capture the communicative information between enhancer and promoter sequences. Besides, EPIHC takes the genomic features of enhancers and promoters into account, incorporating with the sequence-derived features to predict EPIs. The computational experiments show that EPIHC outperforms the existing state-of-the-art EPI prediction methods on the benchmark datasets and chromosome-split datasets, and the study reveals that the communicative learning module can bring explicit information about EPIs, which is ignored by CNN, and provide explainability about EPIs to some degree. Moreover, we consider two strategies to improve the performances of EPIHC in the cross-cell line prediction, and experimental results show that EPIHC constructed on some cell lines can exhibit good performances for other cell lines. The codes and data are available at https://github.com/BioMedicalBigDataMiningLab/EPIHC.
Collapse
|
14
|
Giacoman-Lozano M, Meléndez-Ramírez C, Martinez-Ledesma E, Cuevas-Diaz Duran R, Velasco I. Epigenetics of neural differentiation: Spotlight on enhancers. Front Cell Dev Biol 2022; 10:1001701. [PMID: 36313573 PMCID: PMC9606577 DOI: 10.3389/fcell.2022.1001701] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2022] [Accepted: 10/03/2022] [Indexed: 11/28/2022] Open
Abstract
Neural induction, both in vivo and in vitro, includes cellular and molecular changes that result in phenotypic specialization related to specific transcriptional patterns. These changes are achieved through the implementation of complex gene regulatory networks. Furthermore, these regulatory networks are influenced by epigenetic mechanisms that drive cell heterogeneity and cell-type specificity, in a controlled and complex manner. Epigenetic marks, such as DNA methylation and histone residue modifications, are highly dynamic and stage-specific during neurogenesis. Genome-wide assessment of these modifications has allowed the identification of distinct non-coding regulatory regions involved in neural cell differentiation, maturation, and plasticity. Enhancers are short DNA regulatory regions that bind transcription factors (TFs) and interact with gene promoters to increase transcriptional activity. They are of special interest in neuroscience because they are enriched in neurons and underlie the cell-type-specificity and dynamic gene expression profiles. Classification of the full epigenomic landscape of neural subtypes is important to better understand gene regulation in brain health and during diseases. Advances in novel next-generation high-throughput sequencing technologies, genome editing, Genome-wide association studies (GWAS), stem cell differentiation, and brain organoids are allowing researchers to study brain development and neurodegenerative diseases with an unprecedented resolution. Herein, we describe important epigenetic mechanisms related to neurogenesis in mammals. We focus on the potential roles of neural enhancers in neurogenesis, cell-fate commitment, and neuronal plasticity. We review recent findings on epigenetic regulatory mechanisms involved in neurogenesis and discuss how sequence variations within enhancers may be associated with genetic risk for neurological and psychiatric disorders.
Collapse
Affiliation(s)
- Mayela Giacoman-Lozano
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, NL, Mexico
| | - César Meléndez-Ramírez
- Instituto de Fisiología Celular—Neurociencias, Universidad Nacional Autónoma de Mexico, Mexico City, Mexico
- Laboratorio de Reprogramación Celular, Instituto Nacional de Neurología y Neurocirugía “Manuel Velasco Suárez”, Mexico City, Mexico
| | - Emmanuel Martinez-Ledesma
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, NL, Mexico
- Tecnologico de Monterrey, The Institute for Obesity Research, Monterrey, NL, Mexico
| | - Raquel Cuevas-Diaz Duran
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, NL, Mexico
- *Correspondence: Raquel Cuevas-Diaz Duran, ; Iván Velasco,
| | - Iván Velasco
- Instituto de Fisiología Celular—Neurociencias, Universidad Nacional Autónoma de Mexico, Mexico City, Mexico
- Laboratorio de Reprogramación Celular, Instituto Nacional de Neurología y Neurocirugía “Manuel Velasco Suárez”, Mexico City, Mexico
- *Correspondence: Raquel Cuevas-Diaz Duran, ; Iván Velasco,
| |
Collapse
|
15
|
Zeng W, Liu Q, Yin Q, Jiang R, Wong WH. HiChIPdb: a comprehensive database of HiChIP regulatory interactions. Nucleic Acids Res 2022; 51:D159-D166. [PMID: 36215037 PMCID: PMC9825415 DOI: 10.1093/nar/gkac859] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 09/19/2022] [Accepted: 09/27/2022] [Indexed: 01/29/2023] Open
Abstract
Elucidating the role of 3D architecture of DNA in gene regulation is crucial for understanding cell differentiation, tissue homeostasis and disease development. Among various chromatin conformation capture methods, HiChIP has received increasing attention for its significant improvement over other methods in profiling of regulatory (e.g. H3K27ac) and structural (e.g. cohesin) interactions. To facilitate the studies of 3D regulatory interactions, we developed a HiChIP interactions database, HiChIPdb (http://health.tsinghua.edu.cn/hichipdb/). The current version of HiChIPdb contains ∼262M annotated HiChIP interactions from 200 high-throughput HiChIP samples across 108 cell types. The functionalities of HiChIPdb include: (i) standardized categorization of HiChIP interactions in a hierarchical structure based on organ, tissue and cell line and (ii) comprehensive annotations of HiChIP interactions with regulatory genes and GWAS Catalog SNPs. To the best of our knowledge, HiChIPdb is the first comprehensive database that utilizes a unified pipeline to map the functional interactions across diverse cell types and tissues in different resolutions. We believe this database has the potential to advance cutting-edge research in regulatory mechanisms in development and disease by removing the barrier in data aggregation, preprocessing, and analysis.
Collapse
Affiliation(s)
| | | | | | - Rui Jiang
- Correspondence may also be addressed to Rui Jiang. Tel: +86 10 6279 5578;
| | | |
Collapse
|
16
|
Miller D, Stern A, Burstein D. Deciphering microbial gene function using natural language processing. Nat Commun 2022; 13:5731. [PMID: 36175448 PMCID: PMC9523054 DOI: 10.1038/s41467-022-33397-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Accepted: 09/16/2022] [Indexed: 11/08/2022] Open
Abstract
Revealing the function of uncharacterized genes is a fundamental challenge in an era of ever-increasing volumes of sequencing data. Here, we present a concept for tackling this challenge using deep learning methodologies adopted from natural language processing (NLP). We repurpose NLP algorithms to model "gene semantics" based on a biological corpus of more than 360 million microbial genes within their genomic context. We use the language models to predict functional categories for 56,617 genes and find that out of 1369 genes associated with recently discovered defense systems, 98% are inferred correctly. We then systematically evaluate the "discovery potential" of different functional categories, pinpointing those with the most genes yet to be characterized. Finally, we demonstrate our method's ability to discover systems associated with microbial interaction and defense. Our results highlight that combining microbial genomics and language models is a promising avenue for revealing gene functions in microbes.
Collapse
Affiliation(s)
- Danielle Miller
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, 6997801, Israel
| | - Adi Stern
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, 6997801, Israel
| | - David Burstein
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel-Aviv University, Tel-Aviv, 6997801, Israel.
| |
Collapse
|
17
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
18
|
DNA Computing: Concepts for Medical Applications. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12146928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The branch of informatics that deals with construction and operation of computers built of DNA, is one of the research directions which investigates issues related to the use of DNA as hardware and software. This concept assumes the use of DNA computers due to their biological origin mainly for intelligent, personalized and targeted diagnostics frequently related to therapy. Important elements of this concept are (1) the retrieval of unique DNA sequences using machine learning methods and, based on the results of this process, (2) the construction/design of smart diagnostic biochip projects. The authors of this paper propose a new concept of designing diagnostic biochips, the key elements of which are machine-learning methods and the concept of biomolecular queue automata. This approach enables the scheduling of computational tasks at the molecular level by sequential events of cutting and ligating DNA molecules. We also summarize current challenges and perspectives of biomolecular computer application and machine-learning approaches using DNA sequence data mining.
Collapse
|
19
|
Mora A, Huang X, Jauhari S, Jiang Q, Li X. Chromatin Hubs: A biological and computational outlook. Comput Struct Biotechnol J 2022; 20:3796-3813. [PMID: 35891791 PMCID: PMC9304431 DOI: 10.1016/j.csbj.2022.07.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 07/02/2022] [Accepted: 07/02/2022] [Indexed: 11/20/2022] Open
Abstract
This review discusses our current understanding of chromatin biology and bioinformatics under the unifying concept of “chromatin hubs.” The first part reviews the biology of chromatin hubs, including chromatin–chromatin interaction hubs, chromatin hubs at the nuclear periphery, hubs around macromolecules such as RNA polymerase or lncRNAs, and hubs around nuclear bodies such as the nucleolus or nuclear speckles. The second part reviews existing computational methods, including enhancer–promoter interaction prediction, network analysis, chromatin domain callers, transcription factory predictors, and multi-way interaction analysis. We introduce an integrated model that makes sense of the existing evidence. Understanding chromatin hubs may allow us (i) to explain long-unsolved biological questions such as interaction specificity and redundancy of mechanisms, (ii) to develop more realistic kinetic and functional predictions, and (iii) to explain the etiology of genomic disease.
Collapse
Affiliation(s)
- Antonio Mora
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Guangzhou 511436, PR China
- Corresponding authors.
| | - Xiaowei Huang
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Guangzhou 511436, PR China
| | - Shaurya Jauhari
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Guangzhou 511436, PR China
| | - Qin Jiang
- Affiliated Eye Hospital of Nanjing Medical University, Nanjing 210000, PR China
| | - Xuri Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, and Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangzhou 510060, PR China
- Corresponding authors.
| |
Collapse
|
20
|
An Effective Deep Learning-Based Architecture for Prediction of N7-Methylguanosine Sites in Health Systems. ELECTRONICS 2022. [DOI: 10.3390/electronics11121917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
N7-methylguanosine (m7G) is one of the most important epigenetic modifications found in rRNA, mRNA, and tRNA, and performs a promising role in gene expression regulation. Owing to its significance, well-equipped traditional laboratory-based techniques have been performed for the identification of N7-methylguanosine (m7G). Consequently, these approaches were found to be time-consuming and cost-ineffective. To move on from these traditional approaches to predict N7-methylguanosine sites with high precision, the concept of artificial intelligence has been adopted. In this study, an intelligent computational model called N7-methylguanosine-Long short-term memory (m7G-LSTM) is introduced for the prediction of N7-methylguanosine sites. One-hot encoding and word2vec feature schemes are used to express the biological sequences while the LSTM and CNN algorithms have been employed for classification. The proposed “m7G-LSTM” model obtained an accuracy value of 95.95%, a specificity value of 95.94%, a sensitivity value of 95.97%, and Matthew’s correlation coefficient (MCC) value of 0.919. The proposed predictive m7G-LSTM model has significantly achieved better outcomes than previous models in terms of all evaluation parameters. The proposed m7G-LSTM computational system aims to support the drug industry and help researchers in the fields of bioinformatics to enhance innovation for the prediction of the behavior of N7-methylguanosine sites.
Collapse
|
21
|
Lu Y, Feng Z, Zhang S, Wang Y. Annotating regulatory elements by heterogeneous network embedding. Bioinformatics 2022; 38:2899-2911. [PMID: 35561169 PMCID: PMC9326849 DOI: 10.1093/bioinformatics/btac185] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 03/05/2022] [Accepted: 03/24/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Regulatory elements (REs), such as enhancers and promoters, are known as regulatory sequences functional in a heterogeneous regulatory network to control gene expression by recruiting transcription regulators and carrying genetic variants in a context specific way. Annotating those REs relies on costly and labor-intensive next-generation sequencing and RNA-guided editing technologies in many cellular contexts. RESULTS We propose a systematic Gene Ontology Annotation method for Regulatory Elements (RE-GOA) by leveraging the powerful word embedding in natural language processing. We first assemble a heterogeneous network by integrating context specific regulations, protein-protein interactions and gene ontology (GO) terms. Then we perform network embedding and associate regulatory elements with GO terms by assessing their similarity in a low dimensional vector space. With three applications, we show that RE-GOA outperforms existing methods in annotating TFs' binding sites from ChIP-seq data, in functional enrichment analysis of differentially accessible peaks from ATAC-seq data, and in revealing genetic correlation among phenotypes from their GWAS summary statistics data. AVAILABILITY AND IMPLEMENTATION The source code and the systematic RE annotation for human and mouse are available at https://github.com/AMSSwanglab/RE-GOA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yurun Lu
- CEMS, NCMIS, HCMS, MADIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
| | - Zhanying Feng
- CEMS, NCMIS, HCMS, MADIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
| | - Songmao Zhang
- CEMS, NCMIS, HCMS, MADIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Yong Wang
- To whom correspondence should be addressed. or
| |
Collapse
|
22
|
Geng Q, Yang R, Zhang L. A deep learning framework for enhancer prediction using word embedding and sequence generation. Biophys Chem 2022; 286:106822. [DOI: 10.1016/j.bpc.2022.106822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 04/21/2022] [Accepted: 04/29/2022] [Indexed: 11/28/2022]
|
23
|
Yin Q, Liu Q, Fu Z, Zeng W, Zhang B, Zhang X, Jiang R, Lv H. scGraph: a graph neural network-based approach to automatically identify cell types. Bioinformatics 2022; 38:2996-3003. [PMID: 35394015 DOI: 10.1093/bioinformatics/btac199] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 12/13/2021] [Accepted: 04/07/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Single cell technologies play a crucial role in revolutionizing biological research over the past decade, which strengthens our understanding in cell differentiation, development, and regulation from a single-cell level perspective. Single-cell RNA sequencing (scRNA-seq) is one of the most common single cell technologies, which enables probing transcriptional states in thousands of cells in one experiment. Identification of cell types from scRNA-seq measurements is a fundamental and crucial question to answer. Most previous studies directly take gene expression as input while ignoring the comprehensive gene-gene interactions. RESULTS We propose scGraph, an automatic cell identification algorithm leveraging gene interaction relationships to enhance the performance of the cell type identification. ScGraph is based on a graph neural network to aggregate the information of interacting genes. In a series of experiments, we demonstrate that scGraph is accurate and outperforms eight comparison methods in the task of cell type identification. Moreover, scGraph automatically learns the gene interaction relationships from biological data and the pathway enrichment analysis shows consistent findings with previous analysis, providing insights on the analysis of regulatory mechanism. AVAILABILITY scGraph is freely available at https://github.com/QijinYin/scGraph and https://figshare.com/articles/software/scGraph/17157743. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Qiao Liu
- Department of Statistics, Stanford University Stanford, CA 94305
| | - Zhuoran Fu
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wanwen Zeng
- Department of Statistics, Stanford University Stanford, CA 94305.,College of Software, Nankai University, Tianjin, 300350, China
| | - Boheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Hairong Lv
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China.,Fuzhou Institute of Data Technology, Changle, Fuzhou, 350200, China
| |
Collapse
|
24
|
Wang S, Hu H, Li X. A systematic study of motif pairs that may facilitate enhancer-promoter interactions. J Integr Bioinform 2022; 19:jib-2021-0038. [PMID: 35130376 PMCID: PMC9069648 DOI: 10.1515/jib-2021-0038] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/20/2022] [Indexed: 01/06/2023] Open
Abstract
Pairs of interacting transcription factors (TFs) have previously been shown to bind to enhancers and promoters and contribute to their physical interactions. However, to date, we have limited knowledge about such TF pairs. To fill this void, we systematically studied the co-occurrence of TF-binding motifs in interacting enhancer-promoter (EP) pairs in seven human cell lines. We discovered 423 motif pairs that significantly co-occur in enhancers and promoters of interacting EP pairs. We demonstrated that these motif pairs are biologically meaningful and significantly enriched with motif pairs of known interacting TF pairs. We also showed that the identified motif pairs facilitated the discovery of the interacting EP pairs. The developed pipeline, EPmotifPair, together with the predicted motifs and motif pairs, is available at https://doi.org/10.6084/m9.figshare.14192000. Our study provides a comprehensive list of motif pairs that may contribute to EP physical interactions, which facilitate generating meaningful hypotheses for experimental validation.
Collapse
Affiliation(s)
- Saidi Wang
- Department of Computer Science, University of Central Florida, Orlando, FL, 32816, USA
| | - Haiyan Hu
- Department of Computer Science, University of Central Florida, Orlando, FL, 32816, USA
| | - Xiaoman Li
- Burnett school of Biomedical Science, College of Medicine, University of Central Florida, Orlando, FL, 32816, USA
| |
Collapse
|
25
|
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022; 23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 579] [Impact Index Per Article: 289.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]
Abstract
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
| | - Lewis Moffat
- Department of Computer Science, University College London, London, UK
| | - David T Jones
- Department of Computer Science, University College London, London, UK.
| |
Collapse
|
26
|
Montesinos-López OA, Montesinos-López A, Hernandez-Suarez CM, Barrón-López JA, Crossa J. Deep-learning power and perspectives for genomic selection. THE PLANT GENOME 2021; 14:e20122. [PMID: 34309215 DOI: 10.1002/tpg2.20122] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Accepted: 05/24/2021] [Indexed: 06/13/2023]
Abstract
Deep learning (DL) is revolutionizing the development of artificial intelligence systems. For example, before 2015, humans were better than artificial machines at classifying images and solving many problems of computer vision (related to object localization and detection using images), but nowadays, artificial machines have surpassed the ability of humans in this specific task. This is just one example of how the application of these models has surpassed human abilities and the performance of other machine-learning algorithms. For this reason, DL models have been adopted for genomic selection (GS). In this article we provide insight about the power of DL in solving complex prediction tasks and how combining GS and DL models can accelerate the revolution provoked by GS methodology in plant breeding. Furthermore, we will mention some trends of DL methods, emphasizing some areas of opportunity to really exploit the DL methodology in GS; however, we are aware that considerable research is required to be able not only to use the existing DL in conjunction with GS, but to adapt and develop DL methods that take the peculiarities of breeding inputs and GS into consideration.
Collapse
Affiliation(s)
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, Jalisco, 44430, México
| | | | - José Alberto Barrón-López
- Department of Animal Production (DPA), Universidad Nacional Agraria La Molina, Av. La Molina s/n La Molina, Lima, 15024, Perú
| | - José Crossa
- Colegio de Postgraduados, Montecillos, Edo, de México, 56230, México
- Biometrics and Statistics Unit, Genetic Resources Program, International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, Edo. De, Mexico DF, 52640, Mexico
| |
Collapse
|
27
|
Zhang M, Hu Y, Zhu M. EPIsHilbert: Prediction of Enhancer-Promoter Interactions via Hilbert Curve Encoding and Transfer Learning. Genes (Basel) 2021; 12:genes12091385. [PMID: 34573367 PMCID: PMC8472018 DOI: 10.3390/genes12091385] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 08/31/2021] [Accepted: 09/01/2021] [Indexed: 12/19/2022] Open
Abstract
Enhancer-promoter interactions (EPIs) play a significant role in the regulation of gene transcription. However, enhancers may not necessarily interact with the closest promoters, but with distant promoters via chromatin looping. Considering the spatial position relationship between enhancers and their target promoters is important for predicting EPIs. Most existing methods only consider sequence information regardless of spatial information. On the other hand, recent computational methods lack generalization capability across different cell line datasets. In this paper, we propose EPIsHilbert, which uses Hilbert curve encoding and two transfer learning approaches. Hilbert curve encoding can preserve the spatial position information between enhancers and promoters. Additionally, we use visualization techniques to explore important sequence fragments that have a high impact on EPIs and the spatial relationships between them. Transfer learning can improve prediction performance across cell lines. In order to further prove the effectiveness of transfer learning, we analyze the sequence coincidence of different cell lines. Experimental results demonstrate that EPIsHilbert is a state-of-the-art model that is superior to most of the existing methods both in specific cell lines and cross cell lines.
Collapse
|
28
|
Liu N, Low WY, Alinejad-Rokny H, Pederson S, Sadlon T, Barry S, Breen J. Seeing the forest through the trees: prioritising potentially functional interactions from Hi-C. Epigenetics Chromatin 2021; 14:41. [PMID: 34454581 PMCID: PMC8399707 DOI: 10.1186/s13072-021-00417-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 08/19/2021] [Indexed: 11/30/2022] Open
Abstract
Eukaryotic genomes are highly organised within the nucleus of a cell, allowing widely dispersed regulatory elements such as enhancers to interact with gene promoters through physical contacts in three-dimensional space. Recent chromosome conformation capture methodologies such as Hi-C have enabled the analysis of interacting regions of the genome providing a valuable insight into the three-dimensional organisation of the chromatin in the nucleus, including chromosome compartmentalisation and gene expression. Complicating the analysis of Hi-C data, however, is the massive amount of identified interactions, many of which do not directly drive gene function, thus hindering the identification of potentially biologically functional 3D interactions. In this review, we collate and examine the downstream analysis of Hi-C data with particular focus on methods that prioritise potentially functional interactions. We classify three groups of approaches: structural-based discovery methods, e.g. A/B compartments and topologically associated domains, detection of statistically significant chromatin interactions, and the use of epigenomic data integration to narrow down useful interaction information. Careful use of these three approaches is crucial to successfully identifying potentially functional interactions within the genome.
Collapse
Affiliation(s)
- Ning Liu
- Computational & Systems Biology, Precision Medicine Theme, South Australian Health & Medical Research Institute, SA, 5000, Adelaide, Australia
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia
- Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia
| | - Wai Yee Low
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA, 5371, Australia
| | - Hamid Alinejad-Rokny
- BioMedical Machine Learning Lab, The Graduate School of Biomedical Engineering, The University of New South Wales, NSW, 2052, Sydney, Australia
- Core Member of UNSW Data Science Hub, The University of New South Wales, 2052, Sydney, Australia
| | - Stephen Pederson
- Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia
- Dame Roma Mitchell Cancer Research Laboratories (DRMCRL), Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia
| | - Timothy Sadlon
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia
- Women's & Children's Health Network, SA, 5006, North Adelaide, Australia
| | - Simon Barry
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia
- Core Member of UNSW Data Science Hub, The University of New South Wales, 2052, Sydney, Australia
- Women's & Children's Health Network, SA, 5006, North Adelaide, Australia
| | - James Breen
- Computational & Systems Biology, Precision Medicine Theme, South Australian Health & Medical Research Institute, SA, 5000, Adelaide, Australia.
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia.
- Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia.
- South Australian Genomics Centre (SAGC), South Australian Health & Medical Research Institute (SAHMRI), SA, 5000, Adelaide, Australia.
| |
Collapse
|
29
|
Szyman K, Wilczyński B, Dąbrowski M. K-mer Content Changes with Node Degree in Promoter-Enhancer Network of Mouse ES Cells. Int J Mol Sci 2021; 22:ijms22158067. [PMID: 34360860 PMCID: PMC8347099 DOI: 10.3390/ijms22158067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Revised: 07/16/2021] [Accepted: 07/24/2021] [Indexed: 11/16/2022] Open
Abstract
Maps of Hi-C contacts between promoters and enhancers can be analyzed as networks, with cis-regulatory regions as nodes and their interactions as edges. We checked if in the published promoter-enhancer network of mouse embryonic stem (ES) cells the differences in the node type (promoter or enhancer) and the node degree (number of regions interacting with a given promoter or enhancer) are reflected by sequence composition or sequence similarity of the interacting nodes. We used counts of all k-mers (k = 4) to analyze the sequence composition and the Euclidean distance between the k-mer count vectors (k-mer distance) as the measure of sequence (dis)similarity. The results we obtained with 4-mers are interpretable in terms of dinucleotides. Promoters are GC-rich as compared to enhancers, which is known. Enhancers are enriched in scaffold/matrix attachment regions (S/MARs) patterns and depleted of CpGs. Furthermore, we show that promoters are more similar to their interacting enhancers than vice-versa. Most notably, in both promoters and enhancers, the GC content and the CpG count increase with the node degree. As a consequence, enhancers of higher node degree become more similar to promoters, whereas higher degree promoters become less similar to enhancers. We confirmed the key results also for human keratinocytes.
Collapse
Affiliation(s)
- Kinga Szyman
- Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, 02-093 Warsaw, Poland;
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 02-097 Warsaw, Poland;
| | - Bartek Wilczyński
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 02-097 Warsaw, Poland;
| | - Michał Dąbrowski
- Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, 02-093 Warsaw, Poland;
- Correspondence:
| |
Collapse
|
30
|
Min X, Lu F, Li C. Sequence-Based Deep Learning Frameworks on Enhancer-Promoter Interactions Prediction. Curr Pharm Des 2021; 27:1847-1855. [PMID: 33234095 DOI: 10.2174/1381612826666201124112710] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 07/29/2020] [Accepted: 08/06/2020] [Indexed: 11/22/2022]
Abstract
Enhancer-promoter interactions (EPIs) in the human genome are of great significance to transcriptional regulation, which tightly controls gene expression. Identification of EPIs can help us better decipher gene regulation and understand disease mechanisms. However, experimental methods to identify EPIs are constrained by funds, time, and manpower, while computational methods using DNA sequences and genomic features are viable alternatives. Deep learning methods have shown promising prospects in classification and efforts that have been utilized to identify EPIs. In this survey, we specifically focus on sequence-based deep learning methods and conduct a comprehensive review of the literature. First, we briefly introduce existing sequence- based frameworks on EPIs prediction and their technique details. After that, we elaborate on the dataset, pre-processing means, and evaluation strategies. Finally, we concluded with the challenges these methods are confronted with and suggest several future opportunities. We hope this review will provide a useful reference for further studies on enhancer-promoter interactions.
Collapse
Affiliation(s)
- Xiaoping Min
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Fengqing Lu
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chunyan Li
- Graduate School, Yunnan Minzu University, Kunming 650504, China
| |
Collapse
|
31
|
Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021; 19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open
Abstract
Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Taro Matsutani
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Keisuke Yamada
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Natsuki Iwano
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shunsuke Sumi
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
| | - Shion Hosoda
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan
- Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
| | - Michiaki Hamada
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
32
|
Talukder A, Barham C, Li X, Hu H. Interpretation of deep learning in genomics and epigenomics. Brief Bioinform 2021; 22:bbaa177. [PMID: 34020542 PMCID: PMC8138893 DOI: 10.1093/bib/bbaa177] [Citation(s) in RCA: 49] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 06/26/2020] [Accepted: 07/10/2020] [Indexed: 12/17/2022] Open
Abstract
Machine learning methods have been widely applied to big data analysis in genomics and epigenomics research. Although accuracy and efficiency are common goals in many modeling tasks, model interpretability is especially important to these studies towards understanding the underlying molecular and cellular mechanisms. Deep neural networks (DNNs) have recently gained popularity in various types of genomic and epigenomic studies due to their capabilities in utilizing large-scale high-throughput bioinformatics data and achieving high accuracy in predictions and classifications. However, DNNs are often challenged by their potential to explain the predictions due to their black-box nature. In this review, we present current development in the model interpretation of DNNs, focusing on their applications in genomics and epigenomics. We first describe state-of-the-art DNN interpretation methods in representative machine learning fields. We then summarize the DNN interpretation methods in recent studies on genomics and epigenomics, focusing on current data- and computing-intensive topics such as sequence motif identification, genetic variations, gene expression, chromatin interactions and non-coding RNAs. We also present the biological discoveries that resulted from these interpretation methods. We finally discuss the advantages and limitations of current interpretation approaches in the context of genomic and epigenomic studies. Contact:xiaoman@mail.ucf.edu, haihu@cs.ucf.edu.
Collapse
Affiliation(s)
- Amlan Talukder
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Clayton Barham
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
33
|
Asada K, Kaneko S, Takasawa K, Machino H, Takahashi S, Shinkai N, Shimoyama R, Komatsu M, Hamamoto R. Integrated Analysis of Whole Genome and Epigenome Data Using Machine Learning Technology: Toward the Establishment of Precision Oncology. Front Oncol 2021; 11:666937. [PMID: 34055633 PMCID: PMC8149908 DOI: 10.3389/fonc.2021.666937] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 04/26/2021] [Indexed: 12/17/2022] Open
Abstract
With the completion of the International Human Genome Project, we have entered what is known as the post-genome era, and efforts to apply genomic information to medicine have become more active. In particular, with the announcement of the Precision Medicine Initiative by U.S. President Barack Obama in his State of the Union address at the beginning of 2015, "precision medicine," which aims to divide patients and potential patients into subgroups with respect to disease susceptibility, has become the focus of worldwide attention. The field of oncology is also actively adopting the precision oncology approach, which is based on molecular profiling, such as genomic information, to select the appropriate treatment. However, the current precision oncology is dominated by a method called targeted-gene panel (TGP), which uses next-generation sequencing (NGS) to analyze a limited number of specific cancer-related genes and suggest optimal treatments, but this method causes the problem that the number of patients who benefit from it is limited. In order to steadily develop precision oncology, it is necessary to integrate and analyze more detailed omics data, such as whole genome data and epigenome data. On the other hand, with the advancement of analysis technologies such as NGS, the amount of data obtained by omics analysis has become enormous, and artificial intelligence (AI) technologies, mainly machine learning (ML) technologies, are being actively used to make more efficient and accurate predictions. In this review, we will focus on whole genome sequencing (WGS) analysis and epigenome analysis, introduce the latest results of omics analysis using ML technologies for the development of precision oncology, and discuss the future prospects.
Collapse
Affiliation(s)
- Ken Asada
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Syuzo Kaneko
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Ken Takasawa
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Hidenori Machino
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Satoshi Takahashi
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Norio Shinkai
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| | - Ryo Shimoyama
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Masaaki Komatsu
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
| | - Ryuji Hamamoto
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| |
Collapse
|
34
|
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J 2021; 19:1612-1619. [PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/12/2021] [Accepted: 03/13/2021] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
35
|
Yu X, Zhou J, Zhao M, Yi C, Duan Q, Zhou W, Li J. Exploiting XG Boost for Predicting Enhancer-promoter Interactions. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200120103948] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Gene expression and disease control are regulated by the interaction
between distal enhancers and proximal promoters, and the study of enhancer promoter interactions
(EPIs) provides insight into the genetic basis of diseases.
Objective:
Although the recent emergence of high-throughput sequencing methods have a
deepened understanding of EPIs, accurate prediction of EPIs still limitations.
Methods:
We have implemented a XGBoost-based approach and introduced two sets of features
(epigenomic and sequence) to predict the interactions between enhancers and promoters in
different cell lines.
Results:
Extensive experimental results show that XGBoost effectively predicts EPIs across three
cell lines, especially when using epigenomic and sequence features.
Conclusion:
XGBoost outperforms other methods, such as random forest, Adadboost, GBDT, and
TargetFinder.
Collapse
Affiliation(s)
- Xiaojuan Yu
- Software School of Yunnan University, Kunming, China
| | - Jianguo Zhou
- Software School of Yunnan University, Kunming, China
| | - Mingming Zhao
- Software School of Yunnan University, Kunming, China
| | - Chao Yi
- Software School of Yunnan University, Kunming, China
| | - Qing Duan
- Software School of Yunnan University, Kunming, China
| | - Wei Zhou
- Software School of Yunnan University, Kunming, China
| | - Jin Li
- Software School of Yunnan University, Kunming, China
| |
Collapse
|
36
|
Zeng W, Chen S, Cui X, Chen X, Gao Z, Jiang R. SilencerDB: a comprehensive database of silencers. Nucleic Acids Res 2021; 49:D221-D228. [PMID: 33045745 PMCID: PMC7778955 DOI: 10.1093/nar/gkaa839] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/14/2020] [Accepted: 09/18/2020] [Indexed: 12/20/2022] Open
Abstract
Gene regulatory elements, including promoters, enhancers, silencers, etc., control transcriptional programs in a spatiotemporal manner. Though these elements are known to be able to induce either positive or negative transcriptional control, the community has been mostly studying enhancers which amplify transcription initiation, with less emphasis given to silencers which repress gene expression. To facilitate the study of silencers and the investigation of their potential roles in transcriptional control, we developed SilencerDB (http://health.tsinghua.edu.cn/silencerdb/), a comprehensive database of silencers by manually curating silencers from 2300 published articles. The current version, SilencerDB 1.0, contains (1) 33 060 validated silencers from experimental methods, and (ii) 5 045 547 predicted silencers from state-of-the-art machine learning methods. The functionality of SilencerDB includes (a) standardized categorization of silencers in a tree-structured class hierarchy based on species, organ, tissue and cell line and (b) comprehensive annotations of silencers with the nearest gene and potential regulatory genes. SilencerDB, to the best of our knowledge, is the first comprehensive database at this scale dedicated to silencers, with reliable annotations and user-friendly interactive database features. We believe this database has the potential to enable advanced understanding of silencers in regulatory mechanisms and to empower researchers to devise diverse applications of silencers in disease development.
Collapse
Affiliation(s)
- Wanwen Zeng
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China.,College of Software, Nankai University, Tianjin 300071, China
| | - Shengquan Chen
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuejian Cui
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaoyang Chen
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Zijing Gao
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
37
|
Tao H, Li H, Xu K, Hong H, Jiang S, Du G, Wang J, Sun Y, Huang X, Ding Y, Li F, Zheng X, Chen H, Bo X. Computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles. Brief Bioinform 2021; 22:6102668. [PMID: 33454752 PMCID: PMC8424394 DOI: 10.1093/bib/bbaa405] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 11/26/2020] [Accepted: 12/10/2020] [Indexed: 12/14/2022] Open
Abstract
The exploration of three-dimensional chromatin interaction and organization provides insight into mechanisms underlying gene regulation, cell differentiation and disease development. Advances in chromosome conformation capture technologies, such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag (ChIA-PET), have enabled the exploration of chromatin interaction and organization. However, high-resolution Hi-C and ChIA-PET data are only available for a limited number of cell lines, and their acquisition is costly, time consuming, laborious and affected by theoretical limitations. Increasing evidence shows that DNA sequence and epigenomic features are informative predictors of regulatory interaction and chromatin architecture. Based on these features, numerous computational methods have been developed for the prediction of chromatin interaction and organization, whereas they are not extensively applied in biomedical study. A systematical study to summarize and evaluate such methods is still needed to facilitate their application. Here, we summarize 48 computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles, categorize them and compare their performance. Besides, we provide a comprehensive guideline for the selection of suitable methods to predict chromatin interaction and organization based on available data and biological question of interest.
Collapse
Affiliation(s)
- Huan Tao
- Beijing Institute of Radiation Medicine
| | - Hao Li
- Beijing Institute of Radiation Medicine
| | - Kang Xu
- Beijing Institute of Radiation Medicine
| | - Hao Hong
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Shuai Jiang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Guifang Du
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | | | - Yu Sun
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Xin Huang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Yang Ding
- Beijing Institute of Radiation Medicine
| | - Fei Li
- Chinese Academy of Sciences, Department of Computer Network Information Center
| | | | | | | |
Collapse
|
38
|
Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci 2020; 6:e307. [PMID: 33816958 PMCID: PMC7924456 DOI: 10.7717/peerj-cs.307] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 09/30/2020] [Indexed: 05/03/2023]
Abstract
Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML.
Collapse
Affiliation(s)
- Michal B. Rozenwald
- Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
| | | | - Grigory V. Sapunov
- Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
- Intento, Inc., Berkeley, CA, USA
| | | | - Mikhail S. Gelfand
- Skolkovo Institute of Science and Technology, Moscow, Russia
- A.A. Kharkevich Institute for Information Transmission Problems, RAS, Moscow, Russia
| |
Collapse
|
39
|
Jing F, Zhang SW, Zhang S. Prediction of enhancer-promoter interactions using the cross-cell type information and domain adversarial neural network. BMC Bioinformatics 2020; 21:507. [PMID: 33160328 PMCID: PMC7648314 DOI: 10.1186/s12859-020-03844-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 10/27/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Enhancer-promoter interactions (EPIs) play key roles in transcriptional regulation and disease progression. Although several computational methods have been developed to predict such interactions, their performances are not satisfactory when training and testing data from different cell lines. Currently, it is still unclear what extent a across cell line prediction can be made based on sequence-level information. RESULTS In this work, we present a novel Sequence-based method (called SEPT) to predict the enhancer-promoter interactions in new cell line by using the cross-cell information and Transfer learning. SEPT first learns the features of enhancer and promoter from DNA sequences with convolutional neural network (CNN), then designing the gradient reversal layer of transfer learning to reduce the cell line specific features meanwhile retaining the features associated with EPIs. When the locations of enhancers and promoters are provided in new cell line, SEPT can successfully recognize EPIs in this new cell line based on labeled data of other cell lines. The experiment results show that SEPT can effectively learn the latent import EPIs-related features between cell lines and achieves the best prediction performance in terms of AUC (the area under the receiver operating curves). CONCLUSIONS SEPT is an effective method for predicting the EPIs in new cell line. Domain adversarial architecture of transfer learning used in SEPT can learn the latent EPIs shared features among cell lines from all other existing labeled data. It can be expected that SEPT will be of interest to researchers concerned with biological interaction prediction.
Collapse
Affiliation(s)
- Fang Jing
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, 127 West Youyi Road, Xi’an, 710072 Shaanxi China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, 127 West Youyi Road, Xi’an, 710072 Shaanxi China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 55 Zhongguancun East Road, Beijing, 10090 China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049 China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223 China
| |
Collapse
|
40
|
|
41
|
Gasperini M, Tome JM, Shendure J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat Rev Genet 2020; 21:292-310. [PMID: 31988385 PMCID: PMC7845138 DOI: 10.1038/s41576-019-0209-0] [Citation(s) in RCA: 159] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/13/2019] [Indexed: 12/14/2022]
Abstract
The human gene catalogue is essentially complete, but we lack an equivalently vetted inventory of bona fide human enhancers. Hundreds of thousands of candidate enhancers have been nominated via biochemical annotations; however, only a handful of these have been validated and confidently linked to their target genes. Here we review emerging technologies for discovering, characterizing and validating human enhancers at scale. We furthermore propose a new framework for operationally defining enhancers that accommodates the heterogeneous and complementary results that are emerging from reporter assays, biochemical measurements and CRISPR screens.
Collapse
Affiliation(s)
- Molly Gasperini
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jacob M Tome
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA, USA.
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA.
- Allen Discovery Center for Cell Lineage, Seattle, WA, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
| |
Collapse
|
42
|
Nicholls HL, John CR, Watson DS, Munroe PB, Barnes MR, Cabrera CP. Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci. Front Genet 2020; 11:350. [PMID: 32351543 PMCID: PMC7174742 DOI: 10.3389/fgene.2020.00350] [Citation(s) in RCA: 71] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 03/23/2020] [Indexed: 12/21/2022] Open
Abstract
Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS - the ability to detect genetic association by linkage disequilibrium (LD) - is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.
Collapse
Affiliation(s)
- Hannah L. Nicholls
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Christopher R. John
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - David S. Watson
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Oxford Internet Institute, University of Oxford, Oxford, United Kingdom
| | - Patricia B. Munroe
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Michael R. Barnes
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- The Alan Turing Institute, British Library, London, United Kingdom
| | - Claudia P. Cabrera
- Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| |
Collapse
|
43
|
Xu H, Zhang S, Yi X, Plewczynski D, Li MJ. Exploring 3D chromatin contacts in gene regulation: The evolution of approaches for the identification of functional enhancer-promoter interaction. Comput Struct Biotechnol J 2020; 18:558-570. [PMID: 32226593 PMCID: PMC7090358 DOI: 10.1016/j.csbj.2020.02.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2019] [Revised: 02/21/2020] [Accepted: 02/22/2020] [Indexed: 12/12/2022] Open
Abstract
Mechanisms underlying gene regulation are key to understand how multicellular organisms with various cell types develop from the same genetic blueprint. Dynamic interactions between enhancers and genes are revealed to play central roles in controlling gene transcription, but the determinants to link functional enhancer-promoter pairs remain elusive. A major challenge is the lack of reliable approach to detect and verify functional enhancer-promoter interactions (EPIs). In this review, we summarized the current methods for detecting EPIs and described how developing techniques facilitate the identification of EPI through assessing the merits and drawbacks of these methods. We also reviewed recent state-of-art EPI prediction methods in terms of their rationale, data usage and characterization. Furthermore, we briefly discussed the evolved strategies for validating functional EPIs.
Collapse
Affiliation(s)
- Hang Xu
- 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
| | - Shijie Zhang
- 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| | - Xianfu Yi
- School of Biomedical Engineering, Tianjin Medical University, Tianjin, China
| | - Dariusz Plewczynski
- Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
- Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
| | - Mulin Jun Li
- 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin, China
| |
Collapse
|
44
|
Belokopytova PS, Nuriddinov MA, Mozheiko EA, Fishman D, Fishman V. Quantitative prediction of enhancer-promoter interactions. Genome Res 2019; 30:72-84. [PMID: 31804952 PMCID: PMC6961579 DOI: 10.1101/gr.249367.119] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 11/25/2019] [Indexed: 11/24/2022]
Abstract
Recent experimental and computational efforts have provided large data sets describing three-dimensional organization of mouse and human genomes and showed the interconnection between the expression profile, epigenetic state, and spatial interactions of loci. These interconnections were utilized to infer the spatial organization of chromatin, including enhancer–promoter contacts, from one-dimensional epigenetic marks. Here, we show that the predictive power of some of these algorithms is overestimated due to peculiar properties of the biological data. We propose an alternative approach, which provides high-quality predictions of chromatin interactions using information on gene expression and CTCF-binding alone. Using multiple metrics, we confirmed that our algorithm could efficiently predict the three-dimensional architecture of both normal and rearranged genomes.
Collapse
Affiliation(s)
- Polina S Belokopytova
- Institute of Cytology and Genetics SB RAS 630090, Novosibirsk, Russia.,Novosibirsk State University, Novosibirsk, Russia 630090
| | | | | | - Daniil Fishman
- Novosibirsk State University, Novosibirsk, Russia 630090
| | - Veniamin Fishman
- Institute of Cytology and Genetics SB RAS 630090, Novosibirsk, Russia.,Novosibirsk State University, Novosibirsk, Russia 630090
| |
Collapse
|
45
|
Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou YY, Yeh HY. iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. Anal Biochem 2019; 571:53-61. [PMID: 30822398 DOI: 10.1016/j.ab.2019.02.017] [Citation(s) in RCA: 77] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 12/22/2022]
Abstract
An enhancer is a short (50-1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore.
| | - Edward Kien Yee Yapp
- Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, 138634, Singapore
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, 32003, Taiwan
| | - N Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, 32003, Taiwan
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore.
| |
Collapse
|
46
|
Zou Q, Xing P, Wei L, Liu B. Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA. RNA (NEW YORK, N.Y.) 2019; 25:205-218. [PMID: 30425123 PMCID: PMC6348985 DOI: 10.1261/rna.069112.118] [Citation(s) in RCA: 311] [Impact Index Per Article: 62.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 11/01/2018] [Indexed: 05/20/2023]
Abstract
N6-Methyladenosine (m6A) refers to methylation modification of the adenosine nucleotide acid at the nitrogen-6 position. Many conventional computational methods for identifying N6-methyladenosine sites are limited by the small amount of data available. Taking advantage of the thousands of m6A sites detected by high-throughput sequencing, it is now possible to discover the characteristics of m6A sequences using deep learning techniques. To the best of our knowledge, our work is the first attempt to use word embedding and deep neural networks for m6A prediction from mRNA sequences. Using four deep neural networks, we developed a model inferred from a larger sequence shifting window that can predict m6A accurately and robustly. Four prediction schemes were built with various RNA sequence representations and optimized convolutional neural networks. The soft voting results from the four deep networks were shown to outperform all of the state-of-the-art methods. We evaluated these predictors mentioned above on a rigorous independent test data set and proved that our proposed method outperforms the state-of-the-art predictors. The training, independent, and cross-species testing data sets are much larger than in previous studies, which could help to avoid the problem of overfitting. Furthermore, an online prediction web server implementing the four proposed predictors has been built and is available at http://server.malab.cn/Gene2vec/.
Collapse
Affiliation(s)
- Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610051 Chengdu, China
- School of Computer Science and Technology, Tianjin University, 300350 Tianjin, China
| | - Pengwei Xing
- School of Computer Science and Technology, Tianjin University, 300350 Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, Tianjin University, 300350 Tianjin, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, 150001 Shenzhen, China
| |
Collapse
|
47
|
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet 2019; 51:12-18. [PMID: 30478442 PMCID: PMC11180539 DOI: 10.1038/s41588-018-0295-5] [Citation(s) in RCA: 402] [Impact Index Per Article: 80.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 09/26/2018] [Indexed: 12/13/2022]
Abstract
Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.
Collapse
Affiliation(s)
- James Zou
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA.
- Chan-Zuckerberg Biohub, San Francisco, CA, USA.
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA.
| | - Mikael Huss
- Peltarion, Stockholm, Sweden
- Department of Learning, Informatics, Management and Ethics, Karolinska Institutet, Stockholm, Sweden
| | - Abubakar Abid
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA
| | - Pejman Mohammadi
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Ali Torkamani
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Amalio Telenti
- Scripps Research Translational Institute, La Jolla, CA, USA.
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.
| |
Collapse
|