1
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and deep learning methods for predicting 3D genome organization. ARXIV 2024:arXiv:2403.03231v1. [PMID: 38495565 PMCID: PMC10942493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P. G. Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
| | - J. Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA 23298, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
2
|
Wang Q, Zhang J, Liu Z, Duan Y, Li C. Integrative approaches based on genomic techniques in the functional studies on enhancers. Brief Bioinform 2023; 25:bbad442. [PMID: 38048082 PMCID: PMC10694556 DOI: 10.1093/bib/bbad442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 10/22/2023] [Accepted: 11/08/2023] [Indexed: 12/05/2023] Open
Abstract
With the development of sequencing technology and the dramatic drop in sequencing cost, the functions of noncoding genes are being characterized in a wide variety of fields (e.g. biomedicine). Enhancers are noncoding DNA elements with vital transcription regulation functions. Tens of thousands of enhancers have been identified in the human genome; however, the location, function, target genes and regulatory mechanisms of most enhancers have not been elucidated thus far. As high-throughput sequencing techniques have leapt forwards, omics approaches have been extensively employed in enhancer research. Multidimensional genomic data integration enables the full exploration of the data and provides novel perspectives for screening, identification and characterization of the function and regulatory mechanisms of unknown enhancers. However, multidimensional genomic data are still difficult to integrate genome wide due to complex varieties, massive amounts, high rarity, etc. To facilitate the appropriate methods for studying enhancers with high efficacy, we delineate the principles, data processing modes and progress of various omics approaches to study enhancers and summarize the applications of traditional machine learning and deep learning in multi-omics integration in the enhancer field. In addition, the challenges encountered during the integration of multiple omics data are addressed. Overall, this review provides a comprehensive foundation for enhancer analysis.
Collapse
Affiliation(s)
- Qilin Wang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Junyou Zhang
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Zhaoshuo Liu
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Yingying Duan
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Chunyan Li
- School of Engineering Medicine, Beihang University, Beijing 100191, China
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
- Key Laboratory of Big Data-Based Precision Medicine (Ministry of Industry and Information Technology), Beihang University, Beijing 100191, China
- Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beihang University, Beijing 100191, China
| |
Collapse
|
3
|
Chen M, Liu X, Liu Q, Shi D, Li H. 3D genomics and its applications in precision medicine. Cell Mol Biol Lett 2023; 28:19. [PMID: 36879202 PMCID: PMC9987123 DOI: 10.1186/s11658-023-00428-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 02/06/2023] [Indexed: 03/08/2023] Open
Abstract
Three-dimensional (3D) genomics is an emerging discipline that studies the three-dimensional structure of chromatin and the three-dimensional and functions of genomes. It mainly studies the three-dimensional conformation and functional regulation of intranuclear genomes, such as DNA replication, DNA recombination, genome folding, gene expression regulation, transcription factor regulation mechanism, and the maintenance of three-dimensional conformation of genomes. Self-chromosomal conformation capture (3C) technology has been developed, and 3D genomics and related fields have developed rapidly. In addition, chromatin interaction analysis techniques developed by 3C technologies, such as paired-end tag sequencing (ChIA-PET) and whole-genome chromosome conformation capture (Hi-C), enable scientists to further study the relationship between chromatin conformation and gene regulation in different species. Thus, the spatial conformation of plant, animal, and microbial genomes, transcriptional regulation mechanisms, interaction patterns of chromosomes, and the formation mechanism of spatiotemporal specificity of genomes are revealed. With the help of new experimental technologies, the identification of key genes and signal pathways related to life activities and diseases is sustaining the rapid development of life science, agriculture, and medicine. In this paper, the concept and development of 3D genomics and its application in agricultural science, life science, and medicine are introduced, which provides a theoretical basis for the study of biological life processes.
Collapse
Affiliation(s)
- Mengjie Chen
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Animal Science and Technology, Guangxi University, Nanning, 530004, Guangxi Province, China
| | - Xingyu Liu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Animal Science and Technology, Guangxi University, Nanning, 530004, Guangxi Province, China
| | - Qingyou Liu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Animal Science and Technology, Guangxi University, Nanning, 530004, Guangxi Province, China.,Guangdong Provincial Key Laboratory of Animal Molecular Design and Precise Breeding, School of Life Science and Engineering, Foshan University, Foshan, 528225, China
| | - Deshun Shi
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Animal Science and Technology, Guangxi University, Nanning, 530004, Guangxi Province, China.
| | - Hui Li
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Animal Science and Technology, Guangxi University, Nanning, 530004, Guangxi Province, China.
| |
Collapse
|
4
|
Subramanian S, George TP, George J, Thomas T. Ensemble learning based assessment of the role of transcription factors in gene expression. Comput Biol Med 2023; 152:106455. [PMID: 36566628 DOI: 10.1016/j.compbiomed.2022.106455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Revised: 11/29/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022]
Abstract
Cancer cells are formed when the associated, active genes fail to function the way they are meant to function. Multiple genes collectively control cell growth by activating a proper set of genes. Regulation of gene expression is controlled through the combined effort of multiple regulatory elements. Transcription of each gene is affected differently according to the combinatorial patterns of regulatory elements bound in the nearby regions. Identifying and analysing such patterns will give a better insight into the cell function. The main focus of this study is on developing a computational model to predict the functional role of transcriptional factors residing between divergent gene pairs. Acute Myeloid Leukaemia (AML) gene expression data from GEO and the two TFs EP300 and CTCF binding data calibrated in k562 cell line from ENCODE consortium are taken as a case study.
Collapse
Affiliation(s)
| | | | - Jeslin George
- Department of Statistical Sciences, Kannur University, India.
| | | |
Collapse
|
5
|
Integrating extrusion complex-associated pattern to predict cell type-specific long-range chromatin loops. iScience 2022; 25:105687. [PMID: 36567710 PMCID: PMC9768375 DOI: 10.1016/j.isci.2022.105687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Revised: 11/10/2022] [Accepted: 11/25/2022] [Indexed: 12/07/2022] Open
Abstract
The chromatin loop plays a critical role in the study of gene expression and disease. Supervised learning-based algorithms to predict the chromatin loops require large priori information to satisfy the model construction, while the prediction sensitivity of unsupervised learning-based algorithms is still unsatisfactory. Therefore, we propose an unsupervised algorithm, Ecomap-loop. It takes advantage of extrusion complex-associated patterns, including CTCF, RAD21, and SMC enrichments, as well as the orientation distribution of CTCF motif of loops to build feature matrices; then the eigen decomposition model is employed to obtain the cell type-specific loops. We compare the performance of Ecomap-loop with the state-of-the-art unsupervised algorithm using Hi-C, ChIA-PET, expression quantitative trait locus (eQTL), and CRISPR interference (CRISPRi) screen data; the results show that Ecomap-loop achieves the best in four cell types. In addition, the functional analysis reveals the ability of Ecomap-loop to predict active functionality-related and cell type-specific loops.
Collapse
|
6
|
DLoopCaller: A deep learning approach for predicting genome-wide chromatin loops by integrating accessible chromatin landscapes. PLoS Comput Biol 2022; 18:e1010572. [PMID: 36206320 PMCID: PMC9581407 DOI: 10.1371/journal.pcbi.1010572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 10/19/2022] [Accepted: 09/14/2022] [Indexed: 11/20/2022] Open
Abstract
In recent years, major advances have been made in various chromosome conformation capture technologies to further satisfy the needs of researchers for high-quality, high-resolution contact interactions. Discriminating the loops from genome-wide contact interactions is crucial for dissecting three-dimensional(3D) genome structure and function. Here, we present a deep learning method to predict genome-wide chromatin loops, called DLoopCaller, by combining accessible chromatin landscapes and raw Hi-C contact maps. Some available orthogonal data ChIA-PET/HiChIP and Capture Hi-C were used to generate positive samples with a wider contact matrix which provides the possibility to find more potential genome-wide chromatin loops. The experimental results demonstrate that DLoopCaller effectively improves the accuracy of predicting genome-wide chromatin loops compared to the state-of-the-art method Peakachu. Moreover, compared to two of most popular loop callers, such as HiCCUPS and Fit-Hi-C, DLoopCaller identifies some unique interactions. We conclude that a combination of chromatin landscapes on the one-dimensional genome contributes to understanding the 3D genome organization, and the identified chromatin loops reveal cell-type specificity and transcription factor motif co-enrichment across different cell lines and species.
Collapse
|
7
|
Zhang L, Zhang J, Nie Q. DIRECT-NET: An efficient method to discover cis-regulatory elements and construct regulatory networks from single-cell multiomics data. SCIENCE ADVANCES 2022; 8:eabl7393. [PMID: 35648859 PMCID: PMC9159696 DOI: 10.1126/sciadv.abl7393] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The emergence of single-cell multiomics data provides unprecedented opportunities to scrutinize the transcriptional regulatory mechanisms controlling cell identity. However, how to use those datasets to dissect the cis-regulatory element (CRE)–to–gene relationships at a single-cell level remains a major challenge. Here, we present DIRECT-NET, a machine-learning method based on gradient boosting, to identify genome-wide CREs and their relationship to target genes, either from parallel single-cell gene expression and chromatin accessibility data or from single-cell chromatin accessibility data alone. By extensively evaluating and characterizing DIRECT-NET’s predicted CREs using independent functional genomics data, we find that DIRECT-NET substantially improves the accuracy of inferring CRE-to-gene relationships in comparison to existing methods. DIRECT-NET is also capable of revealing cell subpopulation–specific and dynamic regulatory linkages. Overall, DIRECT-NET provides an efficient tool for predicting transcriptional regulation codes from single-cell multiomics data.
Collapse
Affiliation(s)
- Lihua Zhang
- School of Computer Science, Wuhan University, Wuhan 430072, China
- Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA
- NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, Irvine, CA 92697, USA
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, Irvine, CA 92697, USA
- Corresponding author. (J.Z.); (Q.N.)
| | - Qing Nie
- Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA
- NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, Irvine, CA 92697, USA
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697, USA
- Corresponding author. (J.Z.); (Q.N.)
| |
Collapse
|
8
|
Tang L, Zhong Z, Lin Y, Yang Y, Wang J, Martin JF, Li M. EPIXplorer: A web server for prediction, analysis and visualization of enhancer-promoter interactions. Nucleic Acids Res 2022; 50:W290-W297. [PMID: 35639508 PMCID: PMC9252822 DOI: 10.1093/nar/gkac397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 05/01/2022] [Accepted: 05/05/2022] [Indexed: 11/13/2022] Open
Abstract
Long distance enhancers can physically interact with promoters to regulate gene expression through formation of enhancer-promoter (E-P) interactions. Identification of E-P interactions is also important for profound understanding of normal developmental and disease-associated risk variants. Although the state-of-art predictive computation methods facilitate the identification of E-P interactions to a certain extent, currently there is no efficient method that can meet various requirements of usage. Here we developed EPIXplorer, a user-friendly web server for efficient prediction, analysis and visualization of E-P interactions. EPIXplorer integrates 9 robust predictive algorithms, supports multiple types of 3D contact data and multi-omics data as input. The output from EPIXplorer is scored, fully annotated by regulatory elements and risk single-nucleotide polymorphisms (SNPs). In addition, the Visualization and Downstream module provide further functional analysis, all the output files and high-quality images are available for download. Together, EPIXplorer provides a user-friendly interface to predict the E-P interactions in an acceptable time, as well as understand how the genome-wide association study (GWAS) variants influence disease pathology by altering DNA looping between enhancers and the target gene promoters. EPIXplorer is available at https://www.csuligroup.com/EPIXplorer.
Collapse
Affiliation(s)
- Li Tang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Zhizhou Zhong
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yisheng Lin
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yifei Yang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jun Wang
- Department of Pediatrics, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - James F Martin
- Department of Molecular Physiology and Biophysics, Baylor College of Medicine, Houston, TX 77030, USA.,Cardiovascular Research Institute, Baylor College of Medicine, Houston, TX 77030, USA.,Texas Heart Institute, Houston, TX 77030, USA
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
9
|
Chen Z, Zhang J, Liu J, Dai Y, Lee D, Min MR, Xu M, Gerstein M. DECODE: a Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays. Bioinformatics 2021; 37:i280-i288. [PMID: 34252960 PMCID: PMC8275369 DOI: 10.1093/bioinformatics/btab283] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/26/2021] [Indexed: 11/13/2022] Open
Abstract
Motivation Mapping distal regulatory elements, such as enhancers, is a cornerstone for elucidating how genetic variations may influence diseases. Previous enhancer-prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have implemented enhancer discovery as a binary classification problem without accurate boundary detection, producing low-resolution annotations with superfluous regions and reducing the statistical power for downstream analyses (e.g. causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays (DECODE). First, we employed direct enhancer-activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural network for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution, we implemented a weakly supervised object detection framework for enhancer localization with precise boundary detection (to a 10 bp resolution) using Gradient-weighted Class Activation Mapping. Results Our DECODE binary classifier outperformed a state-of-the-art enhancer prediction method by 24% in transgenic mouse validation. Furthermore, the object detection framework can condense enhancer annotations to only 13% of their original size, and these compact annotations have significantly higher conservation scores and genome-wide association study variant enrichments than the original predictions. Overall, DECODE is an effective tool for enhancer classification and precise localization. Availability and implementation DECODE source code and pre-processing scripts are available at decode.gersteinlab.org. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhanlin Chen
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, USA
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, CA 92617, USA
| | - Jason Liu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Yi Dai
- Department of Computer Science, University of California, Irvine, CA 92617, USA
| | - Donghoon Lee
- Genetics and Genomic Sciences, The Icahn School of Medicine at Mount Sinai, New York, NY 10029-6574, USA
| | | | - Min Xu
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Mark Gerstein
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, USA.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.,Department of Computer Science, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
10
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|