1
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and Deep Learning Methods for Predicting 3D Genome Organization. Methods Mol Biol 2025; 2856:357-400. [PMID: 39283464 DOI: 10.1007/978-1-0716-4136-1_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Three-dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, topologically associating domains (TADs), and A/B compartments, play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers and transcription factor binding site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, and TAD boundaries) and analyze their pros and cons. We also point out obstacles to the computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P G Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA
| | - J Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Mikhail G Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA.
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
2
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and deep learning methods for predicting 3D genome organization. ARXIV 2024:arXiv:2403.03231v1. [PMID: 38495565 PMCID: PMC10942493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P. G. Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
| | - J. Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA 23298, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
3
|
Chen W, Miao C, Zhang Z, Fung CSH, Wang R, Chen Y, Qian Y, Cheng L, Yip KY, Tsui SKW, Cao Q. Commonly used software tools produce conflicting and overly-optimistic AUPRC values. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.02.578654. [PMID: 38370825 PMCID: PMC10871236 DOI: 10.1101/2024.02.02.578654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
The precision-recall curve (PRC) and the area under it (AUPRC) are useful for quantifying classification performance. They are commonly used in situations with imbalanced classes, such as cancer diagnosis and cell type annotation. We evaluated 10 popular tools for plotting PRC and computing AUPRC, which were collectively used in >3,000 published studies. We found the AUPRC values computed by the tools rank classifiers differently and some tools produce overly-optimistic results.
Collapse
Affiliation(s)
- Wenyu Chen
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Chen Miao
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Zhenghao Zhang
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Cathy Sin-Hang Fung
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Ran Wang
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Yizhen Chen
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Yan Qian
- The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Lixin Cheng
- Shenzhen People’s Hospital, First Affiliated Hospital of Southern University of Science and Technology, Second Clinical Medicine College of Jinan University, Shenzhen, China
| | - Kevin Y. Yip
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, USA
| | - Stephen Kwok-Wing Tsui
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Qin Cao
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
- Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China
| |
Collapse
|
4
|
Soibam B. ChromNetMotif: a Python tool to extract chromatin-sate marked motifs in a chromatin interaction network. BIOINFORMATICS ADVANCES 2023; 3:vbad126. [PMID: 37745003 PMCID: PMC10517636 DOI: 10.1093/bioadv/vbad126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 08/08/2023] [Accepted: 09/12/2023] [Indexed: 09/26/2023]
Abstract
Motivation Analysis of network motifs is crucial to studying the robustness, stability, and functions of complex networks. Genome organization can be viewed as a biological network that consists of interactions between different chromatin regions. These interacting regions are also marked by epigenetic or chromatin states which can contribute to the overall organization of the chromatin and proper genome function. Therefore, it is crucial to integrate the chromatin states of the nodes when performing motif analysis in chromatin interaction networks. Even though there has been increasing production of chromatin interaction and genome-wide epigenetic modification data, there is a lack of publicly available tools to extract chromatin state-marked motifs from genome organization data. Results We develop a Python tool, ChromNetMotif, offering an easy-to-use command line interface to extract chromatin-state-marked motifs from a chromatin interaction network. The tool can extract occurrences, frequencies, and statistical enrichment of the chromatin state-marked motifs. Visualization files are also generated which allow the user to interpret the motifs easily. ChromNetMotif also allows the user to leverage the features of a multicore processor environment to reduce computation time for larger networks. The output files generated can be used to perform further downstream analysis. ChromNetMotif aims to serve as an important tool to comprehend the interplay between epigenetics and genome organization. Availability and implementation ChromNetMotif is available at https://github.com/lncRNAAddict/ChromNetworkMotif.
Collapse
Affiliation(s)
- Benjamin Soibam
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, United States
| |
Collapse
|
5
|
Xu H, Yi X, Fan X, Wu C, Wang W, Chu X, Zhang S, Dong X, Wang Z, Wang J, Zhou Y, Zhao K, Yao H, Zheng N, Wang J, Chen Y, Plewczynski D, Sham PC, Chen K, Huang D, Li MJ. Inferring CTCF-binding patterns and anchored loops across human tissues and cell types. PATTERNS (NEW YORK, N.Y.) 2023; 4:100798. [PMID: 37602215 PMCID: PMC10436006 DOI: 10.1016/j.patter.2023.100798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 01/25/2023] [Accepted: 06/20/2023] [Indexed: 08/22/2023]
Abstract
CCCTC-binding factor (CTCF) is a transcription regulator with a complex role in gene regulation. The recognition and effects of CTCF on DNA sequences, chromosome barriers, and enhancer blocking are not well understood. Existing computational tools struggle to assess the regulatory potential of CTCF-binding sites and their impact on chromatin loop formation. Here we have developed a deep-learning model, DeepAnchor, to accurately characterize CTCF binding using high-resolution genomic/epigenomic features. This has revealed distinct chromatin and sequence patterns for CTCF-mediated insulation and looping. An optimized implementation of a previous loop model based on DeepAnchor score excels in predicting CTCF-anchored loops. We have established a compendium of CTCF-anchored loops across 52 human tissue/cell types, and this suggests that genomic disruption of these loops could be a general mechanism of disease pathogenesis. These computational models and resources can help investigate how CTCF-mediated cis-regulatory elements shape context-specific gene regulation in cell development and disease progression.
Collapse
Affiliation(s)
- Hang Xu
- Department of Epidemiology and Biostatistics, Key Laboratory of Prevention and Control of Human Major Diseases (Ministry of Education), National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China
- Singapore Immunology Network (SIgN), Agency for Science, Technology and Research (A∗STAR), Singapore 138648, Singapore
| | - Xianfu Yi
- Department of Bioinformatics, The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Xutong Fan
- Department of Bioinformatics, The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Chengyue Wu
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Wei Wang
- Department of Epidemiology and Biostatistics, Key Laboratory of Prevention and Control of Human Major Diseases (Ministry of Education), National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China
| | - Xinlei Chu
- Department of Epidemiology and Biostatistics, Key Laboratory of Prevention and Control of Human Major Diseases (Ministry of Education), National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China
| | - Shijie Zhang
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Xiaobao Dong
- Department of Genetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Zhao Wang
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Jianhua Wang
- Department of Bioinformatics, The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Yao Zhou
- Department of Bioinformatics, The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Ke Zhao
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Hongcheng Yao
- Centre for PanorOmic Sciences-Genomics and Bioinformatics Cores, The University of Hong Kong, Hong Kong 999077, China
| | - Nan Zheng
- Department of Network Security and Informatization, Tianjin Medical University, Tianjin 300070, China
| | - Junwen Wang
- Department of Health Sciences Research and Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA
| | - Yupeng Chen
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Dariusz Plewczynski
- Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Pak Chung Sham
- Centre for PanorOmic Sciences-Genomics and Bioinformatics Cores, The University of Hong Kong, Hong Kong 999077, China
| | - Kexin Chen
- Department of Epidemiology and Biostatistics, Key Laboratory of Prevention and Control of Human Major Diseases (Ministry of Education), National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China
| | - Dandan Huang
- Wuxi School of Medicine, Jiangnan University, Wuxi 214122, China
| | - Mulin Jun Li
- Department of Epidemiology and Biostatistics, Key Laboratory of Prevention and Control of Human Major Diseases (Ministry of Education), National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300070, China
- Department of Bioinformatics, The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| |
Collapse
|
6
|
Villaman C, Pollastri G, Saez M, Martin AJ. Benefiting from the intrinsic role of epigenetics to predict patterns of CTCF binding. Comput Struct Biotechnol J 2023; 21:3024-3031. [PMID: 37266407 PMCID: PMC10229758 DOI: 10.1016/j.csbj.2023.05.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 05/11/2023] [Accepted: 05/11/2023] [Indexed: 06/03/2023] Open
Abstract
Motivation One of the most relevant mechanisms involved in the determination of chromatin structure is the formation of structural loops that are also related with the conservation of chromatin states. Many of these loops are stabilized by CCCTC-binding factor (CTCF) proteins at their base. Despite the relevance of chromatin structure and the key role of CTCF, the role of the epigenetic factors that are involved in the regulation of CTCF binding, and thus, in the formation of structural loops in the chromatin, is not thoroughly understood. Results Here we describe a CTCF binding predictor based on Random Forest that employs different epigenetic data and genomic features. Importantly, given the ability of Random Forests to determine the relevance of features for the prediction, our approach also shows how the different types of descriptors impact the binding of CTCF, confirming previous knowledge on the relevance of chromatin accessibility and DNA methylation, but demonstrating the effect of epigenetic modifications on the activity of CTCF. We compared our approach against other predictors and found improved performance in terms of areas under PR and ROC curves (PRAUC-ROCAUC), outperforming current state-of-the-art methods.
Collapse
Affiliation(s)
- Camilo Villaman
- Programa de Doctorado en Genómica Integrativa, Vicerrectoría de Investigación, Universidad Mayor, Santiago, Chile
- Laboratorio de Redes Biológicas, Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Fundación Ciencia & Vida, Escuela de Ingeniería, Facultad de Ingeniería, Arquitectura y Diseño, Universidad San Sebastián, Santiago, Chile
| | | | - Mauricio Saez
- Centro de Oncología de Precisión, Facultad de Medicina y Ciencias de la Salud, Universidad Mayor, Santiago, Chile
- Laboratorio de Investigación en Salud de Precisión, Departamento de Procesos Diagnósticos y Evaluación, Facultad de Ciencias de la Salud, Universidad Católica de Temuco, Chile
| | - Alberto J.M. Martin
- Laboratorio de Redes Biológicas, Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Fundación Ciencia & Vida, Escuela de Ingeniería, Facultad de Ingeniería, Arquitectura y Diseño, Universidad San Sebastián, Santiago, Chile
| |
Collapse
|
7
|
Shen Y, Zhong Q, Liu T, Wen Z, Shen W, Li L. CharID: a two-step model for universal prediction of interactions between chromatin accessible regions. Brief Bioinform 2022; 23:6514800. [PMID: 35077535 DOI: 10.1093/bib/bbab602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 12/23/2021] [Accepted: 12/24/2021] [Indexed: 11/14/2022] Open
Abstract
Open chromatin regions (OCRs) allow direct interaction between cis-regulatory elements and trans-acting factors. Therefore, predicting all potential OCR-mediated loops is essential for deciphering the regulation mechanism of gene expression. However, existing loop prediction tools are restricted to specific anchor types. Here, we present CharID (Chromatin Accessible Region Interaction Detector), a two-step model that combines neural network and ensemble learning to predict OCR-mediated loops. In the first step, CharID-Anchor, an attention-based hybrid CNN-BiGRU network is constructed to discriminate between the anchor and nonanchor OCRs. In the second step, CharID-Loop uses gradient boosting decision tree with chromosome-split strategy to predict the interactions between anchor OCRs. The performance was assessed in three human cell lines, and CharID showed superior prediction performance compared with other algorithms. In contrast to the methods designed to predict a particular type of loops, CharID can detect varieties of chromatin loops not limited to enhancer-promoter loops or architectural protein-mediated loops. We constructed the OCR-mediated interaction network using the predicted loops and identified hub anchors, which are highlighted by their proximity to housekeeping genes. By analyzing loops containing SNPs associated with cardiovascular disease, we identified an SNP-gene loop indicating the regulation mechanism of the GFOD1. Taken together, CharID universally predicts diverse chromatin loops beyond other state-of-the-art methods, which are limited by anchor types, and experimental techniques, which are limited by sensitivities drastically decaying with the genomic distance of anchors. Finally, we hosted Peaksniffer, a user-friendly web server that provides online prediction, query and visualization of OCRs and associated loops.
Collapse
Affiliation(s)
- Yin Shen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Quan Zhong
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Tian Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Zi Wen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Wei Shen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Li Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| |
Collapse
|