1
|
Li Q, Li KY, Nicoletti C, Puri PL, Cao Q, Yip KY. Overcoming artificial structures in resolution-enhanced Hi-C data by signal decomposition and multi-scale attention. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.21.619560. [PMID: 39484541 PMCID: PMC11526948 DOI: 10.1101/2024.10.21.619560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Computational enhancement is an important strategy for inferring high-resolution features from genome-wide chromosome conformation capture (Hi-C) data, which typically have limited resolution. Deep learning has been highly successful in this task but we show that it creates prevalent artificial structures in the enhanced data due to the need to divide the large contact matrix into small patches. In addition, previous deep learning methods largely focus on local patterns, which cannot fully capture the complexity of Hi-C data. Here we propose Smooth, High-resolution, and Accurate Reconstruction of Patterns (SHARP) for enhancing Hi-C data. It uses the novel approach of decomposing the data into three types of signals, due to one-dimensional proximity, contiguous domains, and other fine structures, and applies deep learning only to the third type of signals, such that enhancement of the first two is unaffected by the patches. For the deep learning part, SHARP uses both local and global attention mechanisms to capture multi-scale contextual information. We compare SHARP with state-of-the-art methods extensively, including application to data from new samples and another species, and show that SHARP has superior performance in terms of resolution enhancement accuracy, avoiding creation of artificial structures, identifying significant interactions, and enrichment in chromatin states.
Collapse
|
2
|
Bera P, Mondal J. Machine learning unravels inherent structural patterns in Escherichia coli Hi-C matrices and predicts chromosome dynamics. Nucleic Acids Res 2024; 52:10836-10849. [PMID: 39217471 PMCID: PMC11472170 DOI: 10.1093/nar/gkae749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Accepted: 08/19/2024] [Indexed: 09/04/2024] Open
Abstract
High dimensional nature of the chromosomal conformation contact map ('Hi-C Map'), even for microscopically small bacterial cell, poses challenges for extracting meaningful information related to its complex organization. Here we first demonstrate that an artificial deep neural network-based machine-learnt (ML) low-dimensional representation of a recently reported Hi-C interaction map of archetypal bacteria Escherichia coli can decode crucial underlying structural pattern. The ML-derived representation of Hi-C map can automatically detect a set of spatially distinct domains across E. coli genome, sharing reminiscences of six putative macro-domains previously posited via recombination assay. Subsequently, a ML-generated model assimilates the intricate relationship between large array of Hi-C-derived chromosomal contact probabilities and respective diffusive dynamics of each individual chromosomal gene and identifies an optimal number of functionally important chromosomal contact-pairs that are majorly responsible for heterogenous, coordinate-dependent sub-diffusive motions of chromosomal loci. Finally, the ML models, trained on wild-type E. coli show-cased its predictive capabilities on mutant bacterial strains, shedding light on the structural and dynamic nuances of ΔMatP30MM and ΔMukBEF22MM chromosomes. Overall our results illuminate the power of ML techniques in unraveling the complex relationship between structure and dynamics of bacterial chromosomal loci, promising meaningful connections between ML-derived insights and biological phenomena.
Collapse
Affiliation(s)
- Palash Bera
- Tata Institute of Fundamental Research Hyderabad, Telangana 500046, India
| | - Jagannath Mondal
- Tata Institute of Fundamental Research Hyderabad, Telangana 500046, India
| |
Collapse
|
3
|
Lu W, Tang Y, Liu Y, Lin S, Shuai Q, Liang B, Zhang R, Cheng Y, Fang D. CatLearning: highly accurate gene expression prediction from histone mark. Brief Bioinform 2024; 25:bbae373. [PMID: 39073831 DOI: 10.1093/bib/bbae373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Revised: 06/14/2024] [Accepted: 07/16/2024] [Indexed: 07/30/2024] Open
Abstract
Histone modifications, known as histone marks, are pivotal in regulating gene expression within cells. The vast array of potential combinations of histone marks presents a considerable challenge in decoding the regulatory mechanisms solely through biological experimental approaches. To overcome this challenge, we have developed a method called CatLearning. It utilizes a modified convolutional neural network architecture with a specialized adaptation Residual Network to quantitatively interpret histone marks and predict gene expression. This architecture integrates long-range histone information up to 500Kb and learns chromatin interaction features without 3D information. By using only one histone mark, CatLearning achieves a high level of accuracy. Furthermore, CatLearning predicts gene expression by simulating changes in histone modifications at enhancers and throughout the genome. These findings help comprehend the architecture of histone marks and develop diagnostic and therapeutic targets for diseases with epigenetic changes.
Collapse
Affiliation(s)
- Weining Lu
- Beijing National Research Center for Information Science and Technology, Tsinghua University, FIT Building, Haidian District, Beijing 100084, China
| | - Yin Tang
- Liangzhu Laboratory, Zhejiang University, 1369 Wenyixi Road, Yuhang District, Hangzhou, Zhejiang, 311121, China
| | - Yu Liu
- Life Sciences Institute, Zhejiang University, 866 Yuhangtang Road, Xihu District, Hangzhou, Zhejiang, 310058, China
| | - Shiyi Lin
- Life Sciences Institute, Zhejiang University, 866 Yuhangtang Road, Xihu District, Hangzhou, Zhejiang, 310058, China
| | - Qifan Shuai
- School of Electron and Computer, Southeast University Chengxian College, 371 Heyan Road, Qixia District, Nanjing, Jiangsu 210088, China
| | - Bin Liang
- Department of Automation, Tsinghua University, 1 Tsinghua Garden, Haidian District, Beijing, 100084, China
| | - Rongqing Zhang
- Zhejiang Provincial Key Laboratory of Applied Enzymology, Yangtze Delta Region Institute of Tsinghua University, 705 Yatai Road, Jiaxing 314006, China
| | - Yu Cheng
- The Chinese University of Hong Kong, Shatin, NT, Hong Kong, 999077, China
| | - Dong Fang
- Life Sciences Institute, Zhejiang University, 866 Yuhangtang Road, Xihu District, Hangzhou, Zhejiang, 310058, China
- Department of Medical Oncology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Key Laboratory of Cancer Prevention and Intervention, China National Ministry of Education, 88 Jiefang Road, Shangcheng District, Hangzhou, Zhejiang, 310009, China
| |
Collapse
|
4
|
Fang T, Liu Y, Woicik A, Lu M, Jha A, Wang X, Li G, Hristov B, Liu Z, Xu H, Noble WS, Wang S. Enhancing Hi-C contact matrices for loop detection with Capricorn: a multiview diffusion model. Bioinformatics 2024; 40:i471-i480. [PMID: 38940142 PMCID: PMC11211821 DOI: 10.1093/bioinformatics/btae211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. Simultaneously, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops and other stochastic contacts. RESULTS We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We also demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting, improving the downstream loop F1 score by 14% relative to existing methods. We further show that our multiview idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix. AVAILABILITY AND IMPLEMENTATION Implementation of Capricorn and source code for reproducing all figures in this paper are available at https://github.com/CHNFTQ/Capricorn.
Collapse
Affiliation(s)
- Tangqi Fang
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
| | - Yifeng Liu
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
| | - Addie Woicik
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
| | - Minsi Lu
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
| | - Anupama Jha
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, United States
| | - Xiao Wang
- Department of Computer Science, Purdue University, West Lafayette, IN 47907, United States
| | - Gang Li
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, United States
- eScience Institute, University of Washington, Seattle, WA 98195, United States
| | - Borislav Hristov
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, United States
| | - Zixuan Liu
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
| | - Hanwen Xu
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
| | - William S Noble
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, United States
| | - Sheng Wang
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, United States
| |
Collapse
|
5
|
Wang Y, Cheng J. HiCDiff: single-cell Hi-C data denoising with diffusion models. Brief Bioinform 2024; 25:bbae279. [PMID: 38856167 PMCID: PMC11163381 DOI: 10.1093/bib/bbae279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 05/21/2024] [Accepted: 05/29/2024] [Indexed: 06/11/2024] Open
Abstract
The genome-wide single-cell chromosome conformation capture technique, i.e. single-cell Hi-C (ScHi-C), was recently developed to interrogate the conformation of the genome of individual cells. However, single-cell Hi-C data are much sparser than bulk Hi-C data of a population of cells, and noise in single-cell Hi-C makes it difficult to apply and analyze them in biological research. Here, we developed the first generative diffusion models (HiCDiff) to denoise single-cell Hi-C data in the form of chromosomal contact matrices. HiCDiff uses a deep residual network to remove the noise in the reverse process of diffusion and can be trained in both unsupervised and supervised learning modes. Benchmarked on several single-cell Hi-C test datasets, the diffusion models substantially remove the noise in single-cell Hi-C data. The unsupervised HiCDiff outperforms most supervised non-diffusion deep learning methods and achieves the performance comparable to the state-of-the-art supervised deep learning method in terms of multiple metrics, demonstrating that diffusion models are a useful approach to denoising single-cell Hi-C data. Moreover, its good performance holds on denoising bulk Hi-C data.
Collapse
Affiliation(s)
- Yanli Wang
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| |
Collapse
|
6
|
Xu J, Xu X, Huang D, Luo Y, Lin L, Bai X, Zheng Y, Yang Q, Cheng Y, Huang A, Shi J, Bo X, Gu J, Chen H. A comprehensive benchmarking with interpretation and operational guidance for the hierarchy of topologically associating domains. Nat Commun 2024; 15:4376. [PMID: 38782890 PMCID: PMC11116433 DOI: 10.1038/s41467-024-48593-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 05/03/2024] [Indexed: 05/25/2024] Open
Abstract
Topologically associating domains (TADs), megabase-scale features of chromatin spatial architecture, are organized in a domain-within-domain TAD hierarchy. Within TADs, the inner and smaller subTADs not only manifest cell-to-cell variability, but also precisely regulate transcription and differentiation. Although over 20 TAD callers are able to detect TAD, their usability in biomedicine is confined by a disagreement of outputs and a limit in understanding TAD hierarchy. We compare 13 computational tools across various conditions and develop a metric to evaluate the similarity of TAD hierarchy. Although outputs of TAD hierarchy at each level vary among callers, data resolutions, sequencing depths, and matrices normalization, they are more consistent when they have a higher similarity of larger TADs. We present comprehensive benchmarking of TAD hierarchy callers and operational guidance to researchers of life science researchers. Moreover, by simulating the mixing of different types of cells, we confirm that TAD hierarchy is generated not simply from stacking Hi-C heatmaps of heterogeneous cells. Finally, we propose an air conditioner model to decipher the role of TAD hierarchy in transcription.
Collapse
Affiliation(s)
- Jingxuan Xu
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - Xiang Xu
- Academy of Military Medical Science, Beijing, 100850, China
| | - Dandan Huang
- Department of Oncology, Peking University Shougang Hospital, Beijing, China
- Center for Precision Diagnosis and Treatment of Colorectal Cancer and Inflammatory Diseases, Peking University Health Science Center, Beijing, China
| | - Yawen Luo
- Academy of Military Medical Science, Beijing, 100850, China
| | - Lin Lin
- Academy of Military Medical Science, Beijing, 100850, China
- School of Computer Science and Information Technology& KLAS, Northeast Normal University, Changchun, China
| | - Xuemei Bai
- Academy of Military Medical Science, Beijing, 100850, China
| | - Yang Zheng
- Academy of Military Medical Science, Beijing, 100850, China
| | - Qian Yang
- Academy of Military Medical Science, Beijing, 100850, China
| | - Yu Cheng
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - An Huang
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - Jingyi Shi
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - Xiaochen Bo
- Academy of Military Medical Science, Beijing, 100850, China.
| | - Jin Gu
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China.
- Department of Oncology, Peking University Shougang Hospital, Beijing, China.
- Center for Precision Diagnosis and Treatment of Colorectal Cancer and Inflammatory Diseases, Peking University Health Science Center, Beijing, China.
- Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, China.
- Peking University International Cancer Institute, Beijing, China.
| | - Hebing Chen
- Academy of Military Medical Science, Beijing, 100850, China.
| |
Collapse
|
7
|
Liu R, Xu R, Yan S, Li P, Jia C, Sun H, Sheng K, Wang Y, Zhang Q, Guo J, Xin X, Li X, Guo D. Hi-C, a chromatin 3D structure technique advancing the functional genomics of immune cells. Front Genet 2024; 15:1377238. [PMID: 38586584 PMCID: PMC10995239 DOI: 10.3389/fgene.2024.1377238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Accepted: 03/13/2024] [Indexed: 04/09/2024] Open
Abstract
The functional performance of immune cells relies on a complex transcriptional regulatory network. The three-dimensional structure of chromatin can affect chromatin status and gene expression patterns, and plays an important regulatory role in gene transcription. Currently available techniques for studying chromatin spatial structure include chromatin conformation capture techniques and their derivatives, chromatin accessibility sequencing techniques, and others. Additionally, the recently emerged deep learning technology can be utilized as a tool to enhance the analysis of data. In this review, we elucidate the definition and significance of the three-dimensional chromatin structure, summarize the technologies available for studying it, and describe the research progress on the chromatin spatial structure of dendritic cells, macrophages, T cells, B cells, and neutrophils.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | - Dianhao Guo
- School of Clinical and Basic Medical Sciences, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, Shandong, China
| |
Collapse
|
8
|
Murtaza G, Jain A, Hughes M, Wagner J, Singh R. A Comprehensive Evaluation of Generalizability of Deep Learning-Based Hi-C Resolution Improvement Methods. Genes (Basel) 2023; 15:54. [PMID: 38254945 PMCID: PMC10815746 DOI: 10.3390/genes15010054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 12/24/2023] [Accepted: 12/26/2023] [Indexed: 01/24/2024] Open
Abstract
Hi-C is a widely used technique to study the 3D organization of the genome. Due to its high sequencing cost, most of the generated datasets are of a coarse resolution, which makes it impractical to study finer chromatin features such as Topologically Associating Domains (TADs) and chromatin loops. Multiple deep learning-based methods have recently been proposed to increase the resolution of these datasets by imputing Hi-C reads (typically called upscaling). However, the existing works evaluate these methods on either synthetically downsampled datasets, or a small subset of experimentally generated sparse Hi-C datasets, making it hard to establish their generalizability in the real-world use case. We present our framework-Hi-CY-that compares existing Hi-C resolution upscaling methods on seven experimentally generated low-resolution Hi-C datasets belonging to various levels of read sparsities originating from three cell lines on a comprehensive set of evaluation metrics. Hi-CY also includes four downstream analysis tasks, such as TAD and chromatin loops recall, to provide a thorough report on the generalizability of these methods. We observe that existing deep learning methods fail to generalize to experimentally generated sparse Hi-C datasets, showing a performance reduction of up to 57%. As a potential solution, we find that retraining deep learning-based methods with experimentally generated Hi-C datasets improves performance by up to 31%. More importantly, Hi-CY shows that even with retraining, the existing deep learning-based methods struggle to recover biological features such as chromatin loops and TADs when provided with sparse Hi-C datasets. Our study, through the Hi-CY framework, highlights the need for rigorous evaluation in the future. We identify specific avenues for improvements in the current deep learning-based Hi-C upscaling methods, including but not limited to using experimentally generated datasets for training.
Collapse
Affiliation(s)
- Ghulam Murtaza
- Department of Computer Science, Brown University, Providence, RI 02912, USA; (G.M.); (A.J.); (M.H.)
| | - Atishay Jain
- Department of Computer Science, Brown University, Providence, RI 02912, USA; (G.M.); (A.J.); (M.H.)
| | - Madeline Hughes
- Department of Computer Science, Brown University, Providence, RI 02912, USA; (G.M.); (A.J.); (M.H.)
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA;
| | - Ritambhara Singh
- Department of Computer Science, Brown University, Providence, RI 02912, USA; (G.M.); (A.J.); (M.H.)
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA
| |
Collapse
|
9
|
Race AM, Fuchs A, Chung HR. Visualization and data exploration of chromosome conformation capture data using Voronoi diagrams with v3c-viz. Sci Rep 2023; 13:22020. [PMID: 38086827 PMCID: PMC10716258 DOI: 10.1038/s41598-023-49179-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 12/05/2023] [Indexed: 12/18/2023] Open
Abstract
Chromosome conformation capture (3C) sequencing approaches, like Hi-C or micro-C, allow for an unbiased view of chromatin interactions. Most analysis methods rely on so-called interaction matrices, which are derived from counting read pairs in bins of fixed size. Here, we propose the Voronoi diagram, as implemented in Voronoi for chromosome conformation capture data visualization (v3c-viz) to visualize 3C data. The Voronoi diagram corresponds to an adaptive-binning strategy that adapts to the local densities of points. In this way, visualization of data obtained by moderate sequencing depth pinpoint many, if not most, interesting features such as high frequency contacts. The favorable visualization properties of the Voronoi diagram indicate that the Voronoi diagram as density estimator can be used to identify high frequency contacts at a resolution approaching the typical size of enhancers and promoters. v3c-viz is available at https://github.com/imbbLab/v3c-viz .
Collapse
Affiliation(s)
- Alan M Race
- Philipps University Marburg, Institute for Medical Bioinformatics and Biostatistics, Marburg, 35043, Germany
| | - Alisa Fuchs
- Max Planck Institute for Molecular Genetics, Epigenomics, Berlin, 14195, Germany
- Berlin Institute for Medical Systems Biology, Max Delbrück Center, Berlin, 10115, Germany
| | - Ho-Ryun Chung
- Philipps University Marburg, Institute for Medical Bioinformatics and Biostatistics, Marburg, 35043, Germany.
- Max Planck Institute for Molecular Genetics, Epigenomics, Berlin, 14195, Germany.
| |
Collapse
|
10
|
Huang L, Song M, Shen H, Hong H, Gong P, Deng HW, Zhang C. Deep Learning Methods for Omics Data Imputation. BIOLOGY 2023; 12:1313. [PMID: 37887023 PMCID: PMC10604785 DOI: 10.3390/biology12101313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/28/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
One common problem in omics data analysis is missing values, which can arise due to various reasons, such as poor tissue quality and insufficient sample volumes. Instead of discarding missing values and related data, imputation approaches offer an alternative means of handling missing data. However, the imputation of missing omics data is a non-trivial task. Difficulties mainly come from high dimensionality, non-linear or non-monotonic relationships within features, technical variations introduced by sampling methods, sample heterogeneity, and the non-random missingness mechanism. Several advanced imputation methods, including deep learning-based methods, have been proposed to address these challenges. Due to its capability of modeling complex patterns and relationships in large and high-dimensional datasets, many researchers have adopted deep learning models to impute missing omics data. This review provides a comprehensive overview of the currently available deep learning-based methods for omics imputation from the perspective of deep generative model architectures such as autoencoder, variational autoencoder, generative adversarial networks, and Transformer, with an emphasis on multi-omics data imputation. In addition, this review also discusses the opportunities that deep learning brings and the challenges that it might face in this field.
Collapse
Affiliation(s)
- Lei Huang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Hui Shen
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA
| | - Hong-Wen Deng
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| |
Collapse
|
11
|
Baur B, Roy S. Predicting patient-specific enhancer-promoter interactions. CELL REPORTS METHODS 2023; 3:100594. [PMID: 37751694 PMCID: PMC10545932 DOI: 10.1016/j.crmeth.2023.100594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 08/30/2023] [Accepted: 08/30/2023] [Indexed: 09/28/2023]
Abstract
Computational methods that can predict hard-to-measure modalities from those that are easier to measure, in a patient-specific manner, play a critical role in personalized medicine. In this issue of Cell Reports Methods, Khurana et al. present differential gene targets of accessible chromatin (DGTAC), an approach which predicts patient-specific enhancer-promoter interactions.
Collapse
Affiliation(s)
- Brittany Baur
- Wisconsin Institute for Discovery, 330 N. Orchard Street, Madison, WI 53715, USA; The Max Harry Weil Institute of Critical Care Research & Innovation, University of Michigan, Ann Arbor, MI, USA; Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Sushmita Roy
- Wisconsin Institute for Discovery, 330 N. Orchard Street, Madison, WI 53715, USA; Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, USA.
| |
Collapse
|
12
|
Raffo A, Paulsen J. The shape of chromatin: insights from computational recognition of geometric patterns in Hi-C data. Brief Bioinform 2023; 24:bbad302. [PMID: 37646128 PMCID: PMC10516369 DOI: 10.1093/bib/bbad302] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 07/05/2023] [Accepted: 08/03/2023] [Indexed: 09/01/2023] Open
Abstract
The three-dimensional organization of chromatin plays a crucial role in gene regulation and cellular processes like deoxyribonucleic acid (DNA) transcription, replication and repair. Hi-C and related techniques provide detailed views of spatial proximities within the nucleus. However, data analysis is challenging partially due to a lack of well-defined, underpinning mathematical frameworks. Recently, recognizing and analyzing geometric patterns in Hi-C data has emerged as a powerful approach. This review provides a summary of algorithms for automatic recognition and analysis of geometric patterns in Hi-C data and their correspondence with chromatin structure. We classify existing algorithms on the basis of the data representation and pattern recognition paradigm they make use of. Finally, we outline some of the challenges ahead and promising future directions.
Collapse
Affiliation(s)
- Andrea Raffo
- Department of Biosciences, University of Oslo, 0316 Oslo, Norway
| | - Jonas Paulsen
- Department of Biosciences, University of Oslo, 0316 Oslo, Norway
- Centre for Bioinformatics, Department of Informatics, University of Oslo, 0316 Oslo, Norway
| |
Collapse
|
13
|
Jahanyar B, Tabatabaee H, Rowhanimanesh A. MS-ACGAN: A modified auxiliary classifier generative adversarial network for schizophrenia's samples augmentation based on microarray gene expression data. Comput Biol Med 2023; 162:107024. [PMID: 37263150 DOI: 10.1016/j.compbiomed.2023.107024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Revised: 05/01/2023] [Accepted: 05/09/2023] [Indexed: 06/03/2023]
Abstract
Artificial intelligence-based models and robust computational methods have expedited the data-to-knowledge trajectory in precision medicine. Although machine learning models have been widely applied in medical data analysis, some barriers are yet to be challenging, such as available biosample shortage, prohibitive costs, rare diseases, and ethical considerations. Transcriptomics, an omics approach that studies gene activities and provides gene expression data such as microarray and RNA-Sequences faces the difficulties of biospecimen collection, particularly for mental disorders, as some psychiatric patients avoid medical care. Microarray data suffers from the low number of available samples, making it challenging to apply machine learning models. However, adversarial generative network (GAN), the hottest paradigm in deep learning, has created unprecedented momentum in data augmentation and efficiently expands datasets. This paper proposes a novel model termed MS-ACGAN, where the generator feeds on a bordered Gaussian distribution. In machine learning, calibration is of utmost importance, which gives insight into model uncertainty and is considered a crucial step toward improving the robustness and reliability of models. Therefore, we apply calibration techniques to classifiers and focus on estimating their probabilities as accurately as possible. Additionally, we present our trustworthy outputs by harnessing confidence intervals that confine the point estimate limitations and report a range of expected values for performance metrics. Both concepts statistically describe the implemented model's reliability in this study. Furthermore, we employ two quantitative measures, GAN-train and GAN-test, to demonstrate that the artificial data generated by our robust approach remarkably resembles the original data characteristics.
Collapse
Affiliation(s)
- Bahareh Jahanyar
- Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran
| | - Hamid Tabatabaee
- Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran.
| | | |
Collapse
|
14
|
Wang Y, Guo Z, Cheng J. Single-cell Hi-C data enhancement with deep residual and generative adversarial networks. Bioinformatics 2023; 39:btad458. [PMID: 37498561 PMCID: PMC10403428 DOI: 10.1093/bioinformatics/btad458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 07/19/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION The spatial genome organization of a eukaryotic cell is important for its function. The development of single-cell technologies for probing the 3D genome conformation, especially single-cell chromosome conformation capture techniques, has enabled us to understand genome function better than before. However, due to extreme sparsity and high noise associated with single-cell Hi-C data, it is still difficult to study genome structure and function using the HiC-data of one single cell. RESULTS In this work, we developed a deep learning method ScHiCEDRN based on deep residual networks and generative adversarial networks for the imputation and enhancement of Hi-C data of a single cell. In terms of both image evaluation and Hi-C reproducibility metrics, ScHiCEDRN outperforms the four deep learning methods (DeepHiC, HiCPlus, HiCSR, and Loopenhance) on enhancing the raw single-cell Hi-C data of human and Drosophila. The experiments also show that it can generate single-cell Hi-C data more suitable for identifying topologically associating domain boundaries and reconstructing 3D chromosome structures than the existing methods. Moreover, ScHiCEDRN's performance generalizes well across different single cells and cell types, and it can be applied to improving population Hi-C data. AVAILABILITY AND IMPLEMENTATION The source code of ScHiCEDRN is available at the GitHub repository: https://github.com/BioinfoMachineLearning/ScHiCEDRN.
Collapse
Affiliation(s)
- Yanli Wang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, United States
- NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| | - Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, United States
- NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, United States
- NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| |
Collapse
|
15
|
Li K, Zhang P, Wang Z, Shen W, Sun W, Xu J, Wen Z, Li L. iEnhance: a multi-scale spatial projection encoding network for enhancing chromatin interaction data resolution. Brief Bioinform 2023; 24:bbad245. [PMID: 37381618 DOI: 10.1093/bib/bbad245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 06/06/2023] [Accepted: 06/12/2023] [Indexed: 06/30/2023] Open
Abstract
Although sequencing-based high-throughput chromatin interaction data are widely used to uncover genome-wide three-dimensional chromatin architecture, their sparseness and high signal-noise-ratio greatly restrict the precision of the obtained structural elements. To improve data quality, we here present iEnhance (chromatin interaction data resolution enhancement), a multi-scale spatial projection and encoding network, to predict high-resolution chromatin interaction matrices from low-resolution and noisy input data. Specifically, iEnhance projects the input data into matrix spaces to extract multi-scale global and local feature sets, then hierarchically fused these features by attention mechanism. After that, dense channel encoding and residual channel decoding are used to effectively infer robust chromatin interaction maps. iEnhance outperforms state-of-the-art Hi-C resolution enhancement tools in both visual and quantitative evaluation. Comprehensive analysis shows that unlike other tools, iEnhance can recover both short-range structural elements and long-range interaction patterns precisely. More importantly, iEnhance can be transferred to data enhancement of other tissues or cell lines of unknown resolution. Furthermore, iEnhance performs robustly in enhancement of diverse chromatin interaction data including those from single-cell Hi-C and Micro-C experiments.
Collapse
Affiliation(s)
- Kai Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Ping Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Zilin Wang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Wei Shen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Weicheng Sun
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jinsheng Xu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Zi Wen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Li Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
- Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
16
|
Zhang Y, Blanchette M. Reference panel-guided super-resolution inference of Hi-C data. Bioinformatics 2023; 39:i386-i393. [PMID: 37387127 DOI: 10.1093/bioinformatics/btad266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Accurately assessing contacts between DNA fragments inside the nucleus with Hi-C experiment is crucial for understanding the role of 3D genome organization in gene regulation. This challenging task is due in part to the high sequencing depth of Hi-C libraries required to support high-resolution analyses. Most existing Hi-C data are collected with limited sequencing coverage, leading to poor chromatin interaction frequency estimation. Current computational approaches to enhance Hi-C signals focus on the analysis of individual Hi-C datasets of interest, without taking advantage of the facts that (i) several hundred Hi-C contact maps are publicly available and (ii) the vast majority of local spatial organizations are conserved across multiple cell types. RESULTS Here, we present RefHiC-SR, an attention-based deep learning framework that uses a reference panel of Hi-C datasets to facilitate the enhancement of Hi-C data resolution of a given study sample. We compare RefHiC-SR against tools that do not use reference samples and find that RefHiC-SR outperforms other programs across different cell types, and sequencing depths. It also enables high-accuracy mapping of structures such as loops and topologically associating domains. AVAILABILITY AND IMPLEMENTATION https://github.com/BlanchetteLab/RefHiC.
Collapse
Affiliation(s)
- Yanlin Zhang
- School of Computer Science, McGill University, Montréal, Québec H3A 0E9, Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, Montréal, Québec H3A 0E9, Canada
| |
Collapse
|
17
|
Wang B, Liu K, Li Y, Wang J. DFHiC: a dilated full convolution model to enhance the resolution of Hi-C data. Bioinformatics 2023; 39:btad211. [PMID: 37084258 PMCID: PMC10166584 DOI: 10.1093/bioinformatics/btad211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 02/13/2023] [Accepted: 04/12/2023] [Indexed: 04/22/2023] Open
Abstract
MOTIVATION Hi-C technology has been the most widely used chromosome conformation capture (3C) experiment that measures the frequency of all paired interactions in the entire genome, which is a powerful tool for studying the 3D structure of the genome. The fineness of the constructed genome structure depends on the resolution of Hi-C data. However, due to the fact that high-resolution Hi-C data require deep sequencing and thus high experimental cost, most available Hi-C data are in low-resolution. Hence, it is essential to enhance the quality of Hi-C data by developing the effective computational methods. RESULTS In this work, we propose a novel method, so-called DFHiC, which generates the high-resolution Hi-C matrix from the low-resolution Hi-C matrix in the framework of the dilated convolutional neural network. The dilated convolution is able to effectively explore the global patterns in the overall Hi-C matrix by taking advantage of the information of the Hi-C matrix in a way of the longer genomic distance. Consequently, DFHiC can improve the resolution of the Hi-C matrix reliably and accurately. More importantly, the super-resolution Hi-C data enhanced by DFHiC is more in line with the real high-resolution Hi-C data than those done by the other existing methods, in terms of both chromatin significant interactions and identifying topologically associating domains. AVAILABILITY AND IMPLEMENTATION https://github.com/BinWangCSU/DFHiC.
Collapse
Affiliation(s)
- Bin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Kun Liu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529, United States
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| |
Collapse
|
18
|
Liu K, Li HD, Li Y, Wang J, Wang J. A Comparison of Topologically Associating Domain Callers Based on Hi-C Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:15-29. [PMID: 35104223 DOI: 10.1109/tcbb.2022.3147805] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Topologically associating domains (TADs) are local chromatin interaction domains, which have been shown to play an important role in gene expression regulation. TADs were originally discovered in the investigation of 3D genome organization based on High-throughput Chromosome Conformation Capture (Hi-C) data. Continuous considerable efforts have been dedicated to developing methods for detecting TADs from Hi-C data. Different computational methods for TADs identification vary in their assumptions and criteria in calling TADs. As a consequence, the TADs called by these methods differ in their similarities and biological features they are enriched in. In this work, we performed a systematic comparison of twenty-six TAD callers. We first compared the TADs and gaps between adjacent TADs across different methods, resolutions, and sequencing depths. We then assessed the quality of TADs and TAD boundaries according to three criteria: the decay of contact frequencies over the genomic distance, enrichment and depletion of regulatory elements around TAD boundaries, and reproducibility of TADs and TAD boundaries in replicate samples. Last, due to the lack of a gold standard of TADs, we also evaluated the performance of the methods on synthetic datasets. We discussed the key principles of TAD callers, and pinpointed current situation in the detection of TADs. We provide a concise, comprehensive, and systematic framework for evaluating the performance of TAD callers, and expect our work will provide useful guidance in choosing suitable approaches for the detection and evaluation of TADs.
Collapse
|
19
|
DLoopCaller: A deep learning approach for predicting genome-wide chromatin loops by integrating accessible chromatin landscapes. PLoS Comput Biol 2022; 18:e1010572. [PMID: 36206320 PMCID: PMC9581407 DOI: 10.1371/journal.pcbi.1010572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 10/19/2022] [Accepted: 09/14/2022] [Indexed: 11/20/2022] Open
Abstract
In recent years, major advances have been made in various chromosome conformation capture technologies to further satisfy the needs of researchers for high-quality, high-resolution contact interactions. Discriminating the loops from genome-wide contact interactions is crucial for dissecting three-dimensional(3D) genome structure and function. Here, we present a deep learning method to predict genome-wide chromatin loops, called DLoopCaller, by combining accessible chromatin landscapes and raw Hi-C contact maps. Some available orthogonal data ChIA-PET/HiChIP and Capture Hi-C were used to generate positive samples with a wider contact matrix which provides the possibility to find more potential genome-wide chromatin loops. The experimental results demonstrate that DLoopCaller effectively improves the accuracy of predicting genome-wide chromatin loops compared to the state-of-the-art method Peakachu. Moreover, compared to two of most popular loop callers, such as HiCCUPS and Fit-Hi-C, DLoopCaller identifies some unique interactions. We conclude that a combination of chromatin landscapes on the one-dimensional genome contributes to understanding the 3D genome organization, and the identified chromatin loops reveal cell-type specificity and transcription factor motif co-enrichment across different cell lines and species.
Collapse
|
20
|
Zhang S, Plummer D, Lu L, Cui J, Xu W, Wang M, Liu X, Prabhakar N, Shrinet J, Srinivasan D, Fraser P, Li Y, Li J, Jin F. DeepLoop robustly maps chromatin interactions from sparse allele-resolved or single-cell Hi-C data at kilobase resolution. Nat Genet 2022; 54:1013-1025. [PMID: 35817982 PMCID: PMC10082397 DOI: 10.1038/s41588-022-01116-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 05/30/2022] [Indexed: 11/09/2022]
Abstract
Mapping chromatin loops from noisy Hi-C heatmaps remains a major challenge. Here we present DeepLoop, which performs rigorous bias correction followed by deep-learning-based signal enhancement for robust chromatin interaction mapping from low-depth Hi-C data. DeepLoop enables loop-resolution, single-cell Hi-C analysis. It also achieves a cross-platform convergence between different Hi-C protocols and micrococcal nuclease (micro-C). DeepLoop allowed us to map the genetic and epigenetic determinants of allele-specific chromatin interactions in the human genome. We nominate new loci with allele-specific interactions governed by imprinting or allelic DNA methylation. We also discovered that, in the inactivated X chromosome (Xi), local loops at the DXZ4 'megadomain' boundary escape X-inactivation but the FIRRE 'superloop' locus does not. Importantly, DeepLoop can pinpoint heterozygous single-nucleotide polymorphisms and large structure variants that cause allelic chromatin loops, many of which rewire enhancers with transcription consequences. Taken together, DeepLoop expands the use of Hi-C to provide loop-resolution insights into the genetics of the three-dimensional genome.
Collapse
Affiliation(s)
- Shanshan Zhang
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA.,The Biomedical Sciences Training Program, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Dylan Plummer
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - Leina Lu
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Jian Cui
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Wanying Xu
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA.,The Biomedical Sciences Training Program, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Miao Wang
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Xiaoxiao Liu
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Nachiketh Prabhakar
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - Jatin Shrinet
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Divyaa Srinivasan
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Peter Fraser
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Yan Li
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA.
| | - Jing Li
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA. .,Department of Population and Quantitative Health Sciences, Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA.
| | - Fulai Jin
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA. .,Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA. .,Department of Population and Quantitative Health Sciences, Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA.
| |
Collapse
|
21
|
Dsouza KB, Maslova A, Al-Jibury E, Merkenschlager M, Bhargava VK, Libbrecht MW. Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of conformation. Nat Commun 2022; 13:3704. [PMID: 35764630 PMCID: PMC9240038 DOI: 10.1038/s41467-022-31337-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Accepted: 06/15/2022] [Indexed: 11/28/2022] Open
Abstract
Despite the availability of chromatin conformation capture experiments, discerning the relationship between the 1D genome and 3D conformation remains a challenge, which limits our understanding of their affect on gene expression and disease. We propose Hi-C-LSTM, a method that produces low-dimensional latent representations that summarize intra-chromosomal Hi-C contacts via a recurrent long short-term memory neural network model. We find that these representations contain all the information needed to recreate the observed Hi-C matrix with high accuracy, outperforming existing methods. These representations enable the identification of a variety of conformation-defining genomic elements, including nuclear compartments and conformation-related transcription factors. They furthermore enable in-silico perturbation experiments that measure the influence of cis-regulatory elements on conformation.
Collapse
Affiliation(s)
- Kevin B Dsouza
- Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, Canada.
| | - Alexandra Maslova
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | - Ediem Al-Jibury
- MRC, London Institute of Medical Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK
- Department of Computing, Imperial College London, London, UK
| | - Matthias Merkenschlager
- MRC, London Institute of Medical Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK
| | - Vijay K Bhargava
- Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, Canada
| | | |
Collapse
|
22
|
Xie Q, Han C, Jin V, Lin S. HiCImpute: A Bayesian hierarchical model for identifying structural zeros and enhancing single cell Hi-C data. PLoS Comput Biol 2022; 18:e1010129. [PMID: 35696429 PMCID: PMC9232133 DOI: 10.1371/journal.pcbi.1010129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 06/24/2022] [Accepted: 04/21/2022] [Indexed: 11/19/2022] Open
Abstract
Single cell Hi-C techniques enable one to study cell to cell variability in chromatin interactions. However, single cell Hi-C (scHi-C) data suffer severely from sparsity, that is, the existence of excess zeros due to insufficient sequencing depth. Complicating the matter further is the fact that not all zeros are created equal: some are due to loci truly not interacting because of the underlying biological mechanism (structural zeros); others are indeed due to insufficient sequencing depth (sampling zeros or dropouts), especially for loci that interact infrequently. Differentiating between structural zeros and dropouts is important since correct inference would improve downstream analyses such as clustering and discovery of subtypes. Nevertheless, distinguishing between these two types of zeros has received little attention in the single cell Hi-C literature, where the issue of sparsity has been addressed mainly as a data quality improvement problem. To fill this gap, in this paper, we propose HiCImpute, a Bayesian hierarchical model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros. HiCImpute takes spatial dependencies of scHi-C 2D data structure into account while also borrowing information from similar single cells and bulk data, when such are available. Through an extensive set of analyses of synthetic and real data, we demonstrate the ability of HiCImpute for identifying structural zeros with high sensitivity, and for accurate imputation of dropout values. Downstream analyses using data improved from HiCImpute yielded much more accurate clustering of cell types compared to using observed data or data improved by several comparison methods. Most significantly, HiCImpute-improved data have led to the identification of subtypes within each of the excitatory neuronal cells of L4 and L5 in the prefrontal cortex.
Collapse
Affiliation(s)
- Qing Xie
- Interdisciplinary Ph.D. Program in Biostatistics, Ohio State University, Columbus, Ohio, United State of America
| | - Chenggong Han
- Interdisciplinary Ph.D. Program in Biostatistics, Ohio State University, Columbus, Ohio, United State of America
| | - Victor Jin
- Department of Molecular Medicine, University of Texas Health Science Center, San Antonio, Texas, United State of America
| | - Shili Lin
- Interdisciplinary Ph.D. Program in Biostatistics, Ohio State University, Columbus, Ohio, United State of America
- Department of Statistics, Ohio State University, Columbus, Ohio, United State of America
- Translational Data Analytics Institute, Ohio State University, Columbus, Ohio, United State of America
| |
Collapse
|
23
|
Huang L, Yang Y, Li G, Jiang M, Wen J, Abnousi A, Rosen JD, Hu M, Li Y. A systematic evaluation of Hi-C data enhancement methods for enhancing PLAC-seq and HiChIP data. Brief Bioinform 2022; 23:bbac145. [PMID: 35488276 PMCID: PMC9116213 DOI: 10.1093/bib/bbac145] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 03/30/2022] [Accepted: 03/31/2022] [Indexed: 11/12/2022] Open
Abstract
The three-dimensional organization of chromatin plays a critical role in gene regulation. Recently developed technologies, such as HiChIP and proximity ligation-assisted ChIP-Seq (PLAC-seq) (hereafter referred to as HP for brevity), can measure chromosome spatial organization by interrogating chromatin interactions mediated by a protein of interest. While offering cost-efficiency over genome-wide unbiased high-throughput chromosome conformation capture (Hi-C) data, HP data remain sparse at kilobase (Kb) resolution with the current sequencing depth in the order of 108 reads per sample. Deep learning models, including HiCPlus, HiCNN, HiCNN2, DeepHiC and Variationally Encoded Hi-C Loss Enhancer (VEHiCLE), have been developed to enhance the sequencing depth of Hi-C data, but their performance on HP data has not been benchmarked. Here, we performed a comprehensive evaluation of HP data sequencing depth enhancement using models developed for Hi-C data. Specifically, we analyzed various HP data, including Smc1a HiChIP data of the human lymphoblastoid cell line GM12878, H3K4me3 PLAC-seq data of four human neural cell types as well as of mouse embryonic stem cells (mESC), and mESC CCCTC-binding factor (CTCF) PLAC-seq data. Our evaluations lead to the following three findings: (i) most models developed for Hi-C data achieve reasonable performance when applied to HP data (e.g. with Pearson correlation ranging 0.76-0.95 for pairs of loci within 300 Kb), and the enhanced datasets lead to improved statistical power for detecting long-range chromatin interactions, (ii) models trained on HP data outperform those trained on Hi-C data and (iii) most models are transferable across cell types. Our results provide a general guideline for HP data enhancement using existing methods designed for Hi-C data.
Collapse
Affiliation(s)
- Le Huang
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, North Carolina 27599, USA
| | - Yuchen Yang
- State Key Laboratory of Biocontrol, School of Ecology, Sun Yat-sen University, 510275 Guangzhou, China
| | - Gang Li
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, NC 27599, USA
| | - Minzhi Jiang
- Department of Applied Physical Sciences, University of North Carolina at Chapel Hill, NC 27599, USA
| | - Jia Wen
- Department of Genetics, University of North Carolina at Chapel Hill, North Carolina 27599, USA
| | - Armen Abnousi
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, Ohio 44195
| | - Jonathan D Rosen
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina 27599, USA
| | - Ming Hu
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, Ohio 44195
| | - Yun Li
- Department of Genetics, University of North Carolina at Chapel Hill, North Carolina 27599, USA
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina 27599, USA
- Department of Computer Science, University of North Carolina at Chapel Hill, North Carolina 27599, USA
| |
Collapse
|
24
|
Feng F, Yao Y, Wang XQD, Zhang X, Liu J. Connecting high-resolution 3D chromatin organization with epigenomics. Nat Commun 2022; 13:2054. [PMID: 35440119 PMCID: PMC9018831 DOI: 10.1038/s41467-022-29695-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 03/28/2022] [Indexed: 11/09/2022] Open
Abstract
The resolution of chromatin conformation capture technologies keeps increasing, and the recent nucleosome resolution chromatin contact maps allow us to explore how fine-scale 3D chromatin organization is related to epigenomic states in human cells. Using publicly available Micro-C datasets, we develop a deep learning model, CAESAR, to learn a mapping function from epigenomic features to 3D chromatin organization. The model accurately predicts fine-scale structures, such as short-range chromatin loops and stripes, that Hi-C fails to detect. With existing epigenomic datasets from ENCODE and Roadmap Epigenomics Project, we successfully impute high-resolution 3D chromatin contact maps for 91 human tissues and cell lines. In the imputed high-resolution contact maps, we identify the spatial interactions between genes and their experimentally validated regulatory elements, demonstrating CAESAR's potential in coupling transcriptional regulation with 3D chromatin organization at high resolution.
Collapse
Affiliation(s)
- Fan Feng
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Yuan Yao
- Department of Computer Science & Engineering, University of Michigan, Ann Arbor, MI, USA
| | - Xue Qing David Wang
- Division of Hematology, Department of Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Xiaotian Zhang
- Department of Pathology, University of Michigan, Ann Arbor, MI, USA
| | - Jie Liu
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI, USA. .,Department of Computer Science & Engineering, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
25
|
Sefer E. A comparison of topologically associating domain callers over mammals at high resolution. BMC Bioinformatics 2022; 23:127. [PMID: 35413815 PMCID: PMC9006547 DOI: 10.1186/s12859-022-04674-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 04/07/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Topologically associating domains (TADs) are locally highly-interacting genome regions, which also play a critical role in regulating gene expression in the cell. TADs have been first identified while investigating the 3D genome structure over High-throughput Chromosome Conformation Capture (Hi-C) interaction dataset. Substantial degree of efforts have been devoted to develop techniques for inferring TADs from Hi-C interaction dataset. Many TAD-calling methods have been developed which differ in their criteria and assumptions in TAD inference. Correspondingly, TADs inferred via these callers vary in terms of both similarities and biological features they are enriched in. RESULT We have carried out a systematic comparison of 27 TAD-calling methods over mammals. We use Micro-C, a recent high-resolution variant of Hi-C, to compare TADs at a very high resolution, and classify the methods into 3 categories: feature-based methods, Clustering methods, Graph-partitioning methods. We have evaluated TAD boundaries, gaps between adjacent TADs, and quality of TADs across various criteria. We also found particularly CTCF and Cohesin proteins to be effective in formation of TADs with corner dots. We have also assessed the callers performance on simulated datasets since a gold standard for TADs is missing. TAD sizes and numbers change remarkably between TAD callers and dataset resolutions, indicating that TADs are hierarchically-organized domains, instead of disjoint regions. A core subset of feature-based TAD callers regularly perform the best while inferring reproducible domains, which are also enriched for TAD related biological properties. CONCLUSION We have analyzed the fundamental principles of TAD-calling methods, and identified the existing situation in TAD inference across high resolution Micro-C interaction datasets over mammals. We come up with a systematic, comprehensive, and concise framework to evaluate the TAD-calling methods performance across Micro-C datasets. Our research will be useful in selecting appropriate methods for TAD inference and evaluation based on available data, experimental design, and biological question of interest. We also introduce our analysis as a benchmarking tool with publicly available source code.
Collapse
Affiliation(s)
- Emre Sefer
- Department of Computer Science, Ozyegin University, Istanbul, Turkey.
| |
Collapse
|
26
|
Hicks P, Oluwadare O. HiCARN: Resolution Enhancement of Hi-C Data Using Cascading Residual Networks. Bioinformatics 2022; 38:2414-2421. [PMID: 35274679 PMCID: PMC9048669 DOI: 10.1093/bioinformatics/btac156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 02/15/2022] [Accepted: 03/10/2022] [Indexed: 11/29/2022] Open
Abstract
Motivation High throughput chromosome conformation capture (Hi-C) contact matrices are used to predict 3D chromatin structures in eukaryotic cells. High-resolution Hi-C data are less available than low-resolution Hi-C data due to sequencing costs but provide greater insight into the intricate details of 3D chromatin structures such as enhancer–promoter interactions and sub-domains. To provide a cost-effective solution to high-resolution Hi-C data collection, deep learning models are used to predict high-resolution Hi-C matrices from existing low-resolution matrices across multiple cell types. Results Here, we present two Cascading Residual Networks called HiCARN-1 and HiCARN-2, a convolutional neural network and a generative adversarial network, that use a novel framework of cascading connections throughout the network for Hi-C contact matrix prediction from low-resolution data. Shown by image evaluation and Hi-C reproducibility metrics, both HiCARN models, overall, outperform state-of-the-art Hi-C resolution enhancement algorithms in predictive accuracy for both human and mouse 1/16, 1/32, 1/64 and 1/100 downsampled high-resolution Hi-C data. Also, validation by extracting topologically associating domains, chromosome 3D structure and chromatin loop predictions from the enhanced data shows that HiCARN can proficiently reconstruct biologically significant regions. Availability and implementation HiCARN can be accessed and utilized as an open-sourced software at: https://github.com/OluwadareLab/HiCARN and is also available as a containerized application that can be run on any platform. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Parker Hicks
- Concordia University Irvine, Irvine, CA 92612, USA
| | - Oluwatosin Oluwadare
- Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA
| |
Collapse
|
27
|
Tran A, Yang P, Yang JYH, Ormerod JT. scREMOTE: Using multimodal single cell data to predict regulatory gene relationships and to build a computational cell reprogramming model. NAR Genom Bioinform 2022; 4:lqac023. [PMID: 35300460 PMCID: PMC8923006 DOI: 10.1093/nargab/lqac023] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Revised: 02/22/2022] [Accepted: 03/10/2022] [Indexed: 11/12/2022] Open
Abstract
Cell reprogramming offers a potential treatment to many diseases, by regenerating specialized somatic cells. Despite decades of research, discovering the transcription factors that promote cell reprogramming has largely been accomplished through trial and error, a time-consuming and costly method. A computational model for cell reprogramming, however, could guide the hypothesis formulation and experimental validation, to efficiently utilize time and resources. Current methods often cannot account for the heterogeneity observed in cell reprogramming, or they only make short-term predictions, without modelling the entire reprogramming process. Here, we present scREMOTE, a novel computational model for cell reprogramming that leverages single cell multiomics data, enabling a more holistic view of the regulatory mechanisms at cellular resolution. This is achieved by first identifying the regulatory potential of each transcription factor and gene to uncover regulatory relationships, then a regression model is built to estimate the effect of transcription factor perturbations. We show that scREMOTE successfully predicts the long-term effect of overexpressing two key transcription factors in hair follicle development by capturing higher-order gene regulations. Together, this demonstrates that integrating the multimodal processes governing gene regulation creates a more accurate model for cell reprogramming with significant potential to accelerate research in regenerative medicine.
Collapse
Affiliation(s)
- Andy Tran
- School of Mathematics and Statistics, The University of Sydney, Camperdown NSW 2006, Australia
| | - Pengyi Yang
- School of Mathematics and Statistics, The University of Sydney, Camperdown NSW 2006, Australia
| | - Jean Y H Yang
- School of Mathematics and Statistics, The University of Sydney, Camperdown NSW 2006, Australia
| | - John T Ormerod
- School of Mathematics and Statistics, The University of Sydney, Camperdown NSW 2006, Australia
| |
Collapse
|
28
|
Pratt BM, Won H. Advances in profiling chromatin architecture shed light on the regulatory dynamics underlying brain disorders. Semin Cell Dev Biol 2022; 121:153-160. [PMID: 34483043 PMCID: PMC8761161 DOI: 10.1016/j.semcdb.2021.08.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 08/18/2021] [Accepted: 08/23/2021] [Indexed: 01/03/2023]
Abstract
Understanding the exquisitely complex nature of the three-dimensional organization of the genome and how it affects gene regulation remains a central question in biology. Recent advances in sequencing- and imaging-based approaches in decoding the three-dimensional chromatin landscape have enabled a systematic characterization of gene regulatory architecture. In this review, we outline how chromatin architecture provides a reference atlas to predict the functional consequences of non-coding variants associated with human traits and disease. High-throughput perturbation assays such as massively parallel reporter assays (MPRA) and CRISPR-based genome engineering in combination with a reference atlas opened an avenue for going beyond observational studies to experimentally validating the regulatory principles of the genome. We conclude by providing a suggested path forward by calling attention to barriers that can be addressed for a more complete understanding of the regulatory landscape of the human brain.
Collapse
Affiliation(s)
- Brandon M Pratt
- Department of Pharmacology, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Hyejung Won
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA; UNC Neuroscience Center, University of North Carolina, Chapel Hill, NC 27599, USA.
| |
Collapse
|
29
|
Montesinos-López OA, Montesinos-López A, Hernandez-Suarez CM, Barrón-López JA, Crossa J. Deep-learning power and perspectives for genomic selection. THE PLANT GENOME 2021; 14:e20122. [PMID: 34309215 DOI: 10.1002/tpg2.20122] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Accepted: 05/24/2021] [Indexed: 06/13/2023]
Abstract
Deep learning (DL) is revolutionizing the development of artificial intelligence systems. For example, before 2015, humans were better than artificial machines at classifying images and solving many problems of computer vision (related to object localization and detection using images), but nowadays, artificial machines have surpassed the ability of humans in this specific task. This is just one example of how the application of these models has surpassed human abilities and the performance of other machine-learning algorithms. For this reason, DL models have been adopted for genomic selection (GS). In this article we provide insight about the power of DL in solving complex prediction tasks and how combining GS and DL models can accelerate the revolution provoked by GS methodology in plant breeding. Furthermore, we will mention some trends of DL methods, emphasizing some areas of opportunity to really exploit the DL methodology in GS; however, we are aware that considerable research is required to be able not only to use the existing DL in conjunction with GS, but to adapt and develop DL methods that take the peculiarities of breeding inputs and GS into consideration.
Collapse
Affiliation(s)
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, Jalisco, 44430, México
| | | | - José Alberto Barrón-López
- Department of Animal Production (DPA), Universidad Nacional Agraria La Molina, Av. La Molina s/n La Molina, Lima, 15024, Perú
| | - José Crossa
- Colegio de Postgraduados, Montecillos, Edo, de México, 56230, México
- Biometrics and Statistics Unit, Genetic Resources Program, International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, Edo. De, Mexico DF, 52640, Mexico
| |
Collapse
|
30
|
Liu N, Low WY, Alinejad-Rokny H, Pederson S, Sadlon T, Barry S, Breen J. Seeing the forest through the trees: prioritising potentially functional interactions from Hi-C. Epigenetics Chromatin 2021; 14:41. [PMID: 34454581 PMCID: PMC8399707 DOI: 10.1186/s13072-021-00417-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 08/19/2021] [Indexed: 11/30/2022] Open
Abstract
Eukaryotic genomes are highly organised within the nucleus of a cell, allowing widely dispersed regulatory elements such as enhancers to interact with gene promoters through physical contacts in three-dimensional space. Recent chromosome conformation capture methodologies such as Hi-C have enabled the analysis of interacting regions of the genome providing a valuable insight into the three-dimensional organisation of the chromatin in the nucleus, including chromosome compartmentalisation and gene expression. Complicating the analysis of Hi-C data, however, is the massive amount of identified interactions, many of which do not directly drive gene function, thus hindering the identification of potentially biologically functional 3D interactions. In this review, we collate and examine the downstream analysis of Hi-C data with particular focus on methods that prioritise potentially functional interactions. We classify three groups of approaches: structural-based discovery methods, e.g. A/B compartments and topologically associated domains, detection of statistically significant chromatin interactions, and the use of epigenomic data integration to narrow down useful interaction information. Careful use of these three approaches is crucial to successfully identifying potentially functional interactions within the genome.
Collapse
Affiliation(s)
- Ning Liu
- Computational & Systems Biology, Precision Medicine Theme, South Australian Health & Medical Research Institute, SA, 5000, Adelaide, Australia
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia
- Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia
| | - Wai Yee Low
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA, 5371, Australia
| | - Hamid Alinejad-Rokny
- BioMedical Machine Learning Lab, The Graduate School of Biomedical Engineering, The University of New South Wales, NSW, 2052, Sydney, Australia
- Core Member of UNSW Data Science Hub, The University of New South Wales, 2052, Sydney, Australia
| | - Stephen Pederson
- Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia
- Dame Roma Mitchell Cancer Research Laboratories (DRMCRL), Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia
| | - Timothy Sadlon
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia
- Women's & Children's Health Network, SA, 5006, North Adelaide, Australia
| | - Simon Barry
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia
- Core Member of UNSW Data Science Hub, The University of New South Wales, 2052, Sydney, Australia
- Women's & Children's Health Network, SA, 5006, North Adelaide, Australia
| | - James Breen
- Computational & Systems Biology, Precision Medicine Theme, South Australian Health & Medical Research Institute, SA, 5000, Adelaide, Australia.
- Robinson Research Institute, University of Adelaide, SA, 5005, Adelaide, Australia.
- Adelaide Medical School, University of Adelaide, SA, 5005, Adelaide, Australia.
- South Australian Genomics Centre (SAGC), South Australian Health & Medical Research Institute (SAHMRI), SA, 5000, Adelaide, Australia.
| |
Collapse
|
31
|
Hu Y, Ma W. EnHiC: learning fine-resolution Hi-C contact maps using a generative adversarial framework. Bioinformatics 2021; 37:i272-i279. [PMID: 34252966 PMCID: PMC8382278 DOI: 10.1093/bioinformatics/btab272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation The high-throughput chromosome conformation capture (Hi-C) technique has enabled genome-wide mapping of chromatin interactions. However, high-resolution Hi-C data requires costly, deep sequencing; therefore, it has only been achieved for a limited number of cell types. Machine learning models based on neural networks have been developed as a remedy to this problem. Results In this work, we propose a novel method, EnHiC, for predicting high-resolution Hi-C matrices from low-resolution input data based on a generative adversarial network (GAN) framework. Inspired by non-negative matrix factorization, our model fully exploits the unique properties of Hi-C matrices and extracts rank-1 features from multi-scale low-resolution matrices to enhance the resolution. Using three human Hi-C datasets, we demonstrated that EnHiC accurately and reliably enhanced the resolution of Hi-C matrices and outperformed other GAN-based models. Moreover, EnHiC-predicted high-resolution matrices facilitated the accurate detection of topologically associated domains and fine-scale chromatin interactions. Availability and implementation EnHiC is publicly available at https://github.com/wmalab/EnHiC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yangyang Hu
- Department of Computer Science and Engineering
| | - Wenxiu Ma
- Department of Statistics, University of California Riverside, Riverside, CA 92521, USA
| |
Collapse
|
32
|
VEHiCLE: a Variationally Encoded Hi-C Loss Enhancement algorithm for improving and generating Hi-C data. Sci Rep 2021; 11:8880. [PMID: 33893353 PMCID: PMC8065109 DOI: 10.1038/s41598-021-88115-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 03/10/2021] [Indexed: 11/23/2022] Open
Abstract
Chromatin conformation plays an important role in a variety of genomic processes. Hi-C is one of the most popular assays for inspecting chromatin conformation. However, the utility of Hi-C contact maps is bottlenecked by resolution. Here we present VEHiCLE, a deep learning algorithm for resolution enhancement of Hi-C contact data. VEHiCLE utilises a variational autoencoder and adversarial training strategy equipped with four loss functions (adversarial loss, variational loss, chromosome topology-inspired insulation loss, and mean square error loss) to enhance contact maps, making them more viable for downstream analysis. VEHiCLE expands previous efforts at Hi-C super resolution by providing novel insight into the biologically meaningful and human interpretable feature extraction. Using a deep variational autoencoder, VEHiCLE provides a user tunable, full generative model for generating synthetic Hi-C data while also providing state-of-the-art results in enhancement of Hi-C data across multiple metrics.
Collapse
|
33
|
Gong H, Yang Y, Zhang S, Li M, Zhang X. Application of Hi-C and other omics data analysis in human cancer and cell differentiation research. Comput Struct Biotechnol J 2021; 19:2070-2083. [PMID: 33995903 PMCID: PMC8086027 DOI: 10.1016/j.csbj.2021.04.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/04/2021] [Accepted: 04/04/2021] [Indexed: 02/07/2023] Open
Abstract
With the development of 3C (chromosome conformation capture) and its derivative technology Hi-C (High-throughput chromosome conformation capture) research, the study of the spatial structure of the genomic sequence in the nucleus helps researchers understand the functions of biological processes such as gene transcription, replication, repair, and regulation. In this paper, we first introduce the research background and purpose of Hi-C data visualization analysis. After that, we discuss the Hi-C data analysis methods from genome 3D structure, A/B compartment, TADs (topologically associated domain), and loop detection. We also discuss how to apply genome visualization technologies to the identification of chromosome feature structures. We continue with a review of correlation analysis differences among multi-omics data, and how to apply Hi-C and other omics data analysis into cancer and cell differentiation research. Finally, we summarize the various problems in joint analyses based on Hi-C and other multi-omics data. We believe this review can help researchers better understand the progress and applications of 3D genome technology.
Collapse
Affiliation(s)
- Haiyan Gong
- Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
- Beijing Advanced Innovation Center for Materials Genome Engineering, University of Science and Technology Beijing, Beijing 100083, China
- Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing 100083, China
- Shunde Graduate School of University of Science and Technology Beijing, Foshan 528000, China
| | - Yi Yang
- Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
| | - Sichen Zhang
- Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
| | - Minghong Li
- Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
| | - Xiaotong Zhang
- Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
- Beijing Advanced Innovation Center for Materials Genome Engineering, University of Science and Technology Beijing, Beijing 100083, China
- Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing 100083, China
- Shunde Graduate School of University of Science and Technology Beijing, Foshan 528000, China
| |
Collapse
|
34
|
Tao H, Li H, Xu K, Hong H, Jiang S, Du G, Wang J, Sun Y, Huang X, Ding Y, Li F, Zheng X, Chen H, Bo X. Computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles. Brief Bioinform 2021; 22:6102668. [PMID: 33454752 PMCID: PMC8424394 DOI: 10.1093/bib/bbaa405] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 11/26/2020] [Accepted: 12/10/2020] [Indexed: 12/14/2022] Open
Abstract
The exploration of three-dimensional chromatin interaction and organization provides insight into mechanisms underlying gene regulation, cell differentiation and disease development. Advances in chromosome conformation capture technologies, such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag (ChIA-PET), have enabled the exploration of chromatin interaction and organization. However, high-resolution Hi-C and ChIA-PET data are only available for a limited number of cell lines, and their acquisition is costly, time consuming, laborious and affected by theoretical limitations. Increasing evidence shows that DNA sequence and epigenomic features are informative predictors of regulatory interaction and chromatin architecture. Based on these features, numerous computational methods have been developed for the prediction of chromatin interaction and organization, whereas they are not extensively applied in biomedical study. A systematical study to summarize and evaluate such methods is still needed to facilitate their application. Here, we summarize 48 computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles, categorize them and compare their performance. Besides, we provide a comprehensive guideline for the selection of suitable methods to predict chromatin interaction and organization based on available data and biological question of interest.
Collapse
Affiliation(s)
- Huan Tao
- Beijing Institute of Radiation Medicine
| | - Hao Li
- Beijing Institute of Radiation Medicine
| | - Kang Xu
- Beijing Institute of Radiation Medicine
| | - Hao Hong
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Shuai Jiang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Guifang Du
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | | | - Yu Sun
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Xin Huang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Yang Ding
- Beijing Institute of Radiation Medicine
| | - Fei Li
- Chinese Academy of Sciences, Department of Computer Network Information Center
| | | | | | | |
Collapse
|
35
|
Vijay Kumar J, Harshavardhan A, Bhukya H, Krishna Prasad AV. Advanced Machine Learning-Based Analytics on COVID-19 Data Using Generative Adversarial Networks. MATERIALS TODAY. PROCEEDINGS 2020:S2214-7853(20)37620-3. [PMID: 33078094 PMCID: PMC7556782 DOI: 10.1016/j.matpr.2020.10.053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 10/03/2020] [Indexed: 11/01/2022]
Abstract
The domain of medical diagnosis and predictive analytics is one of the key domains of research with enormous dimensions whereby the diseases of different types can be predicted. Nowadays, there is a huge panic of impact and rapid mutation of the COVID-19 virus impression. The world is getting affected by this virus to a huge extent and there is no vaccine developed so far. India is also having more than 10,000 patients with than 300 deceased. The global human community is having around 20 lacs of Coronavirus patients. The Generative Adversarial Network (GAN) is the contemporary high-performance approach in which the use of advanced neural networks is done for the cavernous analytics of the images and multimedia data. In this research work, the analytics of key points from medical images of the COVID-19 dataset is to be presented using which the diagnosis and predictions can be done for the patients. The GANs are used for the generation, transformation as well as presentation of the dataset and key points using advanced deep learning models which can analyze the patterns in the medical images including X-Ray, CT Scan, and many others. Using such approaches with the integration of GANs, the overall predictive analytics can be made high performance aware as compared to the classical neural networks with multiple layers. In this research manuscript, the inscription of work is projected on the benchmark datasets with the advanced scripting so that the predictive mining and knowledge discovery can be done effectively with more accuracy.
Collapse
Affiliation(s)
| | | | - Hanumanthu Bhukya
- Department of CSE, Kakatiya Institute of Technology & Science, Warangal, Telangana, India
| | - A V Krishna Prasad
- Department. of Computer Science and Engineering, MVSR Engineering College, Hyderabad, India
| |
Collapse
|
36
|
Application of deep learning in genomics. SCIENCE CHINA-LIFE SCIENCES 2020; 63:1860-1878. [PMID: 33051704 DOI: 10.1007/s11427-020-1804-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/15/2020] [Indexed: 12/19/2022]
Abstract
In recent years, deep learning has been widely used in diverse fields of research, such as speech recognition, image classification, autonomous driving and natural language processing. Deep learning has showcased dramatically improved performance in complex classification and regression problems, where the intricate structure in the high-dimensional data is difficult to discover using conventional machine learning algorithms. In biology, applications of deep learning are gaining increasing popularity in predicting the structure and function of genomic elements, such as promoters, enhancers, or gene expression levels. In this review paper, we described the basic concepts in machine learning and artificial neural network, followed by elaboration on the workflow of using convolutional neural network in genomics. Then we provided a concise introduction of deep learning applications in genomics and synthetic biology at the levels of DNA, RNA and protein. Finally, we discussed the current challenges and future perspectives of deep learning in genomics.
Collapse
|
37
|
Jiang S, Li H, Hong H, Du G, Huang X, Sun Y, Wang J, Tao H, Xu K, Li C, Chen Y, Chen H, Bo X. Spatial density of open chromatin: an effective metric for the functional characterization of topologically associated domains. Brief Bioinform 2020; 22:5912562. [PMID: 32987404 PMCID: PMC8138881 DOI: 10.1093/bib/bbaa210] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/10/2020] [Accepted: 08/12/2020] [Indexed: 11/13/2022] Open
Abstract
Topologically associated domains (TADs) are spatial and functional units of metazoan chromatin structure. Interpretation of the interplay between regulatory factors and chromatin structure within TADs is crucial to understand the spatial and temporal regulation of gene expression. However, a computational metric for the sensitive characterization of TAD regulatory landscape is lacking. Here, we present the spatial density of open chromatin (SDOC) metric as a quantitative measurement of intra-TAD chromatin state and structure. SDOC sensitively reflects epigenetic properties and gene transcriptional activity in TADs. During mouse T-cell development, we found that TADs with decreased SDOC are enriched in repressed developmental genes, and the joint effect of SDOC-decreasing and TAD clustering corresponds to the highest level of gene repression. In addition, we revealed a pervasive preference for TADs with similar SDOC to interact with each other, which may reflect the principle of chromatin organization.
Collapse
Affiliation(s)
- Shuai Jiang
- Beijing Institute of Ratiation Medicine, Department of Biotechnology
| | - Hao Li
- Beijing Institute of Radiation Medicine
| | - Hao Hong
- Beijing Institute of Ratiation Medicine, Department of Biotechnology
| | - Guifang Du
- Beijing Institute of Ratiation Medicine, Department of Biotechnology
| | - Xin Huang
- Beijing Institute of Ratiation Medicine, Department of Biotechnology
| | - Yu Sun
- Beijing Institute of Ratiation Medicine, Department of Biotechnology
| | | | - Huan Tao
- Beijing Institute of Radiation Medicine
| | - Kang Xu
- Beijing Institute of Radiation Medicine
| | - Cheng Li
- Peking-Tsinghua Center for Life Sciences, School of Life Sciences, Peking University. He is also a Professor at the Center for Statistical Science and the Center for Bioinformatics in Peking University
| | - Yang Chen
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, School of Medicine and Department of Automation in Tsinghua University
| | | | | |
Collapse
|
38
|
Lan L, You L, Zhang Z, Fan Z, Zhao W, Zeng N, Chen Y, Zhou X. Generative Adversarial Networks and Its Applications in Biomedical Informatics. Front Public Health 2020; 8:164. [PMID: 32478029 PMCID: PMC7235323 DOI: 10.3389/fpubh.2020.00164] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Accepted: 04/17/2020] [Indexed: 02/05/2023] Open
Abstract
The basic Generative Adversarial Networks (GAN) model is composed of the input vector, generator, and discriminator. Among them, the generator and discriminator are implicit function expressions, usually implemented by deep neural networks. GAN can learn the generative model of any data distribution through adversarial methods with excellent performance. It has been widely applied to different areas since it was proposed in 2014. In this review, we introduced the origin, specific working principle, and development history of GAN, various applications of GAN in digital image processing, Cycle-GAN, and its application in medical imaging analysis, as well as the latest applications of GAN in medical informatics and bioinformatics.
Collapse
Affiliation(s)
- Lan Lan
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Lei You
- Center for Computational Systems Medicine, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Zeyang Zhang
- Department of Computer Science and Technology, College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Zhiwei Fan
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, China
| | - Weiling Zhao
- Center for Computational Systems Medicine, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Nianyin Zeng
- Department of Instrumental and Electrical Engineering, Xiamen University, Fujian, China
| | - Yidong Chen
- Department of Computer Science and Technology, College of Computer Science, Sichuan University, Chengdu, China
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|