1
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and Deep Learning Methods for Predicting 3D Genome Organization. Methods Mol Biol 2025; 2856:357-400. [PMID: 39283464 DOI: 10.1007/978-1-0716-4136-1_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Three-dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, topologically associating domains (TADs), and A/B compartments, play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers and transcription factor binding site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, and TAD boundaries) and analyze their pros and cons. We also point out obstacles to the computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P G Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA
| | - J Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Mikhail G Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA.
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
2
|
Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3. PLoS One 2022; 17:e0274338. [PMID: 36084008 PMCID: PMC9462789 DOI: 10.1371/journal.pone.0274338] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2021] [Accepted: 08/25/2022] [Indexed: 11/23/2022] Open
Abstract
Gene expression is regulated through cis-regulatory elements (CREs), among which are promoters, enhancers, Polycomb/Trithorax Response Elements (PREs), silencers and insulators. Computational prediction of CREs can be achieved using a variety of statistical and machine learning methods combined with different feature space formulations. Although Python packages for DNA sequence feature sets and for machine learning are available, no existing package facilitates the combination of DNA sequence feature sets with machine learning methods for the genome-wide prediction of candidate CREs. We here present Gnocis, a Python package that streamlines the analysis and the modelling of CRE sequences by providing extensible APIs and implementing the glue required for combining feature sets and models for genome-wide prediction. Gnocis implements a variety of base feature sets, including motif pair occurrence frequencies and the k-spectrum mismatch kernel. It integrates with Scikit-learn and TensorFlow for state-of-the-art machine learning. Gnocis additionally implements a broad suite of tools for the handling and preparation of sequence, region and curve data, which can be useful for general DNA bioinformatics in Python. We also present Deep-MOCCA, a neural network architecture inspired by SVM-MOCCA that achieves moderate to high generalization without prior motif knowledge. To demonstrate the use of Gnocis, we applied multiple machine learning methods to the modelling of D. melanogaster PREs, including a Convolutional Neural Network (CNN), making this the first study to model PREs with CNNs. The models are readily adapted to new CRE modelling problems and to other organisms. In order to produce a high-performance, compiled package for Python 3, we implemented Gnocis in Cython. Gnocis can be installed using the PyPI package manager by running ‘pip install gnocis’. The source code is available on GitHub, at https://github.com/bjornbredesen/gnocis.
Collapse
|
3
|
Kurbidaeva A, Purugganan M. Insulators in Plants: Progress and Open Questions. Genes (Basel) 2021; 12:genes12091422. [PMID: 34573404 PMCID: PMC8470105 DOI: 10.3390/genes12091422] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Revised: 09/08/2021] [Accepted: 09/14/2021] [Indexed: 11/16/2022] Open
Abstract
The genomes of higher eukaryotes are partitioned into topologically associated domains or TADs, and insulators (also known as boundary elements) are the key elements responsible for their formation and maintenance. Insulators were first identified and extensively studied in Drosophila as well as mammalian genomes, and have also been described in yeast and plants. In addition, many insulator proteins are known in Drosophila, and some have been investigated in mammals. However, much less is known about this important class of non-coding DNA elements in plant genomes. In this review, we take a detailed look at known plant insulators across different species and provide an overview of potential determinants of plant insulator functions, including cis-elements and boundary proteins. We also discuss methods previously used in attempts to identify plant insulators, provide a perspective on their importance for research and biotechnology, and discuss areas of potential future research.
Collapse
|
4
|
Bredesen BA, Rehmsmeier M. MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics. BMC Bioinformatics 2021; 22:234. [PMID: 33962556 PMCID: PMC8105988 DOI: 10.1186/s12859-021-04143-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Accepted: 04/21/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. RESULTS We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics-Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest-derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods-including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests-, with RF-MOCCA yielding the best results. CONCLUSION MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA .
Collapse
Affiliation(s)
- Bjørn André Bredesen
- Computational Biology Unit, Department of Informatics, University of Bergen, P.O. Box 7803, 5020, Bergen, Norway.
| | - Marc Rehmsmeier
- Department of Biology, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
| |
Collapse
|
5
|
Gerassis S, Abad A, Taboada J, Saavedra Á, Giráldez E. A comparative analysis of health surveillance strategies for administrative video display terminal employees. Biomed Eng Online 2019; 18:118. [PMID: 31829225 PMCID: PMC6907276 DOI: 10.1186/s12938-019-0737-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2019] [Accepted: 11/29/2019] [Indexed: 11/11/2022] Open
Abstract
Background The objective of this study was to develop a strategy to optimize medical health surveillance protocols for administrative employees using video display terminals (VDTs). A total of 2453 medical examinations were analysed for VDT users in various sectors. From these data, using Bayesian statistics we inferred which factors were most relevant to medical diagnosis of the main disorders affecting VDT users. This information was used to build an influence diagram to evaluate the time and monetary costs associated with each diagnostic test and define an optimal protocol strategy based on occupational risks. Results Musculoskeletal and ophthalmological diseases were identified as the most frequent disorders among VDT users. The Bayesian network inferred age, sleep quality, activity level, smoking and the consumption of alcohol as risk factors. The blood count was the most costly test (5.23 USD/employee) and the second most costly test in time terms (4 min/employee), yet is a diagnostic test that has little influence on the medical decision regarding an employee’s capacity to perform their job. Conclusions Current occupational health surveillance protocols for VDT users may lead to expenditure that is 54% greater than necessary. For many employees and employers, failure to perform a wide range of medical tests for occupational health surveillance purposes is subjectively perceived as a threat to health. Awareness needs to be raised of the appropriate role of different health areas, so as to optimize diagnostic efficiency on the basis of greater flexibility.
Collapse
Affiliation(s)
- Saki Gerassis
- Department of Natural Resources and Environmental Engineering, University of Vigo, Vigo, Spain
| | - Alberto Abad
- Department of Natural Resources and Environmental Engineering, University of Vigo, Vigo, Spain
| | - Javier Taboada
- Department of Natural Resources and Environmental Engineering, University of Vigo, Vigo, Spain
| | - Ángeles Saavedra
- Department of Statistics and Operational Research, University of Vigo, Vigo, Spain.
| | - Eduardo Giráldez
- Department of Natural Resources and Environmental Engineering, University of Vigo, Vigo, Spain
| |
Collapse
|
6
|
Sauerwald N, Kingsford C. Quantifying the similarity of topological domains across normal and cancer human cell types. Bioinformatics 2019; 34:i475-i483. [PMID: 29949963 PMCID: PMC6022623 DOI: 10.1093/bioinformatics/bty265] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Motivation Three-dimensional chromosome structure has been increasingly shown to influence various levels of cellular and genomic functions. Through Hi-C data, which maps contact frequency on chromosomes, it has been found that structural elements termed topologically associating domains (TADs) are involved in many regulatory mechanisms. However, we have little understanding of the level of similarity or variability of chromosome structure across cell types and disease states. In this study, we present a method to quantify resemblance and identify structurally similar regions between any two sets of TADs. Results We present an analysis of 23 human Hi-C samples representing various tissue types in normal and cancer cell lines. We quantify global and chromosome-level structural similarity, and compare the relative similarity between cancer and non-cancer cells. We find that cancer cells show higher structural variability around commonly mutated pan-cancer genes than normal cells at these same locations. Availability and implementation Software for the methods and analysis can be found at https://github.com/Kingsford-Group/localtadsim
Collapse
Affiliation(s)
- Natalie Sauerwald
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
- To whom correspondence should be addressed.
| |
Collapse
|
7
|
Sefer E, Kingsford C. Semi-nonparametric modeling of topological domain formation from epigenetic data. Algorithms Mol Biol 2019; 14:4. [PMID: 30867673 PMCID: PMC6399866 DOI: 10.1186/s13015-019-0142-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Accepted: 02/26/2019] [Indexed: 01/01/2023] Open
Abstract
Background Hi-C experiments capturing the 3D genome architecture have led to the discovery of topologically-associated domains (TADs) that form an important part of the 3D genome organization and appear to play a role in gene regulation and other functions. Several histone modifications have been independently associated with TAD formation, but their combinatorial effects on domain formation remain poorly understood at a global scale. Results We propose a convex semi-nonparametric approach called nTDP based on Bernstein polynomials to explore the joint effects of histone markers on TAD formation as well as predict TADs solely from the histone data. We find a small subset of modifications to be predictive of TADs across species. By inferring TADs using our trained model, we are able to predict TADs across different species and cell types, without the use of Hi-C data, suggesting their effect is conserved. This work provides the first comprehensive joint model of the effect of histone markers on domain formation. Conclusions Our approach, nTDP, can form the basis of a unified, explanatory model of the relationship between epigenetic marks and topological domain structures. It can be used to predict domain boundaries for cell types, species, and conditions for which no Hi-C data is available. The model may also be of use for improving Hi-C-based domain finders.
Collapse
|
8
|
Herman-Izycka J, Wlasnowolski M, Wilczynski B. Taking promoters out of enhancers in sequence based predictions of tissue-specific mammalian enhancers. BMC Med Genomics 2017; 10:34. [PMID: 28589862 PMCID: PMC5461523 DOI: 10.1186/s12920-017-0264-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Many genetic diseases are caused by mutations in non-coding regions of the genome. These mutations are frequently found in enhancer sequences, causing disruption to the regulatory program of the cell. Enhancers are short regulatory sequences in the non-coding part of the genome that are essential for the proper regulation of transcription. While the experimental methods for identification of such sequences are improving every year, our understanding of the rules behind the enhancer activity has not progressed much in the last decade. This is especially true in case of tissue-specific enhancers, where there are clear problems in predicting specificity of enhancer activity. RESULTS We show a random-forest based machine learning approach capable of matching the performance of the current state-of-the-art methods for enhancer prediction. Then we show that it is, similarly to other published methods, frequently cross-predicting enhancers as active in different tissues, making it less useful for predicting tissue specific activity. Then we proceed to show that the problem is related to the fact that the enhancer predicting models exhibit a bias towards predicting gene promoters as active enhancers. Then we show that using a two-step classifier can lead to lower cross-prediction between tissues. CONCLUSIONS We provide whole-genome predictions of human heart and brain enhancers obtained with two-step classifier.
Collapse
Affiliation(s)
- Julia Herman-Izycka
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Michal Wlasnowolski
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Bartek Wilczynski
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland.
| |
Collapse
|