1
|
Liu L, Tan Z, Wei Y, Sun Q. A multi-perspective deep learning framework for enhancer characterization and identification. Comput Biol Chem 2025; 114:108284. [PMID: 39577030 DOI: 10.1016/j.compbiolchem.2024.108284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/02/2024] [Accepted: 11/13/2024] [Indexed: 11/24/2024]
Abstract
Enhancers are vital elements in the genome that boost the transcriptional activity of neighboring genes and are essential in regulating cell-specific gene expression. Therefore, accurately identifying and characterizing enhancers is essential for comprehending gene regulatory networks and the development of related diseases. This study introduces MPDL-Enhancer, a novel multi-perspective deep learning framework aimed at enhancer characterization and identification. In this study, enhancer sequences are encoded using the dna2vec model along with features derived from the structural properties of DNA sequences. Subsequently, these representations are processed through a novel dual-scale deep neural network designed to discern subtle correlations and extended interactions embedded within the semantic content of DNA. The predictive phase of our methodology employs a Support Vector Machine classifier to render the final classification. To rigorously assess the efficacy of our approach, a comprehensive evaluation was executed utilizing an independent test dataset, thereby substantiating the robustness and accuracy of our model. Our methodology demonstrated superior performance over existing computational techniques, with an accuracy (ACC) of 81.00 %, a sensitivity (SN) of 79.00 %, and specificity (SP) of 83.00 %. The innovative dual-scale deep neural network and the unique feature representation strategy contributed to this performance improvement. MPDL-Enhancer has effectively characterized enhancer sequences and achieved excellent predictive performance. Building upon this foundation, we conducted an interpretability analysis of the model, which can assist researchers in identifying key features and patterns that affect the functionality of enhancers, thereby promoting a deeper understanding of gene regulatory networks.
Collapse
Affiliation(s)
- Liwei Liu
- College of Science, Dalian Jiaotong University, Dalian 116028, China.
| | - Zhebin Tan
- College of Software, Dalian Jiaotong University, Dalian 116028, China
| | - Yuxiao Wei
- College of Software, Dalian Jiaotong University, Dalian 116028, China
| | - Qianhui Sun
- College of Software, Dalian Jiaotong University, Dalian 116028, China
| |
Collapse
|
2
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and Deep Learning Methods for Predicting 3D Genome Organization. Methods Mol Biol 2025; 2856:357-400. [PMID: 39283464 DOI: 10.1007/978-1-0716-4136-1_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Three-dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, topologically associating domains (TADs), and A/B compartments, play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers and transcription factor binding site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, and TAD boundaries) and analyze their pros and cons. We also point out obstacles to the computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P G Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA
| | - J Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Mikhail G Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA.
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
3
|
Bréhélin L. Advancing Regulatory Genomics With Machine Learning. Bioinform Biol Insights 2024; 18:11779322241249562. [PMID: 39735654 PMCID: PMC11672376 DOI: 10.1177/11779322241249562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 04/09/2024] [Indexed: 12/31/2024] Open
Abstract
In recent years, several machine learning (ML) approaches have been proposed to predict gene expression signal and chromatin features from the DNA sequence alone. These models are often used to deduce and, to some extent, assess putative new biological insights about gene regulation, and they have led to very interesting advances in regulatory genomics. This article reviews a selection of these methods, ranging from linear models to random forests, kernel methods, and more advanced deep learning models. Specifically, we detail the different techniques and strategies that can be used to extract new gene-regulation hypotheses from these models. Furthermore, because these putative insights need to be validated with wet-lab experiments, we emphasize that it is important to have a measure of confidence associated with the extracted hypotheses. We review the procedures that have been proposed to measure this confidence for the different types of ML models, and we discuss the fact that they do not provide the same kind of information.
Collapse
|
4
|
Wang Z, Yuan H, Yan J, Liu J. Identification, characterization, and design of plant genome sequences using deep learning. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2024. [PMID: 39666835 DOI: 10.1111/tpj.17190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 11/11/2024] [Accepted: 11/23/2024] [Indexed: 12/14/2024]
Abstract
Due to its excellent performance in processing large amounts of data and capturing complex non-linear relationships, deep learning has been widely applied in many fields of plant biology. Here we first review the application of deep learning in analyzing genome sequences to predict gene expression, chromatin interactions, and epigenetic features (open chromatin, transcription factor binding sites, and methylation sites) in plants. Then, current motif mining and functional component design and synthesis based on generative adversarial networks, large models, and attention mechanisms are elaborated in detail. The progress of protein structure and function prediction, genomic prediction, and large model applications based on deep learning is also discussed. Finally, this work provides prospects for the future development of deep learning in plants with regard to multiple omics data, algorithm optimization, large language models, sequence design, and intelligent breeding.
Collapse
Affiliation(s)
- Zhenye Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Hao Yuan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Jianxiao Liu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| |
Collapse
|
5
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
6
|
Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024; 23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open
Abstract
The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | - George C. Georgakopoulos
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
| | - Anshuman Das
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Dionysios V. Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
| | - Jasna Kovac
- Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| |
Collapse
|
7
|
Morgan D, DeMeo DL, Glass K. Using methylation data to improve transcription factor binding prediction. Epigenetics 2024; 19:2309826. [PMID: 38300850 PMCID: PMC10841018 DOI: 10.1080/15592294.2024.2309826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 01/01/2024] [Indexed: 02/03/2024] Open
Abstract
Modelling the regulatory mechanisms that determine cell fate, response to external perturbation, and disease state depends on measuring many factors, a task made more difficult by the plasticity of the epigenome. Scanning the genome for the sequence patterns defined by Position Weight Matrices (PWM) can be used to estimate transcription factor (TF) binding locations. However, this approach does not incorporate information regarding the epigenetic context necessary for TF binding. CpG methylation is an epigenetic mark influenced by environmental factors that is commonly assayed in human cohort studies. We developed a framework to score inferred TF binding locations using methylation data. We intersected motif locations identified using PWMs with methylation information captured in both whole-genome bisulfite sequencing and Illumina EPIC array data for six cell lines, scored motif locations based on these data, and compared with experimental data characterizing TF binding (ChIP-seq). We found that for most TFs, binding prediction improves using methylation-based scoring compared to standard PWM-scores. We also illustrate that our approach can be generalized to infer TF binding when methylation information is only proximally available, i.e. measured for nearby CpGs that do not directly overlap with a motif location. Overall, our approach provides a framework for inferring context-specific TF binding using methylation data. Importantly, the availability of DNA methylation data in existing patient populations provides an opportunity to use our approach to understand the impact of methylation on gene regulatory processes in the context of human disease.
Collapse
Affiliation(s)
- Daniel Morgan
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
| | - Dawn L. DeMeo
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
| | - Kimberly Glass
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
- Department of Biostatistics, Harvard Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
8
|
Morino K, Miyake M, Nagasaki M, Kawaguchi T, Numa S, Mori Y, Yasukura S, Akada M, Nakao SY, Nakata A, Hashimoto H, Otokozawa R, Kamoi K, Takahashi H, Tabara Y, Matsuda F, Ohno-Matsui K, Tsujikawa A. Genome-wide Meta-analysis for Myopic Macular Neovascularization Identified a Novel Susceptibility Locus and Revealed a Shared Genetic Susceptibility with Age-Related Macular Degeneration. Ophthalmol Retina 2024:S2468-6530(24)00472-X. [PMID: 39489378 DOI: 10.1016/j.oret.2024.09.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 09/26/2024] [Accepted: 09/26/2024] [Indexed: 11/05/2024]
Abstract
PURPOSE To identify the susceptibility loci for myopic macular neovascularization (mMNV) in patients with high myopia. DESIGN A genome-wide association study (GWAS) meta-analysis (meta-GWAS). PARTICIPANTS We included 2783 highly myopic individuals, including 608 patients with mMNV and 2175 control participants without mMNV. METHODS We performed a meta-analysis of 3 independent GWASs conducted according to the genotyping platform (Illumina Asian Screening Array [ASA] data set, Illumina Human610 BeadChip [610K] data set, and whole genome sequencing [WGS] data set), adjusted for age, sex, axial length, and the first to third principal components. We used DeltaSVM to evaluate the binding affinity of transcription factors (TFs) to DNA sequences around the susceptibility of single nucleotide polymorphisms (SNPs). In addition, we evaluated the contribution of previously reported age-related macular degeneration (AMD) susceptibility loci. MAIN OUTCOME MEASURES The association between SNPs and mMNV in patients with high myopia. RESULTS The meta-GWAS identified rs56257842 at TEX29- LINC02337 as a novel susceptibility SNP for mMNV (odds ratio [OR]meta = 0.62, Pmeta = 4.63 × 10-8, I2 = 0.00), which was consistently associated with mMNV in all data sets (ORASA = 0.59, PASA = 1.71 × 10-4; OR610K = 0.63, P610K = 5.53 × 10-4; ORWGS = 0.66, PWGS = 4.38 × 10-2). Transcription factor-wide analysis showed that the TFs ZNF740 and EGR1 lost their binding affinity to this locus when rs56257842 had the C allele (alternative allele), and the WNT signaling-related TF ZBTB33 gained binding affinity when rs56257842 had the C allele. When we examined the associations of AMD susceptibility loci, rs12720922 at CETP showed a statistically significant association with mMNV (ORmeta = 0.52, Pmeta = 1.55 × 10-5), whereas rs61871745 near ARMS2 showed a marginal association (ORmeta = 1.25, Pmeta = 7.79 × 10-3). CONCLUSIONS Our study identified a novel locus associated with mMNV in high myopia. Subsequent analyses offered important insights into the molecular biology of mMNV, providing the potential therapeutic targets for mMNV. Furthermore, our findings imply shared genetic susceptibility between mMNV and AMD. FINANCIAL DISCLOSURE(S) Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Collapse
Affiliation(s)
- Kazuya Morino
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Masahiro Miyake
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan.
| | - Masao Nagasaki
- Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan; Division of Biomedical Information Analysis, Medical Research Center for High Depth Omics, Medical Institute of Bioregulation, Kyushu University, Fukuoka, Japan
| | - Takahisa Kawaguchi
- Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Shogo Numa
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Yuki Mori
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Shota Yasukura
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Masahiro Akada
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Shin-Ya Nakao
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Ai Nakata
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan; Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Hiroki Hashimoto
- Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan; Division of Biomedical Information Analysis, Medical Research Center for High Depth Omics, Medical Institute of Bioregulation, Kyushu University, Fukuoka, Japan
| | - Ryoko Otokozawa
- Division of Biomedical Information Analysis, Medical Research Center for High Depth Omics, Medical Institute of Bioregulation, Kyushu University, Fukuoka, Japan
| | - Koju Kamoi
- Department of Ophthalmology and Visual Science, Tokyo Medical and Dental University, Tokyo, Japan
| | - Hiroyuki Takahashi
- Department of Ophthalmology and Visual Science, Tokyo Medical and Dental University, Tokyo, Japan
| | - Yasuharu Tabara
- Graduate School of Public Health, Shizuoka Graduate University of Public Health, Shizuoka, Japan
| | - Fumihiko Matsuda
- Center for Genomic Medicine, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | - Kyoko Ohno-Matsui
- Department of Ophthalmology and Visual Science, Tokyo Medical and Dental University, Tokyo, Japan
| | - Akitaka Tsujikawa
- Department of Ophthalmology, Kyoto University Graduate School of Medicine, Kyoto, Japan
| |
Collapse
|
9
|
Jyoti, Ritu, Gupta S, Shankar R. Comprehensive analysis of computational approaches in plant transcription factors binding regions discovery. Heliyon 2024; 10:e39140. [PMID: 39640721 PMCID: PMC11620080 DOI: 10.1016/j.heliyon.2024.e39140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 08/23/2024] [Accepted: 10/08/2024] [Indexed: 12/07/2024] Open
Abstract
Transcription factors (TFs) are regulatory proteins which bind to a specific DNA region known as the transcription factor binding regions (TFBRs) to regulate the rate of transcription process. The identification of TFBRs has been made possible by a number of experimental and computational techniques established during the past few years. The process of TFBR identification involves peak identification in the binding data, followed by the identification of motif characteristics. Using the same binding data attempts have been made to raise computational models to identify such binding regions which could save time and resources spent for binding experiments. These computational approaches depend a lot on what way they learn and how. These existing computational approaches are skewed heavily around human TFBRs discovery, while plants have drastically different genomic setup for regulation which these approaches have grossly ignored. Here, we provide a comprehensive study of the current state of the matters in plant specific TF discovery algorithms. While doing so, we encountered several software tools' issues rendering the tools not useable to researches. We fixed them and have also provided the corrected scripts for such tools. We expect this study to serve as a guide for better understanding of software tools' approaches for plant specific TFBRs discovery and the care to be taken while applying them, especially during cross-species applications. The corrected scripts of these software tools are made available at https://github.com/SCBB-LAB/Comparative-analysis-of-plant-TFBS-software.
Collapse
Affiliation(s)
- Jyoti
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Ritu
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Sagar Gupta
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Ravi Shankar
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| |
Collapse
|
10
|
Sajek MP, Bilodeau DY, Beer MA, Horton E, Miyamoto Y, Velle KB, Eckmann L, Fritz-Laylin L, Rissland OS, Mukherjee N. Evolutionary dynamics of polyadenylation signals and their recognition strategies in protists. Genome Res 2024; 34:1570-1581. [PMID: 39327029 PMCID: PMC11529991 DOI: 10.1101/gr.279526.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 09/11/2024] [Indexed: 09/28/2024]
Abstract
The poly(A) signal, together with auxiliary elements, directs cleavage of a pre-mRNA and thus determines the 3' end of the mature transcript. In many species, including humans, the poly(A) signal is an AAUAAA hexamer, but we recently found that the deeply branching eukaryote Giardia lamblia uses a distinct hexamer (AGURAA) and lacks any known auxiliary elements. Our discovery prompted us to explore the evolutionary dynamics of poly(A) signals and auxiliary elements in the eukaryotic kingdom. We use direct RNA sequencing to determine poly(A) signals for four protists within the Metamonada clade (which also contains G. lamblia) and two outgroup protists. These experiments reveal that the AAUAAA hexamer serves as the poly(A) signal in at least four different eukaryotic clades, indicating that it is likely the ancestral signal, whereas the unusual Giardia version is derived. We find that the use and relative strengths of auxiliary elements are also plastic; in fact, within Metamonada, species like G. lamblia make use of a previously unrecognized auxiliary element where nucleotides flanking the poly(A) signal itself specify genuine cleavage sites. Thus, despite the fundamental nature of pre-mRNA cleavage for the expression of all protein-coding genes, the motifs controlling this process are dynamic on evolutionary timescales, providing motivation for future biochemical and structural studies as well as new therapeutic angles to target eukaryotic pathogens.
Collapse
Affiliation(s)
- Marcin P Sajek
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
- Institute of Human Genetics, Polish Academy of Sciences, 60-479 Poznan, Poland
| | - Danielle Y Bilodeau
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
| | - Michael A Beer
- Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
| | - Emma Horton
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
| | - Yukiko Miyamoto
- Department of Medicine, University of California San Diego, La Jolla, California 92093, USA
| | - Katrina B Velle
- Department of Biology, University of Massachusetts, Amherst, Massachusetts 01003, USA
| | - Lars Eckmann
- Department of Medicine, University of California San Diego, La Jolla, California 92093, USA
| | - Lillian Fritz-Laylin
- Department of Biology, University of Massachusetts, Amherst, Massachusetts 01003, USA
| | - Olivia S Rissland
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado 80045, USA;
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
| | - Neelanjan Mukherjee
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado 80045, USA;
- RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora, Colorado 80045, USA
| |
Collapse
|
11
|
Koido M, Tomizuka K, Terao C. Fundamentals for predicting transcriptional regulations from DNA sequence patterns. J Hum Genet 2024; 69:499-504. [PMID: 38730006 PMCID: PMC11422166 DOI: 10.1038/s10038-024-01256-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 04/10/2024] [Accepted: 04/25/2024] [Indexed: 05/12/2024]
Abstract
Cell-type-specific regulatory elements, cataloged through extensive experiments and bioinformatics in large-scale consortiums, have enabled enrichment analyses of genetic associations that primarily utilize positional information of the regulatory elements. These analyses have identified cell types and pathways genetically associated with human complex traits. However, our understanding of detailed allelic effects on these elements' activities and on-off states remains incomplete, hampering the interpretation of human genetic study results. This review introduces machine learning methods to learn sequence-dependent transcriptional regulation mechanisms from DNA sequences for predicting such allelic effects (not associations). We provide a concise history of machine-learning-based approaches, the requirements, and the key computational processes, focusing on primers in machine learning. Convolution and self-attention, pivotal in modern deep-learning models, are explained through geometrical interpretations using dot products. This facilitates understanding of the concept and why these have been used for machine learning for DNA sequences. These will inspire further research in this genetics and genomics field.
Collapse
Affiliation(s)
- Masaru Koido
- Laboratory of Complex Trait Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Kohei Tomizuka
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan.
- The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan.
| |
Collapse
|
12
|
Kumar Halder A, Agarwal A, Jodkowska K, Plewczynski D. A systematic analyses of different bioinformatics pipelines for genomic data and its impact on deep learning models for chromatin loop prediction. Brief Funct Genomics 2024; 23:538-548. [PMID: 38555493 DOI: 10.1093/bfgp/elae009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 02/07/2024] [Accepted: 03/04/2024] [Indexed: 04/02/2024] Open
Abstract
Genomic data analysis has witnessed a surge in complexity and volume, primarily driven by the advent of high-throughput technologies. In particular, studying chromatin loops and structures has become pivotal in understanding gene regulation and genome organization. This systematic investigation explores the realm of specialized bioinformatics pipelines designed specifically for the analysis of chromatin loops and structures. Our investigation incorporates two protein (CTCF and Cohesin) factor-specific loop interaction datasets from six distinct pipelines, amassing a comprehensive collection of 36 diverse datasets. Through a meticulous review of existing literature, we offer a holistic perspective on the methodologies, tools and algorithms underpinning the analysis of this multifaceted genomic feature. We illuminate the vast array of approaches deployed, encompassing pivotal aspects such as data preparation pipeline, preprocessing, statistical features and modelling techniques. Beyond this, we rigorously assess the strengths and limitations inherent in these bioinformatics pipelines, shedding light on the interplay between data quality and the performance of deep learning models, ultimately advancing our comprehension of genomic intricacies.
Collapse
Affiliation(s)
- Anup Kumar Halder
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Abhishek Agarwal
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Karolina Jodkowska
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Dariusz Plewczynski
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| |
Collapse
|
13
|
Xie W, Yao Z, Yuan Y, Too J, Li F, Wang H, Zhan Y, Wu X, Wang Z, Zhang G. W2V-repeated index: Prediction of enhancers and their strength based on repeated fragments. Genomics 2024; 116:110906. [PMID: 39084477 DOI: 10.1016/j.ygeno.2024.110906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 07/10/2024] [Accepted: 07/24/2024] [Indexed: 08/02/2024]
Abstract
Enhancers are crucial in gene expression regulation, dictating the specificity and timing of transcriptional activity, which highlights the importance of their identification for unravelling the intricacies of genetic regulation. Therefore, it is critical to identify enhancers and their strengths. Repeated sequences in the genome are repeats of the same or symmetrical fragments. There has been a great deal of evidence that repetitive sequences contain enormous amounts of genetic information. Thus, We introduce the W2V-Repeated Index, designed to identify enhancer sequence fragments and evaluates their strength through the analysis of repeated K-mer sequences in enhancer regions. Utilizing the word2vector algorithm for numerical conversion and Manta Ray Foraging Optimization for feature selection, this method effectively captures the frequency and distribution of K-mer sequences. By concentrating on repeated K-mer sequences, it minimizes computational complexity and facilitates the analysis of larger K values. Experiments indicate that our method performs better than all other advanced methods on almost all indicators.
Collapse
Affiliation(s)
- Weiming Xie
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China
| | - Zhaomin Yao
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China.
| | - Yizhe Yuan
- China Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jingwei Too
- Faculty of Electrical Engineering, Universiti Teknikal Malaysia Melaka, Hang Tuah Jaya, Durian Tunggal, 76100 Melaka, Malaysia
| | - Fei Li
- College of Computer Science and Technology, Jilin University, Changchun, Jilin 130012, China
| | - Hongyu Wang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China
| | - Ying Zhan
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China
| | - Xiaodan Wu
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China
| | - Zhiguo Wang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China.
| | - Guoxu Zhang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110167, China.
| |
Collapse
|
14
|
Oh JW, Beer MA. Gapped-kmer sequence modeling robustly identifies regulatory vocabularies and distal enhancers conserved between evolutionarily distant mammals. Nat Commun 2024; 15:6464. [PMID: 39085231 PMCID: PMC11291912 DOI: 10.1038/s41467-024-50708-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 07/17/2024] [Indexed: 08/02/2024] Open
Abstract
Gene regulatory elements drive complex biological phenomena and their mutations are associated with common human diseases. The impacts of human regulatory variants are often tested using model organisms such as mice. However, mapping human enhancers to conserved elements in mice remains a challenge, due to both rapid enhancer evolution and limitations of current computational methods. We analyze distal enhancers across 45 matched human/mouse cell/tissue pairs from a comprehensive dataset of DNase-seq experiments, and show that while cell-specific regulatory vocabulary is conserved, enhancers evolve more rapidly than promoters and CTCF binding sites. Enhancer conservation rates vary across cell types, in part explainable by tissue specific transposable element activity. We present an improved genome alignment algorithm using gapped-kmer features, called gkm-align, and make genome wide predictions for 1,401,803 orthologous regulatory elements. We show that gkm-align discovers 23,660 novel human/mouse conserved enhancers missed by previous algorithms, with strong evidence of conserved functional activity.
Collapse
Affiliation(s)
- Jin Woo Oh
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Michael A Beer
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
15
|
Yaacov O, Mathiyalagan P, Berk-Rauch HE, Ganesh SK, Zhu L, Hoffmann TJ, Iribarren C, Risch N, Lee D, Chakravarti A. Identification of the Molecular Components of Enhancer-Mediated Gene Expression Variation in Multiple Tissues Regulating Blood Pressure. Hypertension 2024; 81:1500-1510. [PMID: 38747164 PMCID: PMC11168860 DOI: 10.1161/hypertensionaha.123.22538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 04/24/2024] [Indexed: 06/14/2024]
Abstract
BACKGROUND Inter-individual variation in blood pressure (BP) arises in part from sequence variants within enhancers modulating the expression of causal genes. We propose that these genes, active in tissues relevant to BP physiology, can be identified from tissue-level epigenomic data and genotypes of BP-phenotyped individuals. METHODS We used chromatin accessibility data from the heart, adrenal, kidney, and artery to identify cis-regulatory elements (CREs) in these tissues and estimate the impact of common human single-nucleotide variants within these CREs on gene expression, using machine learning methods. To identify causal genes, we performed a gene-wise association test. We conducted analyses in 2 separate large-scale cohorts: 77 822 individuals from the Genetic Epidemiology Research on Adult Health and Aging and 315 270 individuals from the UK Biobank. RESULTS We identified 309, 259, 331, and 367 genes (false discovery rate <0.05) for diastolic BP and 191, 184, 204, and 204 genes for systolic BP in the artery, kidney, heart, and adrenal, respectively, in Genetic Epidemiology Research on Adult Health and Aging; 50% to 70% of these genes were replicated in the UK Biobank, significantly higher than the 12% to 15% expected by chance (P<0.0001). These results enabled tissue expression prediction of these 988 to 2875 putative BP genes in individuals of both cohorts to construct an expression polygenic score. This score explained ≈27% of the reported single-nucleotide variant heritability, substantially higher than expected from prior studies. CONCLUSIONS Our work demonstrates the power of tissue-restricted comprehensive CRE analysis, followed by CRE-based expression prediction, for understanding BP regulation in relevant tissues and provides dual-modality supporting evidence, CRE and expression, for the causality genes.
Collapse
Affiliation(s)
- Or Yaacov
- Center for Human Genetics and Genomics, NYU Grossman School of Medicine, New York, NY, USA
| | - Prabhu Mathiyalagan
- Center for Human Genetics and Genomics, NYU Grossman School of Medicine, New York, NY, USA
- Benthos Prime Central, Houston, TX, USA
| | - Hanna E. Berk-Rauch
- Center for Human Genetics and Genomics, NYU Grossman School of Medicine, New York, NY, USA
| | - Santhi K. Ganesh
- Department of Internal Medicine & Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Luke Zhu
- Center for Human Genetics and Genomics, NYU Grossman School of Medicine, New York, NY, USA
| | - Thomas J. Hoffmann
- Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Carlos Iribarren
- Kaiser Permanente Northern California Division of Research, Oakland, CA, USA
| | - Neil Risch
- Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
- Kaiser Permanente Northern California Division of Research, Oakland, CA, USA
| | - Dongwon Lee
- Department of Pediatrics, Division of Nephrology, Boston Children’s Hospital, Boston & Harvard Medical School, Boston, MA, USA
| | - Aravinda Chakravarti
- Center for Human Genetics and Genomics, NYU Grossman School of Medicine, New York, NY, USA
| |
Collapse
|
16
|
Ghoreishifar M, Chamberlain AJ, Xiang R, Prowse-Wilkins CP, Lopdell TJ, Littlejohn MD, Pryce JE, Goddard ME. Allele-specific binding variants causing ChIP-seq peak height of histone modification are not enriched in expression QTL annotations. Genet Sel Evol 2024; 56:50. [PMID: 38937662 PMCID: PMC11212393 DOI: 10.1186/s12711-024-00916-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 06/04/2024] [Indexed: 06/29/2024] Open
Abstract
BACKGROUND Genome sequence variants affecting complex traits (quantitative trait loci, QTL) are enriched in functional regions of the genome, such as those marked by certain histone modifications. These variants are believed to influence gene expression. However, due to the linkage disequilibrium among nearby variants, pinpointing the precise location of QTL is challenging. We aimed to identify allele-specific binding (ASB) QTL (asbQTL) that cause variation in the level of histone modification, as measured by the height of peaks assayed by ChIP-seq (chromatin immunoprecipitation sequencing). We identified DNA sequences that predict the difference between alleles in ChIP-seq peak height in H3K4me3 and H3K27ac histone modifications in the mammary glands of cows. RESULTS We used a gapped k-mer support vector machine, a novel best linear unbiased prediction model, and a multiple linear regression model that combines the other two approaches to predict variant impacts on peak height. For each method, a subset of 1000 sites with the highest magnitude of predicted ASB was considered as candidate asbQTL. The accuracy of this prediction was measured by the proportion where the predicted direction matched the observed direction. Prediction accuracy ranged between 0.59 and 0.74, suggesting that these 1000 sites are enriched for asbQTL. Using independent data, we investigated functional enrichment in the candidate asbQTL set and three control groups, including non-causal ASB sites, non-ASB variants under a peak, and SNPs (single nucleotide polymorphisms) not under a peak. For H3K4me3, a higher proportion of the candidate asbQTL were confirmed as ASB when compared to the non-causal ASB sites (P < 0.01). However, these candidate asbQTL did not enrich for the other annotations, including expression QTL (eQTL), allele-specific expression QTL (aseQTL) and sites conserved across mammals (P > 0.05). CONCLUSIONS We identified putatively causal sites for asbQTL using the DNA sequence surrounding these sites. Our results suggest that many sites influencing histone modifications may not directly affect gene expression. However, it is important to acknowledge that distinguishing between putative causal ASB sites and other non-causal ASB sites in high linkage disequilibrium with the causal sites regarding their impact on gene expression may be challenging due to limitations in statistical power.
Collapse
Affiliation(s)
- Mohammad Ghoreishifar
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia.
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia.
| | - Amanda J Chamberlain
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Ruidong Xiang
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia
| | - Claire P Prowse-Wilkins
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia
| | - Thomas J Lopdell
- Research and Development, Livestock Improvement Corporation, Private Bag 3016, Hamilton, 3240, New Zealand
| | - Mathew D Littlejohn
- Research and Development, Livestock Improvement Corporation, Private Bag 3016, Hamilton, 3240, New Zealand
| | - Jennie E Pryce
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Michael E Goddard
- Agriculture Victoria Research, AgriBio Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia
| |
Collapse
|
17
|
Razavi-Mohseni M, Huang W, Guo YA, Shigaki D, Ho SWT, Tan P, Skanderup AJ, Beer MA. Machine learning identifies activation of RUNX/AP-1 as drivers of mesenchymal and fibrotic regulatory programs in gastric cancer. Genome Res 2024; 34:680-695. [PMID: 38777607 PMCID: PMC11216402 DOI: 10.1101/gr.278565.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 05/13/2024] [Indexed: 05/25/2024]
Abstract
Gastric cancer (GC) is the fifth most common cancer worldwide and is a heterogeneous disease. Among GC subtypes, the mesenchymal phenotype (Mes-like) is more invasive than the epithelial phenotype (Epi-like). Although gene expression of the epithelial-to-mesenchymal transition (EMT) has been studied, the regulatory landscape shaping this process is not fully understood. Here we use ATAC-seq and RNA-seq data from a compendium of GC cell lines and primary tumors to detect drivers of regulatory state changes and their transcriptional responses. Using the ATAC-seq data, we developed a machine learning approach to determine the transcription factors (TFs) regulating the subtypes of GC. We identified TFs driving the mesenchymal (RUNX2, ZEB1, SNAI2, AP-1 dimer) and the epithelial (GATA4, GATA6, KLF5, HNF4A, FOXA2, GRHL2) states in GC. We identified DNA copy number alterations associated with dysregulation of these TFs, specifically deletion of GATA4 and amplification of MAPK9 Comparisons with bulk and single-cell RNA-seq data sets identified activation toward fibroblast-like epigenomic and expression signatures in Mes-like GC. The activation of this mesenchymal fibrotic program is associated with differentially accessible DNA cis-regulatory elements flanking upregulated mesenchymal genes. These findings establish a map of TF activity in GC and highlight the role of copy number driven alterations in shaping epigenomic regulatory programs as potential drivers of GC heterogeneity and progression.
Collapse
Affiliation(s)
- Milad Razavi-Mohseni
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205, USA
| | - Weitai Huang
- Laboratory of Computational Cancer Genomics, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore 138672
| | - Yu A Guo
- Laboratory of Computational Cancer Genomics, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore 138672
| | - Dustin Shigaki
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205, USA
| | - Shamaine Wei Ting Ho
- Laboratory of Cancer Epigenetic Regulation, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore 138672
| | - Patrick Tan
- Laboratory of Cancer Epigenetic Regulation, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore 138672
- Cancer and Stem Cell Biology Program, Duke-NUS Medical School, Singapore 169857
- Cancer Science Institute of Singapore, National University of Singapore, Singapore 117599
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117593
| | - Anders J Skanderup
- Laboratory of Computational Cancer Genomics, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore 138672
| | - Michael A Beer
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205, USA;
| |
Collapse
|
18
|
Tabe-Bordbar S, Song YJ, Lunt BJ, Alavi Z, Prasanth KV, Sinha S. Mechanistic analysis of enhancer sequences in the estrogen receptor transcriptional program. Commun Biol 2024; 7:719. [PMID: 38862711 PMCID: PMC11167054 DOI: 10.1038/s42003-024-06400-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2022] [Accepted: 05/30/2024] [Indexed: 06/13/2024] Open
Abstract
Estrogen Receptor α (ERα) is a major lineage determining transcription factor (TF) in mammary gland development. Dysregulation of ERα-mediated transcriptional program results in cancer. Transcriptomic and epigenomic profiling of breast cancer cell lines has revealed large numbers of enhancers involved in this regulatory program, but how these enhancers encode function in their sequence remains poorly understood. A subset of ERα-bound enhancers are transcribed into short bidirectional RNA (enhancer RNA or eRNA), and this property is believed to be a reliable marker of active enhancers. We therefore analyze thousands of ERα-bound enhancers and build quantitative, mechanism-aware models to discriminate eRNAs from non-transcribing enhancers based on their sequence. Our thermodynamics-based models provide insights into the roles of specific TFs in ERα-mediated transcriptional program, many of which are supported by the literature. We use in silico perturbations to predict TF-enhancer regulatory relationships and integrate these findings with experimentally determined enhancer-promoter interactions to construct a gene regulatory network. We also demonstrate that the model can prioritize breast cancer-related sequence variants while providing mechanistic explanations for their function. Finally, we experimentally validate the model-proposed mechanisms underlying three such variants.
Collapse
Affiliation(s)
- Shayan Tabe-Bordbar
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - You Jin Song
- Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Bryan J Lunt
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Zahra Alavi
- Department of Physics, Loyola Marymount University, Los Angeles, CA, USA
| | - Kannanganattu V Prasanth
- Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Saurabh Sinha
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
| |
Collapse
|
19
|
Guo Y, Zhou D, Li P, Li C, Cao J. Context-Aware Poly(A) Signal Prediction Model via Deep Spatial-Temporal Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:8241-8253. [PMID: 37015693 DOI: 10.1109/tnnls.2022.3226301] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Polyadenylation [Poly(A)] is an essential process during messenger RNA (mRNA) maturation in biological eukaryote systems. Identifying Poly(A) signals (PASs) from the genome level is the key to understanding the mechanism of translation regulation and mRNA metabolism. In this work, we propose a deep dual-dynamic context-aware Poly(A) signal prediction model, called multiscale convolution with self-attention networks (MCANet), to adaptively uncover the spatial-temporal contextual dependence information. Specifically, the model automatically learns and strengthens informative features from the temporalwise and the spatialwise dimension. The identity connectivity performs contextual feature maps of Poly(A) data by direct connections from previous layers to subsequent layers. Then, a fully parametric rectified linear unit (FP-RELU) with dual-dynamic coefficients is devised to make the training of the model easier and enhance the generalization ability. A cross-entropy loss (CL) function is designed to make the model focus on samples that are easy to misclassify. Experiments on different Poly(A) signals demonstrate the superior performance of the proposed MCANet, and an ablation study shows the effectiveness of the network design for the feature learning and prediction of Poly(A) signals.
Collapse
|
20
|
Zhao L, Hao R, Chai Z, Fu W, Yang W, Li C, Liu Q, Jiang Y. DeepOCR: A multi-species deep-learning framework for accurate identification of open chromatin regions in livestock. Comput Biol Chem 2024; 110:108077. [PMID: 38691895 DOI: 10.1016/j.compbiolchem.2024.108077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 03/27/2024] [Accepted: 04/16/2024] [Indexed: 05/03/2024]
Abstract
A wealth of experimental evidence has suggested that open chromatin regions (OCRs) are involved in many critical biological activities, such as DNA replication, enhancer activity, and gene transcription. Accurately identifying OCRs in livestock species can provide critical insights into the distribution and characteristics of OCRs for disease treatment in livestock, thereby improving animal welfare. However, most current machine-learning methods for OCR prediction were originally designed for a limited number of model organisms, such as humans and some model organisms, and thus their performance on non-model organisms, specifically livestock, is often unsatisfactory. To bridge this gap, we propose DeepOCR, a lightweight depth-separable residual network model for predicting OCRs in livestock, including chicken, cattle, and sheep. DeepOCR integrates a single convolution layer and two improved residue structure blocks to extract and learn important features from the input DNA sequences. A fully connected layer was also employed to further process the extracted features and improve the robustness of the entire network. Our benchmarking experiments demonstrated superior prediction performance of DeepOCR compared to state-of-the-art approaches on testing datasets of the three species. The source code of DeepOCR is freely available for academic purposes at https://github.com/jasonzhao371/DeepOCR/. We anticipate DeepOCR servers as a practical and reliable computational tool for OCR-related studies in livestock species.
Collapse
Affiliation(s)
- Liangwei Zhao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ran Hao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ziyi Chai
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Weiwei Fu
- College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou, Gansu 730020, China
| | - Wei Yang
- National Clinical Research Center for Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen 518112, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China.
| | - Yu Jiang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China; Key Laboratory of Livestock Biology, Northwest A&F University, Yangling, Shaanxi 712100, China.
| |
Collapse
|
21
|
Zhuang J, Huang X, Liu S, Gao W, Su R, Feng K. MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites. J Chem Inf Model 2024; 64:4322-4333. [PMID: 38733561 DOI: 10.1021/acs.jcim.3c02088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]
Abstract
Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R2 of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.
Collapse
Affiliation(s)
- Jujuan Zhuang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Xinru Huang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Shuhan Liu
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Wanquan Gao
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Rui Su
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Kexin Feng
- The School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
22
|
Gupta S, Kesarwani V, Bhati U, Jyoti, Shankar R. PTFSpot: deep co-learning on transcription factors and their binding regions attains impeccable universality in plants. Brief Bioinform 2024; 25:bbae324. [PMID: 39013383 PMCID: PMC11250369 DOI: 10.1093/bib/bbae324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/07/2024] [Accepted: 06/19/2024] [Indexed: 07/18/2024] Open
Abstract
Unlike animals, variability in transcription factors (TFs) and their binding regions (TFBRs) across the plants species is a major problem that most of the existing TFBR finding software fail to tackle, rendering them hardly of any use. This limitation has resulted into underdevelopment of plant regulatory research and rampant use of Arabidopsis-like model species, generating misleading results. Here, we report a revolutionary transformers-based deep-learning approach, PTFSpot, which learns from TF structures and their binding regions' co-variability to bring a universal TF-DNA interaction model to detect TFBR with complete freedom from TF and species-specific models' limitations. During a series of extensive benchmarking studies over multiple experimentally validated data, it not only outperformed the existing software by >30% lead but also delivered consistently >90% accuracy even for those species and TF families that were never encountered during the model-building process. PTFSpot makes it possible now to accurately annotate TFBRs across any plant genome even in the total lack of any TF information, completely free from the bottlenecks of species and TF-specific models.
Collapse
Affiliation(s)
- Sagar Gupta
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Veerbhan Kesarwani
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Umesh Bhati
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Jyoti
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Ravi Shankar
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| |
Collapse
|
23
|
Ghosh N, Santoni D, Saha I, Felici G. Predicting Transcription Factor Binding Sites with Deep Learning. Int J Mol Sci 2024; 25:4990. [PMID: 38732207 PMCID: PMC11084193 DOI: 10.3390/ijms25094990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Accepted: 04/28/2024] [Indexed: 05/13/2024] Open
Abstract
Prediction of binding sites for transcription factors is important to understand how the latter regulate gene expression and how this regulation can be modulated for therapeutic purposes. A consistent number of references address this issue with different approaches, Machine Learning being one of the most successful. Nevertheless, we note that many such approaches fail to propose a robust and meaningful method to embed the genetic data under analysis. We try to overcome this problem by proposing a bidirectional transformer-based encoder, empowered by bidirectional long-short term memory layers and with a capsule layer responsible for the final prediction. To evaluate the efficiency of the proposed approach, we use benchmark ChIP-seq datasets of five cell lines available in the ENCODE repository (A549, GM12878, Hep-G2, H1-hESC, and Hela). The results show that the proposed method can predict TFBS within the five different cell lines very well; moreover, cross-cell predictions provide satisfactory results as well. Experiments conducted across cell lines are reinforced by the analysis of five additional lines used only to test the model trained using the others. The results confirm that prediction across cell lines remains very high, allowing an extensive cross-transcription factor analysis to be performed from which several indications of interest for molecular biology may be drawn.
Collapse
Affiliation(s)
- Nimisha Ghosh
- Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha ’O’ Anusandhan (Deemed to be University), Bhubaneswar 751030, India
| | - Daniele Santoni
- Institute for System Analysis and Computer Science “Antonio Ruberti”, National Research Council of Italy, 00185 Rome, Italy; (D.S.); (G.F.)
| | - Indrajit Saha
- Department of Computer Science and Engineering, National Institute of Technical Teachers’ Training and Research, Kolkata 700106, India;
| | - Giovanni Felici
- Institute for System Analysis and Computer Science “Antonio Ruberti”, National Research Council of Italy, 00185 Rome, Italy; (D.S.); (G.F.)
| |
Collapse
|
24
|
Datta S, Nabeel Asim M, Dengel A, Ahmed S. NTpred: a robust and precise machine learning framework for in silico identification of Tyrosine nitration sites in protein sequences. Brief Funct Genomics 2024; 23:163-179. [PMID: 37248673 DOI: 10.1093/bfgp/elad018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 04/12/2023] [Accepted: 05/02/2023] [Indexed: 05/31/2023] Open
Abstract
Post-translational modifications (PTMs) either enhance a protein's activity in various sub-cellular processes, or degrade their activity which leads toward failure of intracellular processes. Tyrosine nitration (NT) modification degrades protein's activity that initiates and propagates various diseases including neurodegenerative, cardiovascular, autoimmune diseases and carcinogenesis. Identification of NT modification supports development of novel therapies and drug discoveries for associated diseases. Identification of NT modification in biochemical labs is expensive, time consuming and error-prone. To supplement this process, several computational approaches have been proposed. However these approaches fail to precisely identify NT modification, due to the extraction of irrelevant, redundant and less discriminative features from protein sequences. This paper presents the NTpred framework that is competent in extracting comprehensive features from raw protein sequences using four different sequence encoders. To reap the benefits of different encoders, it generates four additional feature spaces by fusing different combinations of individual encodings. Furthermore, it eradicates irrelevant and redundant features from eight different feature spaces through a Recursive Feature Elimination process. Selected features of four individual encodings and four feature fusion vectors are used to train eight different Gradient Boosted Tree classifiers. The probability scores from the trained classifiers are utilized to generate a new probabilistic feature space, which is used to train a Logistic Regression classifier. On the BD1 benchmark dataset, the proposed framework outperforms the existing best-performing predictor in 5-fold cross validation and independent test evaluation with combined improvement of 13.7% in MCC and 20.1% in AUC. Similarly, on the BD2 benchmark dataset, the proposed framework outperforms the existing best-performing predictor with combined improvement of 5.3% in MCC and 1.0% in AUC. NTpred is publicly available for further experimentation and predictive use at: https://sds_genetic_analysis.opendfki.de/PredNTS/.
Collapse
Affiliation(s)
- Sourajyoti Datta
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| |
Collapse
|
25
|
Robson ES, Ioannidis NM. GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.12.562113. [PMID: 37904945 PMCID: PMC10614795 DOI: 10.1101/2023.10.12.562113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
Collapse
Affiliation(s)
- Eyes S Robson
- Center for Computational Biology, UC Berkeley, Berkeley, CA 94720
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720
| |
Collapse
|
26
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and deep learning methods for predicting 3D genome organization. ARXIV 2024:arXiv:2403.03231v1. [PMID: 38495565 PMCID: PMC10942493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P. G. Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
| | - J. Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA 23298, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
27
|
Jiang F, Hu SY, Tian W, Wang NN, Yang N, Dong SS, Song HM, Zhang DJ, Gao HW, Wang C, Wu H, He CY, Zhu DL, Chen XF, Guo Y, Yang Z, Yang TL. A landscape of gene expression regulation for synovium in arthritis. Nat Commun 2024; 15:1409. [PMID: 38360850 PMCID: PMC10869817 DOI: 10.1038/s41467-024-45652-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 01/29/2024] [Indexed: 02/17/2024] Open
Abstract
The synovium is an important component of any synovial joint and is the major target tissue of inflammatory arthritis. However, the multi-omics landscape of synovium required for functional inference is absent from large-scale resources. Here we integrate genomics with transcriptomics and chromatin accessibility features of human synovium in up to 245 arthritic patients, to characterize the landscape of genetic regulation on gene expression and the regulatory mechanisms mediating arthritic diseases predisposition. We identify 4765 independent primary and 616 secondary cis-expression quantitative trait loci (cis-eQTLs) in the synovium and find that the eQTLs with multiple independent signals have stronger effects and heritability than single independent eQTLs. Integration of genome-wide association studies (GWASs) and eQTLs identifies 84 arthritis related genes, revealing 38 novel genes which have not been reported by previous studies using eQTL data from the GTEx project or immune cells. We further develop a method called eQTac to identify variants that could affect gene expression by affecting chromatin accessibility and identify 1517 regions with potential regulatory function of chromatin accessibility. Altogether, our study provides a comprehensive synovium multi-omics resource for arthritic diseases and gains new insights into the regulation of gene expression.
Collapse
Affiliation(s)
- Feng Jiang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Shou-Ye Hu
- Department of Joint Surgery, Honghui Hospital, Xi'an Jiaotong University, Xi'an, Shaanxi, 710054, P.R. China
| | - Wen Tian
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Nai-Ning Wang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Ning Yang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Shan-Shan Dong
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Hui-Miao Song
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Da-Jin Zhang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Hui-Wu Gao
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Chen Wang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Hao Wu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Chang-Yi He
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Dong-Li Zhu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Xiao-Feng Chen
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Yan Guo
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China
| | - Zhi Yang
- Department of Joint Surgery, Honghui Hospital, Xi'an Jiaotong University, Xi'an, Shaanxi, 710054, P.R. China.
| | - Tie-Lin Yang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, P.R. China.
| |
Collapse
|
28
|
Taskiran II, Spanier KI, Dickmänken H, Kempynck N, Pančíková A, Ekşi EC, Hulselmans G, Ismail JN, Theunis K, Vandepoel R, Christiaens V, Mauduit D, Aerts S. Cell-type-directed design of synthetic enhancers. Nature 2024; 626:212-220. [PMID: 38086419 PMCID: PMC10830415 DOI: 10.1038/s41586-023-06936-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 12/05/2023] [Indexed: 01/19/2024]
Abstract
Transcriptional enhancers act as docking stations for combinations of transcription factors and thereby regulate spatiotemporal activation of their target genes1. It has been a long-standing goal in the field to decode the regulatory logic of an enhancer and to understand the details of how spatiotemporal gene expression is encoded in an enhancer sequence. Here we show that deep learning models2-6, can be used to efficiently design synthetic, cell-type-specific enhancers, starting from random sequences, and that this optimization process allows detailed tracing of enhancer features at single-nucleotide resolution. We evaluate the function of fully synthetic enhancers to specifically target Kenyon cells or glial cells in the fruit fly brain using transgenic animals. We further exploit enhancer design to create 'dual-code' enhancers that target two cell types and minimal enhancers smaller than 50 base pairs that are fully functional. By examining the state space searches towards local optima, we characterize enhancer codes through the strength, combination and arrangement of transcription factor activator and transcription factor repressor motifs. Finally, we apply the same strategies to successfully design human enhancers, which adhere to enhancer rules similar to those of Drosophila enhancers. Enhancer design guided by deep learning leads to better understanding of how enhancers work and shows that their code can be exploited to manipulate cell states.
Collapse
Affiliation(s)
- Ibrahim I Taskiran
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Katina I Spanier
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Hannah Dickmänken
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Niklas Kempynck
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Alexandra Pančíková
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
- VIB-KULeuven Center for Cancer Biology, Leuven, Belgium
| | - Eren Can Ekşi
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Gert Hulselmans
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Joy N Ismail
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
- UK Dementia Research Institute at Imperial College London, London, UK
| | - Koen Theunis
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Roel Vandepoel
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Valerie Christiaens
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - David Mauduit
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium.
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium.
- Department of Human Genetics, KU Leuven, Leuven, Belgium.
| |
Collapse
|
29
|
de Almeida BP, Schaub C, Pagani M, Secchia S, Furlong EEM, Stark A. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 2024; 626:207-211. [PMID: 38086418 PMCID: PMC10830412 DOI: 10.1038/s41586-023-06905-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 11/28/2023] [Indexed: 01/19/2024]
Abstract
Enhancers control gene expression and have crucial roles in development and homeostasis1-3. However, the targeted de novo design of enhancers with tissue-specific activities has remained challenging. Here we combine deep learning and transfer learning to design tissue-specific enhancers for five tissues in the Drosophila melanogaster embryo: the central nervous system, epidermis, gut, muscle and brain. We first train convolutional neural networks using genome-wide single-cell assay for transposase-accessible chromatin with sequencing (ATAC-seq) datasets and then fine-tune the convolutional neural networks with smaller-scale data from in vivo enhancer activity assays, yielding models with 13% to 76% positive predictive value according to cross-validation. We designed and experimentally assessed 40 synthetic enhancers (8 per tissue) in vivo, of which 31 (78%) were active and 27 (68%) functioned in the target tissue (100% for central nervous system and muscle). The strategy of combining genome-wide and small-scale functional datasets by transfer learning is generally applicable and should enable the design of tissue-, cell type- and cell state-specific enhancers in any system.
Collapse
Affiliation(s)
- Bernardo P de Almeida
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria
- Vienna BioCenter PhD Program, Doctoral School of the University of Vienna and Medical University of Vienna, Vienna, Austria
- InstaDeep, Paris, France
| | - Christoph Schaub
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Michaela Pagani
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria
| | - Stefano Secchia
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Eileen E M Furlong
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Alexander Stark
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria.
- Medical University of Vienna, Vienna BioCenter (VBC), Vienna, Austria.
| |
Collapse
|
30
|
Jiang D, Zhang J. Ascertainment Bias in the Genomic Test of Positive Selection on Regulatory Sequences. Mol Biol Evol 2024; 41:msad284. [PMID: 38149460 PMCID: PMC10766478 DOI: 10.1093/molbev/msad284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 11/12/2023] [Accepted: 12/22/2023] [Indexed: 12/28/2023] Open
Abstract
Evolution of gene expression mediated by cis-regulatory changes is thought to be an important contributor to organismal adaptation, but identifying adaptive cis-regulatory changes is challenging due to the difficulty in knowing the expectation under no positive selection. A new approach for detecting positive selection on transcription factor binding sites (TFBSs) was recently developed, thanks to the application of machine learning in predicting transcription factor (TF) binding affinities of DNA sequences. Given a TFBS sequence from a focal species and the corresponding inferred ancestral sequence that differs from the former at n sites, one can predict the TF-binding affinities of many n-step mutational neighbors of the ancestral sequence and obtain a null distribution of the derived binding affinity, which allows testing whether the binding affinity of the real derived sequence deviates significantly from the null distribution. Applying this test genomically to all experimentally identified binding sites of 3 TFs in humans, a recent study reported positive selection for elevated binding affinities of TFBSs. Here, we show that this genomic test suffers from an ascertainment bias because, even in the absence of positive selection for strengthened binding, the binding affinities of known human TFBSs are more likely to have increased than decreased in evolution. We demonstrate by computer simulation that this bias inflates the false positive rate of the selection test. We propose several methods to mitigate the ascertainment bias and show that almost all previously reported positive selection signals disappear when these methods are applied.
Collapse
Affiliation(s)
- Daohan Jiang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
- Present address: Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
31
|
Lee D, Han SK, Yaacov O, Berk-Rauch H, Mathiyalagan P, Ganesh SK, Chakravarti A. Tissue-specific and tissue-agnostic effects of genome sequence variation modulating blood pressure. Cell Rep 2023; 42:113351. [PMID: 37910504 PMCID: PMC10726310 DOI: 10.1016/j.celrep.2023.113351] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 09/21/2023] [Accepted: 10/11/2023] [Indexed: 11/03/2023] Open
Abstract
Genome-wide association studies (GWASs) have identified numerous variants associated with polygenic traits and diseases. However, with few exceptions, a mechanistic understanding of which variants affect which genes in which tissues to modulate trait variation is lacking. Here, we present genomic analyses to explain trait heritability of blood pressure (BP) through the genetics of transcriptional regulation using GWASs, multiomics data from different tissues, and machine learning approaches. Approximately 500,000 predicted regulatory variants across four tissues explain 33.4% of variant heritability: 2.5%, 5.3%, 7.7%, and 11.8% for kidney-, adrenal-, heart-, and artery-specific variants, respectively. Variation in the enhancers involved shows greater tissue specificity than in the genes they regulate, suggesting that gene regulatory networks perturbed by enhancer variants in a tissue relevant to a phenotype are the major source of interindividual variation in BP. Thus, our study provides an approach to scan human tissue and cell types for their physiological contribution to any trait.
Collapse
Affiliation(s)
- Dongwon Lee
- Department of Pediatrics, Division of Nephrology, Boston Children's Hospital, Boston & Harvard Medical School, Boston, MA, USA.
| | - Seong Kyu Han
- Department of Pediatrics, Division of Nephrology, Boston Children's Hospital, Boston & Harvard Medical School, Boston, MA, USA
| | - Or Yaacov
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA
| | - Hanna Berk-Rauch
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA
| | - Prabhu Mathiyalagan
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA
| | - Santhi K Ganesh
- Department of Internal Medicine & Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Aravinda Chakravarti
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY, USA.
| |
Collapse
|
32
|
He S, Gao B, Sabnis R, Sun Q. Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions. ACS Synth Biol 2023; 12:3205-3214. [PMID: 37916871 PMCID: PMC10863451 DOI: 10.1021/acssynbio.3c00154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 10/04/2023] [Accepted: 10/06/2023] [Indexed: 11/03/2023]
Abstract
Much work has been done to apply machine learning and deep learning to genomics tasks, but these applications usually require extensive domain knowledge, and the resulting models provide very limited interpretability. Here, we present the Nucleic Transformer, a conceptually simple but effective and interpretable model architecture that excels in the classification of DNA sequences. The Nucleic Transformer employs self-attention and convolutions on nucleic acid sequences, leveraging two prominent deep learning strategies commonly used in computer vision and natural language analysis. We demonstrate that the Nucleic Transformer can be trained without much domain knowledge to achieve high performance in Escherichia coli promoter classification, viral genome identification, enhancer classification, and chromatin profile predictions.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Baizhen Gao
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Rushant Sabnis
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Qing Sun
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| |
Collapse
|
33
|
Bhogale S, Seward C, Stubbs L, Sinha S. SEAMoD: A fully interpretable neural network for cis-regulatory analysis of differentially expressed genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.09.565900. [PMID: 38014229 PMCID: PMC10680628 DOI: 10.1101/2023.11.09.565900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
A common way to investigate gene regulatory mechanisms is to identify differentially expressed genes using transcriptomics, find their candidate enhancers using epigenomics, and search for over-represented transcription factor (TF) motifs in these enhancers using bioinformatics tools. A related follow-up task is to model gene expression as a function of enhancer sequences and rank TF motifs by their contribution to such models, thus prioritizing among regulators. We present a new computational tool called SEAMoD that performs the above tasks of motif finding and sequence-to-expression modeling simultaneously. It trains a convolutional neural network model to relate enhancer sequences to differential expression in one or more biological conditions. The model uses TF motifs to interpret the sequences, learning these motifs and their relative importance to each biological condition from data. It also utilizes epigenomic information in the form of activity scores of putative enhancers and automatically searches for the most promising enhancer for each gene. Compared to existing neural network models of non-coding sequences, SEAMoD uses far fewer parameters, requires far less training data, and emphasizes biological interpretability. We used SEAMoD to understand regulatory mechanisms underlying the differentiation of neural stem cell (NSC) derived from mouse forebrain. We profiled gene expression and histone modifications in NSC and three differentiated cell types and used SEAMoD to model differential expression of nearly 12,000 genes with an accuracy of 81%, in the process identifying the Olig2, E2f family TFs, Foxo3, and Tcf4 as key transcriptional regulators of the differentiation process.
Collapse
|
34
|
Lagunas T, Plassmeyer SP, Fischer AD, Friedman RZ, Rieger MA, Selmanovic D, Sarafinovska S, Sol YK, Kasper MJ, Fass SB, Aguilar Lucero AF, An JY, Sanders SJ, Cohen BA, Dougherty JD. A Cre-dependent massively parallel reporter assay allows for cell-type specific assessment of the functional effects of non-coding elements in vivo. Commun Biol 2023; 6:1151. [PMID: 37953348 PMCID: PMC10641075 DOI: 10.1038/s42003-023-05483-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 10/18/2023] [Indexed: 11/14/2023] Open
Abstract
The function of regulatory elements is highly dependent on the cellular context, and thus for understanding the function of elements associated with psychiatric diseases these would ideally be studied in neurons in a living brain. Massively Parallel Reporter Assays (MPRAs) are molecular genetic tools that enable functional screening of hundreds of predefined sequences in a single experiment. These assays have not yet been adapted to query specific cell types in vivo in a complex tissue like the mouse brain. Here, using a test-case 3'UTR MPRA library with genomic elements containing variants from autism patients, we developed a method to achieve reproducible measurements of element effects in vivo in a cell type-specific manner, using excitatory cortical neurons and striatal medium spiny neurons as test cases. This targeted technique should enable robust, functional annotation of genetic elements in the cellular contexts most relevant to psychiatric disease.
Collapse
Affiliation(s)
- Tomas Lagunas
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Stephen P Plassmeyer
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Anthony D Fischer
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Ryan Z Friedman
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Michael A Rieger
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Din Selmanovic
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Simona Sarafinovska
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Yvette K Sol
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Michael J Kasper
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Stuart B Fass
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Alessandra F Aguilar Lucero
- Department of Psychiatry and Behavioral Sciences, UCSF Weill Institute for Neuroscience, University of California San Francisco, San Francisco, CA, 94518, USA
| | - Joon-Yong An
- Department of Integrated Biomedical and Life Science, Korea University, Seoul, 02841, Republic of Korea
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul, 02841, Republic of Korea
| | - Stephan J Sanders
- Department of Psychiatry and Behavioral Sciences, UCSF Weill Institute for Neuroscience, University of California San Francisco, San Francisco, CA, 94518, USA
| | - Barak A Cohen
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA
| | - Joseph D Dougherty
- Department of Genetics, Washington University School of Medicine, 660 S. Euclid Ave, Saint Louis, MO, 63108, USA.
- Department of Psychiatry, Washington University School of Medicine., 660 S. Euclid Ave, Saint Louis, MO, 63108, USA.
| |
Collapse
|
35
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
36
|
Guo MG, Reynolds DL, Ang CE, Liu Y, Zhao Y, Donohue LKH, Siprashvili Z, Yang X, Yoo Y, Mondal S, Hong A, Kain J, Meservey L, Fabo T, Elfaki I, Kellman LN, Abell NS, Pershad Y, Bayat V, Etminani P, Holodniy M, Geschwind DH, Montgomery SB, Duncan LE, Urban AE, Altman RB, Wernig M, Khavari PA. Integrative analyses highlight functional regulatory variants associated with neuropsychiatric diseases. Nat Genet 2023; 55:1876-1891. [PMID: 37857935 PMCID: PMC10859123 DOI: 10.1038/s41588-023-01533-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 09/15/2023] [Indexed: 10/21/2023]
Abstract
Noncoding variants of presumed regulatory function contribute to the heritability of neuropsychiatric disease. A total of 2,221 noncoding variants connected to risk for ten neuropsychiatric disorders, including autism spectrum disorder, attention deficit hyperactivity disorder, bipolar disorder, borderline personality disorder, major depression, generalized anxiety disorder, panic disorder, post-traumatic stress disorder, obsessive-compulsive disorder and schizophrenia, were studied in developing human neural cells. Integrating epigenomic and transcriptomic data with massively parallel reporter assays identified differentially-active single-nucleotide variants (daSNVs) in specific neural cell types. Expression-gene mapping, network analyses and chromatin looping nominated candidate disease-relevant target genes modulated by these daSNVs. Follow-up integration of daSNV gene editing with clinical cohort analyses suggested that magnesium transport dysfunction may increase neuropsychiatric disease risk and indicated that common genetic pathomechanisms may mediate specific symptoms that are shared across multiple neuropsychiatric diseases.
Collapse
Affiliation(s)
- Margaret G Guo
- Stanford Program in Biomedical Informatics, Stanford University, Stanford, CA, USA
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
| | - David L Reynolds
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
| | - Cheen E Ang
- Department of Pathology, Stanford University, Stanford, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Institute for Stem Cell Biology & Regenerative Medicine, Stanford University, Stanford, CA, USA
| | - Yingfei Liu
- Institute for Stem Cell Biology & Regenerative Medicine, Stanford University, Stanford, CA, USA
- Institute of Neurobiology, Xi'an Jiaotong University Health Science Center, Xi'an, China
| | - Yang Zhao
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
| | - Laura K H Donohue
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Zurab Siprashvili
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
| | - Xue Yang
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
- Stanford Program in Cancer Biology, Stanford University, Stanford, CA, USA
| | - Yongjin Yoo
- Institute for Stem Cell Biology & Regenerative Medicine, Stanford University, Stanford, CA, USA
| | - Smarajit Mondal
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
| | - Audrey Hong
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
| | - Jessica Kain
- Department of Genetics, Stanford University, Stanford, CA, USA
| | | | - Tania Fabo
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Ibtihal Elfaki
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Laura N Kellman
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA
- Stanford Program in Cancer Biology, Stanford University, Stanford, CA, USA
| | - Nathan S Abell
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Yash Pershad
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | | | | | - Mark Holodniy
- Public Health Surveillance and Research, Department of Veterans Affairs, Washington, DC, USA
- Division of Infectious Disease & Geographic Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Daniel H Geschwind
- Program in Neurobehavioral Genetics, Semel Institute, UCLA, Los Angeles, CA, USA
| | - Stephen B Montgomery
- Department of Pathology, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Laramie E Duncan
- Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA
| | - Alexander E Urban
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA
| | - Russ B Altman
- Stanford Program in Biomedical Informatics, Stanford University, Stanford, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Marius Wernig
- Department of Pathology, Stanford University, Stanford, CA, USA
- Institute for Stem Cell Biology & Regenerative Medicine, Stanford University, Stanford, CA, USA
| | - Paul A Khavari
- Program in Epithelial Biology, Stanford University, Stanford, CA, USA.
- Stanford Program in Cancer Biology, Stanford University, Stanford, CA, USA.
- Veterans Affairs Palo Alto Healthcare System, Palo Alto, CA, USA.
| |
Collapse
|
37
|
Tahara S, Tsuchiya T, Matsumoto H, Ozaki H. Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans. BMC Genomics 2023; 24:597. [PMID: 37805453 PMCID: PMC10560430 DOI: 10.1186/s12864-023-09692-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 09/21/2023] [Indexed: 10/09/2023] Open
Abstract
BACKGROUND Transcription factors (TFs) exhibit heterogeneous DNA-binding specificities in individual cells and whole organisms under natural conditions, and de novo motif discovery usually provides multiple motifs, even from a single chromatin immunoprecipitation-sequencing (ChIP-seq) sample. Despite the accumulation of ChIP-seq data and ChIP-seq-derived motifs, the diversity of DNA-binding specificities across different TFs and cell types remains largely unexplored. RESULTS Here, we applied MOCCS2, our k-mer-based motif discovery method, to a collection of human TF ChIP-seq samples across diverse TFs and cell types, and systematically computed profiles of TF-binding specificity scores for all k-mers. After quality control, we compiled a set of TF-binding specificity score profiles for 2,976 high-quality ChIP-seq samples, comprising 473 TFs and 398 cell types. Using these high-quality samples, we confirmed that the k-mer-based TF-binding specificity profiles reflected TF- or TF-family dependent DNA-binding specificities. We then compared the binding specificity scores of ChIP-seq samples with the same TFs but with different cell type classes and found that half of the analyzed TFs exhibited differences in DNA-binding specificities across cell type classes. Additionally, we devised a method to detect differentially bound k-mers between two ChIP-seq samples and detected k-mers exhibiting statistically significant differences in binding specificity scores. Moreover, we demonstrated that differences in the binding specificity scores between k-mers on the reference and alternative alleles could be used to predict the effect of variants on TF binding, as validated by in vitro and in vivo assay datasets. Finally, we demonstrated that binding specificity score differences can be used to interpret disease-associated non-coding single-nucleotide polymorphisms (SNPs) as TF-affecting SNPs and provide candidates responsible for TFs and cell types. CONCLUSIONS Our study provides a basis for investigating the regulation of gene expression in a TF-, TF family-, or cell-type-dependent manner. Furthermore, our differential analysis of binding-specificity scores highlights noncoding disease-associated variants in humans.
Collapse
Affiliation(s)
- Saeko Tahara
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- School of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Takaho Tsuchiya
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Hirotaka Matsumoto
- School of Information and Data Sciences, Nagasaki University, 1-14, Bunkyo-Machi, Nagasaki City, Nagasaki, 852-8521, Japan
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan
| | - Haruka Ozaki
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan.
| |
Collapse
|
38
|
Koido M. Polygenic modelling and machine learning approaches in pharmacogenomics: Importance in downstream analysis of genome-wide association study data. Br J Clin Pharmacol 2023. [PMID: 37743713 DOI: 10.1111/bcp.15913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 09/05/2023] [Accepted: 09/11/2023] [Indexed: 09/26/2023] Open
Abstract
Genome-wide association studies (GWAS) have identified genetic variations associated with adverse drug effects in pharmacogenomics (PGx) research. However, interpreting the biological implications of these associations remains a challenge. This review highlights 2 promising post-GWAS methods for PGx. First, we discuss the polygenic architecture of the PGx traits, especially for drug-induced liver injury. Experimental modelling using multiple donors' human primary hepatocytes and human liver organoids demonstrated the polygenic architecture of drug-induced liver injury susceptibility and found biological vulnerability in genetically high-risk tissue donors. Second, we discuss the challenges of interpreting the roles of variants in noncoding regions. Beyond methods involving expression quantitative trait locus analysis and massively parallel reporter assays, we suggest the use of in silico mutagenesis through machine learning methods to understand the roles of variants in transcriptional regulation. This review underscores the importance of these post-GWAS methods in providing critical insights into PGx, potentially facilitating drug development and personalized treatment.
Collapse
Affiliation(s)
- Masaru Koido
- Laboratory of Complex Trait Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
39
|
Zhang J, Liu B, Wu J, Wang Z, Li J. DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network. BMC Bioinformatics 2023; 24:345. [PMID: 37723425 PMCID: PMC10506269 DOI: 10.1186/s12859-023-05469-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 09/06/2023] [Indexed: 09/20/2023] Open
Abstract
Understanding gene expression processes necessitates the accurate classification and identification of transcription factors, which is supported by high-throughput sequencing technologies. However, these techniques suffer from inherent limitations such as time consumption and high costs. To address these challenges, the field of bioinformatics has increasingly turned to deep learning technologies for analyzing gene sequences. Nevertheless, the pursuit of improved experimental results has led to the inclusion of numerous complex analysis function modules, resulting in models with a growing number of parameters. To overcome these limitations, it is proposed a novel approach for analyzing DNA transcription factor sequences, which is named as DeepCAC. This method leverages deep convolutional neural networks with a multi-head self-attention mechanism. By employing convolutional neural networks, it can effectively capture local hidden features in the sequences. Simultaneously, the multi-head self-attention mechanism enhances the identification of hidden features with long-distant dependencies. This approach reduces the overall number of parameters in the model while harnessing the computational power of sequence data from multi-head self-attention. Through training with labeled data, experiments demonstrate that this approach significantly improves performance while requiring fewer parameters compared to existing methods. Additionally, the effectiveness of our approach is validated in accurately predicting DNA transcription factor sequences.
Collapse
Affiliation(s)
- Jidong Zhang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Bo Liu
- School of Mathematical and Computational Sciences, Massey University, Auckland, 0745, New Zealand.
| | - Jiahui Wu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Zhihan Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Jianqiang Li
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| |
Collapse
|
40
|
Jiang D, Zhang J. Ascertainment bias in the genomic test of positive selection on regulatory sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.20.554030. [PMID: 37662307 PMCID: PMC10473660 DOI: 10.1101/2023.08.20.554030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Evolution of gene expression mediated by cis-regulatory changes is thought to be an important contributor to organismal adaptation, but identifying adaptive cis-regulatory changes is challenging due to the difficulty in knowing the expectation under no positive selection. A new approach for detecting positive selection on transcription factor binding sites (TFBSs) was recently developed, thanks to the application of machine learning in predicting transcription factor (TF) binding affinities of DNA sequences. Given a TFBS sequence from a focal species and the corresponding inferred ancestral sequence that differs from the former at n sites, one can predict the TF binding affinities of many n-step mutational neighbors of the ancestral sequence and obtain a null distribution of the derived binding affinity, which allows testing whether the binding affinity of the real derived sequence deviates significantly from the null distribution. Applying this test genomically to all experimentally identified binding sites of three TFs in humans, a recent study reported positive selection for elevated binding affinities of TFBSs. Here we show that this genomic test suffers from an ascertainment bias because, even in the absence of positive selection for strengthened binding, the binding affinities of known human TFBSs are more likely to have increased than decreased in evolution. We demonstrate by computer simulation that this bias inflates the false positive rate of the selection test. We propose several methods to mitigate the ascertainment bias and show that almost all previously reported positive selection signals disappear when these methods are applied.
Collapse
Affiliation(s)
- Daohan Jiang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
41
|
Duan YY, Chen XF, Zhu RJ, Jia YY, Huang XT, Zhang M, Yang N, Dong SS, Zeng M, Feng Z, Zhu DL, Wu H, Jiang F, Shi W, Hu WX, Ke X, Chen H, Liu Y, Jing RH, Guo Y, Li M, Yang TL. High-throughput functional dissection of noncoding SNPs with biased allelic enhancer activity for insulin resistance-relevant phenotypes. Am J Hum Genet 2023; 110:1266-1288. [PMID: 37506691 PMCID: PMC10432149 DOI: 10.1016/j.ajhg.2023.07.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Revised: 07/04/2023] [Accepted: 07/05/2023] [Indexed: 07/30/2023] Open
Abstract
Most of the single-nucleotide polymorphisms (SNPs) associated with insulin resistance (IR)-relevant phenotypes by genome-wide association studies (GWASs) are located in noncoding regions, complicating their functional interpretation. Here, we utilized an adapted STARR-seq to evaluate the regulatory activities of 5,987 noncoding SNPs associated with IR-relevant phenotypes. We identified 876 SNPs with biased allelic enhancer activity effects (baaSNPs) across 133 loci in three IR-relevant cell lines (HepG2, preadipocyte, and A673), which showed pervasive cell specificity and significant enrichment for cell-specific open chromatin regions or enhancer-indicative markers (H3K4me1, H3K27ac). Further functional characterization suggested several transcription factors (TFs) with preferential allelic binding to baaSNPs. We also incorporated multi-omics data to prioritize 102 candidate regulatory target genes for baaSNPs and revealed prevalent long-range regulatory effects and cell-specific IR-relevant biological functional enrichment on them. Specifically, we experimentally verified the distal regulatory mechanism at IRS1 locus, in which rs952227-A reinforces IRS1 expression by long-range chromatin interaction and preferential binding to the transcription factor HOXC6 to augment the enhancer activity. Finally, based on our STARR-seq screening data, we predicted the enhancer activity of 227,343 noncoding SNPs associated with IR-relevant phenotypes (fasting insulin adjusted for BMI, HDL cholesterol, and triglycerides) from the largest available GWAS summary statistics. We further provided an open resource (http://www.bigc.online/fnSNP-IR) for better understanding genetic regulatory mechanisms of IR-relevant phenotypes.
Collapse
Affiliation(s)
- Yuan-Yuan Duan
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Xiao-Feng Chen
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Ren-Jie Zhu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Ying-Ying Jia
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Xiao-Ting Huang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Meng Zhang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Ning Yang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Shan-Shan Dong
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Mengqi Zeng
- Frontier Institute of Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Zhihui Feng
- Frontier Institute of Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Dong-Li Zhu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Hao Wu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Feng Jiang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Wei Shi
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Wei-Xin Hu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Xin Ke
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Hao Chen
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Yunlong Liu
- Department of Medical and Molecular Genetics, School of Medicine, Indiana University, Indianapolis, IN 46202, USA
| | - Rui-Hua Jing
- Department of Ophthalmology, The Second Affiliated Hospital of Xi'an Jiaotong University, Xi'an, Shaanxi 710000, China
| | - Yan Guo
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Meng Li
- Department of Orthopedics, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, Shaanxi 710061, China.
| | - Tie-Lin Yang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Biomedical Informatics & Genomics Center, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China; Department of Orthopedics, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, Shaanxi 710061, China.
| |
Collapse
|
42
|
Nowling RJ, Njoya K, Peters JG, Riehle MM. Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique. Front Cell Infect Microbiol 2023; 13:1182567. [PMID: 37600946 PMCID: PMC10433755 DOI: 10.3389/fcimb.2023.1182567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023] Open
Abstract
Introduction Various sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods Here, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
Collapse
Affiliation(s)
- Ronald J. Nowling
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Kimani Njoya
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| | - John G. Peters
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Michelle M. Riehle
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| |
Collapse
|
43
|
Luo R, Yan J, Oh JW, Xi W, Shigaki D, Wong W, Cho HS, Murphy D, Cutler R, Rosen BP, Pulecio J, Yang D, Glenn RA, Chen T, Li QV, Vierbuchen T, Sidoli S, Apostolou E, Huangfu D, Beer MA. Dynamic network-guided CRISPRi screen identifies CTCF-loop-constrained nonlinear enhancer gene regulatory activity during cell state transitions. Nat Genet 2023; 55:1336-1346. [PMID: 37488417 PMCID: PMC11012226 DOI: 10.1038/s41588-023-01450-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Accepted: 06/20/2023] [Indexed: 07/26/2023]
Abstract
Comprehensive enhancer discovery is challenging because most enhancers, especially those contributing to complex diseases, have weak effects on gene expression. Our gene regulatory network modeling identified that nonlinear enhancer gene regulation during cell state transitions can be leveraged to improve the sensitivity of enhancer discovery. Using human embryonic stem cell definitive endoderm differentiation as a dynamic transition system, we conducted a mid-transition CRISPRi-based enhancer screen. We discovered a comprehensive set of enhancers for each of the core endoderm-specifying transcription factors. Many enhancers had strong effects mid-transition but weak effects post-transition, consistent with the nonlinear temporal responses to enhancer perturbation predicted by the modeling. Integrating three-dimensional genomic information, we were able to develop a CTCF-loop-constrained Interaction Activity model that can better predict functional enhancers compared to models that rely on Hi-C-based enhancer-promoter contact frequency. Our study provides generalizable strategies for sensitive and systematic enhancer discovery in both normal and pathological cell state transitions.
Collapse
Affiliation(s)
- Renhe Luo
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Louis V. Gerstner Jr. Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York City, NY, USA
| | - Jielin Yan
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Louis V. Gerstner Jr. Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York City, NY, USA
| | - Jin Woo Oh
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Wang Xi
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Dustin Shigaki
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Wilfred Wong
- Computational & Systems Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Weill Cornell Graduate School of Medical Sciences, Weill Cornell Medicine, New York City, NY, USA
| | - Hyein S Cho
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | - Dylan Murphy
- Weill Cornell Graduate School of Medical Sciences, Weill Cornell Medicine, New York City, NY, USA
- Department of Medicine, Weill Cornell Medicine, New York City, NY, USA
| | - Ronald Cutler
- Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Bess P Rosen
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Weill Cornell Graduate School of Medical Sciences, Weill Cornell Medicine, New York City, NY, USA
| | - Julian Pulecio
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | - Dapeng Yang
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | - Rachel A Glenn
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Weill Cornell Graduate School of Medical Sciences, Weill Cornell Medicine, New York City, NY, USA
| | - Tingxu Chen
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Louis V. Gerstner Jr. Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York City, NY, USA
| | - Qing V Li
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Louis V. Gerstner Jr. Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York City, NY, USA
| | - Thomas Vierbuchen
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | - Simone Sidoli
- Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Effie Apostolou
- Department of Medicine, Weill Cornell Medicine, New York City, NY, USA
| | - Danwei Huangfu
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA.
| | - Michael A Beer
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
44
|
Ober-Reynolds B, Wang C, Ko JM, Rios EJ, Aasi SZ, Davis MM, Oro AE, Greenleaf WJ. Integrated single-cell chromatin and transcriptomic analyses of human scalp identify gene-regulatory programs and critical cell types for hair and skin diseases. Nat Genet 2023; 55:1288-1300. [PMID: 37500727 PMCID: PMC11190942 DOI: 10.1038/s41588-023-01445-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 06/17/2023] [Indexed: 07/29/2023]
Abstract
Genome-wide association studies have identified many loci associated with hair and skin disease, but identification of causal variants requires deciphering of gene-regulatory networks in relevant cell types. We generated matched single-cell chromatin profiles and transcriptomes from scalp tissue from healthy controls and patients with alopecia areata, identifying diverse cell types of the hair follicle niche. By interrogating these datasets at multiple levels of cellular resolution, we infer 50-100% more enhancer-gene links than previous approaches and show that aggregate enhancer accessibility for highly regulated genes predicts expression. We use these gene-regulatory maps to prioritize cell types, genes and causal variants implicated in the pathobiology of androgenetic alopecia (AGA), eczema and other complex traits. AGA genome-wide association studies signals are enriched in dermal papilla regulatory regions, supporting the role of these cells as drivers of AGA pathogenesis. Finally, we train machine learning models to nominate single-nucleotide polymorphisms that affect gene expression through disruption of transcription factor binding, predicting candidate functional single-nucleotide polymorphism for AGA and eczema.
Collapse
Affiliation(s)
| | - Chen Wang
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
- Division of Dermatology, Department of Medicine, Santa Clara Valley Medical Center, San Jose, CA, USA
- Institute of Immunity, Transplantation and Infection, School of Medicine, Stanford University, Stanford, CA, USA
| | - Justin M Ko
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Eon J Rios
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
- Division of Dermatology, Department of Medicine, Santa Clara Valley Medical Center, San Jose, CA, USA
| | - Sumaira Z Aasi
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Mark M Davis
- Institute of Immunity, Transplantation and Infection, School of Medicine, Stanford University, Stanford, CA, USA
- Department of Microbiology and Immunology, School of Medicine, Stanford University, Stanford, CA, USA
- Howard Hughes Medical Institute, School of Medicine, Stanford University, Stanford, CA, USA
| | - Anthony E Oro
- Department of Dermatology, School of Medicine, Stanford University, Stanford, CA, USA
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - William J Greenleaf
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
- Department of Applied Physics, Stanford University, Stanford, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
45
|
Zhuang J, Feng K, Teng X, Jia C. GNet: An integrated context-aware neural framework for transcription factor binding signal at single nucleotide resolution prediction. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:15809-15829. [PMID: 37919990 DOI: 10.3934/mbe.2023704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/04/2023]
Abstract
Transcription factors (TFs) are important factors that regulate gene expression. Revealing the mechanism affecting the binding specificity of TFs is the key to understanding gene regulation. Most of the previous studies focus on TF-DNA binding sites at the sequence level, and they seldom utilize the contextual features of DNA sequences. In this paper, we develop an integrated spatiotemporal context-aware neural network framework, named GNet, for predicting TF-DNA binding signal at single nucleotide resolution by achieving three tasks: single nucleotide resolution signal prediction, identification of binding regions at the sequence level, and TF-DNA binding motif prediction. GNet extracts implicit spatial contextual information with a gated highway neural mechanism, which captures large context multi-level patterns using linear shortcut connections, and the idea of it permeates the encoder and decoder parts of GNet. The improved dual external attention mechanism, which learns implicit relationships both within and among samples, and improves the performance of the model. Experimental results on 53 human TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets shows that GNet outperforms the state-of-the-art methods in the three tasks, and the results of cross-species studies on 15 human and 18 mouse TF datasets of the corresponding TF families indicate that GNet also shows the best performance in cross-species prediction over the competitive methods.
Collapse
Affiliation(s)
- Jujuan Zhuang
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| | - Kexin Feng
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| | - Xinyang Teng
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| |
Collapse
|
46
|
Jeong R, Bulyk ML. Blood cell traits' GWAS loci colocalization with variation in PU.1 genomic occupancy prioritizes causal noncoding regulatory variants. CELL GENOMICS 2023; 3:100327. [PMID: 37492098 PMCID: PMC10363807 DOI: 10.1016/j.xgen.2023.100327] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 02/10/2023] [Accepted: 04/25/2023] [Indexed: 07/27/2023]
Abstract
Genome-wide association studies (GWASs) have uncovered numerous trait-associated loci across the human genome, most of which are located in noncoding regions, making interpretation difficult. Moreover, causal variants are hard to statistically fine-map at many loci because of widespread linkage disequilibrium. To address this challenge, we present a strategy utilizing transcription factor (TF) binding quantitative trait loci (bQTLs) for colocalization analysis to identify trait associations likely mediated by TF occupancy variation and to pinpoint likely causal variants using motif scores. We applied this approach to PU.1 bQTLs in lymphoblastoid cell lines and blood cell trait GWAS data. Colocalization analysis revealed 69 blood cell trait GWAS loci putatively driven by PU.1 occupancy variation. We nominate PU.1 motif-altering variants as the likely shared causal variants at 51 loci. Such integration of TF bQTL data with other GWAS data may reveal transcriptional regulatory mechanisms and causal noncoding variants underlying additional complex traits.
Collapse
Affiliation(s)
- Raehoon Jeong
- Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
- Bioinformatics and Integrative Genomics Graduate Program, Harvard University, Cambridge, MA 02138, USA
| | - Martha L. Bulyk
- Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
- Bioinformatics and Integrative Genomics Graduate Program, Harvard University, Cambridge, MA 02138, USA
- Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
47
|
Phan LT, Oh C, He T, Manavalan B. A comprehensive revisit of the machine-learning tools developed for the identification of enhancers in the human genome. Proteomics 2023; 23:e2200409. [PMID: 37021401 DOI: 10.1002/pmic.202200409] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 03/18/2023] [Accepted: 03/27/2023] [Indexed: 04/07/2023]
Abstract
Enhancers are non-coding DNA elements that play a crucial role in enhancing the transcription rate of a specific gene in the genome. Experiments for identifying enhancers can be restricted by their conditions and involve complicated, time-consuming, laborious, and costly steps. To overcome these challenges, computational platforms have been developed to complement experimental methods that enable high-throughput identification of enhancers. Over the last few years, the development of various enhancer computational tools has resulted in significant progress in predicting putative enhancers. Thus, researchers are now able to use a variety of strategies to enhance and advance enhancer study. In this review, an overview of machine learning (ML)-based prediction methods for enhancer identification and related databases has been provided. The existing enhancer-prediction methods have also been reviewed regarding their algorithms, feature selection processes, validation techniques, and software utility. In addition, the advantages and drawbacks of these ML approaches and guidelines for developing bioinformatic tools have been highlighted for a more efficient enhancer prediction. This review will serve as a useful resource for experimentalists in selecting the appropriate ML tool for their study, and for bioinformaticians in developing more accurate and advanced ML-based predictors.
Collapse
Affiliation(s)
- Le Thi Phan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Changmin Oh
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| |
Collapse
|
48
|
Murad T, Ali S, Patterson M. Exploring the Potential of GANs in Biological Sequence Analysis. BIOLOGY 2023; 12:854. [PMID: 37372139 DOI: 10.3390/biology12060854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/03/2023] [Accepted: 06/12/2023] [Indexed: 06/29/2023]
Abstract
Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.
Collapse
Affiliation(s)
- Taslim Murad
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Sarwan Ali
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| |
Collapse
|
49
|
Chadha A, Dara R, Pearl DL, Gillis D, Rosendal T, Poljak Z. Classification of porcine reproductive and respiratory syndrome clinical impact in Ontario sow herds using machine learning approaches. Front Vet Sci 2023; 10:1175569. [PMID: 37351555 PMCID: PMC10284593 DOI: 10.3389/fvets.2023.1175569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 04/28/2023] [Indexed: 06/24/2023] Open
Abstract
Since the early 1990s, porcine reproductive and respiratory syndrome (PRRS) virus outbreaks have been reported across various parts of North America, Europe, and Asia. The incursion of PRRS virus (PRRSV) in swine herds could result in various clinical manifestations, resulting in a substantial impact on the incidence of respiratory morbidity, reproductive loss, and mortality. Veterinary experts, among others, regularly analyze the PRRSV open reading frame-5 (ORF-5) for prognostic purposes to assess the risk of severe clinical outcomes. In this study, we explored if predictive modeling techniques could be used to identify the severity of typical clinical signs observed during PRRS outbreaks in sow herds. Our study aimed to evaluate four baseline machine learning (ML) algorithms: logistic regression (LR) with ridge and lasso regularization techniques, random forest (RF), k-nearest neighbor (KNN), and support vector machine (SVM), for the clinical impact classification of ORF-5 sequences and demographic data into high impact and low impact categories. First, baseline classifiers were evaluated using different input representations of ORF-5 nucleotides, amino acid sequences, and demographic data using a 10-fold cross-validation technique. Then, we designed a consensus voting ensemble approach to aggregate the different types of input representations for genetic and demographic data for classifying clinical impact. In this study, we observed that: (a) for abortion and pre-weaning mortality (PWM), different classifiers gained improvement over baseline accuracy, which showed the plausible presence of both genotypic-phenotypic and demographic-phenotypic relationships, (b) for sow mortality (SM), no baseline classifier successfully established such linkages using either genetic or demographic input data, (c) baseline classifiers showed good performance with a moderate variance of the performance metrics, due to high-class overlap and the small dataset size used for training, and (d) the use of consensus voting ensemble techniques helped to make the predictions more robust and stabilized the performance evaluation metrics, but overall accuracy did not substantially improve the diagnostic metrics over baseline classifiers.
Collapse
Affiliation(s)
- Akshay Chadha
- School of Computer Science, University of Guelph, Guelph, ON, Canada
| | - Rozita Dara
- School of Computer Science, University of Guelph, Guelph, ON, Canada
| | - David L. Pearl
- Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada
| | - Daniel Gillis
- School of Computer Science, University of Guelph, Guelph, ON, Canada
| | | | - Zvonimir Poljak
- Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada
| |
Collapse
|
50
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|