1
|
Kuhl H, Tan WH, Klopp C, Kleiner W, Koyun B, Ciorpac M, Feron R, Knytl M, Kloas W, Schartl M, Winkler C, Stöck M. A candidate sex determination locus in amphibians which evolved by structural variation between X- and Y-chromosomes. Nat Commun 2024; 15:4781. [PMID: 38839766 PMCID: PMC11153619 DOI: 10.1038/s41467-024-49025-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 05/17/2024] [Indexed: 06/07/2024] Open
Abstract
Most vertebrates develop distinct females and males, where sex is determined by repeatedly evolved environmental or genetic triggers. Undifferentiated sex chromosomes and large genomes have caused major knowledge gaps in amphibians. Only a single master sex-determining gene, the dmrt1-paralogue (dm-w) of female-heterogametic clawed frogs (Xenopus; ZW♀/ZZ♂), is known across >8740 species of amphibians. In this study, by combining chromosome-scale female and male genomes of a non-model amphibian, the European green toad, Bufo(tes) viridis, with ddRAD- and whole genome pool-sequencing, we reveal a candidate master locus, governing a male-heterogametic system (XX♀/XY♂). Targeted sequencing across multiple taxa uncovered structural X/Y-variation in the 5'-regulatory region of the gene bod1l, where a Y-specific non-coding RNA (ncRNA-Y), only expressed in males, suggests that this locus initiates sex-specific differentiation. Developmental transcriptomes and RNA in-situ hybridization show timely and spatially relevant sex-specific ncRNA-Y and bod1l-gene expression in primordial gonads. This coincided with differential H3K4me-methylation in pre-granulosa/pre-Sertoli cells, pointing to a specific mechanism of amphibian sex determination.
Collapse
Affiliation(s)
- Heiner Kuhl
- Leibniz-Institute of Freshwater Ecology and Inland Fisheries, IGB, Müggelseedamm 301 & 310, 12587, Berlin, Germany
| | - Wen Hui Tan
- Department of Biological Sciences and Centre for Bioimaging Sciences, National University of Singapore, 14 Science Drive 4, Block S1A, Level 6, Singapore, 117543, Singapore
| | - Christophe Klopp
- SIGENAE, Plate-forme Bio-informatique Genotoul, Mathématiques et Informatique Appliquées de Toulouse, INRAe, 31326, Castanet-Tolosan, France
| | - Wibke Kleiner
- Leibniz-Institute of Freshwater Ecology and Inland Fisheries, IGB, Müggelseedamm 301 & 310, 12587, Berlin, Germany
| | - Baturalp Koyun
- Leibniz-Institute of Freshwater Ecology and Inland Fisheries, IGB, Müggelseedamm 301 & 310, 12587, Berlin, Germany
- Department of Molecular Biology and Genetics, Genetics, Faculty of Science, Bilkent University, SB Building, Ankara, 06800, Turkey
| | - Mitica Ciorpac
- Danube Delta National Institute for Research and Development, Tulcea, 820112, Romania
- Advanced Research and Development Center for Experimental Medicine-CEMEX, "Grigore T. Popa", University of Medicine and Pharmacy, Mihail Kogălniceanu Street 9-13, Iasi, 700259, Romania
| | - Romain Feron
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Martin Knytl
- Department of Cell Biology, Faculty of Science, Charles University, Viničná 7, Prague, 12843, Czech Republic
- Department of Biology, McMaster University, 1280 Main Street West, Hamilton, L8S 4K1, Ontario, ON, Canada
| | - Werner Kloas
- Leibniz-Institute of Freshwater Ecology and Inland Fisheries, IGB, Müggelseedamm 301 & 310, 12587, Berlin, Germany
| | - Manfred Schartl
- Developmental Biochemistry, Biocenter, University of Wuerzburg, Am Hubland, 97074, Wuerzburg, Germany
- The Xiphophorus Genetic Stock Center, Department of Chemistry and Biochemistry, Texas State University, San Marcos, TX, 78666, USA
| | - Christoph Winkler
- Department of Biological Sciences and Centre for Bioimaging Sciences, National University of Singapore, 14 Science Drive 4, Block S1A, Level 6, Singapore, 117543, Singapore.
| | - Matthias Stöck
- Leibniz-Institute of Freshwater Ecology and Inland Fisheries, IGB, Müggelseedamm 301 & 310, 12587, Berlin, Germany.
| |
Collapse
|
2
|
Mehmood F, Arshad S, Shoaib M. ADH-Enhancer: an attention-based deep hybrid framework for enhancer identification and strength prediction. Brief Bioinform 2024; 25:bbae030. [PMID: 38385876 PMCID: PMC10885011 DOI: 10.1093/bib/bbae030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 12/30/2023] [Accepted: 01/11/2024] [Indexed: 02/23/2024] Open
Abstract
Enhancers play an important role in the process of gene expression regulation. In DNA sequence abundance or absence of enhancers and irregularities in the strength of enhancers affects gene expression process that leads to the initiation and propagation of diverse types of genetic diseases such as hemophilia, bladder cancer, diabetes and congenital disorders. Enhancer identification and strength prediction through experimental approaches is expensive, time-consuming and error-prone. To accelerate and expedite the research related to enhancers identification and strength prediction, around 19 computational frameworks have been proposed. These frameworks used machine and deep learning methods that take raw DNA sequences and predict enhancer's presence and strength. However, these frameworks still lack in performance and are not useful in real time analysis. This paper presents a novel deep learning framework that uses language modeling strategies for transforming DNA sequences into statistical feature space. It applies transfer learning by training a language model in an unsupervised fashion by predicting a group of nucleotides also known as k-mers based on the context of existing k-mers in a sequence. At the classification stage, it presents a novel classifier that reaps the benefits of two different architectures: convolutional neural network and attention mechanism. The proposed framework is evaluated over the enhancer identification benchmark dataset where it outperforms the existing best-performing framework by 5%, and 9% in terms of accuracy and MCC. Similarly, when evaluated over the enhancer strength prediction benchmark dataset, it outperforms the existing best-performing framework by 4%, and 7% in terms of accuracy and MCC.
Collapse
Affiliation(s)
- Faiza Mehmood
- Department of Computer Science, University of Engineering and Technology Lahore, (Faisalabad Campus) Pakistan
| | - Shazia Arshad
- Department of Computer Science, University of Engineering and Technology Lahore, 54890, Pakistan
| | - Muhammad Shoaib
- Department of Computer Science, University of Engineering and Technology Lahore, 54890, Pakistan
| |
Collapse
|
3
|
Liu Y, Wang Z, Yuan H, Zhu G, Zhang Y. HEAP: a task adaptive-based explainable deep learning framework for enhancer activity prediction. Brief Bioinform 2023; 24:bbad286. [PMID: 37539835 DOI: 10.1093/bib/bbad286] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 07/05/2023] [Accepted: 07/21/2023] [Indexed: 08/05/2023] Open
Abstract
Enhancers are crucial cis-regulatory elements that control gene expression in a cell-type-specific manner. Despite extensive genetic and computational studies, accurately predicting enhancer activity in different cell types remains a challenge, and the grammar of enhancers is still poorly understood. Here, we present HEAP (high-resolution enhancer activity prediction), an explainable deep learning framework for predicting enhancers and exploring enhancer grammar. The framework includes three modules that use grammar-based reasoning for enhancer prediction. The algorithm can incorporate DNA sequences and epigenetic modifications to obtain better accuracy. We use a novel two-step multi-task learning method, task adaptive parameter sharing (TAPS), to efficiently predict enhancers in different cell types. We first train a shared model with all cell-type datasets. Then we adapt to specific tasks by adding several task-specific subset layers. Experiments demonstrate that HEAP outperforms published methods and showcases the effectiveness of the TAPS, especially for those with limited training samples. Notably, the explainable framework HEAP utilizes post-hoc interpretation to provide insights into the prediction mechanisms from three perspectives: data, model architecture and algorithm, leading to a better understanding of model decisions and enhancer grammar. To the best of our knowledge, HEAP will be a valuable tool for insight into the complex mechanisms of enhancer activity.
Collapse
Affiliation(s)
- Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Zixuan Wang
- College of Electronics and Information Engieering, Sichuan University, 610065, Chengdu, China
| | - Hao Yuan
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Guiquan Zhu
- West China Hospital of Stomatology, Sichuan University, 610041, Chengdu, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| |
Collapse
|
4
|
Wang J, Zhang H, Chen N, Zeng T, Ai X, Wu K. PorcineAI-Enhancer: Prediction of Pig Enhancer Sequences Using Convolutional Neural Networks. Animals (Basel) 2023; 13:2935. [PMID: 37760334 PMCID: PMC10526013 DOI: 10.3390/ani13182935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 08/21/2023] [Accepted: 09/05/2023] [Indexed: 09/29/2023] Open
Abstract
Understanding the mechanisms of gene expression regulation is crucial in animal breeding. Cis-regulatory DNA sequences, such as enhancers, play a key role in regulating gene expression. Identifying enhancers is challenging, despite the use of experimental techniques and computational methods. Enhancer prediction in the pig genome is particularly significant due to the costliness of high-throughput experimental techniques. The study constructed a high-quality database of pig enhancers by integrating information from multiple sources. A deep learning prediction framework called PorcineAI-enhancer was developed for the prediction of pig enhancers. This framework employs convolutional neural networks for feature extraction and classification. PorcineAI-enhancer showed excellent performance in predicting pig enhancers, validated on an independent test dataset. The model demonstrated reliable prediction capability for unknown enhancer sequences and performed remarkably well on tissue-specific enhancer sequences.The study developed a deep learning prediction framework, PorcineAI-enhancer, for predicting pig enhancers. The model demonstrated significant predictive performance and potential for tissue-specific enhancers. This research provides valuable resources for future studies on gene expression regulation in pigs.
Collapse
Affiliation(s)
- Ji Wang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Han Zhang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Nanzhu Chen
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China;
| | - Tong Zeng
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Xiaohua Ai
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Keliang Wu
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| |
Collapse
|
5
|
Phan LT, Oh C, He T, Manavalan B. A comprehensive revisit of the machine-learning tools developed for the identification of enhancers in the human genome. Proteomics 2023; 23:e2200409. [PMID: 37021401 DOI: 10.1002/pmic.202200409] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 03/18/2023] [Accepted: 03/27/2023] [Indexed: 04/07/2023]
Abstract
Enhancers are non-coding DNA elements that play a crucial role in enhancing the transcription rate of a specific gene in the genome. Experiments for identifying enhancers can be restricted by their conditions and involve complicated, time-consuming, laborious, and costly steps. To overcome these challenges, computational platforms have been developed to complement experimental methods that enable high-throughput identification of enhancers. Over the last few years, the development of various enhancer computational tools has resulted in significant progress in predicting putative enhancers. Thus, researchers are now able to use a variety of strategies to enhance and advance enhancer study. In this review, an overview of machine learning (ML)-based prediction methods for enhancer identification and related databases has been provided. The existing enhancer-prediction methods have also been reviewed regarding their algorithms, feature selection processes, validation techniques, and software utility. In addition, the advantages and drawbacks of these ML approaches and guidelines for developing bioinformatic tools have been highlighted for a more efficient enhancer prediction. This review will serve as a useful resource for experimentalists in selecting the appropriate ML tool for their study, and for bioinformaticians in developing more accurate and advanced ML-based predictors.
Collapse
Affiliation(s)
- Le Thi Phan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Changmin Oh
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| |
Collapse
|
6
|
Li J, Wu Z, Lin W, Luo J, Zhang J, Chen Q, Chen J. iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models. BIOINFORMATICS ADVANCES 2023; 3:vbad043. [PMID: 37113248 PMCID: PMC10125906 DOI: 10.1093/bioadv/vbad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 02/04/2023] [Accepted: 03/24/2023] [Indexed: 04/29/2023]
Abstract
Motivation Enhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they cannot learn position-related multiscale contextual information from raw DNA sequences. Results In this article, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale k-mers and extracts contextual information of different scale k-mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale k-mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer. Availability and implementation The models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELM. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Wenhao Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Jiawei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | | |
Collapse
|
7
|
Thakare A, Bhende M, Tesema M, Dighriri M, Bhavani R, Mahmoud A. An Intelligent Classification System for Cancer Detection Based on DNA Methylation Using ML and Semantic Knowledge in Healthcare. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4334852. [PMID: 38501034 PMCID: PMC10948228 DOI: 10.1155/2022/4334852] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Revised: 09/01/2022] [Accepted: 09/10/2022] [Indexed: 03/20/2024]
Abstract
To consistently assess a patient's internal and external wellness and diagnose chronic conditions like cancer, Alzheimer's disease, and cardiovascular disease, wearable sensing devices are being used. Wearable technologies and networking websites have become incredibly common in the medical sector in recent times. The condition of a patient's health can be influenced by a number of factors, including psychological response, emotional stability, and anxiety levels, which can be evaluated using social network analysis based on graph theory-based techniques and these ideas, known as "social network analysis" (SNA) are used to study relationship phenomena. Therefore, numerous uses for SNA in health research are possible, ranging from social science to exact science. For example, it can be used to research cooperative networks of healthcare providers and hazard-prone behaviors, infectious disease transmission, and the spread of initiatives for health promotion and prevention. Recently, a number of machine learning-based healthcare solutions have been proposed to track chronic illnesses utilizing data from social networks and wearable monitoring devices. In our suggested approach, we are using an intelligent system with the assistance of wearable sensors for the classification of cancer based on DNA methylation, an important epigenetic process in the human genome that controls gene expression and has been connected to a number of health issues. A mixed-sampling imbalanced data ensemble classification technique is created with the help of biomedical sensors to address the problem of class imbalance and high dimensionality in the Cancer Genome Atlas (TCGA) massive data. This technique is based on the Intelligent Synthetic Minority Oversampling (SMOTE) algorithm. The false-negative rate significantly rises as a result of this, to give a larger data set, a new minority class sample will be first obtained. The noise created during the sample expansion process is actually any data that has been acquired, preserved, or altered in a way that prevents the system that initially conceived it from accessing or utilizing it. Noisy data boosts the amount of space needed excessively and can also drastically influence the findings of any data collection investigation and therefore can also affect the sample sets of one or the other class, resulting in the class imbalance which acts as a common problem in ML datasets. The Tomek Link method is then used to eliminate this noise, producing a reasonably balanced data set. Each layer selects two random forest structures using the cascading forest structure of the deep forest (GC-Forest) algorithm to increase the generalization ability of the model and create the final classification model. Experiments using DNA methylation data collected by employing biosensors from six tumor patients reveal that the mixed-sampling unbalanced data ensemble classification technique may increase the sensitivity to the minority class while maintaining the majority class's classification accuracy.
Collapse
Affiliation(s)
- Anuradha Thakare
- Department of Computer Engineering, Pimpri Chinchwad College of Engineering, Pune, India
| | - Manisha Bhende
- Marathwada Mitra Mandal's Institute of Technology, Pune, India
| | - Mulugeta Tesema
- Department of Chemistry (Analytical), College of Natural and Computational Sciences, Dambi Dollo University, Dambi Dollo, Oromia Region, Ethiopia
| | - Mohammed Dighriri
- Department of Basic Sciences and General Requirements -IT skills, Fakeeh College for Medical Sciences (FCMS), Jeddah, Saudi Arabia
| | - R. Bhavani
- Department of CSE, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai 602105, India
| | - Amena Mahmoud
- Computer Science Department, Faculty of Computers and Information, Kafrelsheikh University, Kafr El Sheikh, Egypt
| |
Collapse
|
8
|
Cross-species enhancer prediction using machine learning. Genomics 2022; 114:110454. [PMID: 36030022 DOI: 10.1016/j.ygeno.2022.110454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/28/2022] [Accepted: 08/16/2022] [Indexed: 11/21/2022]
Abstract
Cis-regulatory elements (CREs) are non-coding parts of the genome that play a critical role in gene expression regulation. Enhancers, as an important example of CREs, interact with genes to influence complex traits like disease, heat tolerance and growth rate. Much of what is known about enhancers come from studies of humans and a few model organisms like mouse, with little known about other mammalian species. Previous studies have attempted to identify enhancers in less studied mammals using comparative genomics but with limited success. Recently, Machine Learning (ML) techniques have shown promising results to predict enhancer regions. Here, we investigated the ability of ML methods to identify enhancers in three non-model mammalian species (cattle, pig and dog) using human and mouse enhancer data from VISTA and publicly available ChIP-seq. We tested nine models, using four different representations of the DNA sequences in cross-species prediction using both the VISTA dataset and species-specific ChIP-seq data. We identified between 809,399 and 877,278 enhancer-like regions (ELRs) in the study species (11.6-13.7% of each genome). These predictions were close to the ~8% proportion of ELRs that covered the human genome. We propose that our ML methods have predictive ability for identifying enhancers in non-model mammalian species. We have provided a list of high confidence enhancers at https://github.com/DaviesCentreInformatics/Cross-species-enhancer-prediction and believe these enhancers will be of great use to the community.
Collapse
|
9
|
Mulero Hernández J, Fernández-Breis JT. Analysis of the landscape of human enhancer sequences in biological databases. Comput Struct Biotechnol J 2022; 20:2728-2744. [PMID: 35685360 PMCID: PMC9168495 DOI: 10.1016/j.csbj.2022.05.045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/20/2022] [Accepted: 05/21/2022] [Indexed: 12/01/2022] Open
Abstract
The process of gene regulation extends as a network in which both genetic sequences and proteins are involved. The levels of regulation and the mechanisms involved are multiple. Transcription is the main control mechanism for most genes, being the downstream steps responsible for refining the transcription patterns. In turn, gene transcription is mainly controlled by regulatory events that occur at promoters and enhancers. Several studies are focused on analyzing the contribution of enhancers in the development of diseases and their possible use as therapeutic targets. The study of regulatory elements has advanced rapidly in recent years with the development and use of next generation sequencing techniques. All this information has generated a large volume of information that has been transferred to a growing number of public repositories that store this information. In this article, we analyze the content of those public repositories that contain information about human enhancers with the aim of detecting whether the knowledge generated by scientific research is contained in those databases in a way that could be computationally exploited. The analysis will be based on three main aspects identified in the literature: types of enhancers, type of evidence about the enhancers, and methods for detecting enhancer-promoter interactions. Our results show that no single database facilitates the optimal exploitation of enhancer data, most types of enhancers are not represented in the databases and there is need for a standardized model for enhancers. We have identified major gaps and challenges for the computational exploitation of enhancer data.
Collapse
Affiliation(s)
- Juan Mulero Hernández
- Dept. Informática y Sistemas, Universidad de Murcia, CEIR Campus Mare Nostrum, IMIB-Arrixaca, Spain
| | | |
Collapse
|