1
|
Yan W, Tan L, Mengshan L, Weihong Z, Sheng S, Jun W, Fu-An W. Time series-based hybrid ensemble learning model with multivariate multidimensional feature coding for DNA methylation prediction. BMC Genomics 2023; 24:758. [PMID: 38082253 PMCID: PMC10712061 DOI: 10.1186/s12864-023-09866-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 12/02/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND DNA methylation is a form of epigenetic modification that impacts gene expression without modifying the DNA sequence, thereby exerting control over gene function and cellular development. The prediction of DNA methylation is vital for understanding and exploring gene regulatory mechanisms. Currently, machine learning algorithms are primarily used for model construction. However, several challenges remain to be addressed, including limited prediction accuracy, constrained generalization capability, and insufficient learning capacity. RESULTS In response to the aforementioned challenges, this paper leverages the similarities between DNA sequences and time series to introduce a time series-based hybrid ensemble learning model, called Multi2-Con-CAPSO-LSTM. The model utilizes multivariate and multidimensional encoding approach, combining three types of time series encodings with three kinds of genetic feature encodings, resulting in a total of nine types of feature encoding matrices. Convolutional Neural Networks are utilized to extract features from DNA sequences, including temporal, positional, physicochemical, and genetic information, thereby creating a comprehensive feature matrix. The Long Short-Term Memory model is then optimized using the Chaotic Accelerated Particle Swarm Optimization algorithm for predicting DNA methylation. CONCLUSIONS Through cross-validation experiments conducted on 17 species involving three types of DNA methylation (6 mA, 5hmC, and 4mC), the results demonstrate the robust predictive capabilities of the Multi2-Con-CAPSO-LSTM model in DNA methylation prediction across various types and species. Compared with other benchmark models, the Multi2-Con-CAPSO-LSTM model demonstrates significant advantages in sensitivity, specificity, accuracy, and correlation. The model proposed in this paper provides valuable insights and inspiration across various disciplines, including sequence alignment, genetic evolution, time series analysis, and structure-activity relationships.
Collapse
Affiliation(s)
- Wu Yan
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China.
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China.
| | - Li Tan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Li Mengshan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
| | - Zhou Weihong
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Sheng Sheng
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Wang Jun
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Wu Fu-An
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China.
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China.
| |
Collapse
|
2
|
Yang M, Chen S, Huang Z, Gao S, Yu T, Du T, Zhang H, Li X, Liu CM, Chen S, Li H. Deep learning-enabled discovery and characterization of HKT genes in Spartina alterniflora. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2023; 116:690-705. [PMID: 37494542 DOI: 10.1111/tpj.16397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 07/03/2023] [Accepted: 07/11/2023] [Indexed: 07/28/2023]
Abstract
Spartina alterniflora is a halophyte that can survive in high-salinity environments, and it is phylogenetically close to important cereal crops, such as maize and rice. It is of scientific interest to understand why S. alterniflora can live under such extremely stressful conditions. The molecular mechanism underlying its high-saline tolerance is still largely unknown. Here we investigated the possibility that high-affinity K+ transporters (HKTs), which function in salt tolerance and maintenance of ion homeostasis in plants, are responsible for salt tolerance in S. alterniflora. To overcome the imprecision and unstable of the gene screening method caused by the conventional sequence alignment, we used a deep learning method, DeepGOPlus, to automatically extract sequence and protein characteristics from our newly assemble S. alterniflora genome to identify SaHKTs. Results showed that a total of 16 HKT genes were identified. The number of S. alterniflora HKTs (SaHKTs) is larger than that in all other investigated plant species except wheat. Phylogenetically related SaHKT members had similar gene structures, conserved protein domains and cis-elements. Expression profiling showed that most SaHKT genes are expressed in specific tissues and are differentially expressed under salt stress. Yeast complementation expression analysis showed that type I members SaHKT1;2, SaHKT1;3 and SaHKT1;8 and type II members SaHKT2;1, SaHKT2;3 and SaHKT2;4 had low-affinity K+ uptake ability and that type II members showed stronger K+ affinity than rice and Arabidopsis HKTs, as well as most SaHKTs showed preference for Na+ transport. We believe the deep learning-based methods are powerful approaches to uncovering new functional genes, and the SaHKT genes identified are important resources for breeding new varieties of salt-tolerant crops.
Collapse
Affiliation(s)
- Maogeng Yang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
- Key Laboratory of Plant Molecular & Developmental Biology, College of Life Sciences, Yantai University, Yantai, Shandong, China
| | - Shoukun Chen
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
- Hainan Yazhou Bay Seed Laboratory, Sanya, Hainan, China
| | - Zhangping Huang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Shang Gao
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Tingxi Yu
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Tingting Du
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Hao Zhang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Xiang Li
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, Beijing, China
| | - Chun-Ming Liu
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Key Laboratory of Plant Molecular Physiology, Institute of Botany, Chinese Academy of Sciences, Beijing, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
- School of Advanced Agricultural Sciences, Peking University, Beijing, China
| | - Shihua Chen
- Key Laboratory of Plant Molecular & Developmental Biology, College of Life Sciences, Yantai University, Yantai, Shandong, China
| | - Huihui Li
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| |
Collapse
|
3
|
Liu Y, Wang Z, Yuan H, Zhu G, Zhang Y. HEAP: a task adaptive-based explainable deep learning framework for enhancer activity prediction. Brief Bioinform 2023; 24:bbad286. [PMID: 37539835 DOI: 10.1093/bib/bbad286] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 07/05/2023] [Accepted: 07/21/2023] [Indexed: 08/05/2023] Open
Abstract
Enhancers are crucial cis-regulatory elements that control gene expression in a cell-type-specific manner. Despite extensive genetic and computational studies, accurately predicting enhancer activity in different cell types remains a challenge, and the grammar of enhancers is still poorly understood. Here, we present HEAP (high-resolution enhancer activity prediction), an explainable deep learning framework for predicting enhancers and exploring enhancer grammar. The framework includes three modules that use grammar-based reasoning for enhancer prediction. The algorithm can incorporate DNA sequences and epigenetic modifications to obtain better accuracy. We use a novel two-step multi-task learning method, task adaptive parameter sharing (TAPS), to efficiently predict enhancers in different cell types. We first train a shared model with all cell-type datasets. Then we adapt to specific tasks by adding several task-specific subset layers. Experiments demonstrate that HEAP outperforms published methods and showcases the effectiveness of the TAPS, especially for those with limited training samples. Notably, the explainable framework HEAP utilizes post-hoc interpretation to provide insights into the prediction mechanisms from three perspectives: data, model architecture and algorithm, leading to a better understanding of model decisions and enhancer grammar. To the best of our knowledge, HEAP will be a valuable tool for insight into the complex mechanisms of enhancer activity.
Collapse
Affiliation(s)
- Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Zixuan Wang
- College of Electronics and Information Engieering, Sichuan University, 610065, Chengdu, China
| | - Hao Yuan
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Guiquan Zhu
- West China Hospital of Stomatology, Sichuan University, 610041, Chengdu, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| |
Collapse
|
4
|
Zhang Q, Xu Y, Wang S, Wu Y, Ye Y, Yuan CA, Gribova V, Filaretov VF, Huang DS. Using Fully Convolutional Network to Locate Transcription Factor Binding Sites Based on DNA Sequence and Conservation Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2690-2699. [PMID: 36374878 DOI: 10.1109/tcbb.2022.3219831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Transcription factors (TFs) play a part in gene expression. TFs can form complex gene expression regulation system by combining with DNA. Thereby, identifying the binding regions has become an indispensable step for understanding the regulatory mechanism of gene expression. Due to the great achievements of applying deep learning (DL) to computer vision and language processing in recent years, many scholars are inspired to use these methods to predict TF binding sites (TFBSs), achieving extraordinary results. However, these methods mainly focus on whether DNA sequences include TFBSs. In this paper, we propose a fully convolutional network (FCN) coupled with refinement residual block (RRB) and global average pooling layer (GAPL), namely FCNARRB. Our model could classify binding sequences at nucleotide level by outputting dense label for input data. Experimental results on human ChIP-seq datasets show that the RRB and GAPL structures are very useful for improving model performance. Adding GAPL improves the performance by 9.32% and 7.61% in terms of IoU (Intersection of Union) and PRAUC (Area Under Curve of Precision and Recall), and adding RRB improves the performance by 7.40% and 4.64%, respectively. In addition, we find that conservation information can help locate TFBSs.
Collapse
|
5
|
Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. BIOLOGY 2023; 12:1033. [PMID: 37508462 PMCID: PMC10376273 DOI: 10.3390/biology12071033] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 07/18/2023] [Accepted: 07/21/2023] [Indexed: 07/30/2023]
Abstract
The emergence and rapid development of deep learning, specifically transformer-based architectures and attention mechanisms, have had transformative implications across several domains, including bioinformatics and genome data analysis. The analogous nature of genome sequences to language texts has enabled the application of techniques that have exhibited success in fields ranging from natural language processing to genomic data. This review provides a comprehensive analysis of the most recent advancements in the application of transformer architectures and attention mechanisms to genome and transcriptome data. The focus of this review is on the critical evaluation of these techniques, discussing their advantages and limitations in the context of genome data analysis. With the swift pace of development in deep learning methodologies, it becomes vital to continually assess and reflect on the current standing and future direction of the research. Therefore, this review aims to serve as a timely resource for both seasoned researchers and newcomers, offering a panoramic view of the recent advancements and elucidating the state-of-the-art applications in the field. Furthermore, this review paper serves to highlight potential areas of future investigation by critically evaluating studies from 2019 to 2023, thereby acting as a stepping-stone for further research endeavors.
Collapse
Affiliation(s)
- Sanghyuk Roy Choi
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
6
|
Jing Y, Zhang S, Wang H. DapNet-HLA: Adaptive dual-attention mechanism network based on deep learning to predict non-classical HLA binding sites. Anal Biochem 2023; 666:115075. [PMID: 36740003 DOI: 10.1016/j.ab.2023.115075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 01/30/2023] [Accepted: 02/02/2023] [Indexed: 02/05/2023]
Abstract
Human leukocyte antigen (HLA) plays a vital role in immunomodulatory function. Studies have shown that immunotherapy based on non-classical HLA has essential applications in cancer, COVID-19, and allergic diseases. However, there are few deep learning methods to predict non-classical HLA alleles. In this work, an adaptive dual-attention network named DapNet-HLA is established based on existing datasets. Firstly, amino acid sequences are transformed into digital vectors by looking up the table. To overcome the feature sparsity problem caused by unique one-hot encoding, the fused word embedding method is used to map each amino acid to a low-dimensional word vector optimized with the training of the classifier. Then, we use the GCB (group convolution block), SENet attention (squeeze-and-excitation networks), BiLSTM (bidirectional long short-term memory network), and Bahdanau attention mechanism to construct the classifier. The use of SENet can make the weight of the effective feature map high, so that the model can be trained to achieve better results. Attention mechanism is an Encoder-Decoder model used to improve the effectiveness of RNN, LSTM or GRU (gated recurrent neural network). The ablation experiment shows that DapNet-HLA has the best adaptability for five datasets. On the five test datasets, the ACC index and MCC index of DapNet-HLA are 4.89% and 0.0933 higher than the comparison method, respectively. According to the ROC curve and PR curve verified by the 5-fold cross-validation, the AUC value of each fold has a slight fluctuation, which proves the robustness of the DapNet-HLA. The codes and datasets are accessible at https://github.com/JYY625/DapNet-HLA.
Collapse
Affiliation(s)
- Yuanyuan Jing
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Houqiang Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
7
|
Bang I, Lee SM, Park S, Park JY, Nong LK, Gao Y, Palsson BO, Kim D. Deep-learning optimized DEOCSU suite provides an iterable pipeline for accurate ChIP-exo peak calling. Brief Bioinform 2023; 24:7005164. [PMID: 36702751 DOI: 10.1093/bib/bbad024] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 01/02/2023] [Accepted: 01/08/2023] [Indexed: 01/28/2023] Open
Abstract
Recognizing binding sites of DNA-binding proteins is a key factor for elucidating transcriptional regulation in organisms. ChIP-exo enables researchers to delineate genome-wide binding landscapes of DNA-binding proteins with near single base-pair resolution. However, the peak calling step hinders ChIP-exo application since the published algorithms tend to generate false-positive and false-negative predictions. Here, we report the development of DEOCSU (DEep-learning Optimized ChIP-exo peak calling SUite), a novel machine learning-based ChIP-exo peak calling suite. DEOCSU entails the deep convolutional neural network model which was trained with curated ChIP-exo peak data to distinguish the visualized data of bona fide peaks from false ones. Performance validation of the trained deep-learning model indicated its high accuracy, high precision and high recall of over 95%. Applying the new suite to both in-house and publicly available ChIP-exo datasets obtained from bacteria, eukaryotes and archaea revealed an accurate prediction of peaks containing canonical motifs, highlighting the versatility and efficiency of DEOCSU. Furthermore, DEOCSU can be executed on a cloud computing platform or the local environment. With visualization software included in the suite, adjustable options such as the threshold of peak probability, and iterable updating of the pre-trained model, DEOCSU can be optimized for users' specific needs.
Collapse
Affiliation(s)
- Ina Bang
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Sang-Mok Lee
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Seojoung Park
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Joon Young Park
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Linh Khanh Nong
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Ye Gao
- Department of Bioengineering, University of California San Diego, La Jolla CA 92093, USA
| | - Bernhard O Palsson
- Department of Bioengineering, University of California San Diego, La Jolla CA 92093, USA
- Department of Pediatrics, University of California San Diego, La Jolla CA 92093, USA
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| | - Donghyuk Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| |
Collapse
|
8
|
Wang Z, Zhang Y, Yu Y, Zhang J, Liu Y, Zou Q. A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder. Int J Mol Sci 2023; 24:ijms24054784. [PMID: 36902216 PMCID: PMC10003007 DOI: 10.3390/ijms24054784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 01/02/2023] [Accepted: 02/22/2023] [Indexed: 03/06/2023] Open
Abstract
Recent advances in single-cell sequencing assays for the transposase-accessibility chromatin (scATAC-seq) technique have provided cell-specific chromatin accessibility landscapes of cis-regulatory elements, providing deeper insights into cellular states and dynamics. However, few research efforts have been dedicated to modeling the relationship between regulatory grammars and single-cell chromatin accessibility and incorporating different analysis scenarios of scATAC-seq data into the general framework. To this end, we propose a unified deep learning framework based on the ProdDep Transformer Encoder, dubbed PROTRAIT, for scATAC-seq data analysis. Specifically motivated by the deep language model, PROTRAIT leverages the ProdDep Transformer Encoder to capture the syntax of transcription factor (TF)-DNA binding motifs from scATAC-seq peaks for predicting single-cell chromatin accessibility and learning single-cell embedding. Based on cell embedding, PROTRAIT annotates cell types using the Louvain algorithm. Furthermore, according to the identified likely noises of raw scATAC-seq data, PROTRAIT denoises these values based on predated chromatin accessibility. In addition, PROTRAIT employs differential accessibility analysis to infer TF activity at single-cell and single-nucleotide resolution. Extensive experiments based on the Buenrostro2018 dataset validate the effeteness of PROTRAIT for chromatin accessibility prediction, cell type annotation, and scATAC-seq data denoising, therein outperforming current approaches in terms of different evaluation metrics. Besides, we confirm the consistency between the inferred TF activity and the literature review. We also demonstrate the scalability of PROTRAIT to analyze datasets containing over one million cells.
Collapse
Affiliation(s)
- Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yun Yu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Junming Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Correspondence:
| |
Collapse
|
9
|
Zhang Y, Wang M, Wang Z, Liu Y, Xiong S, Zou Q. MetaSEM: Gene Regulatory Network Inference from Single-Cell RNA Data by Meta-Learning. Int J Mol Sci 2023; 24:ijms24032595. [PMID: 36768917 PMCID: PMC9916710 DOI: 10.3390/ijms24032595] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 01/23/2023] [Accepted: 01/26/2023] [Indexed: 01/31/2023] Open
Abstract
Regulators in gene regulatory networks (GRNs) are crucial for identifying cell states. However, GRN inference based on scRNA-seq data has several problems, including high dimensionality and sparsity, and requires more label data. Therefore, we propose a meta-learning GRN inference framework to identify regulatory factors. Specifically, meta-learning solves the parameter optimization problem caused by high-dimensional sparse data features. In addition, a few-shot solution was used to solve the problem of lack of label data. A structural equation model (SEM) was embedded in the model to identify important regulators. We integrated the parameter optimization strategy into the bi-level optimization to extract the feature consistent with GRN reasoning. This unique design makes our model robust to small-scale data. By studying the GRN inference task, we confirmed that the selected regulators were closely related to gene expression specificity. We further analyzed the GRN inferred to find the important regulators in cell type identification. Extensive experimental results showed that our model effectively captured the regulator in single-cell GRN inference. Finally, the visualization results verified the importance of the selected regulators for cell type recognition.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Maocheng Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Shuwen Xiong
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
- Correspondence:
| |
Collapse
|
10
|
Tang X, Zheng P, Liu Y, Yao Y, Huang G. LangMoDHS: A deep learning language model for predicting DNase I hypersensitive sites in mouse genome. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:1037-1057. [PMID: 36650801 DOI: 10.3934/mbe.2023048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
DNase I hypersensitive sites (DHSs) are a specific genomic region, which is critical to detect or understand cis-regulatory elements. Although there are many methods developed to detect DHSs, there is a big gap in practice. We presented a deep learning-based language model for predicting DHSs, named LangMoDHS. The LangMoDHS mainly comprised the convolutional neural network (CNN), the bi-directional long short-term memory (Bi-LSTM) and the feed-forward attention. The CNN and the Bi-LSTM were stacked in a parallel manner, which was helpful to accumulate multiple-view representations from primary DNA sequences. We conducted 5-fold cross-validations and independent tests over 14 tissues and 4 developmental stages. The empirical experiments showed that the LangMoDHS is competitive with or slightly better than the iDHS-Deep, which is the latest method for predicting DHSs. The empirical experiments also implied substantial contribution of the CNN, Bi-LSTM, and attention to DHSs prediction. We implemented the LangMoDHS as a user-friendly web server which is accessible at http:/www.biolscience.cn/LangMoDHS/. We used indices related to information entropy to explore the sequence motif of DHSs. The analysis provided a certain insight into the DHSs.
Collapse
Affiliation(s)
- Xingyu Tang
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China
| | - Peijie Zheng
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China
| | - Yuewu Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China
| |
Collapse
|
11
|
Yan W, Li Z, Pian C, Wu Y. PlantBind: an attention-based multi-label neural network for predicting plant transcription factor binding sites. Brief Bioinform 2022; 23:6713513. [PMID: 36155619 DOI: 10.1093/bib/bbac425] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.
Collapse
Affiliation(s)
| | - Zutan Li
- Nanjing Agricultur al University
| | - Cong Pian
- College of Sciences at Nanjing Agricultural University
| | - Yufeng Wu
- State Key Laboratory for Crop Genetics and Germplasm Enhancement, Bioinformatics Center, College of Agriculture, Academy for Advanced Interdisciplinary Studies at Nanjing Agricultural University
| |
Collapse
|
12
|
Towards a better understanding of TF-DNA binding prediction from genomic features. Comput Biol Med 2022; 149:105993. [DOI: 10.1016/j.compbiomed.2022.105993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/12/2022] [Accepted: 08/14/2022] [Indexed: 11/17/2022]
|
13
|
Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein–DNA-binding sites in computational biology. Brief Funct Genomics 2022; 21:357-375. [DOI: 10.1093/bfgp/elac009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/07/2022] [Accepted: 04/22/2022] [Indexed: 01/08/2023] Open
Abstract
Abstract
Transcription factors are important cellular components of the process of gene expression control. Transcription factor binding sites are locations where transcription factors specifically recognize DNA sequences, targeting gene-specific regions and recruiting transcription factors or chromatin regulators to fine-tune spatiotemporal gene regulation. As the common proteins, transcription factors play a meaningful role in life-related activities. In the face of the increase in the protein sequence, it is urgent how to predict the structure and function of the protein effectively. At present, protein–DNA-binding site prediction methods are based on traditional machine learning algorithms and deep learning algorithms. In the early stage, we usually used the development method based on traditional machine learning algorithm to predict protein–DNA-binding sites. In recent years, methods based on deep learning to predict protein–DNA-binding sites from sequence data have achieved remarkable success. Various statistical and machine learning methods used to predict the function of DNA-binding proteins have been proposed and continuously improved. Existing deep learning methods for predicting protein–DNA-binding sites can be roughly divided into three categories: convolutional neural network (CNN), recursive neural network (RNN) and hybrid neural network based on CNN–RNN. The purpose of this review is to provide an overview of the computational and experimental methods applied in the field of protein–DNA-binding site prediction today. This paper introduces the methods of traditional machine learning and deep learning in protein–DNA-binding site prediction from the aspects of data processing characteristics of existing learning frameworks and differences between basic learning model frameworks. Our existing methods are relatively simple compared with natural language processing, computational vision, computer graphics and other fields. Therefore, the summary of existing protein–DNA-binding site prediction methods will help researchers better understand this field.
Collapse
|
14
|
Base-resolution prediction of transcription factor binding signals by a deep learning framework. PLoS Comput Biol 2022; 18:e1009941. [PMID: 35263332 PMCID: PMC8982852 DOI: 10.1371/journal.pcbi.1009941] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 04/05/2022] [Accepted: 02/19/2022] [Indexed: 01/13/2023] Open
Abstract
Transcription factors (TFs) play an important role in regulating gene expression, thus the identification of the sites bound by them has become a fundamental step for molecular and cellular biology. In this paper, we developed a deep learning framework leveraging existing fully convolutional neural networks (FCN) to predict TF-DNA binding signals at the base-resolution level (named as FCNsignal). The proposed FCNsignal can simultaneously achieve the following tasks: (i) modeling the base-resolution signals of binding regions; (ii) discriminating binding or non-binding regions; (iii) locating TF-DNA binding regions; (iv) predicting binding motifs. Besides, FCNsignal can also be used to predict opening regions across the whole genome. The experimental results on 53 TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets show that our proposed framework outperforms some existing state-of-the-art methods. In addition, we explored to use the trained FCNsignal to locate all potential TF-DNA binding regions on a whole chromosome and predict DNA sequences of arbitrary length, and the results show that our framework can find most of the known binding regions and accept sequences of arbitrary length. Furthermore, we demonstrated the potential ability of our framework in discovering causal disease-associated single-nucleotide polymorphisms (SNPs) through a series of experiments.
Collapse
|
15
|
Zhang L, Yang Y, Chai L, Li Q, Liu J, Lin H, Liu L. A deep learning model to identify gene expression level using cobinding transcription factor signals. Brief Bioinform 2021; 23:6447678. [PMID: 34864886 DOI: 10.1093/bib/bbab501] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 10/13/2021] [Accepted: 11/01/2021] [Indexed: 01/02/2023] Open
Abstract
Gene expression is directly controlled by transcription factors (TFs) in a complex combination manner. It remains a challenging task to systematically infer how the cooperative binding of TFs drives gene activity. Here, we quantitatively analyzed the correlation between TFs and surveyed the TF interaction networks associated with gene expression in GM12878 and K562 cell lines. We identified six TF modules associated with gene expression in each cell line. Furthermore, according to the enrichment characteristics of TFs in these TF modules around a target gene, a convolutional neural network model, called TFCNN, was constructed to identify gene expression level. Results showed that the TFCNN model achieved a good prediction performance for gene expression. The average of the area under receiver operating characteristics curve (AUC) can reach up to 0.975 and 0.976, respectively in GM12878 and K562 cell lines. By comparison, we found that the TFCNN model outperformed the prediction models based on SVM and LDA. This is due to the TFCNN model could better extract the combinatorial interaction among TFs. Further analysis indicated that the abundant binding of regulatory TFs dominates expression of target genes, while the cooperative interaction between TFs has a subtle regulatory effects. And gene expression could be regulated by different TF combinations in a nonlinear way. These results are helpful for deciphering the mechanism of TF combination regulating gene expression.
Collapse
Affiliation(s)
- Lirong Zhang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Yanchao Yang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Lu Chai
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qianzhong Li
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Junjie Liu
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Li Liu
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
16
|
Jiang Z, Xiao SR, Liu R. Dissecting and predicting different types of binding sites in nucleic acids based on structural information. Brief Bioinform 2021; 23:6384399. [PMID: 34624074 PMCID: PMC8769709 DOI: 10.1093/bib/bbab411] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 08/26/2021] [Accepted: 09/07/2021] [Indexed: 12/16/2022] Open
Abstract
The biological functions of DNA and RNA generally depend on their interactions with other molecules, such as small ligands, proteins and nucleic acids. However, our knowledge of the nucleic acid binding sites for different interaction partners is very limited, and identification of these critical binding regions is not a trivial work. Herein, we performed a comprehensive comparison between binding and nonbinding sites and among different categories of binding sites in these two nucleic acid classes. From the structural perspective, RNA may interact with ligands through forming binding pockets and contact proteins and nucleic acids using protruding surfaces, while DNA may adopt regions closer to the middle of the chain to make contacts with other molecules. Based on structural information, we established a feature-based ensemble learning classifier to identify the binding sites by fully using the interplay among different machine learning algorithms, feature spaces and sample spaces. Meanwhile, we designed a template-based classifier by exploiting structural conservation. The complementarity between the two classifiers motivated us to build an integrative framework for improving prediction performance. Moreover, we utilized a post-processing procedure based on the random walk algorithm to further correct the integrative predictions. Our unified prediction framework yielded promising results for different binding sites and outperformed existing methods.
Collapse
Affiliation(s)
- Zheng Jiang
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Si-Rui Xiao
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Rong Liu
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| |
Collapse
|