1
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and Deep Learning Methods for Predicting 3D Genome Organization. Methods Mol Biol 2025; 2856:357-400. [PMID: 39283464 DOI: 10.1007/978-1-0716-4136-1_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Three-dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, topologically associating domains (TADs), and A/B compartments, play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers and transcription factor binding site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, and TAD boundaries) and analyze their pros and cons. We also point out obstacles to the computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P G Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA
| | - J Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Mikhail G Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA.
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
2
|
Şapcı AOB, Mirarab S. Memory-bound k-mer selection for large and evolutionarily diverse reference libraries. Genome Res 2024; 34:1455-1467. [PMID: 39209553 DOI: 10.1101/gr.279339.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 08/06/2024] [Indexed: 09/04/2024]
Abstract
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. Although the increased density provides hope for improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large data set to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called k-mer RANKer (KRANK) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK k-mer selection significantly reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.
Collapse
Affiliation(s)
- Ali Osman Berk Şapcı
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA
| | - Siavash Mirarab
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA;
- Department of Electrical and Computer Engineering, University of California, San Diego, California 92093, USA
| |
Collapse
|
3
|
Peng S. HIV-1 M group subtype classification using deep learning approach. Comput Biol Med 2024; 183:109218. [PMID: 39369547 DOI: 10.1016/j.compbiomed.2024.109218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 09/24/2024] [Accepted: 09/25/2024] [Indexed: 10/08/2024]
Abstract
Traditionally, the classification of HIV-1 M group subtypes has depended on statistical methods constrained by sample sizes. Here HIV-1-M-SPBEnv was proposed as the first deep learning-based method for classifying HIV-1 M group subtypes via env gene sequences. This approach overcomes sample size challenges by utilizing artificial molecular evolution techniques to generate a synthetic dataset suitable for machine learning. Employing a convolutional Autoencoder embedded with two residual blocks and two transpose residual blocks, followed by a full connected neural network block, HIV-1-M-SPBEnv simplifies complex, high-dimensional DNA sequence data into concise, information-rich, low-dimensional representations, achieving exceptional classification accuracy. Through independent data set validation, the precision, accuracy, recall and F1 score of the HIV-1-M-SPBEnv model predictions were all 100 %, confirming its capability to accurately identify all 12 subtypes of the HIV-1 M group. Deployed through a web server, it provides seamless HIV-1 M group subtype prediction capabilities for researchers and clinicians. HIV-1-M-SPBEnv web server is accessible at http://www.hivsubclass.com and all the code is available at https://github.com/pengsihua2023/HIV-1-M-SPBEnv.
Collapse
Affiliation(s)
- Sihua Peng
- Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens, GA, 30602, United States.
| |
Collapse
|
4
|
Shireen H, Batool F, Khatoon H, Parveen N, Sehar NU, Hussain I, Ali S, Abbasi AA. Predicting genome-wide tissue-specific enhancers via combinatorial transcription factor genomic occupancy analysis. FEBS Lett 2024. [PMID: 39367524 DOI: 10.1002/1873-3468.15030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 08/27/2024] [Accepted: 09/13/2024] [Indexed: 10/06/2024]
Abstract
Enhancers are non-coding cis-regulatory elements crucial for transcriptional regulation. Mutations in enhancers can disrupt gene regulation, leading to disease phenotypes. Identifying enhancers and their tissue-specific activity is challenging due to their lack of stereotyped sequences. This study presents a sequence-based computational model that uses combinatorial transcription factor (TF) genomic occupancy to predict tissue-specific enhancers. Trained on diverse datasets, including ENCODE and Vista enhancer browser data, the model predicted 25 000 forebrain-specific cis-regulatory modules (CRMs) in the human genome. Validation using biochemical features, disease-associated SNPs, and in vivo zebrafish analysis confirmed its effectiveness. This model aids in predicting enhancers lacking well-characterized chromatin features, complementing experimental approaches in tissue-specific enhancer discovery.
Collapse
Affiliation(s)
- Huma Shireen
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Fatima Batool
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Hizran Khatoon
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Nazia Parveen
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Noor Us Sehar
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| | - Irfan Hussain
- Centre for Regenerative Medicine and Stem Cells Research, Agha Khan University hospital, Karachi, Pakistan
| | - Shahid Ali
- Department of Organismal Biology and Anatomy, The University of Chicago, Chicago, IL, USA
| | - Amir Ali Abbasi
- National Center for Bioinformatics, Program of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan
| |
Collapse
|
5
|
Koido M, Tomizuka K, Terao C. Fundamentals for predicting transcriptional regulations from DNA sequence patterns. J Hum Genet 2024; 69:499-504. [PMID: 38730006 PMCID: PMC11422166 DOI: 10.1038/s10038-024-01256-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 04/10/2024] [Accepted: 04/25/2024] [Indexed: 05/12/2024]
Abstract
Cell-type-specific regulatory elements, cataloged through extensive experiments and bioinformatics in large-scale consortiums, have enabled enrichment analyses of genetic associations that primarily utilize positional information of the regulatory elements. These analyses have identified cell types and pathways genetically associated with human complex traits. However, our understanding of detailed allelic effects on these elements' activities and on-off states remains incomplete, hampering the interpretation of human genetic study results. This review introduces machine learning methods to learn sequence-dependent transcriptional regulation mechanisms from DNA sequences for predicting such allelic effects (not associations). We provide a concise history of machine-learning-based approaches, the requirements, and the key computational processes, focusing on primers in machine learning. Convolution and self-attention, pivotal in modern deep-learning models, are explained through geometrical interpretations using dot products. This facilitates understanding of the concept and why these have been used for machine learning for DNA sequences. These will inspire further research in this genetics and genomics field.
Collapse
Affiliation(s)
- Masaru Koido
- Laboratory of Complex Trait Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Kohei Tomizuka
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan.
- The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan.
| |
Collapse
|
6
|
Li X, Li H, Yang Z, Wang L. Distribution rules of 8-mer spectra and characterization of evolution state in animal genome sequences. BMC Genomics 2024; 25:855. [PMID: 39266973 PMCID: PMC11391722 DOI: 10.1186/s12864-024-10786-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 09/09/2024] [Indexed: 09/14/2024] Open
Abstract
BACKGROUND Studying the composition rules and evolution mechanisms of genome sequences are core issues in the post-genomic era, and k-mer spectrum analysis of genome sequences is an effective means to solve this problem. RESULT We divided total 8-mers of genome sequences into 16 kinds of XY-type due to XY dinucleotides number in 8-mers. Previous works explored that the independent unimodal distributions observed only in three CG-type 8-mer spectra, while non-CG type 8-mer spectra have not the universal phenomenon from prokaryotes to eukaryotes. On this basis, we analyzed the distribution variation of non-CG type 8-mer spectra across 889 animal genome sequences. Following the evolutionary order of animals from primitive to more complex, we found that the spectrum distributions gradually transition from unimodal to tri-modal. The relative distance from the average frequency of each non-CG type 8-mers to the center frequency is different within a species and among different species. For the 8-mers contain CG dinucleotides, we further divided these into 16 subsets, where each 8-mer contains both CG and XY dinucleotides, called XY1_CG1 subsets. We found that the separability values of XY1_CG1 spectra are closely related to the evolution and specificity of animals. Considering the constraint of Chargaff's second parity rule, we finally obtained 10 separability values as the feature set to characterize the evolution state of genome sequences. In order to verify the rationality of the feature set, we used 14 common classification algorithms to perform binary classification tests. The results showed that the accuracy (Acc) ranged between 98.70% and 83.88% among birds, other vertebrates and mammals. CONCLUSION We proposed a credible feature set to characterizes the evolution state of genomes and obtained satisfied results by the feature set on large scale classification of animals.
Collapse
Affiliation(s)
- Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Zhenhua Yang
- School of Economics and Management, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Lu Wang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
7
|
Pradhan UK, Behera P, Das R, Naha S, Gupta A, Parsad R, Pradhan SK, Meher PK. AScirRNA: A novel computational approach to discover abiotic stress-responsive circular RNAs in plant genome. Comput Biol Chem 2024; 113:108205. [PMID: 39265460 DOI: 10.1016/j.compbiolchem.2024.108205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 07/12/2024] [Accepted: 09/04/2024] [Indexed: 09/14/2024]
Abstract
In the realm of plant biology, understanding the intricate regulatory mechanisms governing stress responses stands as a pivotal pursuit. Circular RNAs (circRNAs), emerging as critical players in gene regulation, have garnered attention in recent days for their potential roles in abiotic stress adaptation. A comprehensive grasp of circRNAs' functions in stress response offers avenues for breeders to manipulating plants to develop abiotic stress resistant crop cultivars to thrive in challenging climates. This study pioneers a machine learning-based model for predicting abiotic stress-responsive circRNAs. The K-tuple nucleotide composition (KNC) and Pseudo KNC (PKNC) features were utilized to numerically represent circRNAs. Three different feature selection strategies were employed to select relevant and non-redundant features. Eight shallow and four deep learning algorithms were evaluated to build the final predictive model. Following five-fold cross-validation process, XGBoost learning algorithm demonstrated superior performance with LightGBM-chosen 260 KNC features (Accuracy: 74.55 %, auROC: 81.23 %, auPRC: 76.52 %) and 160 PKNC features (Accuracy: 74.32 %, auROC: 81.04 %, auPRC: 76.43 %), over other combinations of learning algorithms and feature selection techniques. Further, the robustness of the developed models were evaluated using an independent test dataset, where the overall accuracy, auROC and auPRC were found to be 73.13 %, 72.34 % and 72.68 % for KNC feature set and 73.52 %, 79.53 % and 73.09 % for PKNC feature set, respectively. This computational approach was also integrated into an online prediction tool, AScirRNA (https://iasri-sg.icar.gov.in/ascirna/) for easy prediction by the users. Both the proposed model and the developed tool are poised to augment ongoing efforts in identifying stress-responsive circRNAs in plants.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Prasanjit Behera
- Department of Bioinformatics, Odisha University of Agriculture & Technology, Bhubaneswar, Odisha 751003, India.
| | - Ritwika Das
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Sukanta Kumar Pradhan
- Department of Bioinformatics, Odisha University of Agriculture & Technology, Bhubaneswar, Odisha 751003, India.
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| |
Collapse
|
8
|
Li F, Bi Y, Guo X, Tan X, Wang C, Pan S. Advancing mRNA subcellular localization prediction with graph neural network and RNA structure. Bioinformatics 2024; 40:btae504. [PMID: 39133151 PMCID: PMC11361792 DOI: 10.1093/bioinformatics/btae504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 08/06/2024] [Accepted: 08/09/2024] [Indexed: 08/13/2024] Open
Abstract
MOTIVATION The asymmetrical distribution of expressed mRNAs tightly controls the precise synthesis of proteins within human cells. This non-uniform distribution, a cornerstone of developmental biology, plays a pivotal role in numerous cellular processes. To advance our comprehension of gene regulatory networks, it is essential to develop computational tools for accurately identifying the subcellular localizations of mRNAs. However, considering multi-localization phenomena remains limited in existing approaches, with none considering the influence of RNA's secondary structure. RESULTS In this study, we propose Allocator, a multi-view parallel deep learning framework that seamlessly integrates the RNA sequence-level and structure-level information, enhancing the prediction of mRNA multi-localization. The Allocator models equip four efficient feature extractors, each designed to handle different inputs. Two are tailored for sequence-based inputs, incorporating multilayer perceptron and multi-head self-attention mechanisms. The other two are specialized in processing structure-based inputs, employing graph neural networks. Benchmarking results underscore Allocator's superiority over state-of-the-art methods, showcasing its strength in revealing intricate localization associations. AVAILABILITY AND IMPLEMENTATION The webserver of Allocator is available at http://Allocator.unimelb-biotools.cloud.edu.au; the source code and datasets are available on GitHub (https://github.com/lifuyi774/Allocator) and Zenodo (https://doi.org/10.5281/zenodo.13235798).
Collapse
Affiliation(s)
- Fuyi Li
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
- South Australian immunoGENomics Cancer Institute (SAiGENCI), The University of Adelaide, Adelaide, SA 5005, Australia
| | - Yue Bi
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Xiaolan Tan
- Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Cong Wang
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Shirui Pan
- Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD 4222, Australia
| |
Collapse
|
9
|
Zhao L, Hao R, Chai Z, Fu W, Yang W, Li C, Liu Q, Jiang Y. DeepOCR: A multi-species deep-learning framework for accurate identification of open chromatin regions in livestock. Comput Biol Chem 2024; 110:108077. [PMID: 38691895 DOI: 10.1016/j.compbiolchem.2024.108077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 03/27/2024] [Accepted: 04/16/2024] [Indexed: 05/03/2024]
Abstract
A wealth of experimental evidence has suggested that open chromatin regions (OCRs) are involved in many critical biological activities, such as DNA replication, enhancer activity, and gene transcription. Accurately identifying OCRs in livestock species can provide critical insights into the distribution and characteristics of OCRs for disease treatment in livestock, thereby improving animal welfare. However, most current machine-learning methods for OCR prediction were originally designed for a limited number of model organisms, such as humans and some model organisms, and thus their performance on non-model organisms, specifically livestock, is often unsatisfactory. To bridge this gap, we propose DeepOCR, a lightweight depth-separable residual network model for predicting OCRs in livestock, including chicken, cattle, and sheep. DeepOCR integrates a single convolution layer and two improved residue structure blocks to extract and learn important features from the input DNA sequences. A fully connected layer was also employed to further process the extracted features and improve the robustness of the entire network. Our benchmarking experiments demonstrated superior prediction performance of DeepOCR compared to state-of-the-art approaches on testing datasets of the three species. The source code of DeepOCR is freely available for academic purposes at https://github.com/jasonzhao371/DeepOCR/. We anticipate DeepOCR servers as a practical and reliable computational tool for OCR-related studies in livestock species.
Collapse
Affiliation(s)
- Liangwei Zhao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ran Hao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ziyi Chai
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Weiwei Fu
- College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou, Gansu 730020, China
| | - Wei Yang
- National Clinical Research Center for Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen 518112, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China.
| | - Yu Jiang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China; Key Laboratory of Livestock Biology, Northwest A&F University, Yangling, Shaanxi 712100, China.
| |
Collapse
|
10
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and deep learning methods for predicting 3D genome organization. ARXIV 2024:arXiv:2403.03231v1. [PMID: 38495565 PMCID: PMC10942493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P. G. Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
| | - J. Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA 23298, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
11
|
Yao Z, Li F, Xie W, Chen J, Wu J, Zhan Y, Wu X, Wang Z, Zhang G. DeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features. Comput Biol Med 2024; 171:108166. [PMID: 38382385 DOI: 10.1016/j.compbiomed.2024.108166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 02/15/2024] [Accepted: 02/15/2024] [Indexed: 02/23/2024]
Abstract
N4-methylcytosine (4mC) is a DNA modification involving the addition of a methyl group to the fourth nitrogen atom of the cytosine base. This modification may influence gene regulation, providing potential insights into gene control mechanisms. Traditional laboratory methods for detecting 4mC DNA methylation have limitations, but the rise of artificial intelligence has introduced efficient computational strategies for 4mC site prediction. Despite this progress, challenges persist in terms of model performance and interpretability. To tackle these challenges, we propose DeepSF-4mC, a deep learning model specifically designed for predicting DNA cytosine 4mC methylation sites by leveraging sequence features. Our approach incorporates multiple encoding techniques to enhance prediction accuracy, increase model stability, and reduce the computational resources needed. Leveraging transfer learning, we harness existing models to enhance performance through learned representations or fine-tuning. Ensemble learning techniques combine predictions from multiple models, boosting robustness and accuracy. This research contributes to DNA methylation analysis and lays the groundwork for understanding 4mC's multifaceted role in biological processes. The web server for DeepSF-4mC is accessible at: http://deepsf-4mc.top/and the original code can be found at: https://github.com/754131799/DeepSF-4mC.
Collapse
Affiliation(s)
- Zhaomin Yao
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Fei Li
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Weiming Xie
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Jiaming Chen
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Jiezhang Wu
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Ying Zhan
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Xiaodan Wu
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China
| | - Zhiguo Wang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China.
| | - Guoxu Zhang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China.
| |
Collapse
|
12
|
Wang L, Zhou Y. MRM-BERT: a novel deep neural network predictor of multiple RNA modifications by fusing BERT representation and sequence features. RNA Biol 2024; 21:1-10. [PMID: 38357904 PMCID: PMC10877979 DOI: 10.1080/15476286.2024.2315384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/02/2024] [Indexed: 02/16/2024] Open
Abstract
RNA modifications play crucial roles in various biological processes and diseases. Accurate prediction of RNA modification sites is essential for understanding their functions. In this study, we propose a hybrid approach that fuses a pre-trained sequence representation with various sequence features to predict multiple types of RNA modifications in one combined prediction framework. We developed MRM-BERT, a deep learning method that combined the pre-trained DNABERT deep sequence representation module and the convolutional neural network (CNN) exploiting four traditional sequence feature encodings to improve the prediction performance. MRM-BERT was evaluated on multiple datasets of 12 commonly occurring RNA modifications, including m6A, m5C, m1A and so on. The results demonstrate that our hybrid model outperforms other models in terms of area under receiver operating characteristic curve (AUC) for all 12 types of RNA modifications. MRM-BERT is available as an online tool (http://117.122.208.21:8501) or source code (https://github.com/abhhba999/MRM-BERT), which allows users to predict RNA modification sites and visualize the results. Overall, our study provides an effective and efficient approach to predict multiple RNA modifications, contributing to the understanding of RNA biology and the development of therapeutic strategies.
Collapse
Affiliation(s)
- Linshu Wang
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Yuan Zhou
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
- Department of Biomedical Informatics, School of Basic Medical Sciences, State Key Laboratory of Vascular Homeostasis and Remodeling, Peking University, Beijing, China
| |
Collapse
|
13
|
Dhakal P, Tayara H, Chong KT. An ensemble of stacking classifiers for improved prediction of miRNA-mRNA interactions. Comput Biol Med 2023; 164:107242. [PMID: 37473564 DOI: 10.1016/j.compbiomed.2023.107242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 06/21/2023] [Accepted: 07/07/2023] [Indexed: 07/22/2023]
Abstract
MicroRNAs (miRNAs) are small non-coding RNA molecules that play a crucial role in regulating gene expression at the post-transcriptional level by binding to potential target sites of messenger RNAs (mRNAs), facilitated by the Argonaute family of proteins. Selecting the conservative candidate target sites (CTS) is a challenging step, considering that most of the existing computational algorithms primarily focus on canonical site types, which is a time-consuming and inefficient utilization of miRNA target site interactions. We developed a stacking classifier algorithm that addresses the CTS selection criteria using feature-encoding techniques that generates feature vectors, including k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. This innovative stacking classifier algorithm surpassed previous state-of-the-art algorithms in predicting functional miRNA targets. We evaluated the performance of the proposed model on 10 independent test datasets and obtained an average accuracy of 79.77%, which is a significant improvement of 7.26 % over previous models. This improvement shows that the proposed method has great potential for distinguishing highly functional miRNA targets and can serve as a valuable tool in biomedical and drug development research.
Collapse
Affiliation(s)
- Priyash Dhakal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
14
|
Friedman RZ, Ramu A, Lichtarge S, Myers CA, Granas DM, Gause M, Corbo JC, Cohen BA, White MA. Active learning of enhancer and silencer regulatory grammar in photoreceptors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.21.554146. [PMID: 37662358 PMCID: PMC10473580 DOI: 10.1101/2023.08.21.554146] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Cis-regulatory elements (CREs) direct gene expression in health and disease, and models that can accurately predict their activities from DNA sequences are crucial for biomedicine. Deep learning represents one emerging strategy to model the regulatory grammar that relates CRE sequence to function. However, these models require training data on a scale that exceeds the number of CREs in the genome. We address this problem using active machine learning to iteratively train models on multiple rounds of synthetic DNA sequences assayed in live mammalian retinas. During each round of training the model actively selects sequence perturbations to assay, thereby efficiently generating informative training data. We iteratively trained a model that predicts the activities of sequences containing binding motifs for the photoreceptor transcription factor Cone-rod homeobox (CRX) using an order of magnitude less training data than current approaches. The model's internal confidence estimates of its predictions are reliable guides for designing sequences with high activity. The model correctly identified critical sequence differences between active and inactive sequences with nearly identical transcription factor binding sites, and revealed order and spacing preferences for combinations of motifs. Our results establish active learning as an effective method to train accurate deep learning models of cis-regulatory function after exhausting naturally occurring training examples in the genome.
Collapse
Affiliation(s)
- Ryan Z. Friedman
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Avinash Ramu
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Sara Lichtarge
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Connie A. Myers
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - David M. Granas
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Maria Gause
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Joseph C. Corbo
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Barak A. Cohen
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Michael A. White
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| |
Collapse
|
15
|
Nowling RJ, Njoya K, Peters JG, Riehle MM. Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique. Front Cell Infect Microbiol 2023; 13:1182567. [PMID: 37600946 PMCID: PMC10433755 DOI: 10.3389/fcimb.2023.1182567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023] Open
Abstract
Introduction Various sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods Here, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
Collapse
Affiliation(s)
- Ronald J. Nowling
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Kimani Njoya
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| | - John G. Peters
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Michelle M. Riehle
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| |
Collapse
|
16
|
Ye J, Li A, Zheng H, Yang B, Lu Y. Machine Learning Advances in Predicting Peptide/Protein-Protein Interactions Based on Sequence Information for Lead Peptides Discovery. Adv Biol (Weinh) 2023; 7:e2200232. [PMID: 36775876 DOI: 10.1002/adbi.202200232] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 12/30/2022] [Indexed: 02/14/2023]
Abstract
Peptides have shown increasing advantages and significant clinical value in drug discovery and development. With the development of high-throughput technologies and artificial intelligence (AI), machine learning (ML) methods for discovering new lead peptides have been expanded and incorporated into rational drug design. Predictions of peptide-protein interactions (PepPIs) and protein-protein interactions (PPIs) are both opportunities and challenges in computational biology, which will help to better understand the mechanisms of disease and provide the impetus for the discovery of lead peptides. This paper comprehensively reviews computational models for PepPI and PPI predictions. It begins with an introduction of various databases of peptide ligands and target proteins. Then it discusses data formats and feature representations for proteins and peptides. Furthermore, classical ML methods and emerging deep learning (DL) methods that can be used to train prediction models of PepPI and PPI are classified into four categories, and their advantages and disadvantages are analyzed. To assess the relative performance of different models, different validation protocols and evaluation indexes are discussed. The goal of this review is to help researchers quickly get started to develop computational frameworks using these integrated resources and eventually promote the discovery of lead peptides.
Collapse
Affiliation(s)
- Jiahao Ye
- School of Medicine, Shanghai University, Shanghai, 200444, China
| | - An Li
- Department of Critical Care Medicine, Shanghai Tenth People's Hospital, School of Medicine, Tongji University, Shanghai, 200072, China
- Department of Biochemical Pharmacy, School of Pharmacy, Second Military Medical University, Shanghai, 200433, China
| | - Hao Zheng
- School of Medicine, Shanghai University, Shanghai, 200444, China
| | - Banghua Yang
- School of Medicine, Shanghai University, Shanghai, 200444, China
| | - Yiming Lu
- School of Medicine, Shanghai University, Shanghai, 200444, China
- Department of Critical Care Medicine, Shanghai Tenth People's Hospital, School of Medicine, Tongji University, Shanghai, 200072, China
- Department of Biochemical Pharmacy, School of Pharmacy, Second Military Medical University, Shanghai, 200433, China
| |
Collapse
|
17
|
Chen R, Li F, Guo X, Bi Y, Li C, Pan S, Coin LJM, Song J. ATTIC is an integrated approach for predicting A-to-I RNA editing sites in three species. Brief Bioinform 2023; 24:bbad170. [PMID: 37150785 PMCID: PMC10565902 DOI: 10.1093/bib/bbad170] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 04/12/2023] [Accepted: 04/14/2023] [Indexed: 05/09/2023] Open
Abstract
A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.
Collapse
Affiliation(s)
- Ruyi Chen
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
| | - Yue Bi
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
| | - Shirui Pan
- School of Information and Communication Technology, Griffith University, QLD 4222, Australia
| | - Lachlan J M Coin
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, VIC 3800, Australia
| |
Collapse
|
18
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
19
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
20
|
Li H, Wu B, Sun M, Ye Y, Zhu Z, Chen K. Multi-view graph neural network with cascaded attention for lncRNA-miRNA interaction prediction. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
|
21
|
Ao C, Ye X, Sakurai T, Zou Q, Yu L. m5U-SVM: identification of RNA 5-methyluridine modification sites based on multi-view features of physicochemical features and distributed representation. BMC Biol 2023; 21:93. [PMID: 37095510 PMCID: PMC10127088 DOI: 10.1186/s12915-023-01596-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 04/12/2023] [Indexed: 04/26/2023] Open
Abstract
BACKGROUND RNA 5-methyluridine (m5U) modifications are obtained by methylation at the C5 position of uridine catalyzed by pyrimidine methylation transferase, which is related to the development of human diseases. Accurate identification of m5U modification sites from RNA sequences can contribute to the understanding of their biological functions and the pathogenesis of related diseases. Compared to traditional experimental methods, computational methods developed based on machine learning with ease of use can identify modification sites from RNA sequences in an efficient and time-saving manner. Despite the good performance of these computational methods, there are some drawbacks and limitations. RESULTS In this study, we have developed a novel predictor, m5U-SVM, based on multi-view features and machine learning algorithms to construct predictive models for identifying m5U modification sites from RNA sequences. In this method, we used four traditional physicochemical features and distributed representation features. The optimized multi-view features were obtained from the four fused traditional physicochemical features by using the two-step LightGBM and IFS methods, and then the distributed representation features were fused with the optimized physicochemical features to obtain the new multi-view features. The best performing classifier, support vector machine, was identified by screening different machine learning algorithms. Compared with the results, the performance of the proposed model is better than that of the existing state-of-the-art tool. CONCLUSIONS m5U-SVM provides an effective tool that successfully captures sequence-related attributes of modifications and can accurately predict m5U modification sites from RNA sequences. The identification of m5U modification sites helps to understand and delve into the related biological processes and functions.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| |
Collapse
|
22
|
Pradhan UK, Meher PK, Naha S, Rao AR, Gupta A. ASLncR: a novel computational tool for prediction of abiotic stress-responsive long non-coding RNAs in plants. Funct Integr Genomics 2023; 23:113. [PMID: 37000299 DOI: 10.1007/s10142-023-01040-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 03/23/2023] [Accepted: 03/24/2023] [Indexed: 04/01/2023]
Abstract
Abiotic stresses are detrimental to plant growth and development and have a major negative impact on crop yields. A growing body of evidence indicates that a large number of long non-coding RNAs (lncRNAs) are key to many abiotic stress responses. Thus, identifying abiotic stress-responsive lncRNAs is essential in crop breeding programs in order to develop crop cultivars resistant to abiotic stresses. In this study, we have developed the first machine learning-based computational model for predicting abiotic stress-responsive lncRNAs. The lncRNA sequences which were responsive and non-responsive to abiotic stresses served as the two classes of the dataset for binary classification using the machine learning algorithms. The training dataset was created using 263 stress-responsive and 263 non-stress-responsive sequences, whereas the independent test set consists of 101 sequences from both classes. As the machine learning model can adopt only the numeric data, the Kmer features ranging from sizes 1 to 6 were utilized to represent lncRNAs in numeric form. To select important features, four different feature selection strategies were utilized. Among the seven learning algorithms, the support vector machine (SVM) achieved the highest cross-validation accuracy with the selected feature sets. The observed 5-fold cross-validation accuracy, AU-ROC, and AU-PRC were found to be 68.84, 72.78, and 75.86%, respectively. Furthermore, the robustness of the developed model (SVM with the selected feature) was evaluated using an independent test dataset, where the overall accuracy, AU-ROC, and AU-PRC were found to be 76.23, 87.71, and 88.49%, respectively. The developed computational approach was also implemented in an online prediction tool ASLncR accessible at https://iasri-sg.icar.gov.in/aslncr/ . The proposed computational model and the developed prediction tool are believed to supplement the existing effort for the identification of abiotic stress-responsive lncRNAs in plants.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India.
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | | | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| |
Collapse
|
23
|
DeepmRNALoc: A Novel Predictor of Eukaryotic mRNA Subcellular Localization Based on Deep Learning. Molecules 2023; 28:molecules28052284. [PMID: 36903531 PMCID: PMC10005629 DOI: 10.3390/molecules28052284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Revised: 02/02/2023] [Accepted: 02/10/2023] [Indexed: 03/06/2023] Open
Abstract
The subcellular localization of messenger RNA (mRNA) precisely controls where protein products are synthesized and where they function. However, obtaining an mRNA's subcellular localization through wet-lab experiments is time-consuming and expensive, and many existing mRNA subcellular localization prediction algorithms need to be improved. In this study, a deep neural network-based eukaryotic mRNA subcellular location prediction method, DeepmRNALoc, was proposed, utilizing a two-stage feature extraction strategy that featured bimodal information splitting and fusing for the first stage and a VGGNet-like CNN module for the second stage. The five-fold cross-validation accuracies of DeepmRNALoc in the cytoplasm, endoplasmic reticulum, extracellular region, mitochondria, and nucleus were 0.895, 0.594, 0.308, 0.944, and 0.865, respectively, demonstrating that it outperforms existing models and techniques.
Collapse
|
24
|
A radiomics feature-based machine learning models to detect brainstem infarction (RMEBI) may enable early diagnosis in non-contrast enhanced CT. Eur Radiol 2023; 33:1004-1014. [PMID: 36169689 DOI: 10.1007/s00330-022-09130-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/21/2022] [Accepted: 08/24/2022] [Indexed: 02/03/2023]
Abstract
OBJECTIVES Magnetic resonance imaging has high sensitivity in detecting early brainstem infarction (EBI). However, MRI is not practical for all patients who present with possible stroke and would lead to delayed treatment. The detection rate of EBI on non-contrast computed tomography (NCCT) is currently very low. Thus, we aimed to develop and validate the radiomics feature-based machine learning models to detect EBI (RMEBIs) on NCCT. METHODS In this retrospective observational study, 355 participants from a multicentre multimodal database established by Huashan Hospital were randomly divided into two data sets: a training cohort (70%) and an internal validation cohort (30%). Fifty-seven participants from the Second Affiliated Hospital of Xuzhou Medical University were included as the external validation cohort. Brainstems were segmented by a radiologist committee on NCCT and 1781 radiomics features were automatically computed. After selecting the relevant features, 7 machine learning models were assessed in the training cohort to predict early brainstem infarction. Accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1-score, and the area under the receiver operating characteristic curve (AUC) were used to evaluate the performance of the prediction models. RESULTS The multilayer perceptron (MLP) RMEBI showed the best performance (AUC: 0.99 [95% CI: 0.96-1.00]) in the internal validation cohort. The AUC value in external validation cohort was 0.91 (95% CI: 0.82-0.98). CONCLUSIONS RMEBIs have the potential in routine clinical practice to enable accurate computer-assisted diagnoses of early brainstem infarction in patients with NCCT, which may have important clinical value in reducing therapeutic decision-making time. KEY POINTS • RMEBIs have the potential to enable accurate diagnoses of early brainstem infarction in patients with NCCT. • RMEBIs are suitable for various multidetector CT scanners. • The patient treatment decision-making time is shortened.
Collapse
|
25
|
Li J, Wu Z, Lin W, Luo J, Zhang J, Chen Q, Chen J. iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models. BIOINFORMATICS ADVANCES 2023; 3:vbad043. [PMID: 37113248 PMCID: PMC10125906 DOI: 10.1093/bioadv/vbad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 02/04/2023] [Accepted: 03/24/2023] [Indexed: 04/29/2023]
Abstract
Motivation Enhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they cannot learn position-related multiscale contextual information from raw DNA sequences. Results In this article, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale k-mers and extracts contextual information of different scale k-mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale k-mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer. Availability and implementation The models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELM. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Wenhao Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Jiawei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | | |
Collapse
|
26
|
Li Y, Kong F, Cui H, Wang F, Li C, Ma J. SENIES: DNA Shape Enhanced Two-Layer Deep Learning Predictor for the Identification of Enhancers and Their Strength. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:637-645. [PMID: 35015646 DOI: 10.1109/tcbb.2022.3142019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identifying enhancers is a critical task in bioinformatics due to their primary role in regulating gene expression. For this reason, various computational algorithms devoted to enhancer identification have been put forward over the years. More features are extracted from the single DNA sequences to boost the performance. Nevertheless, DNA structural information is neglected, which is an essential factor affecting the binding preferences of transcription factors to regulatory elements like enhancers. Here, we propose SENIES, a DNA shape enhanced deep learning predictor, to identify enhancers and their strength. The predictor consists of two layers where the first layer is for enhancer and non-enhancer identification, and the second layer is for predicting the strength of enhancers. Apart from two common sequence-derived features (i.e., one-hot and k-mer), DNA shape is introduced to describe the 3D structures of DNA sequences. Performance comparison with state-of-the-art methods conducted on public datasets demonstrates the effectiveness and robustness of our predictor. The code implementation of SENIES is publicly available at https://github.com/hlju-liye/SENIES.
Collapse
|
27
|
Han SK, Muto Y, Wilson PC, Humphreys BD, Sampson MG, Chakravarti A, Lee D. Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model. Proc Natl Acad Sci U S A 2022; 119:e2212810119. [PMID: 36508674 PMCID: PMC9907136 DOI: 10.1073/pnas.2212810119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 10/28/2022] [Indexed: 12/15/2022] Open
Abstract
Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify "high-quality" (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
Collapse
Affiliation(s)
- Seong Kyu Han
- Department of Pediatrics, Division of Nephrology, Boston Children’s Hospital, Boston & Harvard Medical School, Boston, MA02115
- Kidney Disease Initiative, Broad Institute of MIT and Harvard, Cambridge, MA02142
| | - Yoshiharu Muto
- Division of Nephrology, Department of Medicine, Washington University in St. Louis, St. Louis, MO63130
| | - Parker C. Wilson
- Department of Pathology and Immunology, Washington University in St. Louis, St. Louis, MO63130
| | - Benjamin D. Humphreys
- Division of Nephrology, Department of Medicine, Washington University in St. Louis, St. Louis, MO63130
- Department of Developmental Biology, Washington University in St. Louis, St. Louis, MO63130
| | - Matthew G. Sampson
- Department of Pediatrics, Division of Nephrology, Boston Children’s Hospital, Boston & Harvard Medical School, Boston, MA02115
- Kidney Disease Initiative, Broad Institute of MIT and Harvard, Cambridge, MA02142
| | - Aravinda Chakravarti
- Center for Human Genetics and Genomics, New York University Grossman School of Medicine, New York, NY10016
| | - Dongwon Lee
- Department of Pediatrics, Division of Nephrology, Boston Children’s Hospital, Boston & Harvard Medical School, Boston, MA02115
- Manton Center for Orphan Disease Research, Boston Children’s Hospital, Boston, MA02115
| |
Collapse
|
28
|
Identification and functional validation of super-enhancers in Arabidopsis thaliana. Proc Natl Acad Sci U S A 2022; 119:e2215328119. [PMID: 36409894 PMCID: PMC9860255 DOI: 10.1073/pnas.2215328119] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Super-enhancers (SEs) are exceptionally large enhancers and are recognized to play prominent roles in cell identity in mammalian species. We surveyed the genomic regions containing large clusters of accessible chromatin regions (ACRs) marked by deoxyribonuclease (DNase) I hypersensitivity in Arabidopsis thaliana. We identified a set of 749 putative SEs, which have a minimum length of 1.5 kilobases and represent the top 2.5% of the largest ACR clusters. We demonstrate that the genomic regions associating with these SEs were more sensitive to DNase I than other nonpromoter ACRs. The SEs were preferentially associated with topologically associating domains. Furthermore, the SEs and their predicted cognate genes were frequently associated with organ development and tissue identity in A. thaliana. Therefore, the A. thaliana SEs and their cognate genes mirror the functional characteristics of those reported in mammalian species. We developed CRISPR/Cas-mediated deletion lines of a 3,578-bp SE associated with the thalianol biosynthetic gene cluster (BGC). Small deletions (131-157 bp) within the SE resulted in distinct phenotypic changes and transcriptional repression of all five thalianol genes. In addition, T-DNA insertions in the SE region resulted in transcriptional alteration of all five thalianol genes. Thus, this SE appears to play a central role in coordinating the operon-like expression pattern of the thalianol BGC.
Collapse
|
29
|
Yang Y, Lv H, Chen N. A Survey on ensemble learning under the era of deep learning. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10283-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
30
|
Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3. PLoS One 2022; 17:e0274338. [PMID: 36084008 PMCID: PMC9462789 DOI: 10.1371/journal.pone.0274338] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2021] [Accepted: 08/25/2022] [Indexed: 11/23/2022] Open
Abstract
Gene expression is regulated through cis-regulatory elements (CREs), among which are promoters, enhancers, Polycomb/Trithorax Response Elements (PREs), silencers and insulators. Computational prediction of CREs can be achieved using a variety of statistical and machine learning methods combined with different feature space formulations. Although Python packages for DNA sequence feature sets and for machine learning are available, no existing package facilitates the combination of DNA sequence feature sets with machine learning methods for the genome-wide prediction of candidate CREs. We here present Gnocis, a Python package that streamlines the analysis and the modelling of CRE sequences by providing extensible APIs and implementing the glue required for combining feature sets and models for genome-wide prediction. Gnocis implements a variety of base feature sets, including motif pair occurrence frequencies and the k-spectrum mismatch kernel. It integrates with Scikit-learn and TensorFlow for state-of-the-art machine learning. Gnocis additionally implements a broad suite of tools for the handling and preparation of sequence, region and curve data, which can be useful for general DNA bioinformatics in Python. We also present Deep-MOCCA, a neural network architecture inspired by SVM-MOCCA that achieves moderate to high generalization without prior motif knowledge. To demonstrate the use of Gnocis, we applied multiple machine learning methods to the modelling of D. melanogaster PREs, including a Convolutional Neural Network (CNN), making this the first study to model PREs with CNNs. The models are readily adapted to new CRE modelling problems and to other organisms. In order to produce a high-performance, compiled package for Python 3, we implemented Gnocis in Cython. Gnocis can be installed using the PyPI package manager by running ‘pip install gnocis’. The source code is available on GitHub, at https://github.com/bjornbredesen/gnocis.
Collapse
|
31
|
Oya S, Takahashi M, Takashima K, Kakutani T, Inagaki S. Transcription-coupled and epigenome-encoded mechanisms direct H3K4 methylation. Nat Commun 2022; 13:4521. [PMID: 35953471 PMCID: PMC9372134 DOI: 10.1038/s41467-022-32165-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 07/19/2022] [Indexed: 11/13/2022] Open
Abstract
Mono-, di-, and trimethylation of histone H3 lysine 4 (H3K4me1/2/3) are associated with transcription, yet it remains controversial whether H3K4me1/2/3 promote or result from transcription. Our previous characterizations of Arabidopsis H3K4 demethylases suggest roles for H3K4me1 in transcription. However, the control of H3K4me1 remains unexplored in Arabidopsis, in which no methyltransferase for H3K4me1 has been identified. Here, we identify three Arabidopsis methyltransferases that direct H3K4me1. Analyses of their genome-wide localization using ChIP-seq and machine learning reveal that one of the enzymes cooperates with the transcription machinery, while the other two are associated with specific histone modifications and DNA sequences. Importantly, these two types of localization patterns are also found for the other H3K4 methyltransferases in Arabidopsis and mice. These results suggest that H3K4me1/2/3 are established and maintained via interplay with transcription as well as inputs from other chromatin features, presumably enabling elaborate gene control.
Collapse
Affiliation(s)
- Satoyo Oya
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan.
| | | | | | - Tetsuji Kakutani
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan.
- National Institute of Genetics, Mishima, Japan.
| | - Soichi Inagaki
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan.
- PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan.
| |
Collapse
|
32
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Circ-LocNet: A Computational Framework for Circular RNA Sub-Cellular Localization Prediction. Int J Mol Sci 2022; 23:ijms23158221. [PMID: 35897818 PMCID: PMC9329987 DOI: 10.3390/ijms23158221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 02/04/2023] Open
Abstract
Circular ribonucleic acids (circRNAs) are novel non-coding RNAs that emanate from alternative splicing of precursor mRNA in reversed order across exons. Despite the abundant presence of circRNAs in human genes and their involvement in diverse physiological processes, the functionality of most circRNAs remains a mystery. Like other non-coding RNAs, sub-cellular localization knowledge of circRNAs has the aptitude to demystify the influence of circRNAs on protein synthesis, degradation, destination, their association with different diseases, and potential for drug development. To date, wet experimental approaches are being used to detect sub-cellular locations of circular RNAs. These approaches help to elucidate the role of circRNAs as protein scaffolds, RNA-binding protein (RBP) sponges, micro-RNA (miRNA) sponges, parental gene expression modifiers, alternative splicing regulators, and transcription regulators. To complement wet-lab experiments, considering the progress made by machine learning approaches for the determination of sub-cellular localization of other non-coding RNAs, the paper in hand develops a computational framework, Circ-LocNet, to precisely detect circRNA sub-cellular localization. Circ-LocNet performs comprehensive extrinsic evaluation of 7 residue frequency-based, residue order and frequency-based, and physio-chemical property-based sequence descriptors using the five most widely used machine learning classifiers. Further, it explores the performance impact of K-order sequence descriptor fusion where it ensembles similar as well dissimilar genres of statistical representation learning approaches to reap the combined benefits. Considering the diversity of statistical representation learning schemes, it assesses the performance of second-order, third-order, and going all the way up to seventh-order sequence descriptor fusion. A comprehensive empirical evaluation of Circ-LocNet over a newly developed benchmark dataset using different settings reveals that standalone residue frequency-based sequence descriptors and tree-based classifiers are more suitable to predict sub-cellular localization of circular RNAs. Further, K-order heterogeneous sequence descriptors fusion in combination with tree-based classifiers most accurately predict sub-cellular localization of circular RNAs. We anticipate this study will act as a rich baseline and push the development of robust computational methodologies for the accurate sub-cellular localization determination of novel circRNAs.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
- Correspondence:
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- School of Computer Science & Electrical Engineering, National University of Sciences and Technology, Islamabad 44000, Pakistan;
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
33
|
Gao Y, Chen Y, Feng H, Zhang Y, Yue Z. RicENN: Prediction of Rice Enhancers with Neural Network Based on DNA Sequences. Interdiscip Sci 2022; 14:555-565. [PMID: 35190950 DOI: 10.1007/s12539-022-00503-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 01/07/2022] [Accepted: 01/18/2022] [Indexed: 01/22/2023]
Abstract
Enhancers are the primary cis-elements of transcriptional regulation and play a vital role in gene expression at different stages of plant growth and development. Having high locational variation and free scattering in non-encoding genomes, identification of enhancers is a crucial, but challenging work in understanding the biological mechanism of model plants. Recently, applications of neural network models are gaining increasing popularity in predicting the function of genomic elements. Although several computational models have shown great advantages to tackle this challenge, a further study of the identification of rice enhancers from DNA sequences is still lacking. We present RicENN, a novel deep learning framework capable of accurately identifying enhancers of rice, integrating convolution neural networks (CNNs), bi-directional recurrent neural networks (RNNs), and attention mechanisms. A combined-feature representation method was designed to extract the sequence features from original DNA sequences using six types of autocorrelation encodings. Moreover, we verified that the integrated model achieves the best performance by an ablation study. Finally, our deep learning framework realized a reliable prediction of the rice enhancers. The results show RicENN outperforms available alternative approaches in rice species, achieving the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) of 0.960 and 0.960 on cross-validation, and 0.879 and 0.877 during independent tests, respectively. This study develops a hybrid model to combine the merits of different neural network architectures, which shows the potential ability to apply deep learning in bioinformatic sequences and contributes to the acceleration of functional genomic studies of rice. RicENN and its code are freely accessible at http://bioinfor.aielab.cc/RicENN/ .
Collapse
Affiliation(s)
- Yujia Gao
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, Anhui, China
| | - Yiqiong Chen
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, Anhui, China
| | - Haisong Feng
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, Anhui, China
| | - Youhua Zhang
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, Anhui, China.
| | - Zhenyu Yue
- School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, Anhui, China.
| |
Collapse
|
34
|
SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes (Basel) 2022; 13:genes13040568. [PMID: 35456374 PMCID: PMC9028922 DOI: 10.3390/genes13040568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 03/22/2022] [Accepted: 03/22/2022] [Indexed: 11/16/2022] Open
Abstract
A large number of inorganic and organic compounds are able to bind DNA and form complexes, among which drug-related molecules are important. Chromatin accessibility changes not only directly affect drug–DNA interactions, but they can promote or inhibit the expression of the critical genes associated with drug resistance by affecting the DNA binding capacity of TFs and transcriptional regulators. However, the biological experimental techniques for measuring it are expensive and time-consuming. In recent years, several kinds of computational methods have been proposed to identify accessible regions of the genome. Existing computational models mostly ignore the contextual information provided by the bases in gene sequences. To address these issues, we proposed a new solution called SemanticCAP. It introduces a gene language model that models the context of gene sequences and is thus able to provide an effective representation of a certain site in a gene sequence. Basically, we merged the features provided by the gene language model into our chromatin accessibility model. During the process, we designed methods called SFA and SFC to make feature fusion smoother. Compared to DeepSEA, gkm-SVM, and k-mer using public benchmarks, our model proved to have better performance, showing a 1.25% maximum improvement in auROC and a 2.41% maximum improvement in auPRC.
Collapse
|
35
|
Yocca AE, Edger PP. Machine learning approaches to identify core and dispensable genes in pangenomes. THE PLANT GENOME 2022; 15:e20135. [PMID: 34533282 DOI: 10.1002/tpg2.20135] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/16/2021] [Indexed: 05/25/2023]
Abstract
A gene in a given taxonomic group is either present in every individual (core) or absent in at least a single individual (dispensable). Previous pangenomic studies have identified certain functional differences between core and dispensable genes. However, identifying if a gene belongs to the core or dispensable portion of the genome requires the construction of a pangenome, which involves sequencing the genomes of many individuals. Here we aim to leverage the previously characterized core and dispensable gene content for two grass species [Brachypodium distachyon (L.) P. Beauv. and Oryza sativa L.] to construct a machine learning model capable of accurately classifying genes as core or dispensable using only a single annotated reference genome. Such a model may mitigate the need for pangenome construction, an expensive hurdle especially in orphan crops, which often lack the adequate genomic resources.
Collapse
Affiliation(s)
- Alan E Yocca
- Dep. of Plant Biology, Michigan State Univ., East Lansing, MI, 48824, USA
- Dep. of Horticulture, Michigan State Univ., East Lansing, MI, 48824, USA
| | - Patrick P Edger
- Dep. of Horticulture, Michigan State Univ., East Lansing, MI, 48824, USA
- Genetics and Genome Sciences Program, Michigan State Univ., East Lansing, MI, 48824, USA
| |
Collapse
|
36
|
Chien CH, Huang LY, Lo SF, Chen LJ, Liao CC, Chen JJ, Chu YW. Using Machine Learning Approaches to Predict Target Gene Expression in Rice T-DNA Insertional Mutants. Front Genet 2021; 12:798107. [PMID: 34976025 PMCID: PMC8718795 DOI: 10.3389/fgene.2021.798107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 11/15/2021] [Indexed: 11/13/2022] Open
Abstract
To change the expression of the flanking genes by inserting T-DNA into the genome is commonly used in rice functional gene research. However, whether the expression of a gene of interest is enhanced must be validated experimentally. Consequently, to improve the efficiency of screening activated genes, we established a model to predict gene expression in T-DNA mutants through machine learning methods. We gathered experimental datasets consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, support vector machine (SVM) models were constructed with nine features consisting of information about biological function and local and global sequences. Feature encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with minimum redundancy maximum relevance (mRMR) feature selection and the LADTree algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as TIMgo, was 99.3% based on fivefold cross-validation, and 85.6% based on independent testing. We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification. TIMgo had a good predictive ability for target genes within 20 kb from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the activation mechanism of the 35S enhancer.
Collapse
Affiliation(s)
- Ching-Hsuan Chien
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
| | - Lan-Ying Huang
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
| | - Shuen-Fang Lo
- Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
| | - Liang-Jwu Chen
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
- Advanced Plant Biotechnology Center National Chung Hsing University, Taichung, Taiwan
| | - Chi-Chou Liao
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
| | - Jia-Jyun Chen
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Yen-Wei Chu
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
- Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
- Agricultural Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
- Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan
- Rong Hsing Research Center for Translational Medicine, National Chung Hsing University, Taichung, Taiwan
- *Correspondence: Yen-Wei Chu,
| |
Collapse
|
37
|
Benner P, Vingron M. Quantifying the tissue-specific regulatory information within enhancer DNA sequences. NAR Genom Bioinform 2021; 3:lqab095. [PMID: 34729474 PMCID: PMC8557370 DOI: 10.1093/nargab/lqab095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 09/23/2021] [Accepted: 09/28/2021] [Indexed: 12/04/2022] Open
Abstract
Recent efforts to measure epigenetic marks across a wide variety of different cell types and tissues provide insights into the cell type-specific regulatory landscape. We use these data to study whether there exists a correlate of epigenetic signals in the DNA sequence of enhancers and explore with computational methods to what degree such sequence patterns can be used to predict cell type-specific regulatory activity. By constructing classifiers that predict in which tissues enhancers are active, we are able to identify sequence features that might be recognized by the cell in order to regulate gene expression. While classification performances vary greatly between tissues, we show examples where our classifiers correctly predict tissue-specific regulation from sequence alone. We also show that many of the informative patterns indeed harbor transcription factor footprints.
Collapse
Affiliation(s)
- Philipp Benner
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| |
Collapse
|
38
|
Lyu Y, Zhang Z, Li J, He W, Ding Y, Guo F. iEnhancer-KL: A Novel Two-Layer Predictor for Identifying Enhancers by Position Specific of Nucleotide Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2809-2815. [PMID: 33481715 DOI: 10.1109/tcbb.2021.3053608] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
An enhancer is a short region of DNA with the ability to recruit transcription factors and their complexes, increasing the likelihood of the transcription of a particular gene. Considering the importance of enhancers, enhancer identification is a prevailing problem in computational biology. In this paper, we propose a novel two-layer enhancer predictor called iEnhancer-KL, using computational biology algorithms to identify enhancers and then classify these enhancers into strong or weak types. Kullback-Leibler (KL) divergence is creatively taken into consideration to improve the feature extraction method PSTNPss. Then, LASSO is used to reduce the dimension of features and finally helps to get better prediction performance. Furthermore, the selected features are tested on several machine learning models, and the SVM algorithm achieves the best performance. The rigorous cross-validation indicates that our predictor is remarkably superior to the existing state-of-the-art methods with an Acc of 84.23 percent and the MCC of 0.6849 for identifying enhancers. Our code and results can be freely downloaded from https://github.com/Not-so-middle/iEnhancer-KL.git.
Collapse
|
39
|
LGFC-CNN: Prediction of lncRNA-Protein Interactions by Using Multiple Types of Features through Deep Learning. Genes (Basel) 2021; 12:genes12111689. [PMID: 34828296 PMCID: PMC8621699 DOI: 10.3390/genes12111689] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 10/11/2021] [Accepted: 10/22/2021] [Indexed: 12/12/2022] Open
Abstract
Long noncoding RNA (lncRNA) plays a crucial role in many critical biological processes and participates in complex human diseases through interaction with proteins. Considering that identifying lncRNA–protein interactions through experimental methods is expensive and time-consuming, we propose a novel method based on deep learning that combines raw sequence composition features, hand-designed features and structure features, called LGFC-CNN, to predict lncRNA–protein interactions. The two sequence preprocessing methods and CNN modules (GloCNN and LocCNN) are utilized to extract the raw sequence global and local features. Meanwhile, we select hand-designed features by comparing the predictive effect of different lncRNA and protein features combinations. Furthermore, we obtain the structure features and unifying the dimensions through Fourier transform. In the end, the four types of features are integrated to comprehensively predict the lncRNA–protein interactions. Compared with other state-of-the-art methods on three lncRNA–protein interaction datasets, LGFC-CNN achieves the best performance with an accuracy of 94.14%, on RPI21850; an accuracy of 92.94%, on RPI7317; and an accuracy of 98.19% on RPI1847. The results show that our LGFC-CNN can effectively predict the lncRNA–protein interactions by combining raw sequence composition features, hand-designed features and structure features.
Collapse
|
40
|
Umarov R, Li Y, Arakawa T, Takizawa S, Gao X, Arner E. ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation. PLoS Comput Biol 2021; 17:e1009376. [PMID: 34491989 PMCID: PMC8448322 DOI: 10.1371/journal.pcbi.1009376] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 09/17/2021] [Accepted: 08/23/2021] [Indexed: 11/19/2022] Open
Abstract
Regulatory elements control gene expression through transcription initiation (promoters) and by enhancing transcription at distant regions (enhancers). Accurate identification of regulatory elements is fundamental for annotating genomes and understanding gene expression patterns. While there are many attempts to develop computational promoter and enhancer identification methods, reliable tools to analyze long genomic sequences are still lacking. Prediction methods often perform poorly on the genome-wide scale because the number of negatives is much higher than that in the training sets. To address this issue, we propose a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions. The developed method achieves good genome-level performance and maintains robust performance when applied to other vertebrate species, without re-training. Moreover, the unannotated predicted regulatory regions made on the human genome are enriched for disease-associated variants, suggesting them to be potentially true regulatory elements rather than false positives. We validated high scoring "false positive" predictions using reporter assay and all tested candidates were successfully validated, demonstrating the ability of our method to discover novel human regulatory regions.
Collapse
Affiliation(s)
- Ramzan Umarov
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- * E-mail: (RU); (XG); (EA)
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People’s Republic of China
| | - Takahiro Arakawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Satoshi Takizawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Xin Gao
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, Thuwal, Saudi Arabia
- * E-mail: (RU); (XG); (EA)
| | - Erik Arner
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail: (RU); (XG); (EA)
| |
Collapse
|
41
|
Friedman RZ, Granas DM, Myers CA, Corbo JC, Cohen BA, White MA. Information content differentiates enhancers from silencers in mouse photoreceptors. eLife 2021; 10:67403. [PMID: 34486522 PMCID: PMC8492058 DOI: 10.7554/elife.67403] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 09/03/2021] [Indexed: 12/12/2022] Open
Abstract
Enhancers and silencers often depend on the same transcription factors (TFs) and are conflated in genomic assays of TF binding or chromatin state. To identify sequence features that distinguish enhancers and silencers, we assayed massively parallel reporter libraries of genomic sequences targeted by the photoreceptor TF cone-rod homeobox (CRX) in mouse retinas. Both enhancers and silencers contain more TF motifs than inactive sequences, but relative to silencers, enhancers contain motifs from a more diverse collection of TFs. We developed a measure of information content that describes the number and diversity of motifs in a sequence and found that, while both enhancers and silencers depend on CRX motifs, enhancers have higher information content. The ability of information content to distinguish enhancers and silencers targeted by the same TF illustrates how motif context determines the activity of cis-regulatory sequences. Different cell types are established by activating and repressing the activity of specific sets of genes, a process controlled by proteins called transcription factors. Transcription factors work by recognizing and binding short stretches of DNA in parts of the genome called cis-regulatory sequences. A cis-regulatory sequence that increases the activity of a gene when bound by transcription factors is called an enhancer, while a sequence that causes a decrease in gene activity is called a silencer. To establish a cell type, a particular transcription factor will act on both enhancers and silencers that control the activity of different genes. For example, the transcription factor cone-rod homeobox (CRX) is critical for specifying different types of cells in the retina, and it acts on both enhancers and silencers. In rod photoreceptors, CRX activates rod genes by binding their enhancers, while repressing cone photoreceptor genes by binding their silencers. However, CRX always recognizes and binds to the same DNA sequence, known as its binding site, making it unclear why some cis-regulatory sequences bound to CRX act as silencers, while others act as enhancers. Friedman et al. sought to understand how enhancers and silencers, both bound by CRX, can have different effects on the genes they control. Since both enhancers and silencers contain CRX binding sites, the difference between the two must lie in the sequence of the DNA surrounding these binding sites. Using retinas that have been explanted from mice and kept alive in the laboratory, Friedman et al. tested the activity of thousands of CRX-binding sequences from the mouse genome. This showed that both enhancers and silencers have more copies of CRX-binding sites than sequences of the genome that are inactive. Additionally, the results revealed that enhancers have a diverse collection of binding sites for other transcription factors, while silencers do not. Friedman et al. developed a new metric they called information content, which captures the diverse combinations of different transcription binding sites that cis-regulatory sequences can have. Using this metric, Friedman et al. showed that it is possible to distinguish enhancers from silencers based on their information content. It is critical to understand how the DNA sequences of cis-regulatory regions determine their activity, because mutations in these regions of the genome can cause disease. However, since every person has thousands of benign mutations in cis-regulatory sequences, it is a challenge to identify specific disease-causing mutations, which are relatively rare. One long-term goal of models of enhancers and silencers, such as Friedman et al.’s information content model, is to understand how mutations can affect cis-regulatory sequences, and, in some cases, lead to disease.
Collapse
Affiliation(s)
- Ryan Z Friedman
- Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, United States.,Department of Genetics, Washington University School of Medicine, St. Louis, United States
| | - David M Granas
- Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, United States.,Department of Genetics, Washington University School of Medicine, St. Louis, United States
| | - Connie A Myers
- Department of Pathology and Immunology, Washington University School of Medicine, St Louis, United States
| | - Joseph C Corbo
- Department of Pathology and Immunology, Washington University School of Medicine, St Louis, United States
| | - Barak A Cohen
- Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, United States.,Department of Genetics, Washington University School of Medicine, St. Louis, United States
| | - Michael A White
- Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, United States.,Department of Genetics, Washington University School of Medicine, St. Louis, United States
| |
Collapse
|
42
|
iEnhancer-RD: Identification of enhancers and their strength using RKPK features and deep neural networks. Anal Biochem 2021; 630:114318. [PMID: 34364858 DOI: 10.1016/j.ab.2021.114318] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 07/02/2021] [Accepted: 07/27/2021] [Indexed: 11/20/2022]
Abstract
Enhancers are regulatory elements involved in gene expression.It is a part of DNA, which can enhance the transcription rate of gene. However, the identification of enhancer by biological experimental methods is time-consuming and expensive. Therefore, there is an urgent need for more efficient methods to identify them.In this study, we propose a new feature extraction method RKPK, which combines three feature methods and uses the recursive feature elimination algorithm for feature selection, and apply deep neural network as classifier to construct the iEnhancer-RD calculation method for enhancer identification. It is a two-layer classification architecture in which the first layer(layer I) identifies enhancers from a set of DNA sequences, and the second layer(layer II) divides the identified enhancers into two subgroups, namely strong and weak enhancers. Independent dataset test indicates that the proposed method is significantly better than most existing methods, and attains the accuracy of 78.8% and 70.5% in the two layers, respectively. Our iEnhancer-RD architecture is implemented in Python and is available at https://github.com/YangHuan639/iEnhancer-RD.
Collapse
|
43
|
Szyman K, Wilczyński B, Dąbrowski M. K-mer Content Changes with Node Degree in Promoter-Enhancer Network of Mouse ES Cells. Int J Mol Sci 2021; 22:ijms22158067. [PMID: 34360860 PMCID: PMC8347099 DOI: 10.3390/ijms22158067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Revised: 07/16/2021] [Accepted: 07/24/2021] [Indexed: 11/16/2022] Open
Abstract
Maps of Hi-C contacts between promoters and enhancers can be analyzed as networks, with cis-regulatory regions as nodes and their interactions as edges. We checked if in the published promoter-enhancer network of mouse embryonic stem (ES) cells the differences in the node type (promoter or enhancer) and the node degree (number of regions interacting with a given promoter or enhancer) are reflected by sequence composition or sequence similarity of the interacting nodes. We used counts of all k-mers (k = 4) to analyze the sequence composition and the Euclidean distance between the k-mer count vectors (k-mer distance) as the measure of sequence (dis)similarity. The results we obtained with 4-mers are interpretable in terms of dinucleotides. Promoters are GC-rich as compared to enhancers, which is known. Enhancers are enriched in scaffold/matrix attachment regions (S/MARs) patterns and depleted of CpGs. Furthermore, we show that promoters are more similar to their interacting enhancers than vice-versa. Most notably, in both promoters and enhancers, the GC content and the CpG count increase with the node degree. As a consequence, enhancers of higher node degree become more similar to promoters, whereas higher degree promoters become less similar to enhancers. We confirmed the key results also for human keratinocytes.
Collapse
Affiliation(s)
- Kinga Szyman
- Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, 02-093 Warsaw, Poland;
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 02-097 Warsaw, Poland;
| | - Bartek Wilczyński
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 02-097 Warsaw, Poland;
| | - Michał Dąbrowski
- Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, 02-093 Warsaw, Poland;
- Correspondence:
| |
Collapse
|
44
|
Lee D, Kapoor A, Lee C, Mudgett M, Beer MA, Chakravarti A. Sequence-based correction of barcode bias in massively parallel reporter assays. Genome Res 2021; 31:1638-1645. [PMID: 34285053 PMCID: PMC8415370 DOI: 10.1101/gr.268599.120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Accepted: 07/07/2021] [Indexed: 11/24/2022]
Abstract
Massively parallel reporter assays (MPRAs) are a high-throughput method for evaluating in vitro activities of thousands of candidate cis-regulatory elements (CREs). In these assays, candidate sequences are cloned upstream or downstream from a reporter gene tagged by unique DNA sequences. However, tag sequences may themselves affect reporter gene expression and lead to major potential biases in the measured cis-regulatory activity. Here, we present a sequence-based method for correcting tag-sequence-specific effects and show that our method can significantly reduce this source of variation and improve the identification of functional regulatory variants by MPRAs. We also show that our model captures sequence features associated with post-transcriptional regulation of mRNA. Thus, this new method helps not only to improve detection of regulatory signals in MPRA experiments but also to design better MPRA protocols.
Collapse
Affiliation(s)
| | - Ashish Kapoor
- University of Texas Health Science Center at Houston
| | | | | | | | | |
Collapse
|
45
|
Raden M, Wallach T, Miladi M, Zhai Y, Krüger C, Mossmann ZJ, Dembny P, Backofen R, Lehnardt S. Structure-aware machine learning identifies microRNAs operating as Toll-like receptor 7/8 ligands. RNA Biol 2021; 18:268-277. [PMID: 34241565 PMCID: PMC8677043 DOI: 10.1080/15476286.2021.1940697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MicroRNAs (miRNAs) can serve as activation signals for membrane receptors, a recently discovered function that is independent of the miRNAs’ conventional role in post-transcriptional gene regulation. Here, we introduce a machine learning approach, BrainDead, to identify oligonucleotides that act as ligands for single-stranded RNA-detecting Toll-like receptors (TLR)7/8, thereby triggering an immune response. BrainDead was trained on activation data obtained from in vitro experiments on murine microglia, incorporating sequence and intra-molecular structure, as well as inter-molecular homo-dimerization potential of candidate RNAs. The method was applied to analyse all known human miRNAs regarding their potential to induce TLR7/8 signalling and microglia activation. We validated the predicted functional activity of subsets of high- and low-scoring miRNAs experimentally, of which a selection has been linked to Alzheimer’s disease. High agreement between predictions and experiments confirms the robustness and power of BrainDead. The results provide new insight into the mechanisms of how miRNAs act as TLR ligands. Eventually, BrainDead implements a generic machine learning methodology for learning and predicting the functions of short RNAs in any context.
Collapse
Affiliation(s)
- Martin Raden
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Thomas Wallach
- Institute of Cell Biology and Neurobiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Milad Miladi
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Yuanyuan Zhai
- Institute of Cell Biology and Neurobiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Christina Krüger
- Institute of Cell Biology and Neurobiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Zoé J Mossmann
- Institute of Cell Biology and Neurobiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Paul Dembny
- Institute of Cell Biology and Neurobiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany.,Signalling Research Centre CIBSS, University of Freiburg, Freiburg, Germany
| | - Seija Lehnardt
- Department of Neurology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| |
Collapse
|
46
|
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, Akutsu T, Daly RJ, Webb GI, Zhao Q, Kurgan L, Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021; 49:e60. [PMID: 33660783 PMCID: PMC8191785 DOI: 10.1093/nar/gkab122] [Citation(s) in RCA: 107] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 02/05/2021] [Accepted: 02/25/2021] [Indexed: 12/14/2022] Open
Abstract
Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Key Laboratory of Cancer Prevention and Therapy, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300060, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Roger J Daly
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhi Zhao
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou 450046, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
47
|
Asma H, Halfon MS. Annotating the Insect Regulatory Genome. INSECTS 2021; 12:591. [PMID: 34209769 PMCID: PMC8305585 DOI: 10.3390/insects12070591] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 06/23/2021] [Accepted: 06/25/2021] [Indexed: 11/17/2022]
Abstract
An ever-growing number of insect genomes is being sequenced across the evolutionary spectrum. Comprehensive annotation of not only genes but also regulatory regions is critical for reaping the full benefits of this sequencing. Driven by developments in sequencing technologies and in both empirical and computational discovery strategies, the past few decades have witnessed dramatic progress in our ability to identify cis-regulatory modules (CRMs), sequences such as enhancers that play a major role in regulating transcription. Nevertheless, providing a timely and comprehensive regulatory annotation of newly sequenced insect genomes is an ongoing challenge. We review here the methods being used to identify CRMs in both model and non-model insect species, and focus on two tools that we have developed, REDfly and SCRMshaw. These resources can be paired together in a powerful combination to facilitate insect regulatory annotation over a broad range of species, with an accuracy equal to or better than that of other state-of-the-art methods.
Collapse
Affiliation(s)
- Hasiba Asma
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA;
| | - Marc S. Halfon
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA;
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics & Life Sciences, Buffalo, NY 14203, USA
| |
Collapse
|
48
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
49
|
Niu X, Deng K, Liu L, Yang K, Hu X. A statistical framework for predicting critical regions of p53-dependent enhancers. Brief Bioinform 2021; 22:bbaa053. [PMID: 32392580 PMCID: PMC8138796 DOI: 10.1093/bib/bbaa053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/26/2020] [Indexed: 12/13/2022] Open
Abstract
P53 is the 'guardian of the genome' and is responsible for regulating cell cycle and apoptosis. The genomic p53 binding regions, where activating transcriptional factors and cofactors like p300 simultaneously bind, are called 'p53-dependent enhancers', which play an important role in tumorigenesis. Current experimental assays generally provide a broad peak of each enhancer element, leaving our knowledge about critical enhancer regions (CERs) limited. Under the inspiration of enhancer dissection by CRISPR-Cas9 screen library on genome-wide p53 binding sites, here we introduce a statistical framework called 'Computational CRISPR Strategy' (CCS), to predict whether a given DNA fragment will be a p53-dependent CER by employing 7-mer as feature extractions along with random forest as the regressor. When training on a p53 CRISPR enhancer dataset, CCS not only accurately fitted the top-ranked enriched single guide RNAs (sgRNAs) but also successfully reproduced two known CERs that were validated by experiments. When applying it to an independent testing dataset on a tilling of a 2K-b genomic region of CRISPR-deCDKN1A-Lib, the trained model shows great generalizability by identifying a CER containing five top-ranked sgRNAs. A feature importance analysis further indicates that top-ranked 7-mers are mapped onto informative TF motifs including POU5F1 and SOX5, which are differentially enriched in p53-dependent CERs and are potential factors to make a general p53 binding site to form a p53-dependent CER, providing the interpretability of the trained model. Our results demonstrate that CCS is an alternative way of the CRISPR experiment to screen the genome for mapping p53-dependent CERs.
Collapse
Affiliation(s)
| | | | | | | | - Xuehai Hu
- Corresponding author: Xuehai Hu, College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, Hubei, 430070, P.R. China. Tel.: +86-18171282783; Fax: +86-27-87288509; E-mail:
| |
Collapse
|
50
|
Hong J, Gao R, Yang Y. CrepHAN: Cross-species prediction of enhancers by using hierarchical attention networks. Bioinformatics 2021; 37:3436-3443. [PMID: 33978703 DOI: 10.1093/bioinformatics/btab349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 04/21/2021] [Accepted: 05/06/2021] [Indexed: 01/17/2023] Open
Abstract
MOTIVATION Enhancers are important functional elements in genome sequences. The identification of enhancers is a very challenging task due to the great diversity of enhancer sequences and the flexible localization on genomes. Till now, the interactions between enhancers and genes have not been fully understood yet. To speed up the studies of the regulatory roles of enhancers, computational tools for the prediction of enhancers have emerged in recent years. Especially, thanks to the ENCODE project and the advances of high-throughput experimental techniques, a large amount of experimentally verified enhancers have been annotated on the human genome, which allows large-scale predictions of unknown enhancers using data-driven methods. However, except for human and some model organisms, the validated enhancer annotations are scarce for most species, leading to more difficulties in the computational identification of enhancers for their genomes. RESULTS In this study, we propose a deep learning-based predictor for enhancers, named CrepHAN, which is featured by a hierarchical attention neural network and word embedding-based representations for DNA sequences. We use the experimentally-supported data of the human genome to train the model, and perform experiments on human and other mammals, including mouse, cow, and dog. The experimental results show that CrepHAN has more advantages on cross-species predictions, and outperforms the existing models by a large margin. Especially, for human-mouse cross-predictions, the AUC score of ROC curve is increased by 0.033∼0.145 on the combined tissue dataset and 0.032∼0.109 on tissue-specific datasets. AVAILABILITY bcmi.sjtu.edu.cn/~yangyang/CrepHAN.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianwei Hong
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ruitian Gao
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China
| |
Collapse
|