Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	[Subscribe] [Scholar Register]

Number

Cited by Other Article(s)

Boretti A. The transformative potential of AI-driven CRISPR-Cas9 genome editing to enhance CAR T-cell therapy. Comput Biol Med 2024;182:109137. [PMID: 39260044 DOI: 10.1016/j.compbiomed.2024.109137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 08/31/2024] [Accepted: 09/08/2024] [Indexed: 09/13/2024]

Zhang Q, Wei Y, Liu L. GraphPro: An interpretable graph neural network-based model for identifying promoters in multiple species. Comput Biol Med 2024;180:108974. [PMID: 39096613 DOI: 10.1016/j.compbiomed.2024.108974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Revised: 07/29/2024] [Accepted: 07/30/2024] [Indexed: 08/05/2024]

Abstract

Promoters are DNA sequences that bind with RNA polymerase to initiate transcription, regulating this process through interactions with transcription factors. Accurate identification of promoters is crucial for understanding gene expression regulation mechanisms and developing therapeutic approaches for various diseases. However, experimental techniques for promoter identification are often expensive, time-consuming, and inefficient, necessitating the development of accurate and efficient computational models for this task. Enhancing the model's ability to recognize promoters across multiple species and improving its interpretability pose significant challenges. In this study, we introduce a novel interpretable model based on graph neural networks, named GraphPro, for multi-species promoter identification. Initially, we encode the sequences using k-tuple nucleotide frequency pattern, dinucleotide physicochemical properties, and dna2vec. Subsequently, we construct two feature extraction modules based on convolutional neural networks and graph neural networks. These modules aim to extract specific motifs from the promoters, learn their dependencies, and capture the underlying structural features of the promoters, providing a more comprehensive representation. Finally, a fully connected neural network predicts whether the input sequence is a promoter. We conducted extensive experiments on promoter datasets from eight species, including Human, Mouse, and Escherichia coli. The experimental results show that the average Sn, Sp, Acc and MCC values of GraphPro are 0.9123, 0.9482, 0.8840 and 0.7984, respectively. Compared with previous promoter identification methods, GraphPro not only achieves better recognition accuracy on multiple species, but also outperforms all previous methods in cross-species prediction ability. Furthermore, by visualizing GraphPro's decision process and analyzing the sequences matching the transcription factor binding motifs captured by the model, we validate its significant advantages in biological interpretability. The source code for GraphPro is available at https://github.com/liuliwei1980/GraphPro.

Collapse

Nagda BM, Nguyen VM, White RT. promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024;21:208-214. [PMID: 38051616 DOI: 10.1109/tcbb.2023.3339597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]

Liu X, Teng L, Luo Y, Xu Y. Prediction of prokaryotic and eukaryotic promoters based on information-theoretic features. Biosystems 2023;231:104979. [PMID: 37423595 DOI: 10.1016/j.biosystems.2023.104979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 07/06/2023] [Accepted: 07/07/2023] [Indexed: 07/11/2023]

Zaytsev K, Fedorov A, Korotkov E. Classification of Promoter Sequences from Human Genome. Int J Mol Sci 2023;24:12561. [PMID: 37628742 PMCID: PMC10454140 DOI: 10.3390/ijms241612561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 07/28/2023] [Accepted: 08/03/2023] [Indexed: 08/27/2023] Open

Milito A, Aschern M, McQuillan JL, Yang JS. Challenges and advances towards the rational design of microalgal synthetic promoters in Chlamydomonas reinhardtii. JOURNAL OF EXPERIMENTAL BOTANY 2023;74:3833-3850. [PMID: 37025006 DOI: 10.1093/jxb/erad100] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 03/24/2023] [Indexed: 06/19/2023]

Pan C, Qi Y. CRISPR-Combo-mediated orthogonal genome editing and transcriptional activation for plant breeding. Nat Protoc 2023:10.1038/s41596-023-00823-w. [PMID: 37085666 DOI: 10.1038/s41596-023-00823-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Accepted: 02/09/2023] [Indexed: 04/23/2023]

Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023;9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]

Abstract

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.

Collapse

Bharti S, Ploch S, Thines M. High-throughput time series expression profiling of Plasmopara halstedii infecting Helianthus annuus reveals conserved sequence motifs upstream of co-expressed genes. BMC Genomics 2023;24:140. [PMID: 36944935 PMCID: PMC10031896 DOI: 10.1186/s12864-023-09214-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Accepted: 02/27/2023] [Indexed: 03/23/2023] Open

Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023;3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]

Affiliation(s)

Zhongxiao Li Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
Elva Gao The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
Juexiao Zhou Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
Wenkai Han Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
Xiaopeng Xu Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
Xin Gao Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia

Collapse

Kari H, Bandi SMS, Kumar A, Yella VR. DeePromClass: Delineator for Eukaryotic Core Promoters Employing Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:802-807. [PMID: 35353704 DOI: 10.1109/tcbb.2022.3163418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Zhou J, Zhang B, Li H, Zhou L, Li Z, Long Y, Han W, Wang M, Cui H, Li J, Chen W, Gao X. Annotating TSSs in Multiple Cell Types Based on DNA Sequence and RNA-seq Data via DeeReCT-TSS. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022;20:959-973. [PMID: 36528241 PMCID: PMC10025762 DOI: 10.1016/j.gpb.2022.11.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 10/21/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]

Affiliation(s)

Juexiao Zhou Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
Bin Zhang Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
Haoyang Li Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
Longxi Zhou Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
Zhongxiao Li Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
Yongkang Long Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
Wenkai Han Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
Mengran Wang Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
Huanhuan Cui Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China; Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China; Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology, Shenzhen 518055, China
Jingjing Li Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
Wei Chen Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China; Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China; Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology, Shenzhen 518055, China.
Xin Gao Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia.

Collapse

Liu Q, Fang H, Wang X, Wang M, Li S, Coin LJM, Li F, Song J. DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions. Bioinformatics 2022;38:4053-4061. [PMID: 35799358 DOI: 10.1093/bioinformatics/btac454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 04/11/2022] [Accepted: 07/06/2022] [Indexed: 12/24/2022] Open

Guo AJX, Qi H. Using Artificial Neural Networks to Model Errors in Biochemical Manipulation of DNA Molecules. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:3060-3067. [PMID: 34115591 DOI: 10.1109/tcbb.2021.3088525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]

CapsProm: a capsule network for promoter prediction. Comput Biol Med 2022;147:105627. [DOI: 10.1016/j.compbiomed.2022.105627] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 04/05/2022] [Accepted: 04/11/2022] [Indexed: 11/21/2022]

Database of Potential Promoter Sequences in the Capsicum annuum Genome. BIOLOGY 2022;11:biology11081117. [PMID: 35892972 PMCID: PMC9332048 DOI: 10.3390/biology11081117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 07/19/2022] [Accepted: 07/23/2022] [Indexed: 11/16/2022]

Routhier E, Mozziconacci J. Genomics enters the deep learning era. PeerJ 2022;10:e13613. [PMID: 35769139 PMCID: PMC9235815 DOI: 10.7717/peerj.13613] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 05/30/2022] [Indexed: 01/17/2023] Open

Shim H. Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials. Evol Bioinform Online 2022;18:11769343221103887. [PMID: 35692726 PMCID: PMC9185011 DOI: 10.1177/11769343221103887] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 05/05/2022] [Indexed: 12/01/2022] Open

Abstract

CRISPR-Cas systems are an adaptive immunity that protects prokaryotes against foreign genetic elements. Genetic templates acquired during past infection events enable DNA-interacting enzymes to recognize foreign DNA for destruction. Due to the programmability and specificity of these genetic templates, CRISPR-Cas systems are potential alternative antibiotics that can be engineered to self-target antimicrobial resistance genes on the chromosome or plasmid. However, several fundamental questions remain to repurpose these tools against drug-resistant bacteria. For endogenous CRISPR-Cas self-targeting, antimicrobial resistance genes and functional CRISPR-Cas systems have to co-occur in the target cell. Furthermore, these tools have to outplay DNA repair pathways that respond to the nuclease activities of Cas proteins, even for exogenous CRISPR-Cas delivery. Here, we conduct a comprehensive survey of CRISPR-Cas genomes. First, we address the co-occurrence of CRISPR-Cas systems and antimicrobial resistance genes in the CRISPR-Cas genomes. We show that the average number of these genes varies greatly by the CRISPR-Cas type, and some CRISPR-Cas types (IE and IIIA) have over 20 genes per genome. Next, we investigate the DNA repair pathways of these CRISPR-Cas genomes, revealing that the diversity and frequency of these pathways differ by the CRISPR-Cas type. The interplay between CRISPR-Cas systems and DNA repair pathways is essential for the acquisition of new spacers in CRISPR arrays. We conduct simulation studies to demonstrate that the efficiency of these DNA repair pathways may be inferred from the time-series patterns in the RNA structure of CRISPR repeats. This bioinformatic survey of CRISPR-Cas genomes elucidates the necessity to consider multifaceted interactions between different genes and systems, to design effective CRISPR-based antimicrobials that can specifically target drug-resistant bacteria in natural microbial communities.

Collapse

Li Z, Li Y, Zhang B, Li Y, Long Y, Zhou J, Zou X, Zhang M, Hu Y, Chen W, Gao X. DeeReCT-APA: Prediction of Alternative Polyadenylation Site Usage Through Deep Learning. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022;20:483-495. [PMID: 33662629 PMCID: PMC9801043 DOI: 10.1016/j.gpb.2020.05.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 03/28/2020] [Accepted: 06/12/2020] [Indexed: 01/26/2023]

Affiliation(s)

Zhongxiao Li King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
Yisheng Li Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
Bin Zhang Cancer Science Institute of Singapore, Singapore 117599, Singapore
Yu Li King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
Yongkang Long King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia,2Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
Juexiao Zhou Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
Xudong Zou Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
Min Zhang Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
Yuhui Hu Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China,⁎Corresponding authors.
Wei Chen Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China,⁎Corresponding authors.
Xin Gao King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia,⁎Corresponding authors.

Collapse

Yang M, Huang L, Huang H, Tang H, Zhang N, Yang H, Wu J, Mu F. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res 2022;50:e81. [PMID: 35536244 PMCID: PMC9371931 DOI: 10.1093/nar/gkac326] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 02/22/2022] [Accepted: 05/09/2022] [Indexed: 12/12/2022] Open

Abstract

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

Collapse

iProm-Zea: A two-layer model to identify plant promoters and their types using convolutional neural network. Genomics 2022;114:110384. [PMID: 35533969 DOI: 10.1016/j.ygeno.2022.110384] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 04/18/2022] [Accepted: 05/02/2022] [Indexed: 01/14/2023]

Abstract

A promoter is a short DNA sequence near the start codon, responsible for initiating the transcription of a specific gene in the genome. The accurate recognition of promoters is important for achieving a better understanding of transcriptional regulation. Because of their importance in the process of biological transcriptional regulation, there is an urgent need to develop in silico tools to identify promoters and their types in a timely and accurate manner. A number of prediction methods have been developed in this regard; however, almost all of them are merely used for identifying promoters and their strength or sigma types. The TATA box region in TATA promoter influences the post-transcriptional processes; therefore, in the current study, we developed a two-layer predictor called "iProm-Zea" using the convolutional neural network (CNN) for identify TATA and TATA less promoters. The first layer can be used to identify a given DNA sequence as a promoter or non-promoter. The second layer can be used to identify whether the recognized promoter is the TATA promoter. To find an optimal feature encoding scheme and model, we employed four feature encoding schemes on different machine learning and CNN algorithms, and based on the evaluation results, we selected a one-hot encoding scheme and a CNN model for iProm-Zea. The 5-fold cross validation testing results demonstrated that the constructed predictor showed great potential for identifying promoters and classifying them as TATA and TATA less promoters. Furthermore, we performed cross-species analysis of iProm-Zea to evaluate its performance in other species. Moreover, to make it easier for other experimental scientists to obtain the results they need, we established a freely accessible and user-friendly web server at http://nsclbio.jbnu.ac.kr/tools/iProm-Zea/.

Collapse

Perez Martell RI, Ziesel A, Jabbari H, Stege U. Supervised promoter recognition: a benchmark framework. BMC Bioinformatics 2022;23:118. [PMID: 35366794 PMCID: PMC8976979 DOI: 10.1186/s12859-022-04647-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 03/16/2022] [Indexed: 11/10/2022] Open

Prokaryotic and eukaryotic promoters identification based on residual network transfer learning. Bioprocess Biosyst Eng 2022;45:955-967. [DOI: 10.1007/s00449-022-02716-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Accepted: 02/27/2022] [Indexed: 11/26/2022]

Cheng R, Xu Z, Luo M, Wang P, Cao H, Jin X, Zhou W, Xiao L, Jiang Q. Identification of alternative splicing-derived cancer neoantigens for mRNA vaccine development. Brief Bioinform 2022;23:bbab553. [PMID: 35279714 DOI: 10.1093/bib/bbab553] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 11/15/2021] [Accepted: 12/02/2021] [Indexed: 12/17/2023] Open

Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y. AlphaFold2-aware protein-DNA binding site prediction using graph transformer. Brief Bioinform 2022;23:6509729. [PMID: 35039821 DOI: 10.1093/bib/bbab564] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 11/24/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open

Wei PJ, Pang ZZ, Jiang LJ, Tan D, Su Y, Zheng CH. Promoter Prediction in Nannochloropsis Based on Densely Connected Convolutional Neural Networks. Methods 2022;204:38-46. [DOI: 10.1016/j.ymeth.2022.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 03/03/2022] [Accepted: 03/28/2022] [Indexed: 10/18/2022] Open

Wei J, Chen S, Zong L, Gao X, Li Y. Protein-RNA interaction prediction with deep learning: structure matters. Brief Bioinform 2022;23:bbab540. [PMID: 34929730 PMCID: PMC8790951 DOI: 10.1093/bib/bbab540] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 11/14/2021] [Accepted: 11/22/2021] [Indexed: 12/11/2022] Open

Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022;23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open

Affiliation(s)

Meng Zhang
Cangzhi Jia Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
Fuyi Li
Chen Li
Yan Zhu
Tatsuya Akutsu
Geoffrey I Webb Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
Quan Zou Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
Lachlan J M Coin Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
Jiangning Song Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:

Collapse

Mavaie P, Holder L, Beck D, Skinner MK. Predicting environmentally responsive transgenerational differential DNA methylated regions (epimutations) in the genome using a hybrid deep-machine learning approach. BMC Bioinformatics 2021;22:575. [PMID: 34847877 PMCID: PMC8630850 DOI: 10.1186/s12859-021-04491-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 11/18/2021] [Indexed: 11/24/2022] Open

Zhu Y, Yin S, Zheng J, Shi Y, Jia C. O-glycosylation site prediction for Homo sapiens by combining properties and sequence features with support vector machine. J Bioinform Comput Biol 2021;20:2150029. [PMID: 34806952 DOI: 10.1142/s0219720021500293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021;23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open

Lin JL, Kuo WL, Huang YH, Jong TL, Hsu AL, Hsu WH. Using Convolutional Neural Networks to Measure the Physiological Age of Caenorhabditis elegans. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:2724-2732. [PMID: 32031946 DOI: 10.1109/tcbb.2020.2971992] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Abstract

Caenorhabditis elegans (C. elegans) is a popular and excellent model for studies of aging due to its short lifespan. Methods for precisely measuring the physiological age of C. elegans are critically needed, especially for antiaging drug screening and genetic screening studies. The effects of various antiaging interventions on the rate of aging in the early stage of the aging process can be determined based on the quantification of physiological age. However, in general, the age of C. elegans is evaluated via human visual inspection of morphological changes based on personal experience and subjective judgment. For example, the rate of motor activity decay has been used to predict lifespan in early- to mid-stage aging. Using image processing, the physiological age of C. elegans can be measured and then classified into periods or classes from childhood to elderhood (e.g., 3 periods comprising days 0-2, 4-6 and 10-12) by using texture entropy (Shamir, L. et al., 2009). Our dataset consists of 913 microscopic images of C. elegans, with approximately 60 images per day from day 1 to day 14 of adulthood. We present quantitative methods to measure the physiological age of C. elegans with convolution neural networks (CNNs), which can measure age with a granularity of days rather than periods. The methods achieved a mean absolute error (MAE) of less than 1 day for the measured age of C. elegans. In our experiments, we found that after training and testing our dataset, 5 popular CNN models, 50-layer residual network (ResNet50), InceptionV3, InceptionResNetV2, 16-layer Visual Geometry Group network (VGG16) and MobileNet, measured the physiological age of C. elegans with an average testing MAE of 1.58 days. Furthermore, based on the results, we propose two models, one model for linear regression analysis and the other model for logistic regression, that combine a CNN model and a new attribute: curved_or_straight. The linear regression analysis model achieved a test MAE of 0.94 days; the logistic regression model achieved an accuracy of 84.78 percent with an error tolerance of 1 day.

Collapse

Umarov R, Li Y, Arakawa T, Takizawa S, Gao X, Arner E. ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation. PLoS Comput Biol 2021;17:e1009376. [PMID: 34491989 PMCID: PMC8448322 DOI: 10.1371/journal.pcbi.1009376] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 09/17/2021] [Accepted: 08/23/2021] [Indexed: 11/19/2022] Open

Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021;37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 190] [Impact Index Per Article: 63.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open

Vaz JM, Balaji S. Convolutional neural networks (CNNs): concepts and applications in pharmacogenomics. Mol Divers 2021;25:1569-1584. [PMID: 34031788 PMCID: PMC8342355 DOI: 10.1007/s11030-021-10225-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Accepted: 04/21/2021] [Indexed: 12/17/2022]

Wang Z, Sun X, Zhang X, Dong B, Yu H. Development of a miRNA Sensor by an Inducible CRISPR-Cas9 Construct in Ciona Embryogenesis. Mol Biotechnol 2021;63:613-620. [PMID: 33880702 DOI: 10.1007/s12033-021-00324-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 03/29/2021] [Indexed: 11/28/2022]

Mathematical Algorithm for Identification of Eukaryotic Promoter Sequences. Symmetry (Basel) 2021. [DOI: 10.3390/sym13060917] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open

Mainali S, Colorado FA, Garzon MH. Foretelling the Phenotype of a Genomic Sequence. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:777-783. [PMID: 32287003 DOI: 10.1109/tcbb.2020.2985349] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Zohra Smaili F, Tian S, Roy A, Alazmi M, Arold ST, Mukherjee S, Scott Hefty P, Chen W, Gao X. QAUST: Protein Function Prediction Using Structure Similarity, Protein Interaction, and Functional Motifs. GENOMICS PROTEOMICS & BIOINFORMATICS 2021;19:998-1011. [PMID: 33631427 PMCID: PMC9403031 DOI: 10.1016/j.gpb.2021.02.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Revised: 04/03/2019] [Accepted: 05/17/2019] [Indexed: 11/25/2022]

Kong L, Chen Y, Xu F, Xu M, Li Z, Fang J, Zhang L, Pian C. Mining influential genes based on deep learning. BMC Bioinformatics 2021;22:27. [PMID: 33482718 PMCID: PMC7821411 DOI: 10.1186/s12859-021-03972-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 01/15/2021] [Indexed: 11/17/2022] Open

iPTT(2 L)-CNN: A Two-Layer Predictor for Identifying Promoters and Their Types in Plant Genomes by Convolutional Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021;2021:6636350. [PMID: 33488763 PMCID: PMC7803414 DOI: 10.1155/2021/6636350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 12/13/2020] [Accepted: 12/16/2020] [Indexed: 11/18/2022]

Zheng D, Pang G, Liu B, Chen L, Yang J. Learning transferable deep convolutional neural networks for the classification of bacterial virulence factors. Bioinformatics 2020;36:3693-3702. [PMID: 32251507 DOI: 10.1093/bioinformatics/btaa230] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 03/25/2020] [Accepted: 04/01/2020] [Indexed: 12/23/2022] Open

Zhu Y, Li F, Xiang D, Akutsu T, Song J, Jia C. Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks. Brief Bioinform 2020;22:5998831. [PMID: 33227813 DOI: 10.1093/bib/bbaa299] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 10/01/2020] [Accepted: 10/07/2020] [Indexed: 12/26/2022] Open

Xiao M, Yang X, Yu J, Zhang L. CGIDLA:Developing the Web Server for CpG Island Related Density and LAUPs (Lineage-Associated Underrepresented Permutations) Study. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020;17:2148-2154. [PMID: 31443042 DOI: 10.1109/tcbb.2019.2935971] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Amin R, Rahman CR, Ahmed S, Sifat MHR, Liton MNK, Rahman MM, Khan MZH, Shatabda S. iPromoter-BnCNN: a novel branched CNN-based predictor for identifying and classifying sigma promoters. Bioinformatics 2020;36:4869-4875. [DOI: 10.1093/bioinformatics/btaa609] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 05/19/2020] [Accepted: 06/24/2020] [Indexed: 11/14/2022] Open

Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Vicia faba Based on Genotyping by Sequencing Data Using Deep Learning. Genes (Basel) 2020;11:genes11060614. [PMID: 32516876 PMCID: PMC7349281 DOI: 10.3390/genes11060614] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 05/26/2020] [Accepted: 05/28/2020] [Indexed: 12/15/2022] Open

Zhang S, Li X, Lin Q, Lin J, Wong KC. Uncovering the key dimensions of high-throughput biomolecular data using deep learning. Nucleic Acids Res 2020;48:e56. [PMID: 32232416 PMCID: PMC7261195 DOI: 10.1093/nar/gkaa191] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 03/06/2020] [Accepted: 03/16/2020] [Indexed: 01/09/2023] Open

Model-driven generation of artificial yeast promoters. Nat Commun 2020;11:2113. [PMID: 32355169 PMCID: PMC7192914 DOI: 10.1038/s41467-020-15977-4] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 04/02/2020] [Indexed: 01/04/2023] Open

Cui ZJ, Zhang WT, Zhu Q, Zhang QY, Zhang HY. Using a Heat Diffusion Model to Detect Potential Drug Resistance Genes of Mycobacterium tuberculosis. Protein Pept Lett 2020;27:711-717. [PMID: 32167422 DOI: 10.2174/0929866527666200313113157] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Revised: 12/01/2019] [Accepted: 12/21/2019] [Indexed: 01/01/2023]

Abstract

BACKGROUND

Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb), is one of the oldest known and most dangerous diseases. Although the spread of TB was controlled in the early 20th century using antibiotics and vaccines, TB has again become a threat because of increased drug resistance. There is still a lack of effective treatment regimens for a person who is already infected with multidrug-resistant Mtb (MDR-Mtb) or extensively drug-resistant Mtb (XDRMtb). In the past decades, many research groups have explored the drug resistance profiles of Mtb based on sequence data by GWAS, which identified some mutations that were significantly linked with drug resistance, and attempted to explain the resistance mechanisms. However, they mainly focused on several significant mutations in drug targets (e.g. rpoB, katG). Some genes which are potentially associated with drug resistance may be overlooked by the GWAS analysis.

OBJECTIVE

In this article, our motivation is to detect potential drug resistance genes of Mtb using a heat diffusion model.

METHODS

All sequencing data, which contained 127 samples of Mtb, i.e. 34 ethambutol-, 65 isoniazid-, 53 rifampicin- and 45 streptomycin-resistant strains. The raw sequence data were preprocessed using Trimmomatic software and aligned to the Mtb H37Rv reference genome using Bowtie2. From the resulting alignments, SAMtools and VarScan were used to filter sequences and call SNPs. The GWAS was performed by the PLINK package to obtain the significant SNPs, which were mapped to genes. The P-values of genes calculated by GWAS were transferred into a heat vector. The heat vector and the Mtb protein-protein interactions (PPI) derived from the STRING database were inputted into the heat diffusion model to obtain significant subnetworks by HotNet2. Finally, the most significant (P < 0.05) subnetworks associated with different phenotypes were obtained. To verify the change of binding energy between the drug and target before and after mutation, the method of molecular dynamics simulation was performed using the AMBER software.

RESULTS

We identified significant subnetworks in rifampicin-resistant samples. Excitingly, we found rpoB and rpoC, which are drug targets of rifampicin. From the protein structure of rpoB, the mutation location was extremely close to the drug binding site, with a distance of only 3.97 Å. Molecular dynamics simulation revealed that the binding energy of rpoB and rifampicin decreased after D435V mutation. To a large extent, this mutation can influence the affinity of drug-target binding. In addition, topA and pyrG were reported to be linked with drug resistance, and might be new TB drug targets. Other genes that have not yet been reported are worth further study.

CONCLUSION

Using a heat diffusion model in combination with GWAS results and protein-protein interactions, the significantly mutated subnetworks in rifampicin-resistant samples were found. The subnetwork not only contained the known targets of rifampicin (rpoB, rpoC), but also included topA and pyrG, which are potentially associated with drug resistance. Together, these results offer deeper insights into drug resistance of Mtb, and provides potential drug targets for finding new antituberculosis drugs.

Collapse

Wang R, Wang Z, Wang J, Li S. SpliceFinder: ab initio prediction of splice sites using convolutional neural network. BMC Bioinformatics 2019;20:652. [PMID: 31881982 PMCID: PMC6933889 DOI: 10.1186/s12859-019-3306-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open

Abstract

Background

Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing.

Result

We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining.

Conclusion

Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

Collapse