1
|
Wang G, Zhang H, Shao M, Feng Y, Cao C, Hu X. DeepTGIN: a novel hybrid multimodal approach using transformers and graph isomorphism networks for protein-ligand binding affinity prediction. J Cheminform 2024; 16:147. [PMID: 39734235 DOI: 10.1186/s13321-024-00938-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Accepted: 11/25/2024] [Indexed: 12/31/2024] Open
Abstract
Predicting protein-ligand binding affinity is essential for understanding protein-ligand interactions and advancing drug discovery. Recent research has demonstrated the advantages of sequence-based models and graph-based models. In this study, we present a novel hybrid multimodal approach, DeepTGIN, which integrates transformers and graph isomorphism networks to predict protein-ligand binding affinity. DeepTGIN is designed to learn sequence and graph features efficiently. The DeepTGIN model comprises three modules: the data representation module, the encoder module, and the prediction module. The transformer encoder learns sequential features from proteins and protein pockets separately, while the graph isomorphism network extracts graph features from the ligands. To evaluate the performance of DeepTGIN, we compared it with state-of-the-art models using the PDBbind 2016 core set and PDBbind 2013 core set. DeepTGIN outperforms these models in terms of R, RMSE, MAE, SD, and CI metrics. Ablation studies further demonstrate the effectiveness of the ligand features and the encoder module. The code is available at: https://github.com/zhc-moushang/DeepTGIN . SCIENTIFIC CONTRIBUTION: DeepTGIN is a novel hybrid multimodal deep learning model for predict protein-ligand binding affinity. The model combines the Transformer encoder to extract sequence features from protein and protein pocket, while integrating graph isomorphism networks to capture features from the ligand. This model addresses the limitations of existing methods in exploring protein pocket and ligand features.
Collapse
Affiliation(s)
- Guishen Wang
- College of Computer Science and Engineering, Changchun University of Technology, North Yunda Street No. 3000, Changchun, 130012, Jilin, China
- School of Life Sciences, Jilin University, Qianjin Street No. 2055, Changchun, 130000, Jilin, China
| | - Hangchen Zhang
- College of Computer Science and Engineering, Changchun University of Technology, North Yunda Street No. 3000, Changchun, 130012, Jilin, China
| | - Mengting Shao
- Key Laboratory for Bio-Electromagnetic Environment and Advanced Medical Theranostics, School of Biomedical Engineering and Informatics, Nanjing Medical University, Longmian Avenue No. 101, Nanjing, 211166, Jiangsu, China
| | - Yuncong Feng
- College of Computer Science and Engineering, Changchun University of Technology, North Yunda Street No. 3000, Changchun, 130012, Jilin, China
| | - Chen Cao
- Key Laboratory for Bio-Electromagnetic Environment and Advanced Medical Theranostics, School of Biomedical Engineering and Informatics, Nanjing Medical University, Longmian Avenue No. 101, Nanjing, 211166, Jiangsu, China.
| | - Xiaowen Hu
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Longmian Avenue No. 101, Nanjing, 211166, Jiangsu, China.
| |
Collapse
|
2
|
Tahir M, Norouzi M, Khan SS, Davie JR, Yamanaka S, Ashraf A. Artificial intelligence and deep learning algorithms for epigenetic sequence analysis: A review for epigeneticists and AI experts. Comput Biol Med 2024; 183:109302. [PMID: 39500240 DOI: 10.1016/j.compbiomed.2024.109302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/22/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024]
Abstract
Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high-throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer-promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Mahboobeh Norouzi
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Shehroz S Khan
- College of Engineering and Technology, American University of the Middle East, Kuwait
| | - James R Davie
- Department of Biochemistry and Medical Genetics, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| | - Soichiro Yamanaka
- Graduate School of Science, Department of Biophysics and Biochemistry, University of Tokyo, Japan
| | - Ahmed Ashraf
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada.
| |
Collapse
|
3
|
Hou A, Luo H, Liu H, Luo L, Ding P. Multi-scale DNA language model improves 6 mA binding sites prediction. Comput Biol Chem 2024; 112:108129. [PMID: 39067351 DOI: 10.1016/j.compbiolchem.2024.108129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/05/2024] [Accepted: 06/10/2024] [Indexed: 07/30/2024]
Abstract
DNA methylation at the N6 position of adenine (N6-methyladenine, 6 mA), which refers to the attachment of a methyl group to the N6 site of the adenine (A) of DNA, is an important epigenetic modification in prokaryotic and eukaryotic genomes. Accurately predicting the 6 mA binding sites can provide crucial insights into gene regulation, DNA repair, disease development and so on. Wet experiments are commonly used for analyzing 6 mA binding sites. However, they suffer from high cost and expensive time. Therefore, various deep learning methods have been widely used to predict 6 mA binding sites recently. In this study, we develop a framework based on multi-scale DNA language model named "iDNA6mA-MDL". "iDNA6mA-MDL" integrates multiple kmers and the nucleotide property and frequency method for feature embedding, which can capture a full range of DNA sequence context information. At the prediction stage, it also leverages DNABERT to compensate for the incomplete capture of global DNA information. Experiments show that our framework obtains average AUC of 0.981 on a classic 6 mA rice gene dataset, going beyond all existing advanced models under fivefold cross-validations. Moreover, "iDNA6mA-MDL" outperforms most of the popular state-of-the-art methods on another 11 6 mA datasets, demonstrating its effectiveness in 6 mA binding sites prediction.
Collapse
Affiliation(s)
- Anlin Hou
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Hanyu Luo
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Huan Liu
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang 421001, China.
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang 421001, China
| |
Collapse
|
4
|
Sun S, Ren Y, Zhou Y, Guo F, Choi J, Cui M, Khim J. Prediction of micropollutant degradation kinetic constant by ultrasonic using machine learning. CHEMOSPHERE 2024; 363:142701. [PMID: 38925516 DOI: 10.1016/j.chemosphere.2024.142701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Revised: 06/20/2024] [Accepted: 06/23/2024] [Indexed: 06/28/2024]
Abstract
A prediction model based on XGBoost is proposed for ultrasonic degradation of micropollutants' kinetic constants. After parameter optimization through iteration, the model achieves Evaluation metrics with R2 and SMAPE reaching 0.99 and 2.06%, respectively. The impact of design parameters on predicting kinetic constants for ultrasound degradation of trace pollutants was assessed using Shapley additive explanations (SHAP). Results indicate that power density and frequency significantly impact the predictive performance. The database was sorted based on power density and frequency values. Subsequently, 800 raw data were split into small databases of 200 each. After confirming that reducing the database size doesn't affect prediction accuracy, ultrasound degradation experiments were conducted for five pollutants, yielding experimental data. A small database with experimental conditions within the numerical range was selected. Data meeting both feature conditions were filtered, resulting in an optimized 60-data group. After incorporating experimental data, a model was trained for prediction. Degradation kinetic constants for experiments (kE) were compared with predicted constants (for 800 data-based model: kP-800 and for 60 data-based model: kP-60). Results showed ibuprofen, bisphenol A, carbamazepine, and 17β-Estradiol performed better on the 60-data group (kP-60/kE: 1.00, 0.99, 1.00, 1.00), while caffeine suited the model trained on the 800-data group (kP-800/kE: 1.02).
Collapse
Affiliation(s)
- Shiyu Sun
- School of Civil, Environmental, and Architectural Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Yangmin Ren
- School of Civil, Environmental, and Architectural Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Yongyue Zhou
- School of Civil, Environmental, and Architectural Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Fengshi Guo
- School of Civil, Environmental, and Architectural Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Jongbok Choi
- Department of Environmental Engineering, Kumoh National Institute of Technology, Gumi, 39177, Republic of Korea
| | - Mingcan Cui
- School of Civil, Environmental, and Architectural Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea.
| | - Jeehyeong Khim
- School of Civil, Environmental, and Architectural Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea.
| |
Collapse
|
5
|
Lilhore UK, Simiaya S, Alhussein M, Faujdar N, Dalal S, Aurangzeb K. Optimizing protein sequence classification: integrating deep learning models with Bayesian optimization for enhanced biological analysis. BMC Med Inform Decis Mak 2024; 24:236. [PMID: 39192227 DOI: 10.1186/s12911-024-02631-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Accepted: 08/07/2024] [Indexed: 08/29/2024] Open
Abstract
Efforts to enhance the accuracy of protein sequence classification are of utmost importance in driving forward biological analyses and facilitating significant medical advancements. This study presents a cutting-edge model called ProtICNN-BiLSTM, which combines attention-based Improved Convolutional Neural Networks (ICNN) and Bidirectional Long Short-Term Memory (BiLSTM) units seamlessly. Our main goal is to improve the accuracy of protein sequence classification by carefully optimizing performance through Bayesian Optimisation. ProtICNN-BiLSTM combines the power of CNN and BiLSTM architectures to effectively capture local and global protein sequence dependencies. In the proposed model, the ICNN component uses convolutional operations to identify local patterns. Captures long-range associations by analyzing sequence data forward and backwards. In advanced biological studies, Bayesian Optimisation optimizes model hyperparameters for efficiency and robustness. The model was extensively confirmed with PDB-14,189 and other protein data. We found that ProtICNN-BiLSTM outperforms traditional categorization models. Bayesian Optimization's fine-tuning and seamless integration of local and global sequence information make it effective. The precision of ProtICNN-BiLSTM improves comparative protein sequence categorization. The study improves computational bioinformatics for complex biological analysis. Good results from the ProtICNN-BiLSTM model improve protein sequence categorization. This powerful tool could improve medical and biological research. The breakthrough protein sequence classification model is ProtICNN-BiLSTM. Bayesian optimization, ICNN, and BiLSTM analyze biological data accurately.
Collapse
Affiliation(s)
- Umesh Kumar Lilhore
- School of Computing Science and Engineering, Galgotias University, Greater Noida, UP, India
| | - Sarita Simiaya
- School of Computing Science and Engineering, Galgotias University, Greater Noida, UP, India
| | - Musaed Alhussein
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P. O. Box 51178, Riyadh, 11543, Saudi Arabia
| | - Neetu Faujdar
- Department of Computer Engineering and Applications, GLA University, 281406, UP, Mathura, India
| | | | - Khursheed Aurangzeb
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P. O. Box 51178, Riyadh, 11543, Saudi Arabia
| |
Collapse
|
6
|
Akay A, Reddy HN, Galloway R, Kozyra J, Jackson AW. Predicting DNA toehold-mediated strand displacement rate constants using a DNA-BERT transformer deep learning model. Heliyon 2024; 10:e28443. [PMID: 38560216 PMCID: PMC10981123 DOI: 10.1016/j.heliyon.2024.e28443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 03/19/2024] [Indexed: 04/04/2024] Open
Abstract
Dynamic DNA nanotechnology is driving exciting developments in molecular computing, cargo delivery, sensing and detection. Combining this innovative area of research with the progress made in machine learning will aid in the design of sophisticated DNA machinery. Herein, we present a novel framework based on a transformer architecture and a deep learning model which can predict the rate constant of toehold-mediated strand displacement, the underlying process in dynamic DNA nanotechnology. Initially, a dataset of 4450 DNA sequences and corresponding rate constants were generated in-silico using KinDA. Subsequently, a 1D convolution neural network was trained using specific local features and DNA-BERT sequence embedding to produce predicted rate constants. As a result, the newly trained deep learning model predicted toehold-mediated strand displacement rate constants with a root mean square error of 0.76, during testing. These findings demonstrate that DNA-BERT can improve prediction accuracy, negating the need for extensive computational simulations or experimentation. Finally, the impact of various local features during model training is discussed, and a detailed comparison between the One-hot encoder and DNA-BERT sequences representation methods is presented.
Collapse
Affiliation(s)
- Ali Akay
- Nanovery Limited, United Kingdom
- Universita Degli Studi di Trento, Italy
| | | | | | | | | |
Collapse
|
7
|
Wang H, Huang T, Wang D, Zeng W, Sun Y, Zhang L. MSCAN: multi-scale self- and cross-attention network for RNA methylation site prediction. BMC Bioinformatics 2024; 25:32. [PMID: 38233745 PMCID: PMC10795237 DOI: 10.1186/s12859-024-05649-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 01/11/2024] [Indexed: 01/19/2024] Open
Abstract
BACKGROUND Epi-transcriptome regulation through post-transcriptional RNA modifications is essential for all RNA types. Precise recognition of RNA modifications is critical for understanding their functions and regulatory mechanisms. However, wet experimental methods are often costly and time-consuming, limiting their wide range of applications. Therefore, recent research has focused on developing computational methods, particularly deep learning (DL). Bidirectional long short-term memory (BiLSTM), convolutional neural network (CNN), and the transformer have demonstrated achievements in modification site prediction. However, BiLSTM cannot achieve parallel computation, leading to a long training time, CNN cannot learn the dependencies of the long distance of the sequence, and the Transformer lacks information interaction with sequences at different scales. This insight underscores the necessity for continued research and development in natural language processing (NLP) and DL to devise an enhanced prediction framework that can effectively address the challenges presented. RESULTS This study presents a multi-scale self- and cross-attention network (MSCAN) to identify the RNA methylation site using an NLP and DL way. Experiment results on twelve RNA modification sites (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um) reveal that the area under the receiver operating characteristic of MSCAN obtains respectively 98.34%, 85.41%, 97.29%, 96.74%, 99.04%, 79.94%, 76.22%, 65.69%, 92.92%, 92.03%, 95.77%, 89.66%, which is better than the state-of-the-art prediction model. This indicates that the model has strong generalization capabilities. Furthermore, MSCAN reveals a strong association among different types of RNA modifications from an experimental perspective. A user-friendly web server for predicting twelve widely occurring human RNA modification sites (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um) is available at http://47.242.23.141/MSCAN/index.php . CONCLUSIONS A predictor framework has been developed through binary classification to predict RNA methylation sites.
Collapse
Affiliation(s)
- Honglei Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
- School of Information Engineering, Xuzhou College of Industrial Technology, Xuzhou, 221400, China
| | - Tao Huang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Dong Wang
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
| | - Wenliang Zeng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Yanjing Sun
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China.
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China.
| |
Collapse
|
8
|
Gao H, Gao P, Ye N. A method for evaluating of RNA's coding potential using the interaction effects of open reading frames and high-energy scalograms. Comput Biol Med 2024; 168:107752. [PMID: 38007977 DOI: 10.1016/j.compbiomed.2023.107752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 10/19/2023] [Accepted: 11/20/2023] [Indexed: 11/28/2023]
Abstract
The identification and function determination of long non-coding RNAs (lncRNAs) can help to better understand the transcriptional regulation in both normal development and disease pathology, thereby demanding methods to distinguish them from protein-coding (pcRNAs) after obtaining sequencing data. Many algorithms based on the statistical, structural, physical, and chemical properties of the sequences have been developed for evaluating the coding potential of RNA to distinguish them. In order to design common features that do not rely on hyperparameter tuning and optimization and are evaluated accurately, we designed a series of features from the effects of open reading frames (ORFs) on their mutual interactions and with the electrical intensity of sequence sites to further improve the screening accuracy. Finally, the single model constructed from our designed features meets the strong classifier criteria, where the accuracy is between 82% and 89%, and the prediction accuracy of the model constructed after combining the auxiliary features equal to or exceed some best classification tools. Moreover, our method does not require special hyper-parameter tuning operations and is species insensitive compared to other methods, which means this method can be easily applied to a wide range of species. Also, we find some correlations between the features, which provides some reference for follow-up studies.
Collapse
Affiliation(s)
- Hua Gao
- College of Forestry, Nanjing Forestry University, Longpan, Nanjing, 210037, Jiangsu, China; College of Information Science and Technology, Nanjing Forestry University, Longpan, Nanjing, 210037, Jiangsu, China.
| | - Peng Gao
- The First Affiliated Hospital of Xi'an Jiaotong University, 277 West Yanta Road, Xi'an, 710061, Shaanxi, China.
| | - Ning Ye
- College of Forestry, Nanjing Forestry University, Longpan, Nanjing, 210037, Jiangsu, China; College of Information Science and Technology, Nanjing Forestry University, Longpan, Nanjing, 210037, Jiangsu, China.
| |
Collapse
|
9
|
Sun Y, Zhao Z, Tong H, Sun B, Liu Y, Ren N, You S. Machine Learning Models for Inverse Design of the Electrochemical Oxidation Process for Water Purification. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:17990-18000. [PMID: 37189261 DOI: 10.1021/acs.est.2c08771] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
In this study, a machine learning (ML) framework is developed toward target-oriented inverse design of the electrochemical oxidation (EO) process for water purification. The XGBoost model exhibited the best performances for prediction of reaction rate (k) based on training the data set relevant to pollutant characteristics and reaction conditions, indicated by Rext2 of 0.84 and RMSEext of 0.79. Based on 315 data points collected from the literature, the current density, pollutant concentration, and gap energy (Egap) were identified to be the most impactful parameters available for the inverse design of the EO process. In particular, adding reaction conditions as model input features allowed provision of more available information and an increase in the sample size of the data set to improve the model accuracy. The feature importance analysis was performed for revealing the data pattern and feature interpretation by using Shapley additive explanations (SHAP). The ML-based inverse design for the EO process was generalized to a random case for tailoring the optimum conditions with phenol and 2,4-dichlorophenol (2,4-DCP) serving as model pollutants. The resulting predicted k values were close to the experimental k values by experimental verification, accounting for the relative error lower than 5%. This study provides a paradigm shift from conventional trial-and-error mode to data-driven mode for advancing research and development of the EO process by a time-saving, labor-effective, and environmentally friendly target-oriented strategy, which makes electrochemical water purification more efficient, more economic, and more sustainable in the context of global carbon peaking and carbon neutrality.
Collapse
Affiliation(s)
- Ye Sun
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin 150090, P. R. China
| | - Zhiyuan Zhao
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin 150090, P. R. China
| | - Hailong Tong
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin 150090, P. R. China
- State Key Laboratory of Veterinary Biotechnology, Harbin Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Harbin 150069, P. R. China
| | - Baiming Sun
- State Key Laboratory of Veterinary Biotechnology, Harbin Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Harbin 150069, P. R. China
| | - Yanbiao Liu
- College of Environmental Science and Engineering, Textile Pollution Controlling Engineering Center of the Ministry of Ecology and Environment, Donghua University, Shanghai 201620, China
| | - Nanqi Ren
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin 150090, P. R. China
| | - Shijie You
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin 150090, P. R. China
| |
Collapse
|
10
|
Soylu NN, Sefer E. BERT2OME: Prediction of 2'-O-Methylation Modifications From RNA Sequence by Transformer Architecture Based on BERT. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2177-2189. [PMID: 37819796 DOI: 10.1109/tcbb.2023.3237769] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Recent work on language models has resulted in state-of-the-art performance on various language tasks. Among these, Bidirectional Encoder Representations from Transformers (BERT) has focused on contextualizing word embeddings to extract context and semantics of the words. On the other hand, post-transcriptional 2'-O-methylation (Nm) RNA modification is important in various cellular tasks and related to a number of diseases. The existing high-throughput experimental techniques take longer time to detect these modifications, and costly in exploring these functional processes. Here, to deeply understand the associated biological processes faster, we come up with an efficient method Bert2Ome to infer 2'-O-methylation RNA modification sites from RNA sequences. Bert2Ome combines BERT-based model with convolutional neural networks (CNN) to infer the relationship between the modification sites and RNA sequence content. Unlike the methods proposed so far, Bert2Ome assumes each given RNA sequence as a text and focuses on improving the modification prediction performance by integrating the pretrained deep learning-based language model BERT. Additionally, our transformer-based approach could infer modification sites across multiple species. According to 5-fold cross-validation, human and mouse accuracies were 99.15% and 94.35% respectively. Similarly, ROC AUC scores were 0.99, 0.94 for the same species. Detailed results show that Bert2Ome reduces the time consumed in biological experiments and outperforms the existing approaches across different datasets and species over multiple metrics. Additionally, deep learning approaches such as 2D CNNs are more promising in learning BERT attributes than more conventional machine learning methods.
Collapse
|
11
|
Fan Y, Sun G, Pan X. ELMo4m6A: A Contextual Language Embedding-Based Predictor for Detecting RNA N6-Methyladenosine Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:944-954. [PMID: 35536814 DOI: 10.1109/tcbb.2022.3173323] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N6-methyladenosine (m6A) is a universal post-transcriptional modification of RNAs, and it is widely involved in various biological processes. Identifying m6A modification sites accurately is indispensable to further investigate m6A-mediated biological functions. How to better represent RNA sequences is crucial for building effective computational methods for detecting m6A modification sites. However, traditional encoding methods require complex biological prior knowledge and are time-consuming. Furthermore, most of the existing m6A sites prediction methods are limited to single species, and few methods are able to predict m6A sites across different species and tissues. Thus, it is necessary to design a more efficient computational method to predict m6A sites across multiple species and tissues. In this paper, we proposed ELMo4m6A, a contextual language embedding-based method for predicting m6A sites from RNA sequences without any prior knowledge. ELMo4m6A first learns embeddings of RNA sequences using a language model ELMo, then uses a hybrid convolutional neural network (CNN) and long short-term memory (LSTM) to identify m6A sites. The results of 5-fold cross-validation and independent testing demonstrate that ELMo4m6A is superior to state-of-the-art methods. Moreover, we applied integrated gradients to find potential sequence patterns contributing to m6A sites.
Collapse
|
12
|
Fan XQ, Lin B, Hu J, Guo ZY. I-DNAN6mA: Accurate Identification of DNA N 6-Methyladenine Sites Using the Base-Pairing Map and Deep Learning. J Chem Inf Model 2023; 63:1076-1086. [PMID: 36722621 DOI: 10.1021/acs.jcim.2c01465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The recent discovery of numerous DNA N6-methyladenine (6mA) sites has transformed our perception about the roles of 6mA in living organisms. However, our ability to understand them is hampered by our inability to identify 6mA sites rapidly and cost-efficiently by existing experimental methods. Developing a novel method to quickly and accurately identify 6mA sites is critical for speeding up the progress of its function detection and understanding. In this study, we propose a novel computational method, called I-DNAN6mA, to identify 6mA sites and complement experimental methods well, by leveraging the base-pairing rules and a well-designed three-stage deep learning model with pairwise inputs. The performance of our proposed method is benchmarked and evaluated on four species, i.e., Arabidopsis thaliana, Drosophila melanogaster, Rice, and Rosaceae. The experimental results demonstrate that I-DNAN6mA achieves area under the receiver operating characteristic curve values of 0.967, 0.963, 0.947, 0.976, and 0.990, accuracies of 91.5, 92.7, 88.2, 0.938, and 96.2%, and Mathew's correlation coefficient values of 0.855, 0.831, 0.763, 0.877, and 0.924 on five benchmark data sets, respectively, and outperforms several existing state-of-the-art methods. To our knowledge, I-DNAN6mA is the first approach to identify 6mA sites using a novel image-like representation of DNA sequences and a deep learning model with pairwise inputs. I-DNAN6mA is expected to be useful for locating functional regions of DNA.
Collapse
Affiliation(s)
- Xue-Qiang Fan
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Bing Lin
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Zhong-Yi Guo
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| |
Collapse
|
13
|
Ao C, Jiao S, Wang Y, Yu L, Zou Q. Biological Sequence Classification: A Review on Data and General Methods. RESEARCH (WASHINGTON, D.C.) 2022; 2022:0011. [PMID: 39285948 PMCID: PMC11404319 DOI: 10.34133/research.0011] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/25/2022] [Indexed: 09/19/2024]
Abstract
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website (http://lab.malab.cn/~acy/BioseqData/home.html), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
14
|
Identify Bitter Peptides by Using Deep Representation Learning Features. Int J Mol Sci 2022; 23:ijms23147877. [PMID: 35887225 PMCID: PMC9315524 DOI: 10.3390/ijms23147877] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 07/01/2022] [Accepted: 07/14/2022] [Indexed: 02/04/2023] Open
Abstract
A bitter taste often identifies hazardous compounds and it is generally avoided by most animals and humans. Bitterness of hydrolyzed proteins is caused by the presence of bitter peptides. To improve palatability, bitter peptides need to be identified experimentally in a time-consuming and expensive process, before they can be removed or degraded. Here, we report the development of a machine learning prediction method, iBitter-DRLF, which is based on a deep learning pre-trained neural network feature extraction method. It uses three sequence embedding techniques, soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). These were initially combined into various machine learning algorithms to build several models. After optimization, the combined features of UniRep and BiLSTM were finally selected, and the model was built in combination with a light gradient boosting machine (LGBM). The results showed that the use of deep representation learning greatly improves the ability of the model to identify bitter peptides, achieving accurate prediction based on peptide sequence data alone. By helping to identify bitter peptides, iBitter-DRLF can help research into improving the palatability of peptide therapeutics and dietary supplements in the future. A webserver is available, too.
Collapse
|
15
|
An Effective Deep Learning-Based Architecture for Prediction of N7-Methylguanosine Sites in Health Systems. ELECTRONICS 2022. [DOI: 10.3390/electronics11121917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
N7-methylguanosine (m7G) is one of the most important epigenetic modifications found in rRNA, mRNA, and tRNA, and performs a promising role in gene expression regulation. Owing to its significance, well-equipped traditional laboratory-based techniques have been performed for the identification of N7-methylguanosine (m7G). Consequently, these approaches were found to be time-consuming and cost-ineffective. To move on from these traditional approaches to predict N7-methylguanosine sites with high precision, the concept of artificial intelligence has been adopted. In this study, an intelligent computational model called N7-methylguanosine-Long short-term memory (m7G-LSTM) is introduced for the prediction of N7-methylguanosine sites. One-hot encoding and word2vec feature schemes are used to express the biological sequences while the LSTM and CNN algorithms have been employed for classification. The proposed “m7G-LSTM” model obtained an accuracy value of 95.95%, a specificity value of 95.94%, a sensitivity value of 95.97%, and Matthew’s correlation coefficient (MCC) value of 0.919. The proposed predictive m7G-LSTM model has significantly achieved better outcomes than previous models in terms of all evaluation parameters. The proposed m7G-LSTM computational system aims to support the drug industry and help researchers in the fields of bioinformatics to enhance innovation for the prediction of the behavior of N7-methylguanosine sites.
Collapse
|
16
|
Chen D, Li Y. PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features. Front Genet 2022; 13:875112. [PMID: 35547252 PMCID: PMC9081368 DOI: 10.3389/fgene.2022.875112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2022] [Accepted: 03/07/2022] [Indexed: 12/03/2022] Open
Abstract
The major histocompatibility complex (MHC) is a large locus on vertebrate DNA that contains a tightly linked set of polymorphic genes encoding cell surface proteins essential for the adaptive immune system. The groups of proteins encoded in the MHC play an important role in the adaptive immune system. Therefore, the accurate identification of the MHC is necessary to understand its role in the adaptive immune system. An effective predictor called PredMHC is established in this study to identify the MHC from protein sequences. Firstly, PredMHC encoded a protein sequence with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. Secondly, three classifiers including SGD, SMO, and random forest were trained on the mixed features of the protein sequence. Finally, the prediction result was obtained by the voting of the three classifiers. The experimental results of the 10-fold cross-validation test in the training dataset showed that PredMHC can obtain 91.69% accuracy. Experimental results on comparison with other features, classifiers, and existing methods showed the effectiveness of PredMHC in predicting the MHC.
Collapse
Affiliation(s)
- Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| |
Collapse
|
17
|
Soares IM, Camargo FHF, Marques A, Crook OM. Improving lab-of-origin prediction of genetically engineered plasmids via deep metric learning. NATURE COMPUTATIONAL SCIENCE 2022; 2:253-264. [PMID: 38177551 DOI: 10.1038/s43588-022-00234-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 03/22/2022] [Indexed: 01/06/2024]
Abstract
Genome engineering is undergoing unprecedented development and is now becoming widely available. Genetic engineering attribution can make sequence-lab associations and assist forensic experts in ensuring responsible biotechnology innovation and reducing misuse of engineered DNA sequences. Here we propose a method based on metric learning to rank the most likely labs of origin while simultaneously generating embeddings for plasmid sequences and labs. These embeddings can be used to perform various downstream tasks, such as clustering DNA sequences and labs, as well as using them as features in machine learning models. Our approach employs a circular shift augmentation method and can correctly rank the lab of origin 90% of the time within its top-10 predictions. We also demonstrate that we can perform few-shot learning and obtain 76% top-10 accuracy using only 10% of the sequences. Finally, our approach can also extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
Collapse
Affiliation(s)
| | | | | | - Oliver M Crook
- Oxford Protein Informatics Group, University of Oxford, Oxford, UK.
| |
Collapse
|
18
|
Li H, Pang Y, Liu B, Yu L. MoRF-FUNCpred: Molecular Recognition Feature Function Prediction Based on Multi-Label Learning and Ensemble Learning. Front Pharmacol 2022; 13:856417. [PMID: 35350759 PMCID: PMC8957949 DOI: 10.3389/fphar.2022.856417] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 02/14/2022] [Indexed: 01/13/2023] Open
Abstract
Intrinsically disordered regions (IDRs) without stable structure are important for protein structures and functions. Some IDRs can be combined with molecular fragments to make itself completed the transition from disordered to ordered, which are called molecular recognition features (MoRFs). There are five main functions of MoRFs: molecular recognition assembler (MoR_assembler), molecular recognition chaperone (MoR_chaperone), molecular recognition display sites (MoR_display_sites), molecular recognition effector (MoR_effector), and molecular recognition scavenger (MoR_scavenger). Researches on functions of molecular recognition features are important for pharmaceutical and disease pathogenesis. However, the existing computational methods can only predict the MoRFs in proteins, failing to distinguish their different functions. In this paper, we treat MoRF function prediction as a multi-label learning task and solve it with the Binary Relevance (BR) strategy. Finally, we use Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), and Random Forest (RF) as basic models to construct MoRF-FUNCpred through ensemble learning. Experimental results show that MoRF-FUNCpred performs well for MoRF function prediction. To the best knowledge of ours, MoRF-FUNCpred is the first predictor for predicting the functions of MoRFs. Availability and Implementation: The stand alone package of MoRF-FUNCpred can be accessed from https://github.com/LiangYu-Xidian/MoRF-FUNCpred.
Collapse
Affiliation(s)
- Haozheng Li
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yihe Pang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
19
|
Liu M, Sun ZL, Zeng Z, Lam KM. MGF6mARice: prediction of DNA N6-methyladenine sites in rice by exploiting molecular graph feature and residual block. Brief Bioinform 2022; 23:6553606. [PMID: 35325050 DOI: 10.1093/bib/bbac082] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/13/2022] [Accepted: 02/16/2022] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine (6mA) is produced by the N6 position of the adenine being methylated, which occurs at the molecular level, and is involved in numerous vital biological processes in the rice genome. Given the shortcomings of biological experiments, researchers have developed many computational methods to predict 6mA sites and achieved good performance. However, the existing methods do not consider the occurrence mechanism of 6mA to extract features from the molecular structure. In this paper, a novel deep learning method is proposed by devising DNA molecular graph feature and residual block structure for 6mA sites prediction in rice, named MGF6mARice. Firstly, the DNA sequence is changed into a simplified molecular input line entry system (SMILES) format, which reflects chemical molecular structure. Secondly, for the molecular structure data, we construct the DNA molecular graph feature based on the principle of graph convolutional network. Then, the residual block is designed to extract higher level, distinguishable features from molecular graph features. Finally, the prediction module is used to obtain the result of whether it is a 6mA site. By means of 10-fold cross-validation, MGF6mARice outperforms the state-of-the-art approaches. Multiple experiments have shown that the molecular graph feature and residual block can promote the performance of MGF6mARice in 6mA prediction. To the best of our knowledge, it is the first time to derive a feature of DNA sequence by considering the chemical molecular structure. We hope that MGF6mARice will be helpful for researchers to analyze 6mA sites in rice.
Collapse
Affiliation(s)
- Mengya Liu
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Zhan-Li Sun
- School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Zhigang Zeng
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Kin-Man Lam
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
20
|
Cai J, Xiao G, Su R. GC6mA-Pred: A deep learning approach to identify DNA N6-methyladenine sites in the rice genome. Methods 2022; 204:14-21. [PMID: 35149214 DOI: 10.1016/j.ymeth.2022.02.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 01/31/2022] [Accepted: 02/05/2022] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6mA) is a pivotal DNA modification for various biological processes. More accurate prediction of 6mA methylation sites plays an irreplaceable part in grasping the internal rationale of related biological activities. However, the existing prediction methods only extract information from a single dimension, which has some limitations. Therefore, it is very necessary to obtain the information of 6mA sites from different dimensions, so as to establish a reliable prediction method. RESULTS In this study, a neural network based bioinformatics model named GC6mA-Pred is proposed to predict N6-methyladenine modifications in DNA sequences. GC6mA-Pred extracts significant information from both sequence level and graph level. In the sequence level, GC6mA-Pred uses a three-layer convolution neural network (CNN) model to represent the sequence. In the graph level, GC6mA-Pred employs graph neural network (GNN) method to integrate various information contained in the chemical molecular formula corresponding to DNA sequence. In our newly built dataset, GC6mA-Pred shows better performance than other existing models. The results of comparative experiments have illustrated that GC6mA-Pred is capable of producing a marked effect in accurately identifying DNA 6mA modifications.
Collapse
Affiliation(s)
- Jianhua Cai
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China; College of Mathematics and Computer Science, Fuzhou University, Fuzhou, PR China
| | - Guobao Xiao
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China.
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
21
|
Dou L, Zhang Z, Xu L, Zou Q. iKcr_CNN: A novel computational tool for imbalance classification of human nonhistone crotonylation sites based on convolutional neural networks with focal loss. Comput Struct Biotechnol J 2022; 20:3268-3279. [PMID: 35832615 PMCID: PMC9251780 DOI: 10.1016/j.csbj.2022.06.032] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 06/13/2022] [Accepted: 06/13/2022] [Indexed: 11/26/2022] Open
Abstract
Lysine crotonylation (Kcr) is a newly discovered protein post-translational modification and has been proved to be widely involved in various biological processes and human diseases. Thus, the accurate and fast identification of this modification became the preliminary task in investigating the related biological functions. Due to the long duration, high cost and intensity of traditional high-throughput experimental techniques, constructing bioinformatics predictors based on machine learning algorithms is treated as a most popular solution. Although dozens of predictors have been reported to identify Kcr sites, only two, nhKcr and DeepKcrot, focused on human nonhistone protein sequences. Moreover, due to the imbalance nature of data distribution, associated detection performance is severely biased towards the major negative samples and remains much room for improvement. In this research, we developed a convolutional neural network framework, dubbed iKcr_CNN, to identify the human nonhistone Kcr modification. To overcome the imbalance issue (Kcr: 15,274; non-Kcr: 74,018 with imbalance ratio: 1:4), we applied the focal loss function instead of the standard cross-entropy as the indicator to optimize the model, which not only assigns different weights to samples belonging to different categories but also distinguishes easy- and hard-classified samples. Ultimately, the obtained model presents more balanced prediction scores between real-world positive and negative samples than existing tools. The user-friendly web server is accessible at ikcrcnn.webmalab.cn/, and the involved Python scripts can be conveniently downloaded at github.com/lijundou/iKcr_CNN/. The proposed model may serve as an efficient tool to assist academicians with their experimental researches.
Collapse
|
22
|
Desai J, Francis C, Longo K, Hoss A. OUP accepted manuscript. Nucleic Acids Res 2022; 50:3128-3141. [PMID: 35286381 PMCID: PMC8989546 DOI: 10.1093/nar/gkac155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 02/14/2022] [Accepted: 02/26/2022] [Indexed: 11/13/2022] Open
Abstract
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
Collapse
Affiliation(s)
- Jigar Desai
- To whom correspondence should be addressed. Tel: +1 704 214 7914;
| | | | | | - Andrew Hoss
- Wave Life Sciences, Cambridge, MA 02138, USA
| |
Collapse
|
23
|
Le NQK, Ho QT. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes. Methods 2021; 204:199-206. [PMID: 34915158 DOI: 10.1016/j.ymeth.2021.12.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2021] [Revised: 11/30/2021] [Accepted: 12/09/2021] [Indexed: 12/19/2022] Open
Abstract
As one of the most common post-transcriptional epigenetic modifications, N6-methyladenine (6 mA), plays an essential role in various cellular processes and disease pathogenesis. Therefore, accurately identifying 6 mA modifications is necessary for a deep understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models were developed with small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we present a novel model based on transformer architecture and deep learning to identify DNA 6 mA sites from the cross-species genome. The model is constructed on a benchmark dataset and explored a feature derived from pre-trained transformer word embedding approaches. Subsequently, a convolutional neural network was employed to learn the generated features and generate the prediction outcomes. As a result, our predictor achieved excellent performance during independent test with the accuracy and Matthews correlation coefficient (MCC) of 79.3% and 0.58, respectively. Overall, its performance achieved better accuracy than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, our model is expected to assist biologists in accurately identifying 6mAs and formulate the novel testable biological hypothesis. We also release source codes and datasets freely at https://github.com/khanhlee/bert-dna for front-end users.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
| | - Quang-Thai Ho
- College of Information & Communication Technology, Can Tho University, Viet Nam; Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| |
Collapse
|
24
|
Lv H, Zhang Y, Wang JS, Yuan SS, Sun ZJ, Dao FY, Guan ZX, Lin H, Deng KJ. iRice-MS: An integrated XGBoost model for detecting multitype post-translational modification sites in rice. Brief Bioinform 2021; 23:6447435. [PMID: 34864888 DOI: 10.1093/bib/bbab486] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 10/05/2021] [Accepted: 10/23/2021] [Indexed: 12/13/2022] Open
Abstract
Post-translational modification (PTM) refers to the covalent and enzymatic modification of proteins after protein biosynthesis, which orchestrates a variety of biological processes. Detecting PTM sites in proteome scale is one of the key steps to in-depth understanding their regulation mechanisms. In this study, we presented an integrated method based on eXtreme Gradient Boosting (XGBoost), called iRice-MS, to identify 2-hydroxyisobutyrylation, crotonylation, malonylation, ubiquitination, succinylation and acetylation in rice. For each PTM-specific model, we adopted eight feature encoding schemes, including sequence-based features, physicochemical property-based features and spatial mapping information-based features. The optimal feature set was identified from each encoding, and their respective models were established. Extensive experimental results show that iRice-MS always display excellent performance on 5-fold cross-validation and independent dataset test. In addition, our novel approach provides the superiority to other existing tools in terms of AUC value. Based on the proposed model, a web server named iRice-MS was established and is freely accessible at http://lin-group.cn/server/iRice-MS.
Collapse
Affiliation(s)
- Hao Lv
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, China
| | - Jia-Shu Wang
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Shi-Shi Yuan
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Zi-Jie Sun
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Fu-Ying Dao
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Zheng-Xing Guan
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Hao Lin
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Ke-Jun Deng
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| |
Collapse
|
25
|
Wu Y, Sa Y, Guo Y, Li Q, Zhang N. Identification of WHO II/III gliomas by 16 prognostic-related gene signatures using machine learning methods. Curr Med Chem 2021; 29:1622-1639. [PMID: 34455959 DOI: 10.2174/0929867328666210827103049] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 05/27/2021] [Accepted: 05/28/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND It is found that the prognosis of gliomas of the same grade has large differences among World Health Organization(WHO) grade II and III in clinical observation. Therefore, a better understanding of the genetics and molecular mechanisms underlying WHO grade II and III gliomas is required, with the aim of developing a classification scheme at the molecular level rather than the conventional pathological morphology level. METHOD We performed survival analysis combined with machine learning methods of Least Absolute Shrinkage and Selection Operator using expression datasets downloaded from the Chinese Glioma Genome Atlas as well as The Cancer Genome Atlas. Risk scores were calculated by the product of expression level of overall survival-related genes and their multivariate Cox proportional hazards regression coefficients. WHO grade II and III gliomas were categorized into the low-risk subgroup, medium-risk subgroup, and high-risk subgroup. We used the 16 prognostic-related genes as input features to build a classification model based on prognosis using a fully connected neural network. Gene function annotations were also performed. RESULTS The 16 genes (AKNAD1, C7orf13, CDK20, CHRFAM7A, CHRNA1, EFNB1, GAS1, HIST2H2BE, KCNK3, KLHL4, LRRK2, NXPH3, PIGZ, SAMD5, ERINC2, and SIX6) related to the glioma prognosis were screened. The 16 selected genes were associated with the development of gliomas and carcinogenesis. The accuracy of an external validation data set of the fully connected neural network model from the two cohorts reached 95.5%. Our method has good potential capability in classifying WHO grade II and III gliomas into low-risk, medium-risk, and high-risk subgroups. The subgroups showed significant (P<0.01) differences in overall survival. CONCLUSION This resulted in the identification of 16 genes that were related to the prognosis of gliomas. Here we developed a computational method to discriminate WHO grade II and III gliomas into three subgroups with distinct prognoses. The gene expression-based method provides a reliable alternative to determine the prognosis of gliomas.
Collapse
Affiliation(s)
- YaMeng Wu
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Yu Sa
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Yu Guo
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - QiFeng Li
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Ning Zhang
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| |
Collapse
|
26
|
Xiaoru L, Ling G. Combinatorial constraint coding based on the EORS algorithm in DNA storage. PLoS One 2021; 16:e0255376. [PMID: 34324571 PMCID: PMC8320985 DOI: 10.1371/journal.pone.0255376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 07/15/2021] [Indexed: 11/19/2022] Open
Abstract
The development of information technology has produced massive amounts of data, which has brought severe challenges to information storage. Traditional electronic storage media cannot keep up with the ever-increasing demand for data storage, but in its place DNA has emerged as a feasible storage medium with high density, large storage capacity and strong durability. In DNA data storage, many different approaches can be used to encode data into codewords. DNA coding is a key step in DNA storage and can directly affect storage performance and data integrity. However, since errors are prone to occur in DNA synthesis and sequencing, and non-specific hybridization is prone to occur in the solution, how to effectively encode DNA has become an urgent problem to be solved. In this article, we propose a DNA storage coding method based on the equilibrium optimization random search (EORS) algorithm, which meets the Hamming distance, GC content and no-runlength constraints and can reduce the error rate in storage. Simulation experiments have shown that the size of the DNA storage code set constructed by the EORS algorithm that meets the combination constraints has increased by an average of 11% compared with previous work. The increase in the code set means that shorter DNA chains can be used to store more data.
Collapse
Affiliation(s)
- Li Xiaoru
- Hulunbeier Vocational and Technical College, Hulunbeier, Inner Mongolia, China
| | - Guo Ling
- Baidu Co., Ltd., Shanghai, China
| |
Collapse
|
27
|
Dong GF, Zheng L, Huang SH, Gao J, Zuo YC. Amino Acid Reduction Can Help to Improve the Identification of Antimicrobial Peptides and Their Functional Activities. Front Genet 2021; 12:669328. [PMID: 33959153 PMCID: PMC8093877 DOI: 10.3389/fgene.2021.669328] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 03/23/2021] [Indexed: 02/03/2023] Open
Abstract
Antimicrobial peptides (AMPs) are considered as potential substitutes of antibiotics in the field of new anti-infective drug design. There have been several machine learning algorithms and web servers in identifying AMPs and their functional activities. However, there is still room for improvement in prediction algorithms and feature extraction methods. The reduced amino acid (RAA) alphabet effectively solved the problems of simplifying protein complexity and recognizing the structure conservative region. This article goes into details about evaluating the performances of more than 5,000 amino acid reduced descriptors generated from 74 types of amino acid reduced alphabet in the first stage and the second stage to construct an excellent two-stage classifier, Identification of Antimicrobial Peptides by Reduced Amino Acid Cluster (iAMP-RAAC), for identifying AMPs and their functional activities, respectively. The results show that the first stage AMP classifier is able to achieve the accuracy of 97.21 and 97.11% for the training data set and independent test dataset. In the second stage, our classifier still shows good performance. At least three of the four metrics, sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews correlation coefficient (MCC), exceed the calculation results in the literature. Further, the ANOVA with incremental feature selection (IFS) is used for feature selection to further improve prediction performance. The prediction performance is further improved after the feature selection of each stage. At last, a user-friendly web server, iAMP-RAAC, is established at http://bioinfor.imu.edu. cn/iampraac.
Collapse
Affiliation(s)
- Gai-Fang Dong
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Lei Zheng
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Sheng-Hui Huang
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Jing Gao
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Yong-Chun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| |
Collapse
|
28
|
Dou L, Yang F, Xu L, Zou Q. A comprehensive review of the imbalance classification of protein post-translational modifications. Brief Bioinform 2021; 22:6217722. [PMID: 33834199 DOI: 10.1093/bib/bbab089] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Revised: 02/17/2021] [Accepted: 02/24/2021] [Indexed: 12/13/2022] Open
Abstract
Post-translational modifications (PTMs) play significant roles in regulating protein structure, activity and function, and they are closely involved in various pathologies. Therefore, the identification of associated PTMs is the foundation of in-depth research on related biological mechanisms, disease treatments and drug design. Due to the high cost and time consumption of high-throughput sequencing techniques, developing machine learning-based predictors has been considered an effective approach to rapidly recognize potential modified sites. However, the imbalanced distribution of true and false PTM sites, namely, the data imbalance problem, largely effects the reliability and application of prediction tools. In this article, we conduct a systematic survey of the research progress in the imbalanced PTMs classification. First, we describe the modeling process in detail and outline useful data imbalance solutions. Then, we summarize the recently proposed bioinformatics tools based on imbalanced PTM data and simultaneously build a convenient website, ImClassi_PTMs (available at lab.malab.cn/∼dlj/ImbClassi_PTMs/), to facilitate the researchers to view. Moreover, we analyze the challenges of current computational predictors and propose some suggestions to improve the efficiency of imbalance learning. We hope that this work will provide comprehensive knowledge of imbalanced PTM recognition and contribute to advanced predictors in the future.
Collapse
Affiliation(s)
- Lijun Dou
- University of Electronic Science and Technology of China and the Shenzhen Polytechnic, China
| | - Fenglong Yang
- University of Electronic Science and Technology of China and the Shenzhen Polytechnic, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
29
|
Recent Advances in Predicting Protein S-Nitrosylation Sites. BIOMED RESEARCH INTERNATIONAL 2021; 2021:5542224. [PMID: 33628788 PMCID: PMC7892234 DOI: 10.1155/2021/5542224] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 01/24/2021] [Accepted: 01/25/2021] [Indexed: 01/09/2023]
Abstract
Protein S-nitrosylation (SNO) is a process of covalent modification of nitric oxide (NO) and its derivatives and cysteine residues. SNO plays an essential role in reversible posttranslational modifications of proteins. The accurate prediction of SNO sites is crucial in revealing a certain biological mechanism of NO regulation and related drug development. Identification of the sites of SNO in proteins is currently a very hot topic. In this review, we briefly summarize recent advances in computationally identifying SNO sites. The challenges and future perspectives for identifying SNO sites are also discussed. We anticipate that this review will provide insights into research on SNO site prediction.
Collapse
|
30
|
Le NQK, Ho QT, Nguyen TTD, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform 2021; 22:6128847. [PMID: 33539511 DOI: 10.1093/bib/bbab005] [Citation(s) in RCA: 92] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 01/01/2021] [Accepted: 01/03/2021] [Indexed: 01/11/2023] Open
Abstract
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Quang-Thai Ho
- College of Information and Communication Technology, Can Tho University, Vietnam
| | | | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taiwan
| |
Collapse
|
31
|
Lv Z, Cui F, Zou Q, Zhang L, Xu L. Anticancer peptides prediction with deep representation learning features. Brief Bioinform 2021; 22:6126754. [PMID: 33529337 DOI: 10.1093/bib/bbab008] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 12/20/2020] [Accepted: 01/05/2021] [Indexed: 12/13/2022] Open
Abstract
Anticancer peptides constitute one of the most promising therapeutic agents for combating common human cancers. Using wet experiments to verify whether a peptide displays anticancer characteristics is time-consuming and costly. Hence, in this study, we proposed a computational method named identify anticancer peptides via deep representation learning features (iACP-DRLF) using light gradient boosting machine algorithm and deep representation learning features. Two kinds of sequence embedding technologies were used, namely soft symmetric alignment embedding and unified representation (UniRep) embedding, both of which involved deep neural network models based on long short-term memory networks and their derived networks. The results showed that the use of deep representation learning features greatly improved the capability of the models to discriminate anticancer peptides from other peptides. Also, UMAP (uniform manifold approximation and projection for dimension reduction) and SHAP (shapley additive explanations) analysis proved that UniRep have an advantage over other features for anticancer peptide identification. The python script and pretrained models could be downloaded from https://github.com/zhibinlv/iACP-DRLF or from http://public.aibiochem.net/iACP-DRLF/.
Collapse
Affiliation(s)
- Zhibin Lv
- University of Electronic Science and Technology of China
| | - Feifei Cui
- University of Electronic Science and Technology of China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences at University of Electronic Science and Technology of China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic
| |
Collapse
|