1
|
Liu S, Shi T, Yu J, Li R, Lin H, Deng K. Research on Bitter Peptides in the Field of Bioinformatics: A Comprehensive Review. Int J Mol Sci 2024; 25:9844. [PMID: 39337334 PMCID: PMC11432553 DOI: 10.3390/ijms25189844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 09/06/2024] [Accepted: 09/09/2024] [Indexed: 09/30/2024] Open
Abstract
Bitter peptides are small molecular peptides produced by the hydrolysis of proteins under acidic, alkaline, or enzymatic conditions. These peptides can enhance food flavor and offer various health benefits, with attributes such as antihypertensive, antidiabetic, antioxidant, antibacterial, and immune-regulating properties. They show significant potential in the development of functional foods and the prevention and treatment of diseases. This review introduces the diverse sources of bitter peptides and discusses the mechanisms of bitterness generation and their physiological functions in the taste system. Additionally, it emphasizes the application of bioinformatics in bitter peptide research, including the establishment and improvement of bitter peptide databases, the use of quantitative structure-activity relationship (QSAR) models to predict bitterness thresholds, and the latest advancements in classification prediction models built using machine learning and deep learning algorithms for bitter peptide identification. Future research directions include enhancing databases, diversifying models, and applying generative models to advance bitter peptide research towards deepening and discovering more practical applications.
Collapse
Affiliation(s)
| | | | | | | | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (S.L.); (T.S.); (J.Y.); (R.L.)
| | - Kejun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (S.L.); (T.S.); (J.Y.); (R.L.)
| |
Collapse
|
2
|
Ge R, Xia Y, Jiang M, Jia G, Jing X, Li Y, Cai Y. HybAVPnet: A Novel Hybrid Network Architecture for Antiviral Peptides Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1358-1365. [PMID: 38587961 DOI: 10.1109/tcbb.2024.3385635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/10/2024]
Abstract
Viruses pose a great threat to human production and life, thus the research and development of antiviral drugs is urgently needed. Antiviral peptides play an important role in drug design and development. Compared with the time-consuming and laborious wet chemical experiment methods, it is critical to use computational methods to predict antiviral peptides accurately and rapidly. However, due to limited data, accurate prediction of antiviral peptides is still challenging and extracting effective feature representations from sequences is crucial for creating accurate models. This study introduces a novel two-step approach, named HybAVPnet, to predict antiviral peptides with a hybrid network architecture based on neural networks and traditional machine learning methods. We adopted a stacking-like structure to capture both the long-term dependencies and local evolution information to achieve a comprehensive and diverse prediction using the predicted labels and probabilities. Using an ensemble technique with the different kinds of features can reduce the variance without increasing the bias. The experimental result shows HybAVPnet can achieve better and more robust performance compared with the state-of-the-art methods, which makes it useful for the research and development of antiviral drugs. Meanwhile, it can also be extended to other peptide recognition problems because of its generalization ability.
Collapse
|
3
|
Meng C, Pei Y, Bu Y, Liu Q, Li Q, Zou Q, Zhang Y. IIFS2.0: An Improved Incremental Feature Selection Method for Protein Sequence Processing Based on a Caching Strategy. J Mol Biol 2024:168741. [PMID: 39122168 DOI: 10.1016/j.jmb.2024.168741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/08/2024] [Accepted: 08/05/2024] [Indexed: 08/12/2024]
Abstract
The purpose of feature selection in protein sequence recognition problems is to select the optimal feature set and use it as training input for classifiers and discover key sequence features of specific proteins. In the feature selection process, relevant features associated with the target task will be retained, and irrelevant and redundant features will be removed. Therefore, in an ideal state, a feature combination with smaller feature dimensions and higher performance indicators is desired. This paper proposes an algorithm called IIFS2.0 based on the cache elimination strategy, which takes the local optimal combination of cached feature subsets as a breakthrough point. It searches for a new feature combination method through the cache elimination strategy to avoid the drawbacks of human factors and excessive reliance on feature sorting results. We validated and analyzed its effectiveness on the protein dataset, demonstrating that IIFS2.0 significantly reduces the dimensionality of feature combinations while also improving various evaluation indicators. In addition, we provide IIFS2.0 on https://112.124.26.17:8006/ for researchers to use.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Yue Pei
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yongbo Bu
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Qing Liu
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Qun Li
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| | - Ying Zhang
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| |
Collapse
|
4
|
Zhang M, Zhang L, Liu T, Feng H, He Z, Li F, Zhao J, Liu H. CBIL-VHPLI: a model for predicting viral-host protein-lncRNA interactions based on machine learning and transfer learning. Sci Rep 2024; 14:17549. [PMID: 39080344 PMCID: PMC11289117 DOI: 10.1038/s41598-024-68750-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2024] [Accepted: 07/26/2024] [Indexed: 08/02/2024] Open
Abstract
Virus‒host protein‒lncRNA interaction (VHPLI) predictions are critical for decoding the molecular mechanisms of viral pathogens and host immune processes. Although VHPLI interactions have been predicted in both plants and animals, they have not been extensively studied in viruses. For the first time, we propose a new deep learning-based approach that consists mainly of a convolutional neural network and bidirectional long and short-term memory network modules in combination with transfer learning named CBIL‒VHPLI to predict viral-host protein‒lncRNA interactions. The models were first trained on large and diverse datasets (including plants, animals, etc.). Protein sequence features were extracted using a k-mer method combined with the one-hot encoding and composition-transition-distribution (CTD) methods, and lncRNA sequence features were extracted using a k-mer method combined with the one-hot encoding and Z curve methods. The results obtained on three independent external validation datasets showed that the pre-trained CBIL‒VHPLI model performed the best with an accuracy of approximately 0.9. Pretraining was followed by conducting transfer learning on a viral protein-human lncRNA dataset, and the fine-tuning results showed that the accuracy of CBIL‒VHPLI was 0.946, which was significantly greater than that of the previous models. The final case study results showed that CBIL‒VHPLI achieved a prediction reproducibility rate of 91.6% for the RIP-Seq experimental screening results. This model was then used to predict the interactions between human lncRNA PIK3CD-AS2 and the nonstructural protein 1 (NS1) of the H5N1 virus, and RNA pull-down experiments were used to prove the prediction readiness of the model in terms of prediction. The source code of CBIL‒VHPLI and the datasets used in this work are available at https://github.com/Liu-Lab-Lnu/CBIL-VHPLI for academic usage.
Collapse
Affiliation(s)
- Man Zhang
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Li Zhang
- School of Life Science, Liaoning University, Shenyang, 110036, China
- Technology Innovation Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, 110036, China
- Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China
| | - Ting Liu
- School of Life Science, Liaoning University, Shenyang, 110036, China
- China Medical University-Queen's University Belfast Joint College, China Medical University, Shenyang, 110036, China
| | - Huawei Feng
- Technology Innovation Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, 110036, China
- Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China
- School of Pharmacy, Liaoning University, No. 66, Chongshan Zhonglu, Shenyang, 110036, Liaoning, China
| | - Zhe He
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Feng Li
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Jian Zhao
- School of Life Science, Liaoning University, Shenyang, 110036, China
| | - Hongsheng Liu
- Technology Innovation Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, 110036, China.
- Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China.
- School of Pharmacy, Liaoning University, No. 66, Chongshan Zhonglu, Shenyang, 110036, Liaoning, China.
| |
Collapse
|
5
|
Zhang B, Hou Z, Yang Y, Wong KC, Zhu H, Li X. SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues. Commun Biol 2024; 7:679. [PMID: 38830995 PMCID: PMC11148103 DOI: 10.1038/s42003-024-06332-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 05/15/2024] [Indexed: 06/05/2024] Open
Abstract
Proteins and nucleic-acids are essential components of living organisms that interact in critical cellular processes. Accurate prediction of nucleic acid-binding residues in proteins can contribute to a better understanding of protein function. However, the discrepancy between protein sequence information and obtained structural and functional data renders most current computational models ineffective. Therefore, it is vital to design computational models based on protein sequence information to identify nucleic acid binding sites in proteins. Here, we implement an ensemble deep learning model-based nucleic-acid-binding residues on proteins identification method, called SOFB, which characterizes protein sequences by learning the semantics of biological dynamics contexts, and then develop an ensemble deep learning-based sequence network to learn feature representation and classification by explicitly modeling dynamic semantic information. Among them, the language learning model, which is constructed from natural language to biological language, captures the underlying relationships of protein sequences, and the ensemble deep learning-based sequence network consisting of different convolutional layers together with Bi-LSTM refines various features for optimal performance. Meanwhile, to address the imbalanced issue, we adopt ensemble learning to train multiple models and then incorporate them. Our experimental results on several DNA/RNA nucleic-acid-binding residue datasets demonstrate that our proposed model outperforms other state-of-the-art methods. In addition, we conduct an interpretability analysis of the identified nucleic acid binding residue sequences based on the attention weights of the language learning model, revealing novel insights into the dynamic semantic information that supports the identified nucleic acid binding residues. SOFB is available at https://github.com/Encryptional/SOFB and https://figshare.com/articles/online_resource/SOFB_figshare_rar/25499452 .
Collapse
Affiliation(s)
- Bin Zhang
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Zilong Hou
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Yuning Yang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Canada
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong, Hong Kong SAR
| | - Haoran Zhu
- School of Artificial Intelligence, Jilin University, Changchun, China.
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun, China.
| |
Collapse
|
6
|
Ju H, Cui Y, Su Q, Juan L, Manavalan B. CODENET: A deep learning model for COVID-19 detection. Comput Biol Med 2024; 171:108229. [PMID: 38447500 DOI: 10.1016/j.compbiomed.2024.108229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 02/20/2024] [Accepted: 02/25/2024] [Indexed: 03/08/2024]
Abstract
Conventional COVID-19 testing methods have some flaws: they are expensive and time-consuming. Chest X-ray (CXR) diagnostic approaches can alleviate these flaws to some extent. However, there is no accurate and practical automatic diagnostic framework with good interpretability. The application of artificial intelligence (AI) technology to medical radiography can help to accurately detect the disease, reduce the burden on healthcare organizations, and provide good interpretability. Therefore, this study proposes a new deep neural network (CNN) based on CXR for COVID-19 diagnosis - CodeNet. This method uses contrastive learning to make full use of latent image data to enhance the model's ability to extract features and generalize across different data domains. On the evaluation dataset, the proposed method achieves an accuracy as high as 94.20%, outperforming several other existing methods used for comparison. Ablation studies validate the efficacy of the proposed method, while interpretability analysis shows that the method can effectively guide clinical professionals. This work demonstrates the superior detection performance of a CNN using contrastive learning techniques on CXR images, paving the way for computer vision and artificial intelligence technologies to leverage massive medical data for disease diagnosis.
Collapse
Affiliation(s)
- Hong Ju
- Heilongjiang Agricultural Engineering Vocational College, China
| | - Yanyan Cui
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Qiaosen Su
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea
| | - Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
7
|
Niu M, Wang C, Zhang Z, Zou Q. A computational model of circRNA-associated diseases based on a graph neural network: prediction and case studies for follow-up experimental validation. BMC Biol 2024; 22:24. [PMID: 38281919 PMCID: PMC10823650 DOI: 10.1186/s12915-024-01826-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 01/11/2024] [Indexed: 01/30/2024] Open
Abstract
BACKGROUND Circular RNAs (circRNAs) have been confirmed to play a vital role in the occurrence and development of diseases. Exploring the relationship between circRNAs and diseases is of far-reaching significance for studying etiopathogenesis and treating diseases. To this end, based on the graph Markov neural network algorithm (GMNN) constructed in our previous work GMNN2CD, we further considered the multisource biological data that affects the association between circRNA and disease and developed an updated web server CircDA and based on the human hepatocellular carcinoma (HCC) tissue data to verify the prediction results of CircDA. RESULTS CircDA is built on a Tumarkov-based deep learning framework. The algorithm regards biomolecules as nodes and the interactions between molecules as edges, reasonably abstracts multiomics data, and models them as a heterogeneous biomolecular association network, which can reflect the complex relationship between different biomolecules. Case studies using literature data from HCC, cervical, and gastric cancers demonstrate that the CircDA predictor can identify missing associations between known circRNAs and diseases, and using the quantitative real-time PCR (RT-qPCR) experiment of HCC in human tissue samples, it was found that five circRNAs were significantly differentially expressed, which proved that CircDA can predict diseases related to new circRNAs. CONCLUSIONS This efficient computational prediction and case analysis with sufficient feedback allows us to identify circRNA-associated diseases and disease-associated circRNAs. Our work provides a method to predict circRNA-associated diseases and can provide guidance for the association of diseases with certain circRNAs. For ease of use, an online prediction server ( http://server.malab.cn/CircDA ) is provided, and the code is open-sourced ( https://github.com/nmt315320/CircDA.git ) for the convenience of algorithm improvement.
Collapse
Affiliation(s)
- Mengting Niu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen, 518055, China
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chunyu Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150000, Heilongjiang, China
| | - Zhanguo Zhang
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, 1095 Jiefang Avenue, Wuhan, 430030, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 4 Block 2 North Jianshe Road, Chengdu, 610054, China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.
| |
Collapse
|
8
|
Ren J, Chen X, Zhang Z, Shi H, Wu S. DPred_3S: identifying dihydrouridine (D) modification on three species epitranscriptome based on multiple sequence-derived features. Front Genet 2023; 14:1334132. [PMID: 38169665 PMCID: PMC10758487 DOI: 10.3389/fgene.2023.1334132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 11/29/2023] [Indexed: 01/05/2024] Open
Abstract
Introduction: Dihydrouridine (D) is a conserved modification of tRNA among all three life domains. D modification enhances the flexibility of a single nucleotide base in the spatial structure and is disease- and evolution-associated. Recent studies have also suggested the presence of dihydrouridine on mRNA. Methods: To identify D in epitranscriptome, we provided a prediction framework named "DPred_3S" based on the machine learning approach for three species D epitranscriptome, which used epitranscriptome sequencing data as training data for the first time. Results: The optimal features were evaluated by the F-score and integration of different features; our model achieved area under the receiver operating characteristic curve (AUROC) scores 0.955, 0.946, and 0.905 for Saccharomyces cerevisiae, Escherichia coli, and Schizosaccharomyces pombe, respectively. The performances of different machine learning algorithms were also compared in this study. Discussion: The high performances of our model suggest the D sites can be distinguished based on their surrounding sequence, but the lower performance of cross-species prediction may be limited by technique preferences.
Collapse
Affiliation(s)
- Jinjin Ren
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, Fujian, China
- Fujian Key Laboratory of Tumor Microbiology, Department of Medical Microbiology, Fujian Medical University, Fuzhou, Fujian, China
| | - Xiaozhen Chen
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, Fujian, China
| | - Zhengqian Zhang
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, Fujian, China
| | - Haoran Shi
- Institute of Applied Microbiology, Research Center for BioSystems, Land Use, and Nutrition (IFZ), Justus-Liebig-University Giessen, Giessen, Germany
| | - Shuxiang Wu
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, Fujian, China
- Fujian Key Laboratory of Tumor Microbiology, Department of Medical Microbiology, Fujian Medical University, Fuzhou, Fujian, China
| |
Collapse
|
9
|
Yang Y, Liu Z, Lu J, Sun Y, Fu Y, Pan M, Xie X, Ge Q. Analysis approaches for the identification and prediction of N6-methyladenosine sites. Epigenetics 2023; 18:2158284. [PMID: 36562485 PMCID: PMC9980620 DOI: 10.1080/15592294.2022.2158284] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The global dynamics in a variety of biological processes can be revealed by mapping transcriptional m6A sites, in particular full-transcriptome m6A. And individual m6A sites have contributed to biological function, which can be evaluated by stoichiometric information obtained from the single nucleotide resolution. Currently, the identification of m6A sites is mainly carried out by experiment and prediction methods, based on high-throughput sequencing and machine learning model respectively. This review summarizes the recent topics and progress made in bioinformatics methods of deciphering the m6A methylation, including the experimental detection of m6A methylation sites, techniques of data analysis, the way of predicting m6A methylation sites, m6A methylation databases, and detection of m6A modification in circRNA. At the end, the essay makes a brief discussion for the development perspective in this area.
Collapse
Affiliation(s)
- Yuwei Yang
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, People's Republic of China
| | - Zhiyu Liu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, People's Republic of China
| | - Junru Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, People's Republic of China
| | - Yuqing Sun
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, People's Republic of China
| | - Yue Fu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, People's Republic of China
| | - Min Pan
- Department of Pathology and Pathophysiology School of Medicine, Southeast University, Nanjing, China
| | - Xueying Xie
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, People's Republic of China
| | - Qinyu Ge
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, People's Republic of China
| |
Collapse
|
10
|
Jiao S, Ye X, Ao C, Sakurai T, Zou Q, Xu L. Adaptive learning embedding features to improve the predictive performance of SARS-CoV-2 phosphorylation sites. Bioinformatics 2023; 39:btad627. [PMID: 37847658 PMCID: PMC10628388 DOI: 10.1093/bioinformatics/btad627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 09/11/2023] [Accepted: 10/16/2023] [Indexed: 10/19/2023] Open
Abstract
MOTIVATION The rapid and extensive transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to an unprecedented global health emergency, affecting millions of people and causing an immense socioeconomic impact. The identification of SARS-CoV-2 phosphorylation sites plays an important role in unraveling the complex molecular mechanisms behind infection and the resulting alterations in host cell pathways. However, currently available prediction tools for identifying these sites lack accuracy and efficiency. RESULTS In this study, we presented a comprehensive biological function analysis of SARS-CoV-2 infection in a clonal human lung epithelial A549 cell, revealing dramatic changes in protein phosphorylation pathways in host cells. Moreover, a novel deep learning predictor called PSPred-ALE is specifically designed to identify phosphorylation sites in human host cells that are infected with SARS-CoV-2. The key idea of PSPred-ALE lies in the use of a self-adaptive learning embedding algorithm, which enables the automatic extraction of context sequential features from protein sequences. In addition, the tool uses multihead attention module that enables the capturing of global information, further improving the accuracy of predictions. Comparative analysis of features demonstrated that the self-adaptive learning embedding features are superior to hand-crafted statistical features in capturing discriminative sequence information. Benchmarking comparison shows that PSPred-ALE outperforms the state-of-the-art prediction tools and achieves robust performance. Therefore, the proposed model can effectively identify phosphorylation sites assistant the biomedical scientists in understanding the mechanism of phosphorylation in SARS-CoV-2 infection. AVAILABILITY AND IMPLEMENTATION PSPred-ALE is available at https://github.com/jiaoshihu/PSPred-ALE and Zenodo (https://doi.org/10.5281/zenodo.8330277).
Collapse
Affiliation(s)
- Shihu Jiao
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Chunyan Ao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, No. 4089 Shahexi Road, Shenzhen 518000, China
| |
Collapse
|
11
|
Gubeljak P, Xu T, Pedrazzetti L, Burton OJ, Magagnin L, Hofmann S, Malliaras GG, Lombardo A. Electrochemically-gated graphene broadband microwave waveguides for ultrasensitive biosensing. NANOSCALE 2023; 15:15304-15317. [PMID: 37682040 DOI: 10.1039/d3nr01239e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
Identification of non-amplified DNA sequences and single-base mutations is essential for molecular biology and genetic diagnostics. This paper reports a novel sensor consisting of electrochemically-gated graphene coplanar waveguides coupled with a microfluidic channel. Upon exposure to analytes, propagation of electromagnetic waves in the waveguides is modified as a result of interactions with the fringing field and modulation of graphene dynamic conductivity resulting from electrostatic gating. Probe DNA sequences are immobilised on the graphene surface, and the sensor is exposed to DNA sequences which either perfectly match the probe, contain a single-base mismatch or are unrelated. By monitoring the scattering parameters at frequencies between 50 MHz and 50 GHz, unambiguous and reproducible discrimination of the different strands is achieved at concentrations as low as one attomole per litre (1 aM). By controlling and synchronising frequency sweeps, electrochemical gating, and liquid flow in the microfluidic channel, the sensor generates multidimensional datasets. Advanced data analysis techniques are utilised to take full advantage of the richness of the dataset. A classification accuracy >97% between all three sequences is achieved using different Machine Learning models, even in the presence of simulated noise and low signal-to-noise ratios. The sensor exceeds state-of-the-art sensitivity of field-effect transistors and microwave sensors for the identification of single-base mismatches.
Collapse
Affiliation(s)
- Patrik Gubeljak
- Cambridge Graphene Centre, Department of Engineering, University of Cambridge, UK
- Department of Engineering, University of Cambridge, UK
| | - Tianhui Xu
- Department of Engineering, University of Cambridge, UK
- Department of Electronic and Electrical Engineering, University College London, London, UK
| | - Lorenzo Pedrazzetti
- Department of Engineering, University of Cambridge, UK
- Dipartimento di Chimica, Materiali e Ingegneria Chimica "Giulio Natta", Politecnico di Milano, Italy
| | | | - Luca Magagnin
- Dipartimento di Chimica, Materiali e Ingegneria Chimica "Giulio Natta", Politecnico di Milano, Italy
| | | | | | - Antonio Lombardo
- Department of Engineering, University of Cambridge, UK
- Department of Electronic and Electrical Engineering, University College London, London, UK
- London Centre for Nanotechnology, University College London, UK.
| |
Collapse
|
12
|
Dhakal P, Tayara H, Chong KT. An ensemble of stacking classifiers for improved prediction of miRNA-mRNA interactions. Comput Biol Med 2023; 164:107242. [PMID: 37473564 DOI: 10.1016/j.compbiomed.2023.107242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 06/21/2023] [Accepted: 07/07/2023] [Indexed: 07/22/2023]
Abstract
MicroRNAs (miRNAs) are small non-coding RNA molecules that play a crucial role in regulating gene expression at the post-transcriptional level by binding to potential target sites of messenger RNAs (mRNAs), facilitated by the Argonaute family of proteins. Selecting the conservative candidate target sites (CTS) is a challenging step, considering that most of the existing computational algorithms primarily focus on canonical site types, which is a time-consuming and inefficient utilization of miRNA target site interactions. We developed a stacking classifier algorithm that addresses the CTS selection criteria using feature-encoding techniques that generates feature vectors, including k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. This innovative stacking classifier algorithm surpassed previous state-of-the-art algorithms in predicting functional miRNA targets. We evaluated the performance of the proposed model on 10 independent test datasets and obtained an average accuracy of 79.77%, which is a significant improvement of 7.26 % over previous models. This improvement shows that the proposed method has great potential for distinguishing highly functional miRNA targets and can serve as a valuable tool in biomedical and drug development research.
Collapse
Affiliation(s)
- Priyash Dhakal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
13
|
Lee M. Machine learning for small interfering RNAs: a concise review of recent developments. Front Genet 2023; 14:1226336. [PMID: 37519887 PMCID: PMC10372481 DOI: 10.3389/fgene.2023.1226336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 07/04/2023] [Indexed: 08/01/2023] Open
Abstract
The advent of machine learning and its subsequent integration into small interfering RNA (siRNA) research heralds a new epoch in the field of RNA interference (RNAi). This review emphasizes the urgency and relevance of assimilating the plethora of contributions and advancements in this domain, particularly focusing on the period of 2019-2023. Given the rapid progression of deep learning technologies, our synthesis of recent research is paramount to staying apprised of the state-of-the-art methods being utilized. It not only offers a comprehensive insight into the confluence of machine learning and siRNA but also serves as a beacon, guiding future explorations in this intersectional research field. Our rigorous examination of studies promises a discerning perspective on the contemporary landscape of machine learning applications in siRNA design and function. This review is an effort to foster further discourse and propel academic inquiry in this multifaceted domain.
Collapse
|
14
|
Valeri JA, Soenksen LR, Collins KM, Ramesh P, Cai G, Powers R, Angenent-Mari NM, Camacho DM, Wong F, Lu TK, Collins JJ. BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences. Cell Syst 2023; 14:525-542.e9. [PMID: 37348466 PMCID: PMC10700034 DOI: 10.1016/j.cels.2023.05.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 02/17/2023] [Accepted: 05/22/2023] [Indexed: 06/24/2023]
Abstract
The design choices underlying machine-learning (ML) models present important barriers to entry for many biologists who aim to incorporate ML in their research. Automated machine-learning (AutoML) algorithms can address many challenges that come with applying ML to the life sciences. However, these algorithms are rarely used in systems and synthetic biology studies because they typically do not explicitly handle biological sequences (e.g., nucleotide, amino acid, or glycan sequences) and cannot be easily compared with other AutoML algorithms. Here, we present BioAutoMATED, an AutoML platform for biological sequence analysis that integrates multiple AutoML methods into a unified framework. Users are automatically provided with relevant techniques for analyzing, interpreting, and designing biological sequences. BioAutoMATED predicts gene regulation, peptide-drug interactions, and glycan annotation, and designs optimized synthetic biology components, revealing salient sequence characteristics. By automating sequence modeling, BioAutoMATED allows life scientists to incorporate ML more readily into their work.
Collapse
Affiliation(s)
- Jacqueline A Valeri
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Luis R Soenksen
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Mechanical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
| | - Katherine M Collins
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Engineering, University of Cambridge, Trumpington St, Cambridge CB2 1PZ, UK
| | - Pradeep Ramesh
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - George Cai
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Rani Powers
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Pluto Biosciences, Golden, CO 80402, USA
| | - Nicolaas M Angenent-Mari
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Diogo M Camacho
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Felix Wong
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Timothy K Lu
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - James J Collins
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA; Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
15
|
Butt AH, Alkhalifah T, Alturise F, Khan YD. Ensemble Learning for Hormone Binding Protein Prediction: A Promising Approach for Early Diagnosis of Thyroid Hormone Disorders in Serum. Diagnostics (Basel) 2023; 13:diagnostics13111940. [PMID: 37296792 DOI: 10.3390/diagnostics13111940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 05/20/2023] [Accepted: 05/22/2023] [Indexed: 06/12/2023] Open
Abstract
Hormone-binding proteins (HBPs) are specific carrier proteins that bind to a given hormone. A soluble carrier hormone binding protein (HBP), which can interact non-covalently and specifically with growth hormone, modulates or inhibits hormone signaling. HBP is essential for the growth of life, despite still being poorly understood. Several diseases, according to some data, are caused by HBPs that express themselves abnormally. Accurate identification of these molecules is the first step in investigating the roles of HBPs and understanding their biological mechanisms. For a better understanding of cell development and cellular mechanisms, accurate HBP determination from a given protein sequence is essential. Using traditional biochemical experiments, it is difficult to correctly separate HBPs from an increasing number of proteins because of the high experimental costs and lengthy experiment periods. The abundance of protein sequence data that has been gathered in the post-genomic era necessitates a computational method that is automated and enables quick and accurate identification of putative HBPs within a large number of candidate proteins. A brand-new machine-learning-based predictor is suggested as the HBP identification method. To produce the desirable feature set for the method proposed, statistical moment-based features and amino acids were combined, and the random forest was used to train the feature set. During 5-fold cross validation experiments, the suggested method achieved 94.37% accuracy and 0.9438 F1-scores, respectively, demonstrating the importance of the Hahn moment-based features.
Collapse
Affiliation(s)
- Ahmad Hassan Butt
- Department of Computer Science, Faculty of Computing & Information Technology, University of the Punjab, Lahore 54000, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
16
|
Wang C, Yang Q. ScerePhoSite: An interpretable method for identifying fungal phosphorylation sites in proteins using sequence-based features. Comput Biol Med 2023; 158:106798. [PMID: 36966555 DOI: 10.1016/j.compbiomed.2023.106798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 03/03/2023] [Accepted: 03/20/2023] [Indexed: 03/31/2023]
Abstract
Protein phosphorylation plays a vital role in signal transduction pathways and diverse cellular processes. To date, a tremendous number of in silico tools have been designed for phosphorylation site identification, but few of them are suitable for the identification of fungal phosphorylation sites. This largely hampers the functional investigation of fungal phosphorylation. In this paper, we present ScerePhoSite, a machine learning method for fungal phosphorylation site identification. The sequence fragments are represented by hybrid physicochemical features, and then LGB-based feature importance combined with the sequential forward search method is used to choose the optimal feature subset. As a result, ScerePhoSite surpasses current available tools and shown a more robust and balanced performance. Furthermore, the impact and contribution of specific features on the model performance were investigated by SHAP values. We expect ScerePhoSite to be a useful bioinformatics tool that complements hands-on experiments for the pre-screening of possible phosphorylation sites and facilitates our functional understanding of phosphorylation modification in fungi. The source code and datasets are accessible at https://github.com/wangchao-malab/ScerePhoSite/.
Collapse
|
17
|
Ao C, Ye X, Sakurai T, Zou Q, Yu L. m5U-SVM: identification of RNA 5-methyluridine modification sites based on multi-view features of physicochemical features and distributed representation. BMC Biol 2023; 21:93. [PMID: 37095510 PMCID: PMC10127088 DOI: 10.1186/s12915-023-01596-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 04/12/2023] [Indexed: 04/26/2023] Open
Abstract
BACKGROUND RNA 5-methyluridine (m5U) modifications are obtained by methylation at the C5 position of uridine catalyzed by pyrimidine methylation transferase, which is related to the development of human diseases. Accurate identification of m5U modification sites from RNA sequences can contribute to the understanding of their biological functions and the pathogenesis of related diseases. Compared to traditional experimental methods, computational methods developed based on machine learning with ease of use can identify modification sites from RNA sequences in an efficient and time-saving manner. Despite the good performance of these computational methods, there are some drawbacks and limitations. RESULTS In this study, we have developed a novel predictor, m5U-SVM, based on multi-view features and machine learning algorithms to construct predictive models for identifying m5U modification sites from RNA sequences. In this method, we used four traditional physicochemical features and distributed representation features. The optimized multi-view features were obtained from the four fused traditional physicochemical features by using the two-step LightGBM and IFS methods, and then the distributed representation features were fused with the optimized physicochemical features to obtain the new multi-view features. The best performing classifier, support vector machine, was identified by screening different machine learning algorithms. Compared with the results, the performance of the proposed model is better than that of the existing state-of-the-art tool. CONCLUSIONS m5U-SVM provides an effective tool that successfully captures sequence-related attributes of modifications and can accurately predict m5U modification sites from RNA sequences. The identification of m5U modification sites helps to understand and delve into the related biological processes and functions.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| |
Collapse
|
18
|
Wang R, Chung CR, Huang HD, Lee TY. Identification of species-specific RNA N6-methyladinosine modification sites from RNA sequences. Brief Bioinform 2023; 24:7008797. [PMID: 36715277 DOI: 10.1093/bib/bbac573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 11/11/2022] [Accepted: 11/24/2022] [Indexed: 01/31/2023] Open
Abstract
N6-methyladinosine (m6A) modification is the most abundant co-transcriptional modification in eukaryotic RNA and plays important roles in cellular regulation. Traditional high-throughput sequencing experiments used to explore functional mechanisms are time-consuming and labor-intensive, and most of the proposed methods focused on limited species types. To further understand the relevant biological mechanisms among different species with the same RNA modification, it is necessary to develop a computational scheme that can be applied to different species. To achieve this, we proposed an attention-based deep learning method, adaptive-m6A, which consists of convolutional neural network, bi-directional long short-term memory and an attention mechanism, to identify m6A sites in multiple species. In addition, three conventional machine learning (ML) methods, including support vector machine, random forest and logistic regression classifiers, were considered in this work. In addition to the performance of ML methods for multi-species prediction, the optimal performance of adaptive-m6A yielded an accuracy of 0.9832 and the area under the receiver operating characteristic curve of 0.98. Moreover, the motif analysis and cross-validation among different species were conducted to test the robustness of one model towards multiple species, which helped improve our understanding about the sequence characteristics and biological functions of RNA modifications in different species.
Collapse
Affiliation(s)
- Rulan Wang
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, Longgang District, 51872, Shenzhen, P.R. China
| | - Chia-Ru Chung
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, Longgang District, 51872, Shenzhen, P.R. China
- School of Life Sciences, University of Science and Technology of China, 230026, Hefei, Anhui, P.R. China
| | - Hsien-Da Huang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, Longgang District, 51872, Shenzhen, P.R. China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, Longgang District, 51872, Shenzhen, P.R. China
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, Longgang District, 51872, Shenzhen, P.R. China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, Longgang District, 51872, Shenzhen, P.R. China
| |
Collapse
|
19
|
Wang C, Zou Q, Ju Y, Shi H. Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:967-975. [PMID: 36063523 DOI: 10.1109/tcbb.2022.3204365] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Enhancers are crucial for precise regulation of gene expression, while enhancer identification and strength prediction are challenging because of their free distribution and tremendous number of similar fractions in the genome. Although several bioinformatics tools have been developed, shortfalls in these models remain, and their performances need further improvement. In the present study, a two-layer predictor called Enhancer-FRL was proposed for identifying enhancers (enhancers or nonenhancers) and their activities (strong and weak). More specifically, to build an efficient model, the feature representation learning scheme was applied to generate a 50D probabilistic vector based on 10 feature encodings and five machine learning algorithms. Subsequently, the multiview probabilistic features were integrated to construct the final prediction model. Compared with the single feature-based model, Enhancer-FRL showed significant performance improvement and model robustness. Performance assessment on the independent test dataset indicated that the proposed model outperformed state-of-the-art available toolkits. The webserver Enhancer-FRL is freely accessible at http://lab.malab.cn/∼wangchao/softwares/Enhancer-FRL/, The code and datasets can be downloaded at the webserver page or at the Github https://github.com/wangchao-malab/Enhancer-FRL/.
Collapse
|
20
|
Malik A, Shoombuatong W, Kim CB, Manavalan B. GPApred: The first computational predictor for identifying proteins with LPXTG-like motif using sequence-based optimal features. Int J Biol Macromol 2023; 229:529-538. [PMID: 36596370 DOI: 10.1016/j.ijbiomac.2022.12.315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 12/19/2022] [Accepted: 12/28/2022] [Indexed: 01/02/2023]
Abstract
The cell surface proteins of gram-positive bacteria are involved in many important biological functions, including the infection of host cells. Owing to their virulent nature, these proteins are also considered strong candidates for potential drug or vaccine targets. Among the various cell surface proteins of gram-positive bacteria, LPXTG-like proteins form a major class. These proteins have a highly conserved C-terminal cell wall sorting signal, which consists of an LPXTG sequence motif, a hydrophobic domain, and a positively charged tail. These surface proteins are targeted to the cell envelope by a sortase enzyme via transpeptidation. A variety of LPXTG-like proteins have been experimentally characterized; however, their number in public databases has increased owing to extensive bacterial genome sequencing without proper annotation. In the absence of experimental characterization, identifying and annotating these sequences is extremely challenging. Therefore, in this study, we developed the first machine learning-based predictor called GPApred, which can identify LPXTG-like proteins from their primary sequences. Using a newly constructed benchmark dataset, we explored different classifiers and five feature encodings and their hybrids. Optimal features were derived using the recursive feature elimination method, and these features were then trained using a support vector machine algorithm. The performance of different models was evaluated using independent datasets, and a final model (GPApred) was selected based on consistency during cross-validation and independent assessment. GPApred can be an effective tool for predicting LPXTG-like sequences and can be further employed for functional characterization or drug targeting. Availability: https://procarb.org/gpapred/.
Collapse
Affiliation(s)
- Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul 03016, Republic of Korea
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Chang-Bae Kim
- Department of Biotechnology, Sangmyung University, Seoul 03016, Republic of Korea.
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
21
|
Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, Mishra G, Kaur H, Sharma N, Jain S, Usmani SS, Agrawal P, Kumar R, Kumar V, Raghava GPS. Pfeature: A Tool for Computing Wide Range of Protein Features and Building Prediction Models. J Comput Biol 2023; 30:204-222. [PMID: 36251780 DOI: 10.1089/cmb.2022.0241] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
In the last three decades, a wide range of protein features have been discovered to annotate a protein. Numerous attempts have been made to integrate these features in a software package/platform so that the user may compute a wide range of features from a single source. To complement the existing methods, we developed a method, Pfeature, for computing a wide range of protein features. Pfeature allows to compute more than 200,000 features required for predicting the overall function of a protein, residue-level annotation of a protein, and function of chemically modified peptides. It has six major modules, namely, composition, binary profiles, evolutionary information, structural features, patterns, and model building. Composition module facilitates to compute most of the existing compositional features, plus novel features. The binary profile of amino acid sequences allows to compute the fraction of each type of residue as well as its position. The evolutionary information module allows to compute evolutionary information of a protein in the form of a position-specific scoring matrix profile generated using Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST); fit for annotation of a protein and its residues. A structural module was developed for computing of structural features/descriptors from a tertiary structure of a protein. These features are suitable to predict the therapeutic potential of a protein containing non-natural or chemically modified residues. The model-building module allows to implement various machine learning techniques for developing classification and regression models as well as feature selection. Pfeature also allows the generation of overlapping patterns and features from a protein. A user-friendly Pfeature is available as a web server python library and stand-alone package.
Collapse
Affiliation(s)
- Akshara Pande
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Anjali Lathwal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Chakit Arora
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Dilraj Kaur
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Gaurav Mishra
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Department of Electrical Engineering, Shiv Nadar University, Greater Noida, India
| | - Harpreet Kaur
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Neelam Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Shipra Jain
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Salman Sadullah Usmani
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Piyush Agrawal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Rajesh Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Vinod Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
22
|
Wang C, Zou Q. Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE. BMC Biol 2023; 21:12. [PMID: 36694239 PMCID: PMC9875434 DOI: 10.1186/s12915-023-01510-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 01/05/2023] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work. RESULTS In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects. CONCLUSIONS DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ .
Collapse
Affiliation(s)
- Chao Wang
- grid.411307.00000 0004 1790 5236School of Software Engineering, Chengdu University of Information Technology, Chengdu, China
| | - Quan Zou
- grid.54549.390000 0004 0369 4060Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
23
|
Bupi N, Sangaraju VK, Phan LT, Lal A, Vo TTB, Ho PT, Qureshi MA, Tabassum M, Lee S, Manavalan B. An Effective Integrated Machine Learning Framework for Identifying Severity of Tomato Yellow Leaf Curl Virus and Their Experimental Validation. RESEARCH (WASHINGTON, D.C.) 2023; 6:0016. [PMID: 36930763 PMCID: PMC10013792 DOI: 10.34133/research.0016] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Accepted: 11/07/2022] [Indexed: 01/13/2023]
Abstract
Tomato yellow leaf curl virus (TYLCV) dispersed across different countries, specifically to subtropical regions, associated with more severe symptoms. Since TYLCV was first isolated in 1931, it has been a menace to tomato industrial production worldwide over the past century. Three groups were newly isolated from TYLCV-resistant tomatoes in 2022; however, their functions are unknown. The development of machine learning (ML)-based models using characterized sequences and evaluating blind predictions is one of the major challenges in interdisciplinary research. The purpose of this study was to develop an integrated computational framework for the accurate identification of symptoms (mild or severe) based on TYLCV sequences (isolated in Korea). For the development of the framework, we first extracted 11 different feature encodings and hybrid features from the training data and then explored 8 different classifiers and developed their respective prediction models by using randomized 10-fold cross-validation. Subsequently, we carried out a systematic evaluation of these 96 developed models and selected the top 90 models, whose predicted class labels were combined and considered as reduced features. On the basis of these features, a multilayer perceptron was applied and developed the final prediction model (IML-TYLCVs). We conducted blind prediction on 3 groups using IML-TYLCVs, and the results indicated that 2 groups were severe and 1 group was mild. Furthermore, we confirmed the prediction with virus-challenging experiments of tomato plant phenotypes using infectious clones from 3 groups. Plant virologists and plant breeding professionals can access the user-friendly online IML-TYLCVs web server at https://balalab-skku.org/IML-TYLCVs, which can guide them in developing new protection strategies for newly emerging viruses.
Collapse
Affiliation(s)
- Nattanong Bupi
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Vinoth Kumar Sangaraju
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Le Thi Phan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Aamir Lal
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Thuy Thi Bich Vo
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Phuong Thi Ho
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Muhammad Amir Qureshi
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Marjia Tabassum
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Sukchan Lee
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| |
Collapse
|
24
|
Ullah M, Hadi F, Song J, Yu DJ. PScL-2LSAESM: bioimage-based prediction of protein subcellular localization by integrating heterogeneous features with the two-level SAE-SM and mean ensemble method. Bioinformatics 2023; 39:6839969. [PMID: 36413068 PMCID: PMC9947927 DOI: 10.1093/bioinformatics/btac727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 11/02/2022] [Accepted: 11/21/2022] [Indexed: 11/23/2022] Open
Abstract
MOTIVATION Over the past decades, a variety of in silico methods have been developed to predict protein subcellular localization within cells. However, a common and major challenge in the design and development of such methods is how to effectively utilize the heterogeneous feature sets extracted from bioimages. In this regards, limited efforts have been undertaken. RESULTS We propose a new two-level stacked autoencoder network (termed 2L-SAE-SM) to improve its performance by integrating the heterogeneous feature sets. In particular, in the first level of 2L-SAE-SM, each optimal heterogeneous feature set is fed to train our designed stacked autoencoder network (SAE-SM). All the trained SAE-SMs in the first level can output the decision sets based on their respective optimal heterogeneous feature sets, known as 'intermediate decision' sets. Such intermediate decision sets are then ensembled using the mean ensemble method to generate the 'intermediate feature' set for the second-level SAE-SM. Using the proposed framework, we further develop a novel predictor, referred to as PScL-2LSAESM, to characterize image-based protein subcellular localization. Extensive benchmarking experiments on the latest benchmark training and independent test datasets collected from the human protein atlas databank demonstrate the effectiveness of the proposed 2L-SAE-SM framework for the integration of heterogeneous feature sets. Moreover, performance comparison of the proposed PScL-2LSAESM with current state-of-the-art methods further illustrates that PScL-2LSAESM clearly outperforms the existing state-of-the-art methods for the task of protein subcellular localization. AVAILABILITY AND IMPLEMENTATION https://github.com/csbio-njust-edu/PScL-2LSAESM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matee Ullah
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fazal Hadi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | | | - Dong-Jun Yu
- To whom correspondence should be addressed. or
| |
Collapse
|
25
|
Li Y, Kong F, Cui H, Wang F, Li C, Ma J. SENIES: DNA Shape Enhanced Two-Layer Deep Learning Predictor for the Identification of Enhancers and Their Strength. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:637-645. [PMID: 35015646 DOI: 10.1109/tcbb.2022.3142019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identifying enhancers is a critical task in bioinformatics due to their primary role in regulating gene expression. For this reason, various computational algorithms devoted to enhancer identification have been put forward over the years. More features are extracted from the single DNA sequences to boost the performance. Nevertheless, DNA structural information is neglected, which is an essential factor affecting the binding preferences of transcription factors to regulatory elements like enhancers. Here, we propose SENIES, a DNA shape enhanced deep learning predictor, to identify enhancers and their strength. The predictor consists of two layers where the first layer is for enhancer and non-enhancer identification, and the second layer is for predicting the strength of enhancers. Apart from two common sequence-derived features (i.e., one-hot and k-mer), DNA shape is introduced to describe the 3D structures of DNA sequences. Performance comparison with state-of-the-art methods conducted on public datasets demonstrates the effectiveness and robustness of our predictor. The code implementation of SENIES is publicly available at https://github.com/hlju-liye/SENIES.
Collapse
|
26
|
Wang F, Feng X, Kong R, Chang S. Generating new protein sequences by using dense network and attention mechanism. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:4178-4197. [PMID: 36899622 DOI: 10.3934/mbe.2023195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Protein engineering uses de novo protein design technology to change the protein gene sequence, and then improve the physical and chemical properties of proteins. These newly generated proteins will meet the needs of research better in properties and functions. The Dense-AutoGAN model is based on GAN, which is combined with an Attention mechanism to generate protein sequences. In this GAN architecture, the Attention mechanism and Encoder-decoder can improve the similarity of generated sequences and obtain variations in a smaller range on the original basis. Meanwhile, a new convolutional neural network is constructed by using the Dense. The dense network transmits in multiple layers over the generator network of the GAN architecture, which expands the training space and improves the effectiveness of sequence generation. Finally, the complex protein sequences are generated on the mapping of protein functions. Through comparisons of other models, the generated sequences of Dense-AutoGAN verify the model performance. The new generated proteins are highly accurate and effective in chemical and physical properties.
Collapse
Affiliation(s)
- Feng Wang
- School of Computer Engineering, Suzhou Vocational University, Suzhou, China
- Information Engineering Department, Changzhou University Huaide College, Taizhou, China
| | - Xiaochen Feng
- Information Engineering Department, Changzhou University Huaide College, Taizhou, China
| | - Ren Kong
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, China
| |
Collapse
|
27
|
Xia K, Liu X, Wee J. Persistent Homology for RNA Data Analysis. Methods Mol Biol 2023; 2627:211-229. [PMID: 36959450 DOI: 10.1007/978-1-0716-2974-1_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Abstract
Molecular representations are of great importance for machine learning models in RNA data analysis. Essentially, efficient molecular descriptors or fingerprints that characterize the intrinsic structural and interactional information of RNAs can significantly boost the performance of all learning modeling. In this paper, we introduce two persistent models, including persistent homology and persistent spectral, for RNA structure and interaction representations and their applications in RNA data analysis. Different from traditional geometric and graph representations, persistent homology is built on simplicial complex, which is a generalization of graph models to higher-dimensional situations. Hypergraph is a further generalization of simplicial complexes and hypergraph-based embedded persistent homology has been proposed recently. Moreover, persistent spectral models, which combine filtration process with spectral models, including spectral graph, spectral simplicial complex, and spectral hypergraph, are proposed for molecular representation. The persistent attributes for RNAs can be obtained from these two persistent models and further combined with machine learning models for RNA structure, flexibility, dynamics, and function analysis.
Collapse
Affiliation(s)
- Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore.
| | - Xiang Liu
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
- Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China
| | - JunJie Wee
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
28
|
Chen M, Zhang X, Ju Y, Liu Q, Ding Y. iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:13829-13850. [PMID: 36654069 DOI: 10.3934/mbe.2022644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Biological sequence analysis is an important basic research work in the field of bioinformatics. With the explosive growth of data, machine learning methods play an increasingly important role in biological sequence analysis. By constructing a classifier for prediction, the input sequence feature vector is predicted and evaluated, and the knowledge of gene structure, function and evolution is obtained from a large amount of sequence information, which lays a foundation for researchers to carry out in-depth research. At present, many machine learning methods have been applied to biological sequence analysis such as RNA gene recognition and protein secondary structure prediction. As a biological sequence, RNA plays an important biological role in the encoding, decoding, regulation and expression of genes. The analysis of RNA data is currently carried out from the aspects of structure and function, including secondary structure prediction, non-coding RNA identification and functional site prediction. Pseudouridine (У) is the most widespread and rich RNA modification and has been discovered in a variety of RNAs. It is highly essential for the study of related functional mechanisms and disease diagnosis to accurately identify У sites in RNA sequences. At present, several computational approaches have been suggested as an alternative to experimental methods to detect У sites, but there is still potential for improvement in their performance. In this study, we present a model based on twin support vector machine (TWSVM) for У site identification. The model combines a variety of feature representation techniques and uses the max-relevance and min-redundancy methods to obtain the optimum feature subset for training. The independent testing accuracy is improved by 3.4% in comparison to current advanced У site predictors. The outcomes demonstrate that our model has better generalization performance and improves the accuracy of У site identification. iPseU-TWSVM can be a helpful tool to identify У sites.
Collapse
Affiliation(s)
- Mingshuai Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Xin Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Qing Liu
- Department of Anesthesiology, Hospital (T.C.M) Affiliated to Southwest Medical University, Luzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| |
Collapse
|
29
|
Design and Optimization of Aesthetic Education Teaching Information Platform Based on Big Data Analysis. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:5109638. [PMID: 35990160 PMCID: PMC9388248 DOI: 10.1155/2022/5109638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 07/17/2022] [Indexed: 11/17/2022]
Abstract
In the process of promoting school aesthetic education, some schools have some problems, such as insufficient construction of campus aesthetic education environment, lack of aesthetic thinking in various disciplines, and so on. In view of these problems, combined with the concept of the flipped classroom and the characteristics of artificial intelligence task-driven teaching, taking PHP, HTML + CSS + JS, and other development technologies as the main development technologies, and relying on the flipped classroom teaching mode of network learning space, this paper constructs an artificial intelligence core course website as a teaching platform for graduate teaching and undergraduate extended learning. The platform seeks the optimal solution of multiple combination optimization based on a genetic algorithm effectively improves the teaching quality of artificial intelligence courses and students' learning efficiency.
Collapse
|
30
|
Thi Phan L, Woo Park H, Pitti T, Madhavan T, Jeon YJ, Manavalan B. MLACP 2.0: An updated machine learning tool for anticancer peptide prediction. Comput Struct Biotechnol J 2022; 20:4473-4480. [PMID: 36051870 PMCID: PMC9421197 DOI: 10.1016/j.csbj.2022.07.043] [Citation(s) in RCA: 31] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Revised: 07/25/2022] [Accepted: 07/25/2022] [Indexed: 12/24/2022] Open
Abstract
We present a novel meta-approach, MLACP 2.0, and implement it as a user-friendly webserver for the accurate identification of ACPs. MLACP 2.0 employed 11 different encoding schemes and eight different classifiers, including convolutional neural networks, to create a stable meta-model. Benchmarking study has demonstrated that MLACP 2.0 achieves superior performance in ACP prediction compared to publicly available state-of-the-art predictors.
Anticancer peptides are emerging anticancer drug that offers fewer side effects and is more effective than chemotherapy and targeted therapy. Predicting anticancer peptides from sequence information is one of the most challenging tasks in immunoinformatics. In the past ten years, machine learning-based approaches have been proposed for identifying ACP activity from peptide sequences. These methods include our previous method MLACP (developed in 2017) which made a significant impact on anticancer research. MLACP tool has been widely used by the research community, however, its robustness must be improved significantly for its continued practical application. In this study, the first large non-redundant training and independent datasets were constructed for ACP research. Using the training dataset, the study explored a wide range of feature encodings and developed their respective models using seven different conventional classifiers. Subsequently, a subset of encoding-based models was selected for each classifier based on their performance, whose predicted scores were concatenated and trained through a convolutional neural network (CNN), whose corresponding predictor is named MLACP 2.0. The evaluation of MLACP 2.0 with a very diverse independent dataset showed excellent performance and significantly outperformed the recent ACP prediction tools. Additionally, MLACP 2.0 exhibits superior performance during cross-validation and independent assessment when compared to CNN-based embedding models and conventional single models. Consequently, we anticipate that our proposed MLACP 2.0 will facilitate the design of hypothesis-driven experiments by making it easier to discover novel ACPs. The MLACP 2.0 is freely available at https://balalab-skku.org/mlacp2.
Collapse
|
31
|
Chen B, Shi Y, Li J, Zhai J, Liu L, Liu W, Hu L, Zhao Y. Tissue Recognition Based on Electrical Impedance Classified by Support Vector Machine in Spinal Operation Area. Orthop Surg 2022; 14:2276-2285. [PMID: 35913262 PMCID: PMC9483044 DOI: 10.1111/os.13406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Revised: 06/24/2022] [Accepted: 06/25/2022] [Indexed: 11/26/2022] Open
Abstract
OBJECTIVE One of the major difficulties in spinal surgery is the injury of important tissues caused by tissue misclassification, which is the source of surgical complications. Accurate recognization of the tissues is the key to increase safety and effect as well as to reduce the complications of spinal surgery. The study aimed at tissue recognition in the spinal operation area based on electrical impedance and the boundaries of electrical impedance between cortical bone, cancellous bone, spinal cord, muscle, and nucleus pulposus. METHODS Two female white swines with body weight of 40 kg were used to expose cortical bone, cancellous bone, spinal cord, muscle, and nucleus pulposus under general anesthesia and aseptic conditions. The electrical impedance of these tissues at 12 frequencies (in the range of 10-100 kHz) was measured by electrochemical analyzer with a specially designed probe, at 22.0-25.0°C and 50%-60% humidity. Two types of tissue recognition models - one combines principal component analysis (PCA) and support vector machine (SVM) and the other combines combines SVM and ensemble learning - were constructed, and the boundaries of electrical impedance of the five tissues at 12 frequencies of current were figured out. Linear correlation, two-way ANOVA, and paired T-test were conducted to analyze the relationship between the electrical impedance of different tissues at different frequencies. RESULTS The results suggest that the differences of electrical impedance mainly came from tissue type (p < 0.0001), the electrical impedance of five kinds of tissue was statistically different from each other (p < 0.0001). The tissue recognition accuracy of the algorithm based on principal component analysis and support vector machine ranged from 83%-100%, and the overall accuracy was 95.83%. The classification accuracy of the algorithm based on support vector machine and ensemble learning was 100%, and the boundaries of electrical impedance of five tissues at various frequencies were calculated. CONCLUSION The electrical impedance of cortical bone, cancellous bone, spinal cord, muscle, and nucleus pulposus had significant differences in 10-100 kHz frequency. The application of support vector machine realized the accurate tissue recognition in the spinal operation area based on electrical impedance, which is expected to be translated and applied to tissue recognition during spinal surgery.
Collapse
Affiliation(s)
- Bingrong Chen
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yongwang Shi
- MD Program, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jiahao Li
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Jiliang Zhai
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Liang Liu
- China Astronaut Research and Training Center, Beijing, China
| | - Wenyong Liu
- School of Biological Science and Medical Engineering, Beihang University, Beijing, China
| | - Lei Hu
- School of Mechanical Engineering and Automation, Beihang University, Beijing, China
| | - Yu Zhao
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| |
Collapse
|
32
|
Câmara GBM, Coutinho MGF, da Silva LMD, Gadelha WVDN, Torquato MF, Barbosa RDM, Fernandes MAC. Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification. SENSORS (BASEL, SWITZERLAND) 2022; 22:5730. [PMID: 35957287 PMCID: PMC9371030 DOI: 10.3390/s22155730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 07/28/2022] [Accepted: 07/28/2022] [Indexed: 06/15/2023]
Abstract
COVID-19, the illness caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus belonging to the Coronaviridade family, a single-strand positive-sense RNA genome, has been spreading around the world and has been declared a pandemic by the World Health Organization. On 17 January 2022, there were more than 329 million cases, with more than 5.5 million deaths. Although COVID-19 has a low mortality rate, its high capacities for contamination, spread, and mutation worry the authorities, especially after the emergence of the Omicron variant, which has a high transmission capacity and can more easily contaminate even vaccinated people. Such outbreaks require elucidation of the taxonomic classification and origin of the virus (SARS-CoV-2) from the genomic sequence for strategic planning, containment, and treatment of the disease. Thus, this work proposes a high-accuracy technique to classify viruses and other organisms from a genome sequence using a deep learning convolutional neural network (CNN). Unlike the other literature, the proposed approach does not limit the length of the genome sequence. The results show that the novel proposal accurately distinguishes SARS-CoV-2 from the sequences of other viruses. The results were obtained from 1557 instances of SARS-CoV-2 from the National Center for Biotechnology Information (NCBI) and 14,684 different viruses from the Virus-Host DB. As a CNN has several changeable parameters, the tests were performed with forty-eight different architectures; the best of these had an accuracy of 91.94 ± 2.62% in classifying viruses into their realms correctly, in addition to 100% accuracy in classifying SARS-CoV-2 into its respective realm, Riboviria. For the subsequent classifications (family, genera, and subgenus), this accuracy increased, which shows that the proposed architecture may be viable in the classification of the virus that causes COVID-19.
Collapse
Affiliation(s)
- Gabriel B. M. Câmara
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Maria G. F. Coutinho
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Lucileide M. D. da Silva
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Federal Institute of Education, Science and Technology of Rio Grande do Norte, Paraiso, Santa Cruz 59200-000, RN, Brazil
| | - Walter V. do N. Gadelha
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Matheus F. Torquato
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Raquel de M. Barbosa
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, 18071 Granada, Spain
| | - Marcelo A. C. Fernandes
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil
| |
Collapse
|
33
|
Jeon YJ, Hasan MM, Park HW, Lee KW, Manavalan B. TACOS: a novel approach for accurate prediction of cell-specific long noncoding RNAs subcellular localization. Brief Bioinform 2022; 23:6618237. [PMID: 35753698 PMCID: PMC9294414 DOI: 10.1093/bib/bbac243] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 05/23/2022] [Accepted: 05/24/2022] [Indexed: 11/14/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) are primarily regulated by their cellular localization, which is responsible for their molecular functions, including cell cycle regulation and genome rearrangements. Accurately identifying the subcellular location of lncRNAs from sequence information is crucial for a better understanding of their biological functions and mechanisms. In contrast to traditional experimental methods, bioinformatics or computational methods can be applied for the annotation of lncRNA subcellular locations in humans more effectively. In the past, several machine learning-based methods have been developed to identify lncRNA subcellular localization, but relevant work for identifying cell-specific localization of human lncRNA remains limited. In this study, we present the first application of the tree-based stacking approach, TACOS, which allows users to identify the subcellular localization of human lncRNA in 10 different cell types. Specifically, we conducted comprehensive evaluations of six tree-based classifiers with 10 different feature descriptors, using a newly constructed balanced training dataset for each cell type. Subsequently, the strengths of the AdaBoost baseline models were integrated via a stacking approach, with an appropriate tree-based classifier for the final prediction. TACOS displayed consistent performance in both the cross-validation and independent assessments compared with the other two approaches employed in this study. The user-friendly online TACOS web server can be accessed at https://balalab-skku.org/TACOS.
Collapse
Affiliation(s)
- Young-Jun Jeon
- Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hyun Woo Park
- Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| | - Ki Wook Lee
- Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics laboratory, Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| |
Collapse
|
34
|
Bonidia RP, Santos APA, de Almeida BLS, Stadler PF, da Rocha UN, Sanches DS, de Carvalho ACPLF. BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria. Brief Bioinform 2022; 23:6618238. [PMID: 35753697 PMCID: PMC9294424 DOI: 10.1093/bib/bbac218] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 05/06/2022] [Accepted: 05/09/2022] [Indexed: 01/19/2023] Open
Abstract
Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people’s lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Anderson P Avila Santos
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil.,Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, Leipzig, Saxony, Germany
| | - Breno L S de Almeida
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Peter F Stadler
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, Leipzig, Saxony, Germany
| | - Ulisses N da Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, Leipzig, Saxony, Germany
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
35
|
Mathur G, Pandey A, Goyal S. A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING 2022; 14:1-17. [PMID: 35789598 PMCID: PMC9243743 DOI: 10.1007/s12652-022-04099-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 06/06/2022] [Indexed: 06/15/2023]
Abstract
In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.
Collapse
Affiliation(s)
- Garima Mathur
- Department of Computer Science and Engineering, UIT, RGPV, Bhopal, India
| | - Anjana Pandey
- Department of Information Technology, UIT, RGPV, Bhopal, India
| | - Sachin Goyal
- Department of Information Technology, UIT, RGPV, Bhopal, India
| |
Collapse
|
36
|
Xia Y, Jiang M, Luo Y, Feng G, Jia G, Zhang H, Wang P, Ge R. SuccSPred2.0: A Two-Step Model to Predict Succinylation Sites Based on Multifeature Fusion and Selection Algorithm. J Comput Biol 2022; 29:1085-1094. [PMID: 35714347 DOI: 10.1089/cmb.2022.0109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Protein succinylation is a novel type of post-translational modification in recent decade years. It played an important role in biological structure and functions verified by experiments. However, it is time consuming and laborious for the wet experimental identification of succinylation sites. Traditional technology cannot adapt to the rapid growth of the biological sequence data sets. In this study, a new computational method named SuccSPred2.0 was proposed to identify succinylation sites in the protein sequences based on multifeature fusion and maximal information coefficient (MIC) method. SuccSPred2.0 was implemented based on a two-step strategy. At first, high-dimension features were reduced by linear discriminant analysis to prevent overfitting. Subsequently, MIC method was employed to select the important features binding classifiers to predict succinylation sites. From the compared experiments on 10-fold cross-validation and independent test data sets, SuccSPred2.0 obtained promising improvements. Comparative experiments showed that SuccSPred2.0 was superior to previous tools in identifying succinylation sites in the given proteins.
Collapse
Affiliation(s)
- Yixiao Xia
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Minchao Jiang
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Yizhang Luo
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Guanwen Feng
- Xi'an Key Laboratory of Big Data and Intelligent Vision, School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Gangyong Jia
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Hua Zhang
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Pu Wang
- Computer School, Hubei University of Arts and Science, Xiangyang, China
| | - Ruiquan Ge
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| |
Collapse
|
37
|
Feng G, Yao H, Li C, Liu R, Huang R, Fan X, Ge R, Miao Q. ME-ACP: Multi-view neural networks with ensemble model for identification of anticancer peptides. Comput Biol Med 2022; 145:105459. [DOI: 10.1016/j.compbiomed.2022.105459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/22/2022] [Accepted: 03/24/2022] [Indexed: 12/26/2022]
|
38
|
Liu P, Ding Y, Rong Y, Chen D. Prediction of cell penetrating peptides and their uptake efficiency using random forest‐based feature selections. AIChE J 2022. [DOI: 10.1002/aic.17781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Peng Liu
- Institute of Fundamental and Frontier Sciences University of Electronic Science and Technology of China Chengdu China
- Institute of Yangtze Delta Region (Quzhou) University of Electronic Science and Technology of China Quzhou China
| | - Yijie Ding
- Institute of Yangtze Delta Region (Quzhou) University of Electronic Science and Technology of China Quzhou China
| | - Ying Rong
- Beidahuang Industry Group General Hospital Harbin China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University Quzhou China
| |
Collapse
|
39
|
Hasan MM, Tsukiyama S, Cho JY, Kurata H, Alam MA, Liu X, Manavalan B, Deng HW. Deepm5C: A deep learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy. Mol Ther 2022; 30:2856-2867. [PMID: 35526094 PMCID: PMC9372321 DOI: 10.1016/j.ymthe.2022.05.001] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 04/25/2022] [Accepted: 05/03/2022] [Indexed: 11/30/2022] Open
Abstract
As one of the most prevalent post-transcriptional epigenetic modifications, N5-methylcytosine (m5C), plays an essential role in various cellular processes and disease pathogenesis. Therefore, it is important accurately identify m5C modifications in order to gain a deeper understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models have been developed using small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we propose Deepm5C, a bioinformatics method to identify RNA m5C sites in the throughout human genome. To develop Deepm5C, we constructed a novel benchmarking dataset and investigated a mixture of three conventional feature encoding algorithms and a feature derived from word embedding approaches. Afterwards, four variants of deep learning classifiers and four commonly used conventional classifiers were employed and trained with the four encodings, ultimately obtaining 32 baseline models. A stacking strategy is effectively utilized by integrating the predicted output of the optimal baseline models and trained with a 1-D convolutional neural network. As a result, the Deepm5C predictor achieved excellent performance during cross-validation with a Matthews correlation coefficient and accuracy of 0.697 and 0.855, respectively. The corresponding metrics during the independent test were 0.691 and 0.852, respectively. Overall, Deepm5C achieved a more accurate and stable performance than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, Deepm5C is expected to assist community-wide efforts in identifying putative m5Cs and formulate the novel testable biological hypothesis.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Jae Youl Cho
- Molecular Immunology Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Ashad Alam
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Xiaowen Liu
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea.
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| |
Collapse
|
40
|
Sequeira AM, Lousa D, Rocha M. ProPythia: A Python package for protein classification based on machine and deep learning. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.07.102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
41
|
He Z, Xu J, Shi H, Wu S. m5CRegpred: Epitranscriptome Target Prediction of 5-Methylcytosine (m5C) Regulators Based on Sequencing Features. Genes (Basel) 2022; 13:genes13040677. [PMID: 35456483 PMCID: PMC9025882 DOI: 10.3390/genes13040677] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 04/02/2022] [Accepted: 04/05/2022] [Indexed: 02/04/2023] Open
Abstract
5-methylcytosine (m5C) is a common post-transcriptional modification observed in a variety of RNAs. m5C has been demonstrated to be important in a variety of biological processes, including RNA structural stability and metabolism. Driven by the importance of m5C modification, many projects focused on the m5C sites prediction were reported before. To better understand the upstream and downstream regulation of m5C, we present a bioinformatics framework, m5CRegpred, to predict the substrate of m5C writer NSUN2 and m5C readers YBX1 and ALYREF for the first time. After features comparison, window lengths selection and algorism comparison on the mature mRNA model, our model achieved AUROC scores 0.869, 0.724 and 0.889 for NSUN2, YBX1 and ALYREF, respectively in an independent test. Our work suggests the substrate of m5C regulators can be distinguished and may help the research of m5C regulators in a special condition, such as substrates prediction of hyper- or hypo-expressed m5C regulators in human disease.
Collapse
Affiliation(s)
- Zhizhou He
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China; (Z.H.); (J.X.)
- Department of Molecular, Cell, and Developmental Biology, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Jing Xu
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China; (Z.H.); (J.X.)
| | - Haoran Shi
- Research Center for BioSystems, Land Use, and Nutrition (IFZ), Institute of Applied Microbiology, Justus-Liebig-University Giessen, Heinrich-Buff-Ring 26-32, 35392 Giessen, Germany
- Correspondence: (H.S.); (S.W.)
| | - Shuxiang Wu
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China; (Z.H.); (J.X.)
- Fujian Key Laboratory of Tumor Microbiology, Department of Medical Microbiology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China
- Correspondence: (H.S.); (S.W.)
| |
Collapse
|
42
|
Meng C, Ju Y, Shi H. TMPpred: A support vector machine-based thermophilic protein identifier. Anal Biochem 2022; 645:114625. [PMID: 35218736 DOI: 10.1016/j.ab.2022.114625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2021] [Revised: 02/18/2022] [Accepted: 02/21/2022] [Indexed: 11/13/2022]
Abstract
MOTIVATION The thermostability of proteins will cause them to break the temperature binding and play more functions. Using machine learning, we explored the mechanism of and reasons for protein thermostability characteristics. RESULTS Different from other methods that only pursue the performance of models, we aim to find important features so as to provide a powerful reference for in vitro experiments. We transformed this problem into a binary classification problem, that is, the distinction between thermophilic proteins and nonthermophilic proteins. Using support vector machine-based model construction and analysis, we inferred that Gly, Ala, Ser and Thr may be the most important components at the residue level that determine the thermal stability of proteins. It is also noteworthy that our proposed model obtains an Sn of 0.892, an Sp of 0.857, an ACC of 0.87566 and an AUC of 0.874. To facilitate other researchers, we wrapped our model and deployed it as a web server, which is accessible at http://112.124.26.17:7000/TMPpred/index.html.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Hohhot, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China.
| | - Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| |
Collapse
|
43
|
Jin Y, Yang Y. ProtPlat: an efficient pre-training platform for protein classification based on FastText. BMC Bioinformatics 2022; 23:66. [PMID: 35148686 PMCID: PMC8832758 DOI: 10.1186/s12859-022-04604-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 02/02/2022] [Indexed: 11/24/2022] Open
Abstract
Background For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few. Results In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service (https://compbio.sjtu.edu.cn/protplat) that is accessible to the public. Conclusions To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04604-2.
Collapse
Affiliation(s)
- Yuan Jin
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China.
| |
Collapse
|
44
|
Samami E, Pourali G, Arabpour M, Fanipakdel A, Shahidsales S, Javadinia SA, Hassanian SM, Mohammadparast S, Avan A. The Potential Diagnostic and Prognostic Value of Circulating MicroRNAs in the Assessment of Patients With Prostate Cancer: Rational and Progress. Front Oncol 2022; 11:716831. [PMID: 35186706 PMCID: PMC8855122 DOI: 10.3389/fonc.2021.716831] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Accepted: 12/31/2021] [Indexed: 12/20/2022] Open
Abstract
Prostate cancer (P.C.) is one of the most frequent diagnosed cancers among men and the first leading cause of death with an annual incidence of 1.4 million worldwide. Prostate-specific antigen is being used for screening/diagnosis of prostate disease, although it is associated with several limitations. Thus, identification of novel biomarkers is warranted for diagnosis of patients at earlier stages. MicroRNAs (miRNAs) are recently being emerged as potential biomarkers. It has been shown that these small molecules can be circulated in body fluids and prognosticate the risk of developing P.C. Several miRNAs, including MiR-20a, MiR-21, miR-375, miR-378, and miR-141, have been proposed to be expressed in prostate cancer. This review summarizes the current knowledge about possible molecular mechanisms and potential application of tissue specific and circulating microRNAs as diagnosis, prognosis, and therapeutic targets in prostate cancer.
Collapse
Affiliation(s)
- Elham Samami
- Network of Immunity in Infection, Malignancy and Autoimmunity (NIIMA), Universal Scientific Education and Research Network (USERN), Tehran University of Medical Sciences, Tehran, Iran
| | - Ghazaleh Pourali
- Cancer Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mahla Arabpour
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Azar Fanipakdel
- Cancer Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | | | - Seyed Alireza Javadinia
- Vasei Clinical Research Development Unit, Sabzevar University of Medical Sciences, Sabzevar, Iran
| | - Seyed Mahdi Hassanian
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Saeid Mohammadparast
- Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Amir Avan
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
- Basic Medical Sciences Institute, Mashhad University of Medical Sciences, Mashhad, Iran
- *Correspondence: Amir Avan,
| |
Collapse
|
45
|
DNA Methylation Biomarkers-Based Human Age Prediction Using Machine Learning. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8393498. [PMID: 35111213 PMCID: PMC8803417 DOI: 10.1155/2022/8393498] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Revised: 11/20/2021] [Accepted: 12/22/2021] [Indexed: 12/28/2022]
Abstract
Purpose. Age can be an important clue in uncovering the identity of persons that left biological evidence at crime scenes. With the availability of DNA methylation data, several age prediction models are developed by using statistical and machine learning methods. From epigenetic studies, it has been demonstrated that there is a close association between aging and DNA methylation. Most of the existing studies focused on healthy samples, whereas diseases may have a significant impact on human age. Therefore, in this article, an age prediction model is proposed using DNA methylation biomarkers for healthy and diseased samples. Methods. The dataset contains 454 healthy samples and 400 diseased samples from publicly available sources with age (1–89 years old). Six CpG sites are identified from this data having a high correlation with age using Pearson’s correlation coefficient. In this work, the age prediction model is developed using four different machine learning techniques, namely, Multiple Linear Regression, Support Vector Regression, Gradient Boosting Regression, and Random Forest Regression. Separate models are designed for healthy and diseased data. The data are split randomly into 80 : 20 ratios for training and testing, respectively. Results. Among all the techniques, the model designed using Random Forest Regression shows the best performance, and Gradient Boosting Regression is the second best model. In the case of healthy samples, the model achieved a MAD of 2.51 years for training data and 4.85 for testing data. Also, for diseased samples, a MAD of 3.83 years is obtained for training and 9.53 years for testing. Conclusion. These results showed that the proposed model can predict age for healthy and diseased samples.
Collapse
|
46
|
Wang R, Wang Z, Li Z, Lee TY. Residue-Residue Contact Can Be a Potential Feature for the Prediction of Lysine Crotonylation Sites. Front Genet 2022; 12:788467. [PMID: 35058968 PMCID: PMC8764140 DOI: 10.3389/fgene.2021.788467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/23/2021] [Indexed: 11/13/2022] Open
Abstract
Lysine crotonylation (Kcr) is involved in plenty of activities in the human body. Various technologies have been developed for Kcr prediction. Sequence-based features are typically adopted in existing methods, in which only linearly neighboring amino acid composition was considered. However, modified Kcr sites are neighbored by not only the linear-neighboring amino acid but also those spatially surrounding residues around the target site. In this paper, we have used residue-residue contact as a new feature for Kcr prediction, in which features encoded with not only linearly surrounding residues but also those spatially nearby the target site. Then, the spatial-surrounding residue was used as a new scheme for feature encoding for the first time, named residue-residue composition (RRC) and residue-residue pair composition (RRPC), which were used in supervised learning classification for Kcr prediction. As the result suggests, RRC and RRPC have achieved the best performance of RRC at an accuracy of 0.77 and an area under curve (AUC) value of 0.78, RRPC at an accuracy of 0.74, and an AUC value of 0.80. In order to show that the spatial feature is of a competitively high significance as other sequence-based features, feature selection was carried on those sequence-based features together with feature RRPC. In addition, different ranges of the surrounding amino acid compositions' radii were used for comparison of the performance. After result assessment, RRC and RRPC features have shown competitively outstanding performance as others or in some cases even around 0.20 higher in accuracy or 0.3 higher in AUC values compared with sequence-based features.
Collapse
Affiliation(s)
- Rulan Wang
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China
| | - Zhuo Wang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China
| | - Zhongyan Li
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| |
Collapse
|
47
|
Li F, Guo X, Xiang D, Pitt ME, Bainomugisa A, Coin LJ. Computational analysis and prediction of PE_PGRS proteins using machine learning. Comput Struct Biotechnol J 2022; 20:662-674. [PMID: 35140886 PMCID: PMC8804200 DOI: 10.1016/j.csbj.2022.01.019] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 01/09/2022] [Accepted: 01/18/2022] [Indexed: 12/18/2022] Open
Abstract
Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.
Collapse
Affiliation(s)
- Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | - Xudong Guo
- School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China
| | - Dongxu Xiang
- Faculty of Engineering and Information Technology, The University of Melbourne, VIC 3000, Australia
| | - Miranda E. Pitt
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | | | - Lachlan J.M. Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| |
Collapse
|
48
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
49
|
Bonidia RP, Domingues DS, Sanches DS, de Carvalho ACPLF. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform 2022; 23:bbab434. [PMID: 34750626 PMCID: PMC8769707 DOI: 10.1093/bib/bbab434] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/18/2021] [Accepted: 09/20/2021] [Indexed: 12/24/2022] Open
Abstract
One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Douglas S Domingues
- Group of Genomics and Transcriptomes in Plants, Institute of Biosciences, São Paulo State University (UNESP), Rio Claro 13506-900, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
50
|
Lin C, Wang L, Shi L. AAPred-CNN: accurate predictor based on deep convolution neural network for identification of anti-angiogenic peptides. Methods 2022; 204:442-448. [PMID: 35031486 DOI: 10.1016/j.ymeth.2022.01.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/28/2021] [Accepted: 01/09/2022] [Indexed: 12/13/2022] Open
Abstract
Recently, deep learning techniques have been developed for various bioactive peptide prediction tasks. However, there are only conventional machine learning-based methods for the prediction of anti-angiogenic peptides (AAP), which play an important role in cancer treatment. The main reason why no deep learning method has been involved in this field is that there are too few experimentally validated AAPs to support the training of deep models but researchers have believed that deep learning seriously depends on the amounts of labeled data. In this paper, as a tentative work, we try to predict AAP by constructing different classical deep learning models and propose the first deep convolution neural network-based predictor (AAPred-CNN) for AAP. Contrary to intuition, the experimental results show that deep learning models can achieve superior or comparable performance to the state-of-the-art model, although they are given a few labeled sequences to train. We also decipher the influence of hyper-parameters and training samples on the performance of deep learning models to help understand how the model work. Furthermore, we also visualize the learned embeddings by dimension reduction to increase the model interpretability and reveal the residue propensity of AAP through the statistics of convolutional features for different residues. In summary, this work demonstrates the powerful representation ability of AAPred-CNNfor AAP prediction, further improving the prediction accuracy of AAP.
Collapse
Affiliation(s)
- Changhang Lin
- School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuzhou, China
| | - Lei Wang
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|