Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Han LY, Cai CZ, Lo SL, Chung MCM, Chen YZ. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 2004;10:355-68. [PMID: 14970381 PMCID: PMC1370931 DOI: 10.1261/rna.5890304] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2003] [Accepted: 10/06/2003] [Indexed: 05/20/2023]

For:	Han LY, Cai CZ, Lo SL, Chung MCM, Chen YZ. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 2004;10:355-68. [PMID: 14970381 PMCID: PMC1370931 DOI: 10.1261/rna.5890304] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2003] [Accepted: 10/06/2003] [Indexed: 05/20/2023]

Number

Cited by Other Article(s)

Yadav AK, Gupta PK, Singh TR. PMTPred: machine-learning-based prediction of protein methyltransferases using the composition of k-spaced amino acid pairs. Mol Divers 2024:10.1007/s11030-024-10937-2. [PMID: 39033257 DOI: 10.1007/s11030-024-10937-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 07/10/2024] [Indexed: 07/23/2024]

Madugula SS, Pujar P, Nammi B, Wang S, Jayasinghe-Arachchige VM, Pham T, Mashburn D, Artiles M, Liu J. Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum. J Chem Inf Model 2024;64:4897-4911. [PMID: 38838358 DOI: 10.1021/acs.jcim.4c00625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]

Abstract

The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations such as large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In this study, we aim to elucidate the unique protein features associated with Cas9 and Cas12 families and identify the features distinguishing each family from non-Cas proteins. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,494 features) encoding various physiochemical, topological, constitutional, and coevolutionary information on Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and non-Cas proteins. All the models were evaluated rigorously on the test and independent data sets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 92% and 95% on their respective independent data sets, while the multiclass classifier achieved an F1 score of close to 0.98. We observed that Quasi-Sequence-Order (QSO) descriptors like Schneider.lag and Composition descriptors like charge, volume, and polarizability are predominant in the Cas12 family. Conversely Amino Acid Composition descriptors, especially Tripeptide Composition (TPC), predominate the Cas9 family. Four of the top 10 descriptors identified in Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all Cas9 proteins and located within different catalytically important domains of the Streptococcus pyogenes Cas9 (SpCas9) structure. Among these, DHI and HHA are well-known to be involved in the DNA cleavage activity of the SpCas9 protein. Mutation studies have highlighted the significance of the PWN tripeptide in PAM recognition and DNA cleavage activity of SpCas9, while Y450 from the PYY tripeptide plays a crucial role in reducing off-target effects and improving the specificity in SpCas9. Leveraging our machine learning (ML) pipeline, we identified numerous Cas9 and Cas12 family-specific features. These features offer valuable insights for future experimental and computational studies aiming at designing Cas systems with enhanced gene-editing properties. These features suggest plausible structural modifications that can effectively guide the development of Cas proteins with improved editing capabilities.

Collapse

Affiliation(s)

Sita Sirisha Madugula Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
Pranav Pujar Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
Bharani Nammi Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
Shouyi Wang Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
Vindi M Jayasinghe-Arachchige Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
Tyler Pham School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
Dominic Mashburn Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
Maria Artiles School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
Jin Liu Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States

Collapse

Ghafoor H, Asim MN, Ibrahim MA, Ahmed S, Dengel A. CAPTURE: Comprehensive anti-cancer peptide predictor with a unique amino acid sequence encoder. Comput Biol Med 2024;176:108538. [PMID: 38759585 DOI: 10.1016/j.compbiomed.2024.108538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/26/2024] [Accepted: 04/28/2024] [Indexed: 05/19/2024]

Abstract

Anticancer peptides (ACPs) key properties including bioactivity, high efficacy, low toxicity, and lack of drug resistance make them ideal candidates for cancer therapies. To deeply explore the potential of ACPs and accelerate development of cancer therapies, although 53 Artificial Intelligence supported computational predictors have been developed for ACPs and non ACPs classification but only one predictor has been developed for ACPs functional types annotations. Moreover, these predictors extract amino acids distribution patterns to transform peptides sequences into statistical vectors that are further fed to classifiers for discriminating peptides sequences and annotating peptides functional classes. Overall, these predictors remain fail in extracting diverse types of amino acids distribution patterns from peptide sequences. The paper in hand presents a unique CARE encoder that transforms peptides sequences into statistical vectors by extracting 4 different types of distribution patterns including correlation, distribution, composition, and transition. Across public benchmark dataset, proposed encoder potential is explored under two different evaluation settings namely; intrinsic and extrinsic. Extrinsic evaluation indicates that 12 different machine learning classifiers achieve superior performance with the proposed encoder as compared to 55 existing encoders. Furthermore, an intrinsic evaluation reveals that, unlike existing encoders, the proposed encoder generates more discriminative clusters for ACPs and non-ACPs classes. Across 8 public benchmark ACPs and non-ACPs classification datasets, proposed encoder and Adaboost classifier based CAPTURE predictor outperforms existing predictors with an average accuracy, recall and MCC score of 1%, 4%, and 2% respectively. In generalizeability evaluation case study, across 7 benchmark anti-microbial peptides classification datasets, CAPTURE surpasses existing predictors by an average AU-ROC of 2%. CAPTURE predictive pipeline along with label powerset method outperforms state-of-the-art ACPs functional types predictor by 5%, 5%, 5%, 6%, and 3% in terms of average accuracy, subset accuracy, precision, recall, and F1 respectively. CAPTURE web application is available at https://sds_genetic_analysis.opendfki.de/CAPTURE.

Collapse

Liao YH, Chen SZ, Bin YN, Zhao JP, Feng XL, Zheng CH. UsIL-6: An unbalanced learning strategy for identifying IL-6 inducing peptides by undersampling technique. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024;250:108176. [PMID: 38677081 DOI: 10.1016/j.cmpb.2024.108176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 03/26/2024] [Accepted: 04/11/2024] [Indexed: 04/29/2024]

Madugula SS, Pujar P, Bharani N, Wang S, Jayasinghe-Arachchige VM, Pham T, Mashburn D, Artilis M, Liu J. Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.22.576286. [PMID: 38328240 PMCID: PMC10849529 DOI: 10.1101/2024.01.22.576286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]

Abstract

The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations like large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In the current study, we aim to elucidate the unique protein attributes associated with Cas9 and Cas12 families and identify the features that distinguish each family from the other. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,495 features) encoding various physiochemical, topological, constitutional, and coevolutionary information of Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and Non-Cas proteins. All the models were evaluated rigorously on the test and independent datasets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 95% and 97% on their respective independent datasets, while the multiclass classifier achieved a high F1 score of 0.97. We observed that Quasi-sequence-order descriptors like Schneider-lag descriptors and Composition descriptors like charge, volume, and polarizability are essential for the Cas12 family. More interestingly, we discovered that Amino Acid Composition descriptors, especially the Tripeptide Composition (TPC) descriptors, are important for the Cas9 family. Four of the identified important descriptors of Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all the Cas9 proteins and were located within different catalytically important domains of the Cas9 protein structure. Among these four tripeptides, tripeptides DHI and HHA are well-known to be involved in the DNA cleavage activity of the Cas9 protein. We therefore propose the the other two tripeptides, PWN and PYY, may also be essential for the Cas9 family. Our identified important descriptors enhanced the understanding of the catalytic mechanisms of Cas9 and Cas12 proteins and provide valuable insights into design of novel Cas systems to achieve enhanced gene-editing properties.

Collapse

Chung CR, Liou JT, Wu LC, Horng JT, Lee TY. Multi-label classification and features investigation of antimicrobial peptides with various functional classes. iScience 2023;26:108250. [PMID: 38025779 PMCID: PMC10679894 DOI: 10.1016/j.isci.2023.108250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2023] [Revised: 07/15/2023] [Accepted: 10/16/2023] [Indexed: 12/01/2023] Open

Chen J, Gu Z, Lai L, Pei J. In silico protein function prediction: the rise of machine learning-based approaches. MEDICAL REVIEW (2021) 2023;3:487-510. [PMID: 38282798 PMCID: PMC10808870 DOI: 10.1515/mr-2023-0038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/11/2023] [Indexed: 01/30/2024]

Deng H, Ding M, Wang Y, Li W, Liu G, Tang Y. ACP-MLC: A two-level prediction engine for identification of anticancer peptides and multi-label classification of their functional types. Comput Biol Med 2023;158:106844. [PMID: 37058760 DOI: 10.1016/j.compbiomed.2023.106844] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Revised: 03/09/2023] [Accepted: 03/30/2023] [Indexed: 04/07/2023]

Peng X, Wang X, Guo Y, Ge Z, Li F, Gao X, Song J. RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins. Brief Bioinform 2022;23:6596984. [PMID: 35649392 PMCID: PMC9294422 DOI: 10.1093/bib/bbac215] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 04/25/2022] [Accepted: 05/06/2022] [Indexed: 11/27/2022] Open

Singh O, Hsu WL, Su ECY. ILeukin10Pred: A Computational Approach for Predicting IL-10-Inducing Immunosuppressive Peptides Using Combinations of Amino Acid Global Features. BIOLOGY 2021;11:biology11010005. [PMID: 35053004 PMCID: PMC8773200 DOI: 10.3390/biology11010005] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 11/25/2021] [Accepted: 12/15/2021] [Indexed: 01/03/2023]

Abstract

Simple Summary

Interleukin-10 is a cytokine that exhibits potent anti-inflammatory characteristics that play an essential role in limiting the host’s immune response to pathogens and regulating the growth or differentiation of various immune cells. Moreover, interleukin-10 prediction via conventional approaches is time-consuming and labor-intensive. Hence, researchers are inclined towards an alternative approach to predict interleukin-10-inducing peptides. Additionally, numerous in silico tools are available to predict T cell epitopes. These methods generally follow a direct or indirect approach where they directly predict cytotoxic T-lymphocyte epitopes rather than major histocompatibility complex binders or indirectly predict single components of the T cell recognition pathway. However, very few studies are available that address cytokine-specific predictions. Our research utilized a computer-aided approach to develop a model to predict IL-10-inducing peptides. This study outperformed the existing state-of-the-art method and achieved an accuracy of 87.5% and Matthew’s correlation coefficient (MCC) of 0.755 on the hybrid feature types and outperformed an existing state-of-the-art method based on dipeptide compositions that achieved an accuracy of 81.24% and an MCC value of 0.59. Therefore, our model is promising to assist in predicting immunosuppressive peptides that induce interleukin-10 cytokines.

Abstract

Interleukin (IL)-10 is a homodimer cytokine that plays a crucial role in suppressing inflammatory responses and regulating the growth or differentiation of various immune cells. However, the molecular mechanism of IL-10 regulation is only partially understood because its regulation is environment or cell type-specific. In this study, we developed a computational approach, ILeukin10Pred (interleukin-10 prediction), by employing amino acid sequence-based features to predict and identify potential immunosuppressive IL-10-inducing peptides. The dataset comprises 394 experimentally validated IL-10-inducing and 848 non-inducing peptides. Furthermore, we split the dataset into a training set (80%) and a test set (20%). To train and validate the model, we applied a stratified five-fold cross-validation method. The final model was later evaluated using the holdout set. An extra tree classifier (ETC)-based model achieved an accuracy of 87.5% and Matthew’s correlation coefficient (MCC) of 0.755 on the hybrid feature types. It outperformed an existing state-of-the-art method based on dipeptide compositions that achieved an accuracy of 81.24% and an MCC value of 0.59. Our experimental results showed that the combination of various features achieved better predictive performance..

Collapse

Agrawal S, Sisodia DS, Nagwani NK. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 2021;59:2297-2310. [PMID: 34545514 DOI: 10.1007/s11517-021-02436-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 08/29/2021] [Indexed: 11/24/2022]

Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, Akutsu T, Daly RJ, Webb GI, Zhao Q, Kurgan L, Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021;49:e60. [PMID: 33660783 PMCID: PMC8191785 DOI: 10.1093/nar/gkab122] [Citation(s) in RCA: 107] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 02/05/2021] [Accepted: 02/25/2021] [Indexed: 12/14/2022] Open

Affiliation(s)

Zhen Chen Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
Pei Zhao State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
Chen Li Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
Fuyi Li Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
Dongxu Xiang Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
Yong-Zi Chen Laboratory of Tumor Cell Biology, Key Laboratory of Cancer Prevention and Therapy, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300060, China
Tatsuya Akutsu Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
Roger J Daly Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
Geoffrey I Webb Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
Quanzhi Zhao Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou 450046, China
Lukasz Kurgan Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
Jiangning Song Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia

Collapse

Mishra A, Khanal R, Kabir WU, Hoque T. AIRBP: Accurate identification of RNA-binding proteins using machine learning techniques. Artif Intell Med 2021;113:102034. [PMID: 33685590 DOI: 10.1016/j.artmed.2021.102034] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2020] [Revised: 01/19/2021] [Accepted: 02/09/2021] [Indexed: 12/25/2022]

Abstract

Identification of RNA-binding proteins (RBPs) that bind to ribonucleic acid molecules is an important problem in Computational Biology and Bioinformatics. It becomes indispensable to identify RBPs as they play crucial roles in post-transcriptional control of RNAs and RNA metabolism as well as have diverse roles in various biological processes such as splicing, mRNA stabilization, mRNA localization, and translation, RNA synthesis, folding-unfolding, modification, processing, and degradation. The existing experimental techniques for identifying RBPs are time-consuming and expensive. Therefore, identifying RBPs directly from the sequence using computational methods can be useful to annotate RBPs and assist the experimental design efficiently. In this work, we present a method called AIRBP, which is designed using an advanced machine learning technique, called stacking, to effectively predict RBPs by utilizing features extracted from evolutionary information, physiochemical properties, and disordered properties. Moreover, our method, AIRBP, use the majority vote from RBPPred, DeepRBPPred, and the stacking model for the prediction for RBPs. The results show that AIRBP attains Accuracy (ACC), Balanced Accuracy (BACC), F1-score, and Mathews Correlation Coefficient (MCC) of 95.84 %, 94.71 %, 0.928, and 0.899, respectively, based on the training dataset, using 10-fold cross-validation (CV). Further evaluation of AIRBP on independent test set reveals that it achieves ACC, BACC, F1-score, and MCC of 94.36 %, 94.28 %, 0.897, and 0.860, for Human test set; 91.25 %, 93.00 %, 0.896, and 0.835 for S. cerevisiae test set; and 90.60 %, 90.41 %, 0.934, and 0.775 for A. thaliana test set, respectively. These results indicate that the AIRBP outperforms the existing Deep- and TriPepSVM methods. Therefore, the proposed better-performing AIRBP can be useful for accurate identification and annotation of RBPs directly from the sequence and help gain valuable insight to treat critical diseases. Availability: Code-data is available here: http://cs.uno.edu/∼tamjid/Software/AIRBP/code_data.zip.

Collapse

The search for RNA-binding proteins: a technical and interdisciplinary challenge. Biochem Soc Trans 2021;49:393-403. [PMID: 33492363 PMCID: PMC7925008 DOI: 10.1042/bst20200688] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 12/21/2020] [Accepted: 12/21/2020] [Indexed: 12/13/2022]

The evolutionary relationship of S15/NS1RNA binding domains with a similar protein domain pattern - A computational approach. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Enhanced prediction of anti-tubercular peptides from sequence information using divergence measure-based intuitionistic fuzzy-rough feature selection. Soft comput 2020. [DOI: 10.1007/s00500-020-05363-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Yang H, Deng Z, Pan X, Shen HB, Choi KS, Wang L, Wang S, Wu J. RNA-binding protein recognition based on multi-view deep feature and multi-label learning. Brief Bioinform 2020;22:5893431. [PMID: 32808039 DOI: 10.1093/bib/bbaa174] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 06/17/2020] [Accepted: 07/09/2020] [Indexed: 12/28/2022] Open

Abstract

RNA-binding protein (RBP) is a class of proteins that bind to and accompany RNAs in regulating biological processes. An RBP may have multiple target RNAs, and its aberrant expression can cause multiple diseases. Methods have been designed to predict whether a specific RBP can bind to an RNA and the position of the binding site using binary classification model. However, most of the existing methods do not take into account the binding similarity and correlation between different RBPs. While methods employing multiple labels and Long Short Term Memory Network (LSTM) are proposed to consider binding similarity between different RBPs, the accuracy remains low due to insufficient feature learning and multi-label learning on RNA sequences. In response to this challenge, the concept of RNA-RBP Binding Network (RRBN) is proposed in this paper to provide theoretical support for multi-label learning to identify RBPs that can bind to RNAs. It is experimentally shown that the RRBN information can significantly improve the prediction of unknown RNA-RBP interactions. To further improve the prediction accuracy, we present the novel computational method iDeepMV which integrates multi-view deep learning technology under the multi-label learning framework. iDeepMV first extracts data from the views of amino acid sequence and dipeptide component based on the RNA sequences as the original view. Deep neural network models are then designed for the respective views to perform deep feature learning. The extracted deep features are fed into multi-label classifiers which are trained with the RNA-RBP interaction information for the three views. Finally, a voting mechanism is designed to make comprehensive decision on the results of the multi-label classifiers. Our experimental results show that the prediction performance of iDeepMV, which combines multi-view deep feature learning models with RNA-RBP interaction information, is significantly better than that of the state-of-the-art methods. iDeepMV is freely available at http://www.csbio.sjtu.edu.cn/bioinf/iDeepMV for academic use. The code is freely available at http://github.com/uchihayht/iDeepMV.

Collapse

Identification of common and dissimilar biomarkers for different cancer types from gene expressions of RNA-sequencing data. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Sagar A, Xue B. Recent Advances in Machine Learning Based Prediction of RNA-protein Interactions. Protein Pept Lett 2019;26:601-619. [PMID: 31215361 DOI: 10.2174/0929866526666190619103853] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 04/04/2019] [Accepted: 06/01/2019] [Indexed: 12/18/2022]

Zuo Y, Chang Y, Huang S, Zheng L, Yang L, Cao G. iDEF-PseRAAC: Identifying the Defensin Peptide by Using Reduced Amino Acid Composition Descriptor. Evol Bioinform Online 2019;15:1176934319867088. [PMID: 31391777 PMCID: PMC6669840 DOI: 10.1177/1176934319867088] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 07/08/2019] [Indexed: 11/18/2022] Open

Poursheikhali Asghari M, Abdolmaleki P. Prediction of RNA- and DNA-Binding Proteins Using Various Machine Learning Classifiers. Avicenna J Med Biotechnol 2019;11:104-111. [PMID: 30800250 PMCID: PMC6359699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Shen WJ, Cui W, Chen D, Zhang J, Xu J. RPiRLS: Quantitative Predictions of RNA Interacting with Any Protein of Known Sequence. Molecules 2018;23:molecules23030540. [PMID: 29495575 PMCID: PMC6017498 DOI: 10.3390/molecules23030540] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 02/24/2018] [Accepted: 02/25/2018] [Indexed: 02/05/2023] Open

Tripathi P, Pandey PN. A novel alignment-free method to classify protein folding types by combining spectral graph clustering with Chou's pseudo amino acid composition. J Theor Biol 2017;424:49-54. [DOI: 10.1016/j.jtbi.2017.04.027] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Revised: 04/24/2017] [Accepted: 04/27/2017] [Indexed: 10/19/2022]

Zhang X, Liu S. RBPPred: predicting RNA-binding proteins from sequence using SVM. Bioinformatics 2016;33:854-862. [DOI: 10.1093/bioinformatics/btw730] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Accepted: 11/16/2016] [Indexed: 11/13/2022] Open

DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues. PLoS One 2016;11:e0167345. [PMID: 27907159 PMCID: PMC5132331 DOI: 10.1371/journal.pone.0167345] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2016] [Accepted: 11/12/2016] [Indexed: 12/24/2022] Open

Chrysostomou C, Seker H. Prediction of protein allergenicity based on signal-processing bioinformatics approach. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016;2014:808-11. [PMID: 25570082 DOI: 10.1109/embc.2014.6943714] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Ghosh P, Mathew OK, Sowdhamini R. RStrucFam: a web server to associate structure and cognate RNA for RNA-binding proteins from sequence information. BMC Bioinformatics 2016;17:411. [PMID: 27717309 PMCID: PMC5054549 DOI: 10.1186/s12859-016-1289-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2016] [Accepted: 09/29/2016] [Indexed: 11/25/2022] Open

Abstract

Background

RNA-binding proteins (RBPs) interact with their cognate RNA(s) to form large biomolecular assemblies. They are versatile in their functionality and are involved in a myriad of processes inside the cell. RBPs with similar structural features and common biological functions are grouped together into families and superfamilies. It will be useful to obtain an early understanding and association of RNA-binding property of sequences of gene products. Here, we report a web server, RStrucFam, to predict the structure, type of cognate RNA(s) and function(s) of proteins, where possible, from mere sequence information.

Results

The web server employs Hidden Markov Model scan (hmmscan) to enable association to a back-end database of structural and sequence families. The database (HMMRBP) comprises of 437 HMMs of RBP families of known structure that have been generated using structure-based sequence alignments and 746 sequence-centric RBP family HMMs. The input protein sequence is associated with structural or sequence domain families, if structure or sequence signatures exist. In case of association of the protein with a family of known structures, output features like, multiple structure-based sequence alignment (MSSA) of the query with all others members of that family is provided. Further, cognate RNA partner(s) for that protein, Gene Ontology (GO) annotations, if any and a homology model of the protein can be obtained. The users can also browse through the database for details pertaining to each family, protein or RNA and their related information based on keyword search or RNA motif search.

Conclusions

RStrucFam is a web server that exploits structurally conserved features of RBPs, derived from known family members and imprinted in mathematical profiles, to predict putative RBPs from sequence information. Proteins that fail to associate with such structure-centric families are further queried against the sequence-centric RBP family HMMs in the HMMRBP database. Further, all other essential information pertaining to an RBP, like overall function annotations, are provided. The web server can be accessed at the following link: http://caps.ncbs.res.in/rstrucfam.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-016-1289-x) contains supplementary material, which is available to authorized users.

Collapse

Prediction of protein-RNA interactions using sequence and structure descriptors. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.11.105] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]

Li YH, Xu JY, Tao L, Li XF, Li S, Zeng X, Chen SY, Zhang P, Qin C, Zhang C, Chen Z, Zhu F, Chen YZ. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity. PLoS One 2016;11:e0155290. [PMID: 27525735 PMCID: PMC4985167 DOI: 10.1371/journal.pone.0155290] [Citation(s) in RCA: 80] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Accepted: 04/27/2016] [Indexed: 12/20/2022] Open

Affiliation(s)

Ying Hong Li Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
Jing Yu Xu Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, China
Lin Tao Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
Xiao Feng Li Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
Shuang Li Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
Xian Zeng Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
Shang Ying Chen Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
Peng Zhang Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
Chu Qin Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
Cheng Zhang Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
Zhe Chen Zhejiang Key Laboratory of Gastro-intestinal Pathophysiology, Zhejiang Hospital of Traditional Chinese Medicine, Zhejiang Chinese Medical University, Hangzhou, P. R. China
Feng Zhu Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
Yu Zong Chen Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore

Collapse

Computational Prediction of RNA-Binding Proteins and Binding Sites. Int J Mol Sci 2015;16:26303-17. [PMID: 26540053 PMCID: PMC4661811 DOI: 10.3390/ijms161125952] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2015] [Revised: 10/20/2015] [Accepted: 10/23/2015] [Indexed: 11/19/2022] Open

Ma X, Guo J, Xiao K, Sun X. PRBP: Prediction of RNA-Binding Proteins Using a Random Forest Algorithm Combined with an RNA-Binding Residue Predictor. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015;12:1385-1393. [PMID: 26671809 DOI: 10.1109/tcbb.2015.2418773] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection. BIOMED RESEARCH INTERNATIONAL 2015;2015:425810. [PMID: 26543860 PMCID: PMC4620426 DOI: 10.1155/2015/425810] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 09/21/2015] [Indexed: 11/17/2022]

Yousef A, Moghadam Charkari N. SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J Theor Biol 2015. [DOI: 10.1016/j.jtbi.2015.07.010] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Ren H, Shen Y. RNA-binding residues prediction using structural features. BMC Bioinformatics 2015;16:249. [PMID: 26254826 PMCID: PMC4529986 DOI: 10.1186/s12859-015-0691-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2015] [Accepted: 07/31/2015] [Indexed: 01/25/2023] Open

Yousef A, Charkari NM. A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification. J Biomed Inform 2015;56:300-6. [DOI: 10.1016/j.jbi.2015.06.018] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Revised: 06/04/2015] [Accepted: 06/26/2015] [Indexed: 10/23/2022]

Pérez-Cano L, Fernández-Recio J. Dissection and prediction of RNA-binding sites on proteins. Biomol Concepts 2015;1:345-55. [PMID: 25962008 DOI: 10.1515/bmc.2010.037] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

CISAPS: Complex Informational Spectrum for the Analysis of Protein Sequences. Adv Bioinformatics 2015;2015:909765. [PMID: 25632276 PMCID: PMC4302972 DOI: 10.1155/2015/909765] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2014] [Revised: 11/27/2014] [Accepted: 12/04/2014] [Indexed: 11/23/2022] Open

Prediction of Protein-RNA Interactions Using Sequence and Structure Descriptors**This work was partially supported by the National Natural Science Foundation of China (NSFC) Grant No. 31100949, the Scientific Research Foundation for the Returned Overseas Chinese Scholars, Ministry of Education of China, the Fundamental Research Funds of Shandong University Grant No. 2014TB006, University of Rochester Center for AIDS Research Grant P30 AI078498 (NIH/NIAID) and NIH R01 Grant GM100788-01. ACTA ACUST UNITED AC 2015. [DOI: 10.1016/j.ifacol.2015.12.090] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Gupta Y, Witte M, Möller S, Ludwig RJ, Restle T, Zillikens D, Ibrahim SM. ptRNApred: computational identification and classification of post-transcriptional RNA. Nucleic Acids Res 2014;42:e167. [PMID: 25303994 PMCID: PMC4267668 DOI: 10.1093/nar/gku918] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Zhao H, Yang Y, Janga SC, Kao CC, Zhou Y. Prediction and validation of the unexplored RNA-binding protein atlas of the human proteome. Proteins 2014;82:640-7. [PMID: 24123256 PMCID: PMC3949140 DOI: 10.1002/prot.24441] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2013] [Revised: 09/13/2013] [Accepted: 09/26/2013] [Indexed: 12/13/2022]

Das Roy R, Dash D. Selection of relevant features from amino acids enables development of robust classifiers. Amino Acids 2014;46:1343-51. [PMID: 24604165 DOI: 10.1007/s00726-014-1697-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 02/14/2014] [Indexed: 12/30/2022]

Incorporating significant amino acid pairs and protein domains to predict RNA splicing-related proteins with functional roles. J Comput Aided Mol Des 2014;28:49-60. [PMID: 24442949 DOI: 10.1007/s10822-014-9706-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2013] [Accepted: 01/07/2014] [Indexed: 12/20/2022]

Zhao H, Yang Y, Zhou Y. Prediction of RNA binding proteins comes of age from low resolution to high resolution. MOLECULAR BIOSYSTEMS 2013;9:2417-25. [PMID: 23872922 PMCID: PMC3870025 DOI: 10.1039/c3mb70167k] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, Goliaei B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. SPRINGERPLUS 2013;2:238. [PMID: 23888262 PMCID: PMC3710575 DOI: 10.1186/2193-1801-2-238] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 03/21/2013] [Indexed: 01/15/2023]

Wang Y, Chen X, Liu ZP, Huang Q, Wang Y, Xu D, Zhang XS, Chen R, Chen L. De novo prediction of RNA–protein interactions from sequence information. ACTA ACUST UNITED AC 2013;9:133-42. [DOI: 10.1039/c2mb25292a] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]

Jahandideh S, Srinivasasainagendra V, Zhi D. Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection. J Theor Biol 2012;312:65-75. [PMID: 22884576 DOI: 10.1016/j.jtbi.2012.07.013] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2012] [Revised: 07/09/2012] [Accepted: 07/13/2012] [Indexed: 01/11/2023]

BATUWITA RUKSHAN, PALADE VASILE. ADJUSTED GEOMETRIC-MEAN: A NOVEL PERFORMANCE MEASURE FOR IMBALANCED BIOINFORMATICS DATASETS LEARNING. J Bioinform Comput Biol 2012;10:1250003. [DOI: 10.1142/s0219720012500035] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]

Hosseinzadeh F, Ebrahimi M, Goliaei B, Shamabadi N. Classification of lung cancer tumors based on structural and physicochemical properties of proteins by bioinformatics models. PLoS One 2012;7:e40017. [PMID: 22829872 PMCID: PMC3400626 DOI: 10.1371/journal.pone.0040017] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2012] [Accepted: 05/30/2012] [Indexed: 12/03/2022] Open

Abstract

Rapid distinction between small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) tumors is very important in diagnosis of this disease. Furthermore sequence-derived structural and physicochemical descriptors are very useful for machine learning prediction of protein structural and functional classes, classifying proteins and the prediction performance. Herein, in this study is the classification of lung tumors based on 1497 attributes derived from structural and physicochemical properties of protein sequences (based on genes defined by microarray analysis) investigated through a combination of attribute weighting, supervised and unsupervised clustering algorithms. Eighty percent of the weighting methods selected features such as autocorrelation, dipeptide composition and distribution of hydrophobicity as the most important protein attributes in classification of SCLC, NSCLC and COMMON classes of lung tumors. The same results were observed by most tree induction algorithms while descriptors of hydrophobicity distribution were high in protein sequences COMMON in both groups and distribution of charge in these proteins was very low; showing COMMON proteins were very hydrophobic. Furthermore, compositions of polar dipeptide in SCLC proteins were higher than NSCLC proteins. Some clustering models (alone or in combination with attribute weighting algorithms) were able to nearly classify SCLC and NSCLC proteins. Random Forest tree induction algorithm, calculated on leaves one-out and 10-fold cross validation) shows more than 86% accuracy in clustering and predicting three different lung cancer tumors. Here for the first time the application of data mining tools to effectively classify three classes of lung cancer tumors regarding the importance of dipeptide composition, autocorrelation and distribution descriptor has been reported.

Collapse

Fernandez M, Kumagai Y, Standley DM, Sarai A, Mizuguchi K, Ahmad S. Prediction of dinucleotide-specific RNA-binding sites in proteins. BMC Bioinformatics 2011;12 Suppl 13:S5. [PMID: 22373260 PMCID: PMC3278845 DOI: 10.1186/1471-2105-12-s13-s5] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open

Hsu JBK, Bretaña NA, Lee TY, Huang HD. Incorporating evolutionary information and functional domains for identifying RNA splicing factors in humans. PLoS One 2011;6:e27567. [PMID: 22110674 PMCID: PMC3217973 DOI: 10.1371/journal.pone.0027567] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Accepted: 10/19/2011] [Indexed: 11/19/2022] Open

Abstract

Regulation of pre-mRNA splicing is achieved through the interaction of RNA sequence elements and a variety of RNA-splicing related proteins (splicing factors). The splicing machinery in humans is not yet fully elucidated, partly because splicing factors in humans have not been exhaustively identified. Furthermore, experimental methods for splicing factor identification are time-consuming and lab-intensive. Although many computational methods have been proposed for the identification of RNA-binding proteins, there exists no development that focuses on the identification of RNA-splicing related proteins so far. Therefore, we are motivated to design a method that focuses on the identification of human splicing factors using experimentally verified splicing factors. The investigation of amino acid composition reveals that there are remarkable differences between splicing factors and non-splicing proteins. A support vector machine (SVM) is utilized to construct a predictive model, and the five-fold cross-validation evaluation indicates that the SVM model trained with amino acid composition could provide a promising accuracy (80.22%). Another basic feature, amino acid dipeptide composition, is also examined to yield a similar predictive performance to amino acid composition. In addition, this work presents that the incorporation of evolutionary information and domain information could improve the predictive performance. The constructed models have been demonstrated to effectively classify (73.65% accuracy) an independent data set of human splicing factors. The result of independent testing indicates that in silico identification could be a feasible means of conducting preliminary analyses of splicing factors and significantly reducing the number of potential targets that require further in vivo or in vitro confirmation.

Collapse