1
|
Zhu L, Wang X, Li F, Song J. PreAcrs: a machine learning framework for identifying anti-CRISPR proteins. BMC Bioinformatics 2022; 23:444. [PMID: 36284264 PMCID: PMC9597991 DOI: 10.1186/s12859-022-04986-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Accepted: 10/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Anti-CRISPR proteins are potent modulators that inhibit the CRISPR-Cas immunity system and have huge potential in gene editing and gene therapy as a genome-editing tool. Extensive studies have shown that anti-CRISPR proteins are essential for modifying endogenous genes, promoting the RNA-guided binding and cleavage of DNA or RNA substrates. In recent years, identifying and characterizing anti-CRISPR proteins has become a hot and significant research topic in bioinformatics. However, as most anti-CRISPR proteins fall short in sharing similarities to those currently known, traditional screening methods are time-consuming and inefficient. Machine learning methods could fill this gap with powerful predictive capability and provide a new perspective for anti-CRISPR protein identification. RESULTS Here, we present a novel machine learning ensemble predictor, called PreAcrs, to identify anti-CRISPR proteins from protein sequences directly. Three features and eight different machine learning algorithms were used to train PreAcrs. PreAcrs outperformed other existing methods and significantly improved the prediction accuracy for identifying anti-CRISPR proteins. CONCLUSIONS In summary, the PreAcrs predictor achieved a competitive performance for predicting new anti-CRISPR proteins in terms of accuracy and robustness. We anticipate PreAcrs will be a valuable tool for researchers to speed up the research process. The source code is available at: https://github.com/Lyn-666/anti_CRISPR.git .
Collapse
Affiliation(s)
- Lin Zhu
- Institute for Advanced Study, Shenzhen University, Shenzhen, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800 Australia
| |
Collapse
|
2
|
Zhu L, Li W. Roles of Physicochemical and Structural Properties of RNA-Binding Proteins in Predicting the Activities of Trans-Acting Splicing Factors with Machine Learning. Int J Mol Sci 2022; 23:ijms23084426. [PMID: 35457243 PMCID: PMC9030803 DOI: 10.3390/ijms23084426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/13/2022] [Accepted: 04/14/2022] [Indexed: 02/06/2023] Open
Abstract
Trans-acting splicing factors play a pivotal role in modulating alternative splicing by specifically binding to cis-elements in pre-mRNAs. There are approximately 1500 RNA-binding proteins (RBPs) in the human genome, but the activities of these RBPs in alternative splicing are unknown. Since determining RBP activities through experimental methods is expensive and time consuming, the development of an efficient computational method for predicting the activities of RBPs in alternative splicing from their sequences is of great practical importance. Recently, a machine learning model for predicting the activities of splicing factors was built based on features of single and dual amino acid compositions. Here, we explored the role of physicochemical and structural properties in predicting their activities in alternative splicing using machine learning approaches and found that the prediction performance is significantly improved by including these properties. By combining the minimum redundancy–maximum relevance (mRMR) method and forward feature searching strategy, a promising feature subset with 24 features was obtained to predict the activities of RBPs. The feature subset consists of 16 dual amino acid compositions, 5 physicochemical features, and 3 structural features. The physicochemical and structural properties were as important as the sequence composition features for an accurate prediction of the activities of splicing factors. The hydrophobicity and distribution of coil are suggested to be the key physicochemical and structural features, respectively.
Collapse
Affiliation(s)
| | - Wenjin Li
- Correspondence: ; Tel.: +86-0755-26942336
| |
Collapse
|
3
|
Recent Advances in the Prediction of Protein Structural Classes: Feature Descriptors and Machine Learning Algorithms. CRYSTALS 2021. [DOI: 10.3390/cryst11040324] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
In the postgenomic age, rapid growth in the number of sequence-known proteins has been accompanied by much slower growth in the number of structure-known proteins (as a result of experimental limitations), and a widening gap between the two is evident. Because protein function is linked to protein structure, successful prediction of protein structure is of significant importance in protein function identification. Foreknowledge of protein structural class can help improve protein structure prediction with significant medical and pharmaceutical implications. Thus, a fast, suitable, reliable, and reasonable computational method for protein structural class prediction has become pivotal in bioinformatics. Here, we review recent efforts in protein structural class prediction from protein sequence, with particular attention paid to new feature descriptors, which extract information from protein sequence, and the use of machine learning algorithms in both feature selection and the construction of new classification models. These new feature descriptors include amino acid composition, sequence order, physicochemical properties, multiprofile Bayes, and secondary structure-based features. Machine learning methods, such as artificial neural networks (ANNs), support vector machine (SVM), K-nearest neighbor (KNN), random forest, deep learning, and examples of their application are discussed in detail. We also present our view on possible future directions, challenges, and opportunities for the applications of machine learning algorithms for prediction of protein structural classes.
Collapse
|
4
|
Liu L, Chen L, Zhang YH, Wei L, Cheng S, Kong X, Zheng M, Huang T, Cai YD. Analysis and prediction of drug-drug interaction by minimum redundancy maximum relevance and incremental feature selection. J Biomol Struct Dyn 2016; 35:312-329. [PMID: 26750516 DOI: 10.1080/07391102.2016.1138142] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Drug-drug interaction (DDI) defines a situation in which one drug affects the activity of another when both are administered together. DDI is a common cause of adverse drug reactions and sometimes also leads to improved therapeutic effects. Therefore, it is of great interest to discover novel DDIs according to their molecular properties and mechanisms in a robust and rigorous way. This paper attempts to predict effective DDIs using the following properties: (1) chemical interaction between drugs; (2) protein interactions between the targets of drugs; and (3) target enrichment of KEGG pathways. The data consisted of 7323 pairs of DDIs collected from the DrugBank and 36,615 pairs of drugs constructed by randomly combining two drugs. Each drug pair was represented by 465 features derived from the aforementioned three categories of properties. The random forest algorithm was adopted to train the prediction model. Some feature selection techniques, including minimum redundancy maximum relevance and incremental feature selection, were used to extract key features as the optimal input for the prediction model. The extracted key features may help to gain insights into the mechanisms of DDIs and provide some guidelines for the relevant clinical medication developments, and the prediction model can give new clues for identification of novel DDIs.
Collapse
Affiliation(s)
- Lili Liu
- a Intelligence Research Department, Information Center , Shanghai Institute of Materia Medica, Chinese Academy of Sciences , Shanghai 201203 , P. R. China
| | - Lei Chen
- b College of Information Engineering, Shanghai Maritime University , Shanghai 201306 , P. R. China
| | - Yu-Hang Zhang
- c Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai 200031 , P. R. China
| | - Lai Wei
- b College of Information Engineering, Shanghai Maritime University , Shanghai 201306 , P. R. China
| | - Shiwen Cheng
- b College of Information Engineering, Shanghai Maritime University , Shanghai 201306 , P. R. China
| | - Xiangyin Kong
- c Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai 200031 , P. R. China
| | - Mingyue Zheng
- d State Key Laboratory of Drug Research, Drug Discovery and Design Center , Shanghai Institute of Materia Medica, Chinese Academy of Sciences , Shanghai 201203 , P. R. China
| | - Tao Huang
- c Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai 200031 , P. R. China
| | - Yu-Dong Cai
- e School of Life Sciences, Shanghai University , Shanghai 200444 , P. R. China
| |
Collapse
|
5
|
Abstract
The small ubiquitin-like modifier (SUMO) proteins are a kind of proteins that can be attached to a series of proteins. The sumoylation of protein is an important posttranslational modification. Thus, the prediction of the sumoylation site of a given protein is significant. Here we employed a combined method to perform this task. We predicted the sumoylation site of a protein by a two-staged procedure. At the first stage, whether a protein would be sumoylated was predicted; whereas at the second stage, the sumoylation sites of the protein were predicted if it was determined to be modified by SUMO at the first stage. At the first stage, we encoded a protein with protein families (PFAM) and trained the predictor with nearest network algorithm (NNA); at the second stage, we encoded nonapeptides (peptides that contain nine residues) of the protein containing the lysine residues, with Amino Acid Index, and trained the predictor with NNA. The predictor was tested by the k-fold cross-validation method. The highest accuracy of the second-staged predictor was 99.55% when 12 features were incorporated in the predictor. The corresponding Matthews Correlation Coefficient was 0.7952. These results indicate that the method is a promising tool to predict the sumoylation site of a protein. At last, the features used in the predictor are discussed. The software is available at request.
Collapse
Affiliation(s)
- YuDong Cai
- Institute of System Biology, Shanghai University, 99 Shangda Road, Shanghai, 200244, China.
| | | | | |
Collapse
|
6
|
Prediction of RNA-binding proteins by voting systems. J Biomed Biotechnol 2011; 2011:506205. [PMID: 21826121 PMCID: PMC3149752 DOI: 10.1155/2011/506205] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2011] [Revised: 05/12/2011] [Accepted: 05/26/2011] [Indexed: 11/29/2022] Open
Abstract
It is important to identify which proteins can interact with RNA for the purpose of
protein annotation, since interactions between RNA and proteins influence the
structure of the ribosome and play important roles in gene expression. This paper
tries to identify proteins that can interact with RNA using voting systems. Firstly
through Weka, 34 learning algorithms are chosen for investigation. Then simple
majority voting system (SMVS) is used for the prediction of RNA-binding proteins,
achieving average ACC (overall prediction accuracy) value of 79.72% and MCC
(Matthew's correlation coefficient) value of 59.77% for the
independent testing dataset. Then mRMR (minimum redundancy maximum relevance)
strategy is used, which is transferred into algorithm selection. In addition, the
MCC value of each classifier is assigned to be the weight of the
classifier's vote. As a result, best average MCC values are attained
when 22 algorithms are selected and integrated through weighted votes, which are
64.70% for the independent testing dataset, and ACC value is 82.04% at this
moment.
Collapse
|
7
|
Analysis of protein pathway networks using hybrid properties. Molecules 2010; 15:8177-92. [PMID: 21076385 PMCID: PMC6259184 DOI: 10.3390/molecules15118177] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2010] [Revised: 11/11/2010] [Accepted: 11/12/2010] [Indexed: 12/20/2022] Open
Abstract
Given a protein-forming system, i.e., a system consisting of certain number of different proteins, can it form a biologically meaningful pathway? This is a fundamental problem in systems biology and proteomics. During the past decade, a vast amount of information on different organisms, at both the genetic and metabolic levels, has been accumulated and systematically stored in various specific databases, such as KEGG, ENZYME, BRENDA, EcoCyc and MetaCyc. These data have made it feasible to address such an essential problem. In this paper, we have analyzed known regulatory pathways in humans by extracting different (biological and graphic) features from each of the 17,069 protein-formed systems, of which 169 are positive pathways, i.e., known regulatory pathways taken from KEGG; while 16,900 were negative, i.e., not formed as a biologically meaningful pathway. Each of these protein-forming systems was represented by 352 features, of which 88 are graph features and 264 biological features. To analyze these features, the "Minimum Redundancy Maximum Relevance" and the "Incremental Feature Selection" techniques were utilized to select a set of 22 optimal features to query whether a protein-forming system is able to form a biologically meaningful pathway or not. It was found through cross-validation that the overall success rate thus obtained in identifying the positive pathways was 79.88%. It is anticipated that, this novel approach and encouraging result, although preliminary yet, may stimulate extensive investigations into this important topic.
Collapse
|
8
|
Chen L, Qian Z, Fen K, Cai Y. Prediction of interactiveness between small molecules and enzymes by combining gene ontology and compound similarity. J Comput Chem 2010; 31:1766-76. [PMID: 20033913 DOI: 10.1002/jcc.21467] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Determination of whether a small organic molecule interacts with an enzyme can help to understand the molecular and cellular functions of organisms, and the metabolic pathways. In this research, we present a prediction model, by combining compound similarity and enzyme similarity, to predict the interactiveness between small molecules and enzymes. A dataset consisting of 2859 positive couples of small molecule and enzyme and 286,056 negative couples was employed. Compound similarity is a measurement of how similar two small molecules are, proposed by Hattori et al., J Am Chem Soc 2003, 125, 11853 which can be availed at http://www.genome.jp/ligand-bin/search_compound, while enzyme similarity was obtained by three ways, they are blast method, using gene ontology items and functional domain composition. Then a new distance between a pair of couples was established and nearest neighbor algorithm (NNA) was employed to predict the interactiveness of enzymes and small molecules. A data distribution strategy was adopted to get a better data balance between the positive samples and the negative samples during training the prediction model, by singling out one-fourth couples as testing samples and dividing the rest data into seven training datasets-the rest positive samples were added into each training dataset while only the negative samples were divided. In this way, seven NNAs were built. Finally, simple majority voting system was applied to integrate these seven models to predict the testing dataset, which was demonstrated to have better prediction results than using any single prediction model. As a result, the highest overall prediction accuracy achieved 97.30%.
Collapse
Affiliation(s)
- Lei Chen
- Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, People's Republic of China
| | | | | | | |
Collapse
|
9
|
Prediction of pharmacological and xenobiotic responses to drugs based on time course gene expression profiles. PLoS One 2009; 4:e8126. [PMID: 19956587 PMCID: PMC2780314 DOI: 10.1371/journal.pone.0008126] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2009] [Accepted: 11/10/2009] [Indexed: 12/19/2022] Open
Abstract
More and more people are concerned by the risk of unexpected side effects observed in the later steps of the development of new drugs, either in late clinical development or after marketing approval. In order to reduce the risk of the side effects, it is important to look out for the possible xenobiotic responses at an early stage. We attempt such an effort through a prediction by assuming that similarities in microarray profiles indicate shared mechanisms of action and/or toxicological responses among the chemicals being compared. A large time course microarray database derived from livers of compound-treated rats with thirty-four distinct pharmacological and toxicological responses were studied. The mRMR (Minimum-Redundancy-Maximum-Relevance) method and IFS (Incremental Feature Selection) were used to select a compact feature set (141 features) for the reduction of feature dimension and improvement of prediction performance. With these 141 features, the Leave-one-out cross-validation prediction accuracy of first order response using NNA (Nearest Neighbor Algorithm) was 63.9%. Our method can be used for pharmacological and xenobiotic responses prediction of new compounds and accelerate drug development.
Collapse
|
10
|
Yuan Y, Shi X, Li X, Lu W, Cai Y, Gu L, Liu L, Li M, Kong X, Xing M. Prediction of interactiveness of proteins and nucleic acids based on feature selections. Mol Divers 2009; 14:627-33. [PMID: 19816781 DOI: 10.1007/s11030-009-9198-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2009] [Accepted: 09/07/2009] [Indexed: 11/29/2022]
Abstract
It is important to identify which proteins can interact with nucleic acids for the purpose of protein annotation, since interactions between nucleic acids and proteins involve in numerous cellular processes such as replication, transcription, splicing, and DNA repair. This research tries to identify proteins that can interact with DNA, RNA, and rRNA, respectively. mRMR (Minimum redundancy and maximum relevance), with its elegant mathematical formulation, has been applied widely in processing biological data and feature analysis since its introduction in 2005. mRMR plus incremental feature selection (IFS) is known to be very efficient in feature selection and analysis, and able to improve both effectiveness and efficiency of a prediction model. IFS is applied to decide how many features should be selected from feature list provided by mRMR. In the end, the selected features of mRMR and IFS are further refined by a conventional feature selection method--forward feature wrapper (FFW), by reordering the features. Each protein is coded by 132 features including amino acid compositions and physicochemical properties. After the feature selection, k-Nearest Neighbor algorithm, the adopted prediction model, is trained and tested. As a result, the optimized prediction accuracies for the DNA, RNA, and rRNA are 82.0, 83.4, and 92.3%, respectively. Furthermore, the most important features that contribute to the prediction are identified and analyzed biologically. The predictor, developed for this research, is available for public access at http://chemdata.shu.edu.cn/protein_na_mrmr/.
Collapse
Affiliation(s)
- YouLang Yuan
- Chemical Data mining Laboratory, Department of Chemistry, College of Sciences, Shanghai University, 99 Shang-Da Road, Shanghai, 200444, People's Republic of China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
González-Díaz H, Dea-Ayuela MA, Pérez-Montoto LG, Prado-Prado FJ, Agüero-Chapín G, Bolas-Fernández F, Vazquez-Padrón RI, Ubeira FM. QSAR for RNases and theoretic-experimental study of molecular diversity on peptide mass fingerprints of a new Leishmania infantum protein. Mol Divers 2009; 14:349-69. [PMID: 19578942 PMCID: PMC7088557 DOI: 10.1007/s11030-009-9178-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2009] [Accepted: 06/13/2009] [Indexed: 11/29/2022]
Abstract
The toxicity and low success of current treatments for Leishmaniosis determines the search of new peptide drugs and/or molecular targets in Leishmania pathogen species (L. infantum and L. major). For example, Ribonucleases (RNases) are enzymes relevant to several biologic processes; then, theoretical and experimental study of the molecular diversity of Peptide Mass Fingerprints (PMFs) of RNases is useful for drug design. This study introduces a methodology that combines QSAR models, 2D-Electrophoresis (2D-E), MALDI-TOF Mass Spectroscopy (MS), BLAST alignment, and Molecular Dynamics (MD) to explore PMFs of RNases. We illustrate this approach by investigating for the first time the PMFs of a new protein of L. infantum. Here we report and compare new versus old predictive models for RNases based on Topological Indices (TIs) of Markov Pseudo-Folding Lattices. These group of indices called Pseudo-folding Lattice 2D-TIs include: Spectral moments pi ( k )(x,y), Mean Electrostatic potentials xi ( k )(x,y), and Entropy measures theta ( k )(x,y). The accuracy of the models (training/cross-validation) was as follows: xi ( k )(x,y)-model (96.0%/91.7%)>pi ( k )(x,y)-model (84.7/83.3) > theta ( k )(x,y)-model (66.0/66.7). We also carried out a 2D-E analysis of biological samples of L. infantum promastigotes focusing on a 2D-E gel spot of one unknown protein with M<20, 100 and pI <7. MASCOT search identified 20 proteins with Mowse score >30, but not one >52 (threshold value), the higher value of 42 was for a probable DNA-directed RNA polymerase. However, we determined experimentally the sequence of more than 140 peptides. We used QSAR models to predict RNase scores for these peptides and BLAST alignment to confirm some results. We also calculated 3D-folding TIs based on MD experiments and compared 2D versus 3D-TIs on molecular phylogenetic analysis of the molecular diversity of these peptides. This combined strategy may be of interest in drug development or target identification.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Microbiology and Parasitology, and Department of Organic Chemistry, Faculty of Pharmacy, USC, 15782, Santiago de Compostela, Spain.
| | | | | | | | | | | | | | | |
Collapse
|