1
|
Protein-Specific Prediction of RNA-Binding Sites Based on Information Entropy. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8626628. [PMID: 36225547 PMCID: PMC9550406 DOI: 10.1155/2022/8626628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/15/2022] [Accepted: 09/20/2022] [Indexed: 11/25/2022]
Abstract
Understanding the protein-RNA interaction mechanism can help us to further explore various biological processes. The experimental techniques still have some limitations, such as the high cost of economy and time. Predicting protein-RNA-binding sites by using computational methods is an excellent research tool. Here, we developed a universal method for predicting protein-specific RNA-binding sites, so one general model for a given protein was constructed on a fixed dataset by fusing the data of different experimental techniques. At the same time, information theory was employed to characterize the sequence conservation of RNA-binding segments. Conversation difference profiles between binding and nonbinding segments were constructed by information entropy (IE), which indicates a significant difference. Finally, the 19 proteins-specific models based on random forest (RF) were built based on IE encoding. The performance on the independent datasets demonstrates that our method can obtain competitive results when compared with the current best prediction model.
Collapse
|
2
|
Proteome-wide Prediction of Lysine Methylation Leads to Identification of H2BK43 Methylation and Outlines the Potential Methyllysine Proteome. Cell Rep 2021; 32:107896. [PMID: 32668242 DOI: 10.1016/j.celrep.2020.107896] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 04/29/2020] [Accepted: 06/22/2020] [Indexed: 12/15/2022] Open
Abstract
Protein Lys methylation plays a critical role in numerous cellular processes, but it is challenging to identify Lys methylation in a systematic manner. Here we present an approach combining in silico prediction with targeted mass spectrometry (MS) to identify Lys methylation (Kme) sites at the proteome level. We develop MethylSight, a program that predicts Kme events solely on the physicochemical properties of residues surrounding the putative methylation sites, which then requires validation by targeted MS. Using this approach, we identify 70 new histone Kme marks with a 90% validation rate. H2BK43me2, which undergoes dynamic changes during stem cell differentiation, is found to be a substrate of KDM5b. Furthermore, MethylSight predicts that Lys methylation is a prevalent post-translational modification in the human proteome. Our work provides a useful resource for guiding systematic exploration of the role of Lys methylation in human health and disease.
Collapse
|
3
|
Huang G, Zheng Y, Wu YQ, Han GS, Yu ZG. An Information Entropy-Based Approach for Computationally Identifying Histone Lysine Butyrylation. Front Genet 2020; 10:1325. [PMID: 32117407 PMCID: PMC7033570 DOI: 10.3389/fgene.2019.01325] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Accepted: 12/05/2019] [Indexed: 12/14/2022] Open
Abstract
Butyrylation plays a crucial role in the cellular processes. Due to limit of techniques, it is a challenging task to identify histone butyrylation sites on a large scale. To fill the gap, we propose an approach based on information entropy and machine learning for computationally identifying histone butyrylation sites. The proposed method achieves 0.92 of area under the receiver operating characteristic (ROC) curve over the training set by 3-fold cross validation and 0.80 over the testing set by independent test. Feature analysis implies that amino acid residues in the down/upstream of butyrylation sites would exhibit specific sequence motif to a certain extent. Functional analysis suggests that histone butyrylation was most possibly associated with four pathways (systemic lupus erythematosus, alcoholism, viral carcinogenesis and transcriptional misregulation in cancer), was involved in binding with other molecules, processes of biosynthesis, assembly, arrangement or disassembly and was located in such complex as consists of DNA, RNA, protein, etc. The proposed method is useful to predict histone butyrylation sites. Analysis of feature and function improves understanding of histone butyrylation and increases knowledge of functions of butyrylated histones.
Collapse
Affiliation(s)
- Guohua Huang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China
| | - Yang Zheng
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China
| | - Yao-Qun Wu
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang, China.,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Guo-Sheng Han
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, QLD, Australia
| |
Collapse
|
4
|
Abstract
Protein methylation is an important and reversible post-translational modification
that regulates many biological processes in cells. It occurs mainly on lysine and arginine
residues and involves many important biological processes, including transcriptional
activity, signal transduction, and the regulation of gene expression. Protein methylation
and its regulatory enzymes are related to a variety of human diseases, so improved identification
of methylation sites is useful for designing drugs for a variety of related diseases.
In this review, we systematically summarize and analyze the tools used for the prediction
of protein methylation sites on arginine and lysine residues over the last decade.
Collapse
Affiliation(s)
- Chunyan Ao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shunshan Jin
- Department of Neurology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Yuan Lin
- Department of System Integration, Sparebanken Vest, Bergen, Norway
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
5
|
Liu Y, Guo Y, Wu W, Xiong Y, Sun C, Yuan L, Li M. A Machine Learning-Based QSAR Model for Benzimidazole Derivatives as Corrosion Inhibitors by Incorporating Comprehensive Feature Selection. Interdiscip Sci 2019; 11:738-747. [PMID: 31486019 DOI: 10.1007/s12539-019-00346-7] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Revised: 07/23/2019] [Accepted: 07/25/2019] [Indexed: 01/28/2023]
Abstract
BACKGROUND Computational prediction of inhibition efficiency (IE) for inhibitor molecules is a crucial supplementary way to design novel molecules that can efficiently inhibit corrosion onto metallic surfaces. PURPOSE Here we are dedicated to developing a new machine learning-based predictor for the inhibition efficiency (IE) of benzimidazole derivatives. METHODS First, a comprehensively numerical representation was given on inhibitor molecules from all aspects of energy, electronic, topological, physicochemical and spatial properties based on 3-D structures and 150 valid structural descriptors were obtained. Then, a thorough investigation of these structural descriptors was implemented. The multicollinearity-based clustering analysis was performed to remove the linear correlated feature variables, so 47 feature clusters were produced. Meanwhile, Gini importance by random forest (RF) was used to further measure the contributions of the descriptors in each cluster and 47 non-linear descriptors were selected with the highest Gini importance score in the corresponding cluster. Further, considering the limited number of available inhibitors, different feature subsets were constructed according to the Gini importance score ranking list of 47 descriptors. RESULTS Finally, support vector machine (SVM) models based on different feature subsets were tested by leave-one-out cross validation. Through comparisons, the optimal SVM model with the top 11 descriptors was achieved based on Poly kernel. This model yields a promising performance with the correlation coefficient (R) and root-mean-square error (RMSE) of 0.9589 and 4.45, respectively, which indicates that the method proposed by us gives the best performance for the current data. CONCLUSION Based on our model, 6 new benzimidazole molecules were designed and their IE values predicted by this model indicate that two of them have high potential as outstanding corrosion inhibitors.
Collapse
Affiliation(s)
- Youquan Liu
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China.
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu, Sichuan, 610064, People's Republic of China.
| | - Wengang Wu
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Ying Xiong
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Chuan Sun
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Li Yuan
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, Sichuan, 610064, People's Republic of China
| |
Collapse
|
6
|
Ma B, Allard C, Bouchard L, Perron P, Mittleman MA, Hivert MF, Liang L. Locus-specific DNA methylation prediction in cord blood and placenta. Epigenetics 2019; 14:405-420. [PMID: 30885044 DOI: 10.1080/15592294.2019.1588685] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
DNA methylation is known to be responsive to prenatal exposures, which may be a part of the mechanism linking early developmental exposures to future chronic diseases. Many studies use blood to measure DNA methylation, yet we know that DNA methylation is tissue specific. Placenta is central to fetal growth and development, but it is rarely feasible to collect this tissue in large epidemiological studies; on the other hand, cord blood samples are more accessible. In this study, based on paired samples of both placenta and cord blood tissues from 169 individuals, we investigated the methylation concordance between placenta and cord blood. We then employed a machine-learning-based model to predict locus-specific DNA methylation levels in placenta using DNA methylation levels in cord blood. We found that methylation correlation between placenta and cord blood is lower than other tissue pairs, consistent with existing observations that placenta methylation has a distinct pattern. Nonetheless, there are still a number of CpG sites showing robust association between the two tissues. We built prediction models for placenta methylation based on cord blood data and documented a subset of 1,012 CpG sites with high correlation between measured and predicted placenta methylation levels. The resulting list of CpG sites and prediction models could help to reveal the loci where internal or external influences may affect DNA methylation in both placenta and cord blood, and provide a reference data to predict the effects on placenta in future study even when the tissue is not available in an epidemiological study.
Collapse
Affiliation(s)
- Baoshan Ma
- a College of Information Science and Technology , Dalian Maritime University , Dalian , Liaoning Province , China
| | - Catherine Allard
- b Centre de Recherche du Center Hospitalier Universitaire de Sherbrooke , Sherbrooke , Quebec , Canada
| | - Luigi Bouchard
- b Centre de Recherche du Center Hospitalier Universitaire de Sherbrooke , Sherbrooke , Quebec , Canada.,c Department of Biochemistry, Faculty of Medicine and Health Sciences , Université de Sherbrooke , Sherbrooke , Quebec , Canada.,d ECOGENE-21 Biocluster , CSSS de Chicoutimi , Chicoutimi , Quebec , Canada
| | - Patrice Perron
- b Centre de Recherche du Center Hospitalier Universitaire de Sherbrooke , Sherbrooke , Quebec , Canada.,e Department of Medicine, Faculty of Medicine and Life Sciences , Université de Sherbrooke , Sherbrooke , Quebec , Canada
| | - Murray A Mittleman
- f Department of Epidemiology , Harvard T.H. Chan School of Public Health , Boston , MA , USA.,g Cardiovascular Epidemiology Research Unit , Beth Israel Deaconess Medical Center , Boston , MA , USA
| | - Marie-France Hivert
- b Centre de Recherche du Center Hospitalier Universitaire de Sherbrooke , Sherbrooke , Quebec , Canada.,e Department of Medicine, Faculty of Medicine and Life Sciences , Université de Sherbrooke , Sherbrooke , Quebec , Canada.,h Department of Population Medicine , Harvard Pilgrim Health Care Institute, Harvard Medical School , Boston , MA , USA.,i Diabetes Unit , Massachusetts General Hospital , Boston , MA , USA
| | - Liming Liang
- f Department of Epidemiology , Harvard T.H. Chan School of Public Health , Boston , MA , USA.,j Department of Biostatistics , Harvard T.H. Chan School of Public Health , Boston , MA , USA
| |
Collapse
|
7
|
Li W, Li M, Pu X, Guo Y. Distinguishing the disease-associated SNPs based on composition frequency analysis. Interdiscip Sci 2017; 9:459-467. [PMID: 29143920 DOI: 10.1007/s12539-017-0248-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Revised: 06/03/2017] [Accepted: 06/26/2017] [Indexed: 12/22/2022]
Abstract
Single-nucleotide polymorphism (SNP) is a basical variation in genome. When SNPs occur at the binding sites of microRNA, they can influence the binding efficiency, cause a fluctuation of the mRNA in vivo, and thus arouse posttranscriptional level abnormality. Therefore, SNP has a strong correlation with diseases. Although enormous SNPs have been experimentally identified, only a tiny proportion of them are truly disease-associated SNPs (dSNPs) that relate to microRNA modification and then are involved in disease causing process. Therefore, it is important to distinguish dSNPs from the usual SNPs. Analysis here shows that composition is different between sequence segments centered by dSNP and SNP. Inspired by the composition, transition and distribution features which are meaningful and effective in characterizing proteins' sequence information, we improved and applied it to represent the frequency and physicochemical properties of a gene sequence. Binary encoding scheme was also used for further labelling four nucleic acids (A, T, C, and G). First, clustering analysis was performed to gain reasonable negative samples. Then, optimization tests were implemented on different ratios of positive vs negative samples and different feature subsets retrieved by evaluation method of F score. The optimal model constructed by random forest achieves an accuracy of more than 90% on the testing data set. Moreover, the promising results of the external validation also demonstrate the practical applicability of our method. Finally, principal component analysis on the features indicates that all features in our method gain the gross contribution to the prediction model.
Collapse
Affiliation(s)
- Wenling Li
- College of Chemistry, Sichuan University, Chengdu, 610064, People's Republic of China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, 610064, People's Republic of China
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu, 610064, People's Republic of China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu, 610064, People's Republic of China.
| |
Collapse
|
8
|
Silva JCF, Carvalho TFM, Fontes EPB, Cerqueira FR. Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae. BMC Bioinformatics 2017; 18:431. [PMID: 28964254 PMCID: PMC5622471 DOI: 10.1186/s12859-017-1839-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Accepted: 09/20/2017] [Indexed: 11/14/2022] Open
Abstract
Background Geminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host have greatly increased in recent years. Furthermore, the use of rolling circle amplification (RCA) and advanced metagenomics approaches have enabled the elucidation of viromes and the identification of many viral agents in a large number of plant species. As a result, determining the nomenclature and taxonomically classifying geminiviruses turned into complex tasks. In addition, the gene responsible for viral replication (particularly, the viruses belonging to the genus Mastrevirus) may be spliced due to the use of the transcriptional/splicing machinery in the host cells. However, the current tools have limitations concerning the identification of introns. Results This study proposes a new method, designated Fangorn Forest (F2), based on machine learning approaches to classify genera using an ab initio approach, i.e., using only the genomic sequence, as well as to predict and classify genes in the family Geminiviridae. In this investigation, nine genera of the family Geminiviridae and their related satellite DNAs were selected. We obtained two training sets, one for genus classification, containing attributes extracted from the complete genome of geminiviruses, while the other was made up to classify geminivirus genes, containing attributes extracted from ORFs taken from the complete genomes cited above. Three ML algorithms were applied on those datasets to build the predictive models: support vector machines, using the sequential minimal optimization training approach, random forest (RF), and multilayer perceptron. RF demonstrated a very high predictive power, achieving 0.966, 0.964, and 0.995 of precision, recall, and area under the curve (AUC), respectively, for genus classification. For gene classification, RF could reach 0.983, 0.983, and 0.998 of precision, recall, and AUC, respectively. Conclusions Therefore, Fangorn Forest is proven to be an efficient method for classifying genera of the family Geminiviridae with high precision and effective gene prediction and classification. The method is freely accessible at www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp. Electronic supplementary material The online version of this article (10.1186/s12859-017-1839-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- José Cleydson F Silva
- Department of Informatics, Universidade Federal de Viçosa, Viçosa, Minas Gerais, 36570-900, Brazil.,Department of Biochemistry and Molecular Biology, Universidade Federal de Viçosa, Campus Universitário, Viçosa, Minas Gerais, 36570-900, Brazil
| | - Thales F M Carvalho
- Department of Informatics, Universidade Federal de Viçosa, Viçosa, Minas Gerais, 36570-900, Brazil
| | - Elizabeth P B Fontes
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Campus Universitário, Viçosa, Minas Gerais, 36570-900, Brazil. .,Department of Biochemistry and Molecular Biology, Universidade Federal de Viçosa, Campus Universitário, Viçosa, Minas Gerais, 36570-900, Brazil.
| | - Fabio R Cerqueira
- Department of Informatics, Universidade Federal de Viçosa, Viçosa, Minas Gerais, 36570-900, Brazil. .,Department of Production Engineering, Universidade Federal Fluminense, Rua Domingos Silvério, s/n, Bairro Quitandinha, Petrópolis, Rio de Janeiro, 25650-050, Brazil.
| |
Collapse
|
9
|
Using oriented peptide array libraries to evaluate methylarginine-specific antibodies and arginine methyltransferase substrate motifs. Sci Rep 2016; 6:28718. [PMID: 27338245 PMCID: PMC4919620 DOI: 10.1038/srep28718] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 06/08/2016] [Indexed: 12/29/2022] Open
Abstract
Signal transduction in response to stimuli relies on the generation of cascades of posttranslational modifications that promote protein-protein interactions and facilitate the assembly of distinct signaling complexes. Arginine methylation is one such modification, which is catalyzed by a family of nine protein arginine methyltransferases, or PRMTs. Elucidating the substrate specificity of each PRMT will promote a better understanding of which signaling networks these enzymes contribute to. Although many PRMT substrates have been identified, and their methylation sites mapped, the optimal target motif for each of the nine PRMTs has not been systematically addressed. Here we describe the use of Oriented Peptide Array Libraries (OPALs) to methodically dissect the preferred methylation motifs for three of these enzymes - PRMT1, CARM1 and PRMT9. In parallel, we show that an OPAL platform with a fixed methylarginine residue can be used to validate the methyl-specific and sequence-specific properties of antibodies that have been generated against different PRMT substrates, and can also be used to confirm the pan nature of some methylarginine-specific antibodies.
Collapse
|