1
|
Selvaraj MK, Thakur A, Kumar M, Pinnaka AK, Suri CR, Siddhardha B, Elumalai SP. Ion-pumping microbial rhodopsin protein classification by machine learning approach. BMC Bioinformatics 2023; 24:29. [PMID: 36707759 PMCID: PMC9881276 DOI: 10.1186/s12859-023-05138-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Accepted: 01/04/2023] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Rhodopsin is a seven-transmembrane protein covalently linked with retinal chromophore that absorbs photons for energy conversion and intracellular signaling in eukaryotes, bacteria, and archaea. Haloarchaeal rhodopsins are Type-I microbial rhodopsin that elicits various light-driven functions like proton pumping, chloride pumping and Phototaxis behaviour. The industrial application of Ion-pumping Haloarchaeal rhodopsins is limited by the lack of full-length rhodopsin sequence-based classifications, which play an important role in Ion-pumping activity. The well-studied Haloarchaeal rhodopsin is a proton-pumping bacteriorhodopsin that shows promising applications in optogenetics, biosensitized solar cells, security ink, data storage, artificial retinal implant and biohydrogen generation. As a result, a low-cost computational approach is required to identify Ion-pumping Haloarchaeal rhodopsin sequences and its subtype. RESULTS This study uses a support vector machine (SVM) technique to identify these ion-pumping Haloarchaeal rhodopsin proteins. The haloarchaeal ion pumping rhodopsins viz., bacteriorhodopsin, halorhodopsin, xanthorhodopsin, sensoryrhodopsin and marine prokaryotic Ion-pumping rhodopsins like actinorhodopsin, proteorhodopsin have been utilized to develop the methods that accurately identified the ion pumping haloarchaeal and other type I microbial rhodopsins. We achieved overall maximum accuracy of 97.78%, 97.84% and 97.60%, respectively, for amino acid composition, dipeptide composition and hybrid approach on tenfold cross validation using SVM. Predictive models for each class of rhodopsin performed equally well on an independent data set. In addition to this, similar results were achieved using another machine learning technique namely random forest. Simultaneously predictive models performed equally well during five-fold cross validation. Apart from this study, we also tested the own, blank, BLAST dataset and annotated whole-genome rhodopsin sequences of PWS haloarchaeal isolates in the developed methods. The developed web server ( https://bioinfo.imtech.res.in/servers/rhodopred ) can identify the Ion Pumping Haloarchaeal rhodopsin proteins and their subtypes. We expect this web tool would be useful for rhodopsin researchers. CONCLUSION The overall performance of the developed method results show that it accurately identifies the Ionpumping Haloarchaeal rhodopsin and their subtypes using known and unknown microbial rhodopsin sequences. We expect that this study would be useful for optogenetics, molecular biologists and rhodopsin researchers.
Collapse
Affiliation(s)
- Muthu Krishnan Selvaraj
- grid.418099.dMTCC-Microbial Type Culture Collection and Gene Bank, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Anamika Thakur
- grid.418099.dVirology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Manoj Kumar
- grid.418099.dVirology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Anil Kumar Pinnaka
- grid.418099.dMTCC-Microbial Type Culture Collection and Gene Bank, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Chander Raman Suri
- grid.418099.dBiosensor Department, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Busi Siddhardha
- grid.412517.40000 0001 2152 9956Department of Microbiology, School of Life Sciences, Pondicherry University, Puducherry, 605014 India
| | - Senthil Prasad Elumalai
- grid.418099.dBiochemical Engineering Research and Process Development Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| |
Collapse
|
2
|
Ling C, Wei X, Shen Y, Zhang H. Development and validation of multiple machine learning algorithms for the classification of G-protein-coupled receptors using molecular evolution model-based feature extraction strategy. Amino Acids 2021; 53:1705-1714. [PMID: 34562175 DOI: 10.1007/s00726-021-03080-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2021] [Accepted: 09/13/2021] [Indexed: 11/25/2022]
Abstract
Machine learning is one of the most potential ways to realize the function prediction of the incremental large-scale G-protein-coupled receptors (GPCR). Prior research reveals that the key to determining the overall classification accuracy of GPCR is extracting valuable features and filtering out redundancy. To achieve a more efficient classification model, we put the feature synonym problem into consideration and create a new method based on functional word clustering and integration. Through evaluating the evolution correlation between features using the transition scores in mature molecular substitution matrices, candidate features are clustered into synonym groups. Each group of the clustered features is then integrated and represented by a unique key functional word. These retained key functional words are used to form a feature knowledge base. The original GPCR sequences are then transferred into feature vectors based on a feature re-extraction strategy according to the features in the knowledge base before the training and testing stage. We create multiple machine learning models based on Naïve Bayesian (NB), random forest (RF), support vector machine (SVM), and multi-layer perceptron (MLP) algorithms. The established model is applied to classify two public data sets containing 8354 and 12,731 GPCRs, respectively. These models achieve significant performance in almost all evaluation criteria in comparison with state-of-the art. This work demonstrated the potential of the novel feature extraction strategy and provided an effective theoretical design for the hierarchical classification of GPCRs.
Collapse
Affiliation(s)
- Cheng Ling
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Xiaolin Wei
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Yitian Shen
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Haoyu Zhang
- School of Information Engineering, Zhejiang Ocean University, Zhoushan, China.
| |
Collapse
|
3
|
Wang Y, Li M, Ji R, Wang M, Zheng L. Comparison of Soil Total Nitrogen Content Prediction Models Based on Vis-NIR Spectroscopy. SENSORS 2020; 20:s20247078. [PMID: 33321833 PMCID: PMC7763030 DOI: 10.3390/s20247078] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/29/2020] [Revised: 11/24/2020] [Accepted: 12/07/2020] [Indexed: 01/20/2023]
Abstract
Visible-near-infrared spectrum (Vis-NIR) spectroscopy technology is one of the most important methods for non-destructive and rapid detection of soil total nitrogen (STN) content. In order to find a practical way to build STN content prediction model, three conventional machine learning methods and one deep learning approach are investigated and their predictive performances are compared and analyzed by using a public dataset called LUCAS Soil (19,019 samples). The three conventional machine learning methods include ordinary least square estimation (OLSE), random forest (RF), and extreme learning machine (ELM), while for the deep learning method, three different structures of convolutional neural network (CNN) incorporated Inception module are constructed and investigated. In order to clarify effectiveness of different pre-treatments on predicting STN content, the three conventional machine learning methods are combined with four pre-processing approaches (including baseline correction, smoothing, dimensional reduction, and feature selection) are investigated, compared, and analyzed. The results indicate that the baseline-corrected and smoothed ELM model reaches practical precision (coefficient of determination (R2) = 0.89, root mean square error of prediction (RMSEP) = 1.60 g/kg, and residual prediction deviation (RPD) = 2.34). While among three different structured CNN models, the one with more 1 × 1 convolutions preforms better (R2 = 0.93; RMSEP = 0.95 g/kg; and RPD = 3.85 in optimal case). In addition, in order to evaluate the influence of data set characteristics on the model, the LUCAS data set was divided into different data subsets according to dataset size, organic carbon (OC) content and countries, and the results show that the deep learning method is more effective and practical than conventional machine learning methods and, on the premise of enough data samples, it can be used to build a robust STN content prediction model with high accuracy for the same type of soil with similar agricultural treatment.
Collapse
Affiliation(s)
- Yueting Wang
- Key Laboratory of Modern Precision Agriculture System Integration Research, Ministry of Education, China Agricultural University, Beijing 100083, China; (Y.W.); (M.L.); (R.J.)
- Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing 100083, China;
| | - Minzan Li
- Key Laboratory of Modern Precision Agriculture System Integration Research, Ministry of Education, China Agricultural University, Beijing 100083, China; (Y.W.); (M.L.); (R.J.)
| | - Ronghua Ji
- Key Laboratory of Modern Precision Agriculture System Integration Research, Ministry of Education, China Agricultural University, Beijing 100083, China; (Y.W.); (M.L.); (R.J.)
| | - Minjuan Wang
- Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing 100083, China;
| | - Lihua Zheng
- Key Laboratory of Modern Precision Agriculture System Integration Research, Ministry of Education, China Agricultural University, Beijing 100083, China; (Y.W.); (M.L.); (R.J.)
- Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, China Agricultural University, Beijing 100083, China;
- Correspondence:
| |
Collapse
|
4
|
Saorin A, Di Gregorio E, Miolo G, Steffan A, Corona G. Emerging Role of Metabolomics in Ovarian Cancer Diagnosis. Metabolites 2020; 10:E419. [PMID: 33086611 PMCID: PMC7603269 DOI: 10.3390/metabo10100419] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 10/14/2020] [Accepted: 10/15/2020] [Indexed: 01/20/2023] Open
Abstract
Ovarian cancer is considered a silent killer due to the lack of clear symptoms and efficient diagnostic tools that often lead to late diagnoses. Over recent years, the impelling need for proficient biomarkers has led researchers to consider metabolomics, an emerging omics science that deals with analyses of the entire set of small-molecules (≤1.5 kDa) present in biological systems. Metabolomics profiles, as a mirror of tumor-host interactions, have been found to be useful for the analysis and identification of specific cancer phenotypes. Cancer may cause significant metabolic alterations to sustain its growth, and metabolomics may highlight this, making it possible to detect cancer in an early phase of development. In the last decade, metabolomics has been widely applied to identify different metabolic signatures to improve ovarian cancer diagnosis. The aim of this review is to update the current status of the metabolomics research for the discovery of new diagnostic metabolomic biomarkers for ovarian cancer. The most promising metabolic alterations are discussed in view of their potential biological implications, underlying the issues that limit their effective clinical translation into ovarian cancer diagnostic tools.
Collapse
Affiliation(s)
- Asia Saorin
- Immunopathology and Cancer Biomarkers Unit, Centro di Riferimento Oncologico di Aviano (CRO), IRCCS, 33081 Aviano, Italy; (A.S.); (E.D.G.); (A.S.)
| | - Emanuela Di Gregorio
- Immunopathology and Cancer Biomarkers Unit, Centro di Riferimento Oncologico di Aviano (CRO), IRCCS, 33081 Aviano, Italy; (A.S.); (E.D.G.); (A.S.)
| | - Gianmaria Miolo
- Medical Oncology and Cancer Prevention Unit, Centro di Riferimento Oncologico di Aviano (CRO), IRCCS, 33081 Aviano, Italy;
| | - Agostino Steffan
- Immunopathology and Cancer Biomarkers Unit, Centro di Riferimento Oncologico di Aviano (CRO), IRCCS, 33081 Aviano, Italy; (A.S.); (E.D.G.); (A.S.)
| | - Giuseppe Corona
- Immunopathology and Cancer Biomarkers Unit, Centro di Riferimento Oncologico di Aviano (CRO), IRCCS, 33081 Aviano, Italy; (A.S.); (E.D.G.); (A.S.)
| |
Collapse
|
5
|
|
6
|
Bekhouche S, Mohamed Ben Ali Y. Feature Selection in GPCR Classification Using BAT Algorithm. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2020. [DOI: 10.1142/s1469026820500066] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
G-Protein-Coupled Receptors (GPCR) are the large family of protein membrane; and until now some of them still remain orphans. Predicting GPCR functions is a challenging task, it depends closely to their classification, which requires a digital representation of each protein chain as an attribute vector. A major problem of GPCR databases is their great number of features which can produce combinatorial explosion and increase the complexity of classification algorithms. Feature selection techniques are used to deal with this problem by minimizing features space dimension, and keeping the most relevant ones. In this paper, we propose to use the BAT algorithm for extracting the pertinent features and to improve the classification results. We compared the results obtained by our system with two other bio-inspired algorithms, Evolutionary Algorithm and PSO search. Metrics quality measures used for comparison are Error Rate, Accuracy, MCC and [Formula: see text]-measure. Experimental results indicate that our system is more efficient.
Collapse
Affiliation(s)
- Safia Bekhouche
- Department of Computer Science, Badji Mokhtar University, Annaba 23000, Algeria
| | - Yamina Mohamed Ben Ali
- Lboratory of Research in Informatics (LRI), Badji Mokhtar University, Annaba 23000, Algeria
| |
Collapse
|
7
|
Zhang Y, Dong D, Li D, Lu L, Li J, Zhang Y, Chen L. Computational Method for the Identification of Molecular Metabolites Involved in Cereal Hull Color Variations. Comb Chem High Throughput Screen 2019; 21:760-770. [DOI: 10.2174/1386207322666190129105441] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Revised: 08/02/2018] [Accepted: 08/16/2018] [Indexed: 11/22/2022]
Abstract
Background:
Cereal hull color is an important quality specification characteristic. Many
studies were conducted to identify genetic changes underlying cereal hull color diversity. However,
these studies mainly focused on the gene level. Recent studies have suggested that metabolomics can
accurately reflect the integrated and real-time cell processes that contribute to the formation of
different cereal colors.
Methods:
In this study, we exploited published metabolomics databases and applied several
advanced computational methods, such as minimum redundancy maximum relevance (mRMR),
incremental forward search (IFS), random forest (RF) to investigate cereal hull color at the metabolic
level. First, the mRMR was applied to analyze cereal hull samples represented by metabolite
features, yielding a feature list. Then, the IFS and RF were used to test several feature sets,
constructed according to the aforementioned feature list. Finally, the optimal feature sets and RF
classifier were accessed based on the testing results.
Results and Conclusion:
A total of 158 key metabolites were found to be useful in distinguishing
white cereal hulls from colorful cereal hulls. A prediction model constructed with these metabolites
and a random forest algorithm generated a high Matthews coefficient correlation value of 0.701.
Furthermore, 24 of these metabolites were previously found to be relevant to cereal color. Our study
can provide new insights into the molecular basis of cereal hull color formation.
Collapse
Affiliation(s)
- Yunhua Zhang
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, Anhui, China
| | - Dong Dong
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, Anhui, China
| | - Dai Li
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, Anhui, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York, United States
| | - JiaRui Li
- School of Life Sciences, Shanghai University, Shanghai, China
| | - YuHang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lijuan Chen
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, Anhui, China
| |
Collapse
|
8
|
Lu J, Zhang Y, Wang S, Bi Y, Huang T, Luo X, Cai YD. Analysis of Four Types of Leukemia Using Gene Ontology Term and Kyoto Encyclopedia of Genes and Genomes Pathway Enrichment Scores. Comb Chem High Throughput Screen 2019; 23:295-303. [PMID: 30599106 DOI: 10.2174/1386207322666181231151900] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 09/24/2018] [Accepted: 12/05/2018] [Indexed: 12/16/2022]
Abstract
AIM AND OBJECTIVE Leukemia is the second common blood cancer after lymphoma, and its incidence rate has an increasing trend in recent years. Leukemia can be classified into four types: acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), and chronic myelogenous leukemia (CML). More than forty drugs are applicable to different types of leukemia based on the discrepant pathogenesis. Therefore, the identification of specific drug-targeted biological processes and pathways is helpful to determinate the underlying pathogenesis among such four types of leukemia. METHODS In this study, the gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that were highly related to drugs for leukemia were investigated for the first time. The enrichment scores for associated GO terms and KEGG pathways were calculated to evaluate the drugs and leukemia. The feature selection method, minimum redundancy maximum relevance (mRMR), was used to analyze and identify important GO terms and KEGG pathways. RESULTS Twenty Go terms and two KEGG pathways with high scores have all been confirmed to effectively distinguish four types of leukemia. CONCLUSION This analysis may provide a useful tool for the discrepant pathogenesis and drug design of different types of leukemia.
Collapse
Affiliation(s)
- Jing Lu
- School of Pharmacy, Key Laboratory of Molecular Pharmacology and Drug Evaluation (Yantai University), Ministry of Education, Collaborative Innovation Center of Advanced Drug Delivery System and Biotech Drugs in Universities of Shandong, Yantai University, 32 Qingquan Road, Yantai 264005, China
| | - YuHang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China
| | - ShaoPeng Wang
- School of Life Sciences, Shanghai University, 99 Shangda Road, Shanghai 200444, China
| | - Yi Bi
- School of Pharmacy, Key Laboratory of Molecular Pharmacology and Drug Evaluation (Yantai University), Ministry of Education, Collaborative Innovation Center of Advanced Drug Delivery System and Biotech Drugs in Universities of Shandong, Yantai University, 32 Qingquan Road, Yantai 264005, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China
| | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of MateriaMedica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, 99 Shangda Road, Shanghai 200444, China
| |
Collapse
|
9
|
Zhang J, Cui X, Cai W, Shao X. A variable importance criterion for variable selection in near-infrared spectral analysis. Sci China Chem 2018. [DOI: 10.1007/s11426-018-9368-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
10
|
Li J, Lu L, Zhang YH, Liu M, Chen L, Huang T, Cai YD. Identification of synthetic lethality based on a functional network by using machine learning algorithms. J Cell Biochem 2018; 120:405-416. [PMID: 30125975 DOI: 10.1002/jcb.27395] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2017] [Accepted: 07/09/2018] [Indexed: 12/27/2022]
Abstract
Synthetic lethality is the synthesis of mutations leading to cell death. Tumor-specific synthetic lethality has been targeted in research to improve cancer therapy. With the advances of techniques in molecular biology, such as RNAi and CRISPR/Cas9 gene editing, efforts have been made to systematically identify synthetic lethal interactions, especially for frequently mutated genes in cancers. However, elucidating the mechanism of synthetic lethality remains a challenge because of the complexity of its influencing conditions. In this study, we proposed a new computational method to identify critical functional features that can accurately predict synthetic lethal interactions. This method incorporates several machine learning algorithms and encodes protein-coding genes by an enrichment system derived from gene ontology terms and Kyoto Encyclopedia of Genes and Genomes pathways to represent their functional features. We built a random forest-based prediction engine by using 2120 selected features and obtained a Matthews correlation coefficient of 0.532. We examined the top 15 features and found that most of them have potential roles in synthetic lethality according to previous studies. These results demonstrate the ability of our proposed method to predict synthetic lethal interactions and provide a basis for further characterization of these particular genetic combinations.
Collapse
Affiliation(s)
- JiaRui Li
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
11
|
Yuan F, Lu L, Zhang Y, Wang S, Cai YD. Data mining of the cancer-related lncRNAs GO terms and KEGG pathways by using mRMR method. Math Biosci 2018; 304:1-8. [PMID: 30086268 DOI: 10.1016/j.mbs.2018.08.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Revised: 06/15/2018] [Accepted: 08/01/2018] [Indexed: 02/07/2023]
Abstract
LncRNAs plays an important role in the regulation of gene expression. Identification of cancer-related lncRNAs GO terms and KEGG pathways is great helpful for revealing cancer-related functional biological processes. Therefore, in this study, we proposed a computational method to identify novel cancer-related lncRNAs GO terms and KEGG pathways. By using existing lncRNA database and Max-relevance Min-redundancy (mRMR) method, GO terms and KEGG pathways were evaluated based on their importance on distinguishing cancer-related and non-cancer-related lncRNAs. Finally, GO terms and KEGG pathways with high importance were presented and analyzed. Our literature reviewing showed that the top 10 ranked GO terms and pathways were really related to interpretable tumorigenesis according to recent publications.
Collapse
Affiliation(s)
- Fei Yuan
- Department of Science & Technology, Binzhou Medical University Hospital, Binzhou 256603, Shandong, China.
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York 10032, USA.
| | - YuHang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - ShaoPeng Wang
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
12
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
13
|
Computational Approach to Investigating Key GO Terms and KEGG Pathways Associated with CNV. BIOMED RESEARCH INTERNATIONAL 2018; 2018:8406857. [PMID: 29850576 PMCID: PMC5925134 DOI: 10.1155/2018/8406857] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/26/2017] [Revised: 02/28/2018] [Accepted: 03/06/2018] [Indexed: 12/25/2022]
Abstract
Choroidal neovascularization (CNV) is a severe eye disease that leads to blindness, especially in the elderly population. Various endogenous and exogenous regulatory factors promote its pathogenesis. However, the detailed molecular biological mechanisms of CNV have not been fully revealed. In this study, by using advanced computational tools, a number of key gene ontology (GO) terms and KEGG pathways were selected for CNV. A total of 29 validated genes associated with CNV and 17,639 nonvalidated genes were encoded based on the features derived from the GO terms and KEGG pathways by using the enrichment theory. The widely accepted feature selection method-maximum relevance and minimum redundancy (mRMR)-was applied to analyze and rank the features. An extensive literature review for the top 45 ranking features was conducted to confirm their close associations with CNV. Identifying the molecular biological mechanisms of CNV as described by the GO terms and KEGG pathways may contribute to improving the understanding of the pathogenesis of CNV.
Collapse
|
14
|
An Efficient Approach for Prediction of Nuclear Receptor and Their Subfamilies Based on Fuzzy k-Nearest Neighbor with Maximum Relevance Minimum Redundancy. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES INDIA SECTION A-PHYSICAL SCIENCES 2018. [DOI: 10.1007/s40010-016-0325-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
15
|
Accurate prediction of subcellular location of apoptosis proteins combining Chou's PseAAC and PsePSSM based on wavelet denoising. Oncotarget 2017; 8:107640-107665. [PMID: 29296195 PMCID: PMC5746097 DOI: 10.18632/oncotarget.22585] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2017] [Accepted: 10/30/2017] [Indexed: 02/05/2023] Open
Abstract
Apoptosis proteins subcellular localization information are very important for understanding the mechanism of programmed cell death and the development of drugs. The prediction of subcellular localization of an apoptosis protein is still a challenging task because the prediction of apoptosis proteins subcellular localization can help to understand their function and the role of metabolic processes. In this paper, we propose a novel method for protein subcellular localization prediction. Firstly, the features of the protein sequence are extracted by combining Chou's pseudo amino acid composition (PseAAC) and pseudo-position specific scoring matrix (PsePSSM), then the feature information of the extracted is denoised by two-dimensional (2-D) wavelet denoising. Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of apoptosis proteins. Quite promising predictions are obtained using the jackknife test on three widely used datasets and compared with other state-of-the-art methods. The results indicate that the method proposed in this paper can remarkably improve the prediction accuracy of apoptosis protein subcellular localization, which will be a supplementary tool for future proteomics research.
Collapse
|
16
|
Li M, Ling C, Xu Q, Gao J. Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments. Amino Acids 2017; 50:255-266. [PMID: 29151135 DOI: 10.1007/s00726-017-2512-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2017] [Accepted: 11/14/2017] [Indexed: 10/18/2022]
Abstract
Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN .
Collapse
Affiliation(s)
- Man Li
- Department of Computer Science and Technology, College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Cheng Ling
- Department of Computer Science and Technology, College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China.
| | - Qi Xu
- Department of Computer Science and Technology, College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Jingyang Gao
- Department of Computer Science and Technology, College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| |
Collapse
|
17
|
Abstract
In this study, we delineate an unsupervised clustering algorithm, minimum span clustering (MSC), and apply it to detect G-protein coupled receptor (GPCR) sequences and to study the GPCR network using a base dataset of 2770 GPCR and 652 non-GPCR sequences. High detection accuracy can be achieved with a proper dataset. The clustering results of GPCRs derived from MSC show a strong correlation between their sequences and functions. By comparing our level 1 MSC results with the GPCRdb classification, the consistency is 87.9% for the fourth level of GPCRdb, 89.2% for the third level, 98.4% for the second level, and 100% for the top level (the lowest resolution level of GPCRdb). The MSC results of GPCRs can be well explained by estimating the selective pressure of GPCRs, as exemplified by investigating the largest two subfamilies, peptide receptors (PRs) and olfactory receptors (ORs), in class A GPCRs. PRs are decomposed into three groups due to a positive selective pressure, whilst ORs remain as a single group due to a negative selective pressure. Finally, we construct and compare phylogenetic trees using distance-based and character-based methods, a combination of which could convey more comprehensive information about the evolution of GPCRs.
Collapse
|
18
|
Li J, Huang T. Predicting and analyzing early wake-up associated gene expressions by integrating GWAS and eQTL studies. Biochim Biophys Acta Mol Basis Dis 2017; 1864:2241-2246. [PMID: 29109033 DOI: 10.1016/j.bbadis.2017.10.036] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2017] [Revised: 10/19/2017] [Accepted: 10/30/2017] [Indexed: 12/31/2022]
Abstract
Circadian rhythms are endogenous 24-hour rhythmic oscillations affecting human behaviors, such as sleep, blood pressure and other biological processes, the disturbance of which lead to circadian rhythm sleep disorders (CRSDs). In this study, based on the data from genome-wide association studies (GWASs) and expression quantitative trait loci (eQTLs), we tried to identify novel gene expression patterns in brain tissues that were associated with early wake-up. First, the maximum-relevance-minimum-redundancy (mRMR) method was adopted to analyze the involved gene expression patterns, yielding a feature list. Second, the incremental feature selection (IFS) method and the Dagging algorithm were applied to extract important gene expression patterns, which yield the best performance for Dagging. As a result, 4374 gene expression patterns were obtained, and they were further used to build an optimal classifier with a good performance of a Matthews's correlation coefficient of 0.933. Furthermore, the most important 49 gene expression patterns were extensively analyzed. Four genes were found to be related to circadian rhythm, as reported in previous studies. As a first attempt in identifying the target genes whose expression levels are associated with sleep-wake rhythms through integrating GWAS and eQTL results, this study can motivate more investigations in this regard. This article is part of a Special Issue entitled: Accelerating Precision Medicine through Genetic and Genomic Big Data Analysis edited by Yudong Cai & Tao Huang.
Collapse
Affiliation(s)
- JiaRui Li
- College of Life Science, Shanghai University, Shanghai 200444, People's Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China.
| |
Collapse
|
19
|
Chen L, Zhang YH, Huang G, Pan X, Wang S, Huang T, Cai YD. Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection. Mol Genet Genomics 2017; 293:137-149. [DOI: 10.1007/s00438-017-1372-7] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2017] [Accepted: 09/07/2017] [Indexed: 12/15/2022]
|
20
|
Chen L, Zhang YH, Wang S, Zhang Y, Huang T, Cai YD. Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways. PLoS One 2017; 12:e0184129. [PMID: 28873455 PMCID: PMC5584762 DOI: 10.1371/journal.pone.0184129] [Citation(s) in RCA: 173] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2017] [Accepted: 08/18/2017] [Indexed: 12/20/2022] Open
Abstract
Identifying essential genes in a given organism is important for research on their fundamental roles in organism survival. Furthermore, if possible, uncovering the links between core functions or pathways with these essential genes will further help us obtain deep insight into the key roles of these genes. In this study, we investigated the essential and non-essential genes reported in a previous study and extracted gene ontology (GO) terms and biological pathways that are important for the determination of essential genes. Through the enrichment theory of GO and KEGG pathways, we encoded each essential/non-essential gene into a vector in which each component represented the relationship between the gene and one GO term or KEGG pathway. To analyze these relationships, the maximum relevance minimum redundancy (mRMR) was adopted. Then, the incremental feature selection (IFS) and support vector machine (SVM) were employed to extract important GO terms and KEGG pathways. A prediction model was built simultaneously using the extracted GO terms and KEGG pathways, which yielded nearly perfect performance, with a Matthews correlation coefficient of 0.951, for distinguishing essential and non-essential genes. To fully investigate the key factors influencing the fundamental roles of essential genes, the 21 most important GO terms and three KEGG pathways were analyzed in detail. In addition, several genes was provided in this study, which were predicted to be essential genes by our prediction model. We suggest that this study provides more functional and pathway information on the essential genes and provides a new way to investigate related problems.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, People’s Republic of China
- College of Information Engineering, Shanghai Maritime University, Shanghai, People’s Republic of China
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - ShaoPeng Wang
- School of Life Sciences, Shanghai University, Shanghai, People’s Republic of China
| | - YunHua Zhang
- Anhui province key lab of farmland ecological conversation and pollution prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, People’s Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, People’s Republic of China
| |
Collapse
|
21
|
Chen L, Zhang YH, Lu G, Huang T, Cai YD. Analysis of cancer-related lncRNAs using gene ontology and KEGG pathways. Artif Intell Med 2017; 76:27-36. [PMID: 28363286 DOI: 10.1016/j.artmed.2017.02.001] [Citation(s) in RCA: 107] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Revised: 01/31/2017] [Accepted: 02/05/2017] [Indexed: 12/17/2022]
Abstract
BACKGROUND Cancer is a disease that involves abnormal cell growth and can invade or metastasize to other tissues. It is known that several factors are related to its initiation, proliferation, and invasiveness. Recently, it has been reported that long non-coding RNAs (lncRNAs) can participate in specific functional pathways and further regulate the biological function of cancer cells. Studies on lncRNAs are therefore helpful for uncovering the underlying mechanisms of cancer biological processes. METHODS We investigated cancer-related lncRNAs using gene ontology (GO) terms and KEGG pathway enrichment scores of neighboring genes that are co-expressed with the lncRNAs by extracting important GO terms and KEGG pathways that can help us identify cancer-related lncRNAs. The enrichment theory of GO terms and KEGG pathways was adopted to encode each lncRNA. Then, feature selection methods were employed to analyze these features and obtain the key GO terms and KEGG pathways. RESULTS The analysis indicated that the extracted GO terms and KEGG pathways are closely related to several cancer associated processes, such as hormone associated pathways, energy associated pathways, and ribosome associated pathways. And they can accurately predict cancer-related lncRNAs. CONCLUSIONS This study provided novel insight of how lncRNAs may affect tumorigenesis and which pathways may play important roles during it. These results could help understanding the biological mechanisms of lncRNAs and treating cancer.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, People's Republic of China; College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, People's Republic of China.
| | - Guohui Lu
- Department of Neurosurgery, The First Affiliated Hospital of Nanchang University, Nanchang 330006, People's Republic of China.
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, People's Republic of China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, People's Republic of China.
| |
Collapse
|
22
|
Analysis of Important Gene Ontology Terms and Biological Pathways Related to Pancreatic Cancer. BIOMED RESEARCH INTERNATIONAL 2016; 2016:7861274. [PMID: 27957501 PMCID: PMC5120232 DOI: 10.1155/2016/7861274] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Revised: 07/18/2016] [Accepted: 09/07/2016] [Indexed: 12/16/2022]
Abstract
Pancreatic cancer is a serious disease that results in more than thirty thousand deaths around the world per year. To design effective treatments, many investigators have devoted themselves to the study of biological processes and mechanisms underlying this disease. However, it is far from complete. In this study, we tried to extract important gene ontology (GO) terms and KEGG pathways for pancreatic cancer by adopting some existing computational methods. Genes that have been validated to be related to pancreatic cancer and have not been validated were represented by features derived from GO terms and KEGG pathways using the enrichment theory. A popular feature selection method, minimum redundancy maximum relevance, was employed to analyze these features and extract important GO terms and KEGG pathways. An extensive analysis of the obtained GO terms and KEGG pathways was provided to confirm the correlations between them and pancreatic cancer.
Collapse
|
23
|
The Use of Gene Ontology Term and KEGG Pathway Enrichment for Analysis of Drug Half-Life. PLoS One 2016; 11:e0165496. [PMID: 27780226 PMCID: PMC5079577 DOI: 10.1371/journal.pone.0165496] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2016] [Accepted: 10/12/2016] [Indexed: 02/07/2023] Open
Abstract
A drug's biological half-life is defined as the time required for the human body to metabolize or eliminate 50% of the initial drug dosage. Correctly measuring the half-life of a given drug is helpful for the safe and accurate usage of the drug. In this study, we investigated which gene ontology (GO) terms and biological pathways were highly related to the determination of drug half-life. The investigated drugs, with known half-lives, were analyzed based on their enrichment scores for associated GO terms and KEGG pathways. These scores indicate which GO terms or KEGG pathways the drug targets. The feature selection method, minimum redundancy maximum relevance, was used to analyze these GO terms and KEGG pathways and to identify important GO terms and pathways, such as sodium-independent organic anion transmembrane transporter activity (GO:0015347), monoamine transmembrane transporter activity (GO:0008504), negative regulation of synaptic transmission (GO:0050805), neuroactive ligand-receptor interaction (hsa04080), serotonergic synapse (hsa04726), and linoleic acid metabolism (hsa00591), among others. This analysis confirmed our results and may show evidence for a new method in studying drug half-lives and building effective computational methods for the prediction of drug half-lives.
Collapse
|
24
|
Tiwari AK. Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou's general PseAAC. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 134:197-213. [PMID: 27480744 DOI: 10.1016/j.cmpb.2016.07.004] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2016] [Revised: 05/27/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]
Abstract
BACKGROUND AND OBJECTIVE The G-protein coupled receptors are the largest superfamilies of membrane proteins and important targets for the drug design. G-protein coupled receptors are responsible for many physiochemical processes such as smell, taste, vision, neurotransmission, metabolism, cellular growth and immune response. So it is necessary to design a robust and efficient approach for the prediction of G-protein coupled receptors and their subfamilies. METHODS In this paper, the protein samples are represented by amino acid composition, dipeptide composition, correlation features, composition, transition, distribution, sequence order descriptors and pseudo amino acid composition with total 1497 number of sequence derived features. To address the issue of efficient classification of G-protein coupled receptors and their subfamilies, we propose to use a weighted k-nearest neighbor classifier with UNION of best 50 features, selected by Fisher score based feature selection, ReliefF, fast correlation based filter, minimum redundancy maximum relevancy, and support vector machine based recursive elimination feature selection methods to exploit the advantages of these feature selection methods. RESULTS The proposed method achieved an overall accuracy of 99.9%, 98.3%, 95.4%, MCC values of 1.00, 0.98, 0.95, ROC area values of 1.00, 0.998, 0.996 and precision of 99.9%, 98.3% and 95.5% using 10-fold cross-validation to predict the G-protein coupled receptors and non-G-protein coupled receptors, subfamilies of G-protein coupled receptors, and subfamilies of class A G-protein coupled receptors, respectively. CONCLUSIONS The high accuracies, MCC, ROC area values, and precision values indicate that the proposed method is better for the prediction of G-protein coupled receptors families and their subfamilies.
Collapse
|
25
|
Chen L, Zhang YH, Zheng M, Huang T, Cai YD. Identification of compound-protein interactions through the analysis of gene ontology, KEGG enrichment for proteins and molecular fragments of compounds. Mol Genet Genomics 2016; 291:2065-2079. [PMID: 27530612 DOI: 10.1007/s00438-016-1240-x] [Citation(s) in RCA: 51] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Accepted: 08/09/2016] [Indexed: 12/13/2022]
Abstract
Compound-protein interactions play important roles in every cell via the recognition and regulation of specific functional proteins. The correct identification of compound-protein interactions can lead to a good comprehension of this complicated system and provide useful input for the investigation of various attributes of compounds and proteins. In this study, we attempted to understand this system by extracting properties from both proteins and compounds, in which proteins were represented by gene ontology and KEGG pathway enrichment scores and compounds were represented by molecular fragments. Advanced feature selection methods, including minimum redundancy maximum relevance, incremental feature selection, and the basic machine learning algorithm random forest, were used to analyze these properties and extract core factors for the determination of actual compound-protein interactions. Compound-protein interactions reported in The Binding Databases were used as positive samples. To improve the reliability of the results, the analytic procedure was executed five times using different negative samples. Simultaneously, five optimal prediction methods based on a random forest and yielding maximum MCCs of approximately 77.55 % were constructed and may be useful tools for the prediction of compound-protein interactions. This work provides new clues to understanding the system of compound-protein interactions by analyzing extracted core features. Our results indicate that compound-protein interactions are related to biological processes involving immune, developmental and hormone-associated pathways.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Mingyue Zheng
- Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Shanghai, 201203, People's Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.
| |
Collapse
|
26
|
Chen L, Zhang YH, Zou Q, Chu C, Ji Z. Analysis of the chemical toxicity effects using the enrichment of Gene Ontology terms and KEGG pathways. Biochim Biophys Acta Gen Subj 2016; 1860:2619-26. [PMID: 27208425 DOI: 10.1016/j.bbagen.2016.05.015] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2016] [Revised: 04/25/2016] [Accepted: 05/13/2016] [Indexed: 02/06/2023]
Abstract
BACKGROUND Chemical toxicity is one of the major barriers for designing and detecting new chemical entities during drug discovery. Unexpected toxicity of an approved drug may lead to withdrawal from the market and significant loss of the associated costs. Better understanding of the mechanisms underlying various toxicity effects can help eliminate unqualified candidate drugs in early stages, allowing researchers to focus their attention on other more viable candidates. METHODS In this study, we aimed to understand the mechanisms underlying several toxicity effects using Gene Ontology (GO) terms and KEGG pathways. GO term and KEGG pathway enrichment theories were adopted to encode each chemical, and the minimum redundancy maximum relevance (mRMR) was used to analyze the GO terms and the KEGG pathways. Based on the feature list obtained by the mRMR method, the most related GO terms and KEGG pathways were extracted. RESULTS Some important GO terms and KEGG pathways were uncovered, which were concluded to be significant for determining chemical toxicity effects. CONCLUSIONS Several GO terms and KEGG pathways are highly related to all investigated toxicity effects, while some are specific to a certain toxicity effect. GENERAL SIGNIFICANCE The findings in this study have the potential to further our understanding of different chemical toxicity mechanisms and to assist scientists in developing new chemical toxicity prediction algorithms. This article is part of a Special Issue entitled "System Genetics" Guest Editor: Dr. Yudong Cai and Dr. Tao Huang.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China.
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, People's Republic of China.
| | - Chen Chu
- Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China.
| | - Zhiliang Ji
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, Xiamen, Fujian 361102, People's Republic of China.
| |
Collapse
|
27
|
An efficient approach for the prediction of ion channels and their subfamilies. Comput Biol Chem 2015; 58:205-21. [PMID: 26256801 DOI: 10.1016/j.compbiolchem.2015.07.002] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Revised: 06/25/2015] [Accepted: 07/08/2015] [Indexed: 01/25/2023]
Abstract
Ion channels are integral membrane proteins that are responsible for controlling the flow of ions across the cell. There are various biological functions that are performed by different types of ion channels. Therefore for new drug discovery it is necessary to develop a novel computational intelligence techniques based approach for the reliable prediction of ion channels families and their subfamilies. In this paper random forest based approach is proposed to predict ion channels families and their subfamilies by using sequence derived features. Here, seven feature vectors are used to represent the protein sample, including amino acid composition, dipeptide composition, correlation features, composition, transition and distribution and pseudo amino acid composition. The minimum redundancy and maximum relevance feature selection is used to find the optimal number of features for improving the prediction performance. The proposed method achieved an overall accuracy of 100%, 98.01%, 91.5%, 93.0%, 92.2%, 78.6%, 95.5%, 84.9%, MCC values of 1.00, 0.92, 0.88, 0.88, 0.90, 0.79, 0.91, 0.81 and ROC area values of 1.00, 0.99, 0.99, 0.99, 0.99, 0.95, 0.99 and 0.96 using 10-fold cross validation to predict the ion channels and non-ion channels, voltage gated ion channels and ligand gated ion channels, four subfamilies (calcium, potassium, sodium and chloride) of voltage gated ion channels, and four subfamilies of ligand gated ion channels and predict subfamilies of voltage gated calcium, potassium, sodium and chloride ion channels respectively.
Collapse
|
28
|
Chen L, Chu C, Lu J, Kong X, Huang T, Cai YD. Gene Ontology and KEGG Pathway Enrichment Analysis of a Drug Target-Based Classification System. PLoS One 2015; 10:e0126492. [PMID: 25951454 PMCID: PMC4423955 DOI: 10.1371/journal.pone.0126492] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2014] [Accepted: 04/02/2015] [Indexed: 12/22/2022] Open
Abstract
Drug-target interaction (DTI) is a key aspect in pharmaceutical research. With the ever-increasing new drug data resources, computational approaches have emerged as powerful and labor-saving tools in predicting new DTIs. However, so far, most of these predictions have been based on structural similarities rather than biological relevance. In this study, we proposed for the first time a "GO and KEGG enrichment score" method to represent a certain category of drug molecules by further classification and interpretation of the DTI database. A benchmark dataset consisting of 2,015 drugs that are assigned to nine categories ((1) G protein-coupled receptors, (2) cytokine receptors, (3) nuclear receptors, (4) ion channels, (5) transporters, (6) enzymes, (7) protein kinases, (8) cellular antigens and (9) pathogens) was constructed by collecting data from KEGG. We analyzed each category and each drug for its contribution in GO terms and KEGG pathways using the popular feature selection "minimum redundancy maximum relevance (mRMR)" method, and key GO terms and KEGG pathways were extracted. Our analysis revealed the top enriched GO terms and KEGG pathways of each drug category, which were highly enriched in the literature and clinical trials. Our results provide for the first time the biological relevance among drugs, targets and biological functions, which serves as a new basis for future DTI predictions.
Collapse
Affiliation(s)
- Lei Chen
- College of Life Science, Shanghai University, Shanghai, People’s Republic of China
- College of Information Engineering, Shanghai Maritime University, Shanghai, People’s Republic of China
| | - Chen Chu
- Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Jing Lu
- Department of Medicinal Chemistry, School of Pharmacy, Yantai University, Shandong, Yantai, People’s Republic of China
| | - Xiangyin Kong
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Yu-Dong Cai
- College of Life Science, Shanghai University, Shanghai, People’s Republic of China
| |
Collapse
|
29
|
Yang J, Chen L, Kong X, Huang T, Cai YD. Analysis of tumor suppressor genes based on gene ontology and the KEGG pathway. PLoS One 2014; 9:e107202. [PMID: 25207935 PMCID: PMC4160198 DOI: 10.1371/journal.pone.0107202] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2014] [Accepted: 08/07/2014] [Indexed: 12/31/2022] Open
Abstract
Cancer is a serious disease that causes many deaths every year. We urgently need to design effective treatments to cure this disease. Tumor suppressor genes (TSGs) are a type of gene that can protect cells from becoming cancerous. In view of this, correct identification of TSGs is an alternative method for identifying effective cancer therapies. In this study, we performed gene ontology (GO) and pathway enrichment analysis of the TSGs and non-TSGs. Some popular feature selection methods, including minimum redundancy maximum relevance (mRMR) and incremental feature selection (IFS), were employed to analyze the enrichment features. Accordingly, some GO terms and KEGG pathways, such as biological adhesion, cell cycle control, genomic stability maintenance and cell death regulation, were extracted, which are important factors for identifying TSGs. We hope these findings can help in building effective prediction methods for identifying TSGs and thereby, promoting the discovery of effective cancer treatments.
Collapse
Affiliation(s)
- Jing Yang
- The Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine (SJTUSM) and Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, People’s Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, People’s Republic of China
| | - Xiangyin Kong
- The Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine (SJTUSM) and Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, People’s Republic of China
| | - Tao Huang
- Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, New York, United States of America
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, People’s Republic of China
| |
Collapse
|
30
|
Gene ontology and KEGG enrichment analyses of genes related to age-related macular degeneration. BIOMED RESEARCH INTERNATIONAL 2014; 2014:450386. [PMID: 25165703 PMCID: PMC4140130 DOI: 10.1155/2014/450386] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2014] [Accepted: 07/21/2014] [Indexed: 01/10/2023]
Abstract
Identifying disease genes is one of the most important topics in biomedicine and may facilitate studies on the mechanisms underlying disease. Age-related macular degeneration (AMD) is a serious eye disease; it typically affects older adults and results in a loss of vision due to retina damage. In this study, we attempt to develop an effective method for distinguishing AMD-related genes. Gene ontology and KEGG enrichment analyses of known AMD-related genes were performed, and a classification system was established. In detail, each gene was encoded into a vector by extracting enrichment scores of the gene set, including it and its direct neighbors in STRING, and gene ontology terms or KEGG pathways. Then certain feature-selection methods, including minimum redundancy maximum relevance and incremental feature selection, were adopted to extract key features for the classification system. As a result, 720 GO terms and 11 KEGG pathways were deemed the most important factors for predicting AMD-related genes.
Collapse
|
31
|
Bioinformatics tools for predicting GPCR gene functions. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2014; 796:205-24. [PMID: 24158807 DOI: 10.1007/978-94-007-7423-0_10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
The automatic classification of GPCRs by bioinformatics methodology can provide functional information for new GPCRs in the whole 'GPCR proteome' and this information is important for the development of novel drugs. Since GPCR proteome is classified hierarchically, general ways for GPCR function prediction are based on hierarchical classification. Various computational tools have been developed to predict GPCR functions; those tools use not simple sequence searches but more powerful methods, such as alignment-free methods, statistical model methods, and machine learning methods used in protein sequence analysis, based on learning datasets. The first stage of hierarchical function prediction involves the discrimination of GPCRs from non-GPCRs and the second stage involves the classification of the predicted GPCR candidates into family, subfamily, and sub-subfamily levels. Then, further classification is performed according to their protein-protein interaction type: binding G-protein type, oligomerized partner type, etc. Those methods have achieved predictive accuracies of around 90 %. Finally, I described the future subject of research of the bioinformatics technique about functional prediction of GPCR.
Collapse
|
32
|
Li ZC, Lai YH, Chen LL, Xie Y, Dai Z, Zou XY. Identifying functions of protein complexes based on topology similarity with random forest. MOLECULAR BIOSYSTEMS 2014; 10:514-25. [PMID: 24389559 DOI: 10.1039/c3mb70401g] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Elucidating the functions of protein complexes is critical for understanding disease mechanisms, diagnosis and therapy. In this study, based on the concept that protein complexes with similar topology may have similar functions, we firstly model protein complexes as weighted graphs with nodes representing the proteins and edges indicating interaction between proteins. Secondly, we use topology features derived from the graphs to characterize protein complexes based on the graph theory. Finally, we construct a predictor by using random forest and topology features to identify the functions of protein complexes. Effectiveness of the current method is evaluated by identifying the functions of mammalian protein complexes. And then the predictor is also utilized to identify the functions of protein complexes retrieved from human protein-protein interaction networks. We identify some protein complexes with significant roles in the occurrence of tumors, vesicles and retinoblastoma. It is anticipated that the current research has an important impact on pathogenesis and the pharmaceutical industry. The source code of Matlab and the dataset are freely available on request from the authors.
Collapse
Affiliation(s)
- Zhan-Chao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China.
| | | | | | | | | | | |
Collapse
|
33
|
Li ZC, Lai YH, Chen LL, Chen C, Xie Y, Dai Z, Zou XY. Identifying subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm. MOLECULAR BIOSYSTEMS 2013; 9:658-67. [DOI: 10.1039/c3mb25451h] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
34
|
Lai YH, Li ZC, Chen LL, Dai Z, Zou XY. Identification of potential host proteins for influenza A virus based on topological and biological characteristics by proteome-wide network approach. J Proteomics 2012; 75:2500-13. [DOI: 10.1016/j.jprot.2012.02.034] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2011] [Revised: 02/21/2012] [Accepted: 02/26/2012] [Indexed: 12/31/2022]
|
35
|
Identification of human protein complexes from local sub-graphs of protein-protein interaction network based on random forest with topological structure features. Anal Chim Acta 2012; 718:32-41. [PMID: 22305895 DOI: 10.1016/j.aca.2011.12.069] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2011] [Revised: 12/28/2011] [Accepted: 12/30/2011] [Indexed: 11/20/2022]
Abstract
In the post-genomic era, one of the most important and challenging tasks is to identify protein complexes and further elucidate its molecular mechanisms in specific biological processes. Previous computational approaches usually identify protein complexes from protein interaction network based on dense sub-graphs and incomplete priori information. Additionally, the computational approaches have little concern about the biological properties of proteins and there is no a common evaluation metric to evaluate the performance. So, it is necessary to construct novel method for identifying protein complexes and elucidating the function of protein complexes. In this study, a novel approach is proposed to identify protein complexes using random forest and topological structure. Each protein complex is represented by a graph of interactions, where descriptor of the protein primary structure is used to characterize biological properties of protein and vertex is weighted by the descriptor. The topological structure features are developed and used to characterize protein complexes. Random forest algorithm is utilized to build prediction model and identify protein complexes from local sub-graphs instead of dense sub-graphs. As a demonstration, the proposed approach is applied to protein interaction data in human, and the satisfied results are obtained with accuracy of 80.24%, sensitivity of 81.94%, specificity of 80.07%, and Matthew's correlation coefficient of 0.4087 in 10-fold cross-validation test. Some new protein complexes are identified, and analysis based on Gene Ontology shows that the complexes are likely to be true complexes and play important roles in the pathogenesis of some diseases. PCI-RFTS, a corresponding executable program for protein complexes identification, can be acquired freely on request from the authors.
Collapse
|
36
|
Fanelli F, De Benedetti PG. Update 1 of: computational modeling approaches to structure-function analysis of G protein-coupled receptors. Chem Rev 2011; 111:PR438-535. [PMID: 22165845 DOI: 10.1021/cr100437t] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Francesca Fanelli
- Dulbecco Telethon Institute, University of Modena and Reggio Emilia, via Campi 183, 41125 Modena, Italy.
| | | |
Collapse
|
37
|
Classification of G proteins and prediction of GPCRs-G proteins coupling specificity using continuous wavelet transform and information theory. Amino Acids 2011; 43:793-804. [PMID: 22086210 DOI: 10.1007/s00726-011-1133-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Accepted: 10/20/2011] [Indexed: 10/15/2022]
Abstract
The coupling between G protein-coupled receptors (GPCRs) and guanine nucleotide-binding proteins (G proteins) regulates various signal transductions from extracellular space into the cell. However, the coupling mechanism between GPCRs and G proteins is still unknown, and experimental determination of their coupling specificity and function is both expensive and time consuming. Therefore, it is significant to develop a theoretical method to predict the coupling specificity between GPCRs and G proteins as well as their function using their primary sequences. In this study, a novel four-layer predictor (GPCRsG_CWTIT) based on support vector machine (SVM), continuous wavelet transform (CWT) and information theory (IT) is developed to classify G proteins and predict the coupling specificity between GPCRs and G proteins. SVM is used for construction of models. CWT and IT are used to characterize the primary structure of protein. Performance of GPCRsG_CWTIT is evaluated with cross-validation test on various working dataset. The overall accuracy of the G proteins at the levels of class and family is 98.23 and 85.42%, respectively. The accuracy of the coupling specificity prediction varies from 74.60 to 94.30%. These results indicate that the proposed predictor is an effective and feasible tool to predict the coupling specificity between GPCRs and G proteins as well as their functions using only the protein full sequence. The establishment of such an accurate prediction method will facilitate drug discovery by improving the ability to identify and predict protein-protein interactions. GPCRsG_CWTIT and dataset can be acquired freely on request from the authors.
Collapse
|