1
|
Panis F, Rompel A. The Novel Role of Tyrosinase Enzymes in the Storage of Globally Significant Amounts of Carbon in Wetland Ecosystems. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2022; 56:11952-11968. [PMID: 35944157 PMCID: PMC9454253 DOI: 10.1021/acs.est.2c03770] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 07/01/2022] [Accepted: 07/05/2022] [Indexed: 05/30/2023]
Abstract
Over the last millennia, wetlands have been sequestering carbon from the atmosphere via photosynthesis at a higher rate than releasing it and, therefore, have globally accumulated 550 × 1015 g of carbon, which is equivalent to 73% of the atmospheric carbon pool. The accumulation of organic carbon in wetlands is effectuated by phenolic compounds, which suppress the degradation of soil organic matter by inhibiting the activity of organic-matter-degrading enzymes. The enzymatic removal of phenolic compounds by bacterial tyrosinases has historically been blocked by anoxic conditions in wetland soils, resulting from waterlogging. Bacterial tyrosinases are a subgroup of oxidoreductases that oxidatively remove phenolic compounds, coupled to the reduction of molecular oxygen to water. The biochemical properties of bacterial tyrosinases have been investigated thoroughly in vitro within recent decades, while investigations focused on carbon fluxes in wetlands on a macroscopic level have remained a thriving yet separated research area so far. In the wake of climate change, however, anoxic conditions in wetland soils are threatened by reduced rainfall and prolonged summer drought. This potentially allows tyrosinase enzymes to reduce the concentration of phenolic compounds, which in turn will increase the release of stored carbon back into the atmosphere. To offer compelling evidence for the novel concept that bacterial tyrosinases are among the key enzymes influencing carbon cycling in wetland ecosystems first, bacterial organisms indigenous to wetland ecosystems that harbor a TYR gene within their respective genome (tyr+) have been identified, which revealed a phylogenetically diverse community of tyr+ bacteria indigenous to wetlands based on genomic sequencing data. Bacterial TYR host organisms covering seven phyla (Acidobacteria, Actinobacteria, Bacteroidetes, Firmicutes, Nitrospirae, Planctomycetes, and Proteobacteria) have been identified within various wetland ecosystems (peatlands, marshes, mangrove forests, bogs, and alkaline soda lakes) which cover a climatic continuum ranging from high arctic to tropic ecosystems. Second, it is demonstrated that (in vitro) bacterial TYR activity is commonly observed at pH values characteristic for wetland ecosystems (ranging from pH 3.5 in peatlands and freshwater swamps to pH 9.0 in soda lakes and freshwater marshes) and toward phenolic compounds naturally present within wetland environments (p-coumaric acid, gallic acid, protocatechuic acid, p-hydroxybenzoic acid, caffeic acid, catechin, and epicatechin). Third, analyzing the available data confirmed that bacterial host organisms tend to exhibit in vitro growth optima at pH values similar to their respective wetland habitats. Based on these findings, it is concluded that, following increased aeration of previously anoxic wetland soils due to climate change, TYRs are among the enzymes capable of reducing the concentration of phenolic compounds present within wetland ecosystems, which will potentially destabilize vast amounts of carbon stored in these ecosystems. Finally, promising approaches to mitigate the detrimental effects of increased TYR activity in wetland ecosystems and the requirement of future investigations of the abundance and activity of TYRs in an environmental setting are presented.
Collapse
|
2
|
Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins 2019; 88:397-413. [PMID: 31603244 DOI: 10.1002/prot.25832] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Revised: 07/05/2019] [Accepted: 09/17/2019] [Indexed: 12/17/2022]
Abstract
Proteins play important roles in living organisms, and their function is directly linked with their structure. Due to the growing gap between the number of proteins being discovered and their functional characterization (in particular as a result of experimental limitations), reliable prediction of protein function through computational means has become crucial. This paper reviews the machine learning techniques used in the literature, following their evolution from simple algorithms such as logistic regression to more advanced methods like support vector machines and modern deep neural networks. Hyperparameter optimization methods adopted to boost prediction performance are presented. In parallel, the metamorphosis in the features used by these algorithms from classical physicochemical properties and amino acid composition, up to text-derived features from biomedical literature and learned feature representations using autoencoders, together with feature selection and dimensionality reduction techniques, are also reviewed. The success stories in the application of these techniques to both general and specific protein function prediction are discussed.
Collapse
Affiliation(s)
- Rosalin Bonetta
- Centre for Molecular Medicine and Biobanking, University of Malta, Msida, Malta
| | - Gianluca Valentino
- Department of Communications and Computer Engineering, University of Malta, Msida, Malta
| |
Collapse
|
3
|
Mishra S, Rastogi YP, Jabin S, Kaur P, Amir M, Khatun S. A deep learning ensemble for function prediction of hypothetical proteins from pathogenic bacterial species. Comput Biol Chem 2019; 83:107147. [PMID: 31698160 DOI: 10.1016/j.compbiolchem.2019.107147] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 10/05/2019] [Accepted: 10/09/2019] [Indexed: 01/06/2023]
Abstract
Protein function prediction is a crucial task in the post-genomics era due to their diverse irreplaceable roles in a biological system. Traditional methods involved cost-intensive and time-consuming molecular biology techniques but they proved to be ineffective after the outburst of sequencing data through the advent of cost-effective and advanced sequencing techniques. To manage the pace of annotation with that of data generation, there is a shift to computational approaches which are based on homology, sequence and structure-based features, protein-protein interaction networks, phylogenetic profiles, and physicochemical properties, etc. A combination of these features has proven to be promising for protein function prediction in terms of improving prediction accuracy. In the present work, we have employed a combination of features based on sequence, physicochemical property, subsequence and annotation features with a total of 9890 features extracted and/or calculated for 171,212 reviewed prokaryotic proteins of 9 bacterial phyla from UniProtKB, to train a supervised deep learning ensemble model with the aim to categorize a bacterial hypothetical/unreviewed protein's function into 1739 GO terms as functional classes. The proposed system being fully dedicated to bacterial organisms is a novel attempt amongst various existing machine learning based protein function prediction systems based on mixed organisms. Experimental results demonstrate the success of the proposed deep learning ensemble model based on deep neural network method with F1 measure of 0.7912 on the prepared Test dataset 1 of reviewed proteins.
Collapse
Affiliation(s)
- Sarthak Mishra
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Yash Pratap Rastogi
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Suraiya Jabin
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India.
| | - Punit Kaur
- Department of Biophysics, All India Institute of Medical Sciences (AIIMS), New Delhi, 110 029, Delhi, India
| | - Mohammad Amir
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Shabnam Khatun
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| |
Collapse
|
4
|
Raccaud M, Friman ET, Alber AB, Agarwal H, Deluz C, Kuhn T, Gebhardt JCM, Suter DM. Mitotic chromosome binding predicts transcription factor properties in interphase. Nat Commun 2019; 10:487. [PMID: 30700703 PMCID: PMC6353955 DOI: 10.1038/s41467-019-08417-5] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Accepted: 01/08/2019] [Indexed: 12/31/2022] Open
Abstract
Mammalian transcription factors (TFs) differ broadly in their nuclear mobility and sequence-specific/non-specific DNA binding. How these properties affect their ability to occupy specific genomic sites and modify the epigenetic landscape is unclear. The association of TFs with mitotic chromosomes observed by fluorescence microscopy is largely mediated by non-specific DNA interactions and differs broadly between TFs. Here we combine quantitative measurements of mitotic chromosome binding (MCB) of 501 TFs, TF mobility measurements by fluorescence recovery after photobleaching, single molecule imaging of DNA binding, and mapping of TF binding and chromatin accessibility. TFs associating to mitotic chromosomes are enriched in DNA-rich compartments in interphase and display slower mobility in interphase and mitosis. Remarkably, MCB correlates with relative TF on-rates and genome-wide specific site occupancy, but not with TF residence times. This suggests that non-specific DNA binding properties of TFs regulate their search efficiency and occupancy of specific genomic sites.
Collapse
Affiliation(s)
- Mahé Raccaud
- Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
| | - Elias T Friman
- Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
| | - Andrea B Alber
- Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
| | - Harsha Agarwal
- Institute of Biophysics, Ulm University, Albert-Einstein-Allee 11, 89081, Ulm, Germany
| | - Cédric Deluz
- Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
| | - Timo Kuhn
- Institute of Biophysics, Ulm University, Albert-Einstein-Allee 11, 89081, Ulm, Germany
| | - J Christof M Gebhardt
- Institute of Biophysics, Ulm University, Albert-Einstein-Allee 11, 89081, Ulm, Germany
| | - David M Suter
- Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
| |
Collapse
|
5
|
Fa R, Cozzetto D, Wan C, Jones DT. Predicting human protein function with multi-task deep neural networks. PLoS One 2018; 13:e0198216. [PMID: 29889900 PMCID: PMC5995439 DOI: 10.1371/journal.pone.0198216] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Accepted: 05/15/2018] [Indexed: 11/19/2022] Open
Abstract
Machine learning methods for protein function prediction are urgently needed, especially now that a substantial fraction of known sequences remains unannotated despite the extensive use of functional assignments based on sequence similarity. One major bottleneck supervised learning faces in protein function prediction is the structured, multi-label nature of the problem, because biological roles are represented by lists of terms from hierarchically organised controlled vocabularies such as the Gene Ontology. In this work, we build on recent developments in the area of deep learning and investigate the usefulness of multi-task deep neural networks (MTDNN), which consist of upstream shared layers upon which are stacked in parallel as many independent modules (additional hidden layers with their own output units) as the number of output GO terms (the tasks). MTDNN learns individual tasks partially using shared representations and partially from task-specific characteristics. When no close homologues with experimentally validated functions can be identified, MTDNN gives more accurate predictions than baseline methods based on annotation frequencies in public databases or homology transfers. More importantly, the results show that MTDNN binary classification accuracy is higher than alternative machine learning-based methods that do not exploit commonalities and differences among prediction tasks. Interestingly, compared with a single-task predictor, the performance improvement is not linearly correlated with the number of tasks in MTDNN, but medium size models provide more improvement in our case. One of advantages of MTDNN is that given a set of features, there is no requirement for MTDNN to have a bootstrap feature selection procedure as what traditional machine learning algorithms do. Overall, the results indicate that the proposed MTDNN algorithm improves the performance of protein function prediction. On the other hand, there is still large room for deep learning techniques to further enhance prediction ability.
Collapse
Affiliation(s)
- Rui Fa
- The Francis Crick Institute, London, United Kingdom
- Computer Science Department, University College London, London, United Kingdom
| | - Domenico Cozzetto
- The Francis Crick Institute, London, United Kingdom
- Computer Science Department, University College London, London, United Kingdom
| | - Cen Wan
- The Francis Crick Institute, London, United Kingdom
- Computer Science Department, University College London, London, United Kingdom
| | - David T. Jones
- The Francis Crick Institute, London, United Kingdom
- Computer Science Department, University College London, London, United Kingdom
- * E-mail:
| |
Collapse
|
6
|
Abstract
The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. Current machine learning-based methods used for the automated GO-CC annotation of proteins suffer from the inconsistency of individual GO-CC term predictions. Here, we present FGGA-CC+, a class of hierarchical graph-based classifiers for the consistent GO-CC annotation of protein coding genes at the subcellular compartment or macromolecular complex levels. Aiming to boost the accuracy of GO-CC predictions, we make use of the protein localization knowledge in the GO-Biological Process (GO-BP) annotations to boost the accuracy of GO-CC prediction. As a result, FGGA-CC+ classifiers are built from annotation data in both the GO-CC and GO-BP ontologies. Due to their graph-based design, FGGA-CC+ classifiers are fully interpretable and their predictions amenable to expert analysis. Promising results on protein annotation data from five model organisms were obtained. Additionally, successful validation results in the annotation of a challenging subset of tandem duplicated genes in the tomato non-model organism were accomplished. Overall, these results suggest that FGGA-CC+ classifiers can indeed be useful for satisfying the huge demand of GO-CC annotation arising from ubiquitous high throughout sequencing and proteomic projects.
Collapse
|
7
|
GRAFENE: Graphlet-based alignment-free network approach integrates 3D structural and sequence (residue order) data to improve protein structural comparison. Sci Rep 2017; 7:14890. [PMID: 29097661 PMCID: PMC5668259 DOI: 10.1038/s41598-017-14411-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 10/11/2017] [Indexed: 12/26/2022] Open
Abstract
Initial protein structural comparisons were sequence-based. Since amino acids that are distant in the sequence can be close in the 3-dimensional (3D) structure, 3D contact approaches can complement sequence approaches. Traditional 3D contact approaches study 3D structures directly and are alignment-based. Instead, 3D structures can be modeled as protein structure networks (PSNs). Then, network approaches can compare proteins by comparing their PSNs. These can be alignment-based or alignment-free. We focus on the latter. Existing network alignment-free approaches have drawbacks: 1) They rely on naive measures of network topology. 2) They are not robust to PSN size. They cannot integrate 3) multiple PSN measures or 4) PSN data with sequence data, although this could improve comparison because the different data types capture complementary aspects of the protein structure. We address this by: 1) exploiting well-established graphlet measures via a new network alignment-free approach, 2) introducing normalized graphlet measures to remove the bias of PSN size, 3) allowing for integrating multiple PSN measures, and 4) using ordered graphlets to combine the complementary PSN data and sequence (specifically, residue order) data. We compare synthetic networks and real-world PSNs more accurately and faster than existing network (alignment-free and alignment-based), 3D contact, or sequence approaches.
Collapse
|
8
|
Lima AN, Philot EA, Trossini GHG, Scott LPB, Maltarollo VG, Honorio KM. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov 2016; 11:225-39. [PMID: 26814169 DOI: 10.1517/17460441.2016.1146250] [Citation(s) in RCA: 138] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
INTRODUCTION The use of computational tools in the early stages of drug development has increased in recent decades. Machine learning (ML) approaches have been of special interest, since they can be applied in several steps of the drug discovery methodology, such as prediction of target structure, prediction of biological activity of new ligands through model construction, discovery or optimization of hits, and construction of models that predict the pharmacokinetic and toxicological (ADMET) profile of compounds. AREAS COVERED This article presents an overview on some applications of ML techniques in drug design. These techniques can be employed in ligand-based drug design (LBDD) and structure-based drug design (SBDD) studies, such as similarity searches, construction of classification and/or prediction models of biological activity, prediction of secondary structures and binding sites docking and virtual screening. EXPERT OPINION Successful cases have been reported in the literature, demonstrating the efficiency of ML techniques combined with traditional approaches to study medicinal chemistry problems. Some ML techniques used in drug design are: support vector machine, random forest, decision trees and artificial neural networks. Currently, an important application of ML techniques is related to the calculation of scoring functions used in docking and virtual screening assays from a consensus, combining traditional and ML techniques in order to improve the prediction of binding sites and docking solutions.
Collapse
Affiliation(s)
- Angélica Nakagawa Lima
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil
| | - Eric Allison Philot
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil
| | | | - Luis Paulo Barbour Scott
- c Centro de Matemática, Computação e Cognição , Universidade Federal do ABC , São Paulo , Brazil
| | | | - Kathia Maria Honorio
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil.,d Escola de Artes, Ciências e Humanidades , Universidade de São Paulo , São Paulo , Brazil
| |
Collapse
|
9
|
Spetale FE, Tapia E, Krsticevic F, Roda F, Bulacio P. A Factor Graph Approach to Automated GO Annotation. PLoS One 2016; 11:e0146986. [PMID: 26771463 PMCID: PMC4714749 DOI: 10.1371/journal.pone.0146986] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Accepted: 12/23/2015] [Indexed: 12/19/2022] Open
Abstract
As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.
Collapse
Affiliation(s)
- Flavio E. Spetale
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad de Cs. Exactas, Ingeniería y Agrimensura, National University of Rosario, Rosario, Argentina
| | - Elizabeth Tapia
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad de Cs. Exactas, Ingeniería y Agrimensura, National University of Rosario, Rosario, Argentina
| | - Flavia Krsticevic
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad Regional San Nicolás, National Technological University, San Nicolás, Argentina
| | | | - Pilar Bulacio
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad de Cs. Exactas, Ingeniería y Agrimensura, National University of Rosario, Rosario, Argentina
- Facultad Regional San Nicolás, National Technological University, San Nicolás, Argentina
| |
Collapse
|
10
|
Zhai J, Tang Y, Yuan H, Wang L, Shang H, Ma C. A Meta-Analysis Based Method for Prioritizing Candidate Genes Involved in a Pre-specific Function. FRONTIERS IN PLANT SCIENCE 2016; 7:1914. [PMID: 28018423 PMCID: PMC5156684 DOI: 10.3389/fpls.2016.01914] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 12/02/2016] [Indexed: 05/10/2023]
Abstract
The identification of genes associated with a given biological function in plants remains a challenge, although network-based gene prioritization algorithms have been developed for Arabidopsis thaliana and many non-model plant species. Nevertheless, these network-based gene prioritization algorithms have encountered several problems; one in particular is that of unsatisfactory prediction accuracy due to limited network coverage, varying link quality, and/or uncertain network connectivity. Thus, a model that integrates complementary biological data may be expected to increase the prediction accuracy of gene prioritization. Toward this goal, we developed a novel gene prioritization method named RafSee, to rank candidate genes using a random forest algorithm that integrates sequence, evolutionary, and epigenetic features of plants. Subsequently, we proposed an integrative approach named RAP (Rank Aggregation-based data fusion for gene Prioritization), in which an order statistics-based meta-analysis was used to aggregate the rank of the network-based gene prioritization method and RafSee, for accurately prioritizing candidate genes involved in a pre-specific biological function. Finally, we showcased the utility of RAP by prioritizing 380 flowering-time genes in Arabidopsis. The "leave-one-out" cross-validation experiment showed that RafSee could work as a complement to a current state-of-art network-based gene prioritization system (AraNet v2). Moreover, RAP ranked 53.68% (204/380) flowering-time genes higher than AraNet v2, resulting in an 39.46% improvement in term of the first quartile rank. Further evaluations also showed that RAP was effective in prioritizing genes-related to different abiotic stresses. To enhance the usability of RAP for Arabidopsis and non-model plant species, an R package implementing the method is freely available at http://bioinfo.nwafu.edu.cn/software.
Collapse
|
11
|
Tiwari AK, Srivastava R. A survey of computational intelligence techniques in protein function prediction. INTERNATIONAL JOURNAL OF PROTEOMICS 2014; 2014:845479. [PMID: 25574395 PMCID: PMC4276698 DOI: 10.1155/2014/845479] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Revised: 10/31/2014] [Accepted: 11/07/2014] [Indexed: 02/08/2023]
Abstract
During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.
Collapse
Affiliation(s)
- Arvind Kumar Tiwari
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| | - Rajeev Srivastava
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| |
Collapse
|
12
|
Nagao C, Nagano N, Mizuguchi K. Prediction of detailed enzyme functions and identification of specificity determining residues by random forests. PLoS One 2014; 9:e84623. [PMID: 24416252 PMCID: PMC3885575 DOI: 10.1371/journal.pone.0084623] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 11/15/2013] [Indexed: 12/03/2022] Open
Abstract
Determining enzyme functions is essential for a thorough understanding of cellular processes. Although many prediction methods have been developed, it remains a significant challenge to predict enzyme functions at the fourth-digit level of the Enzyme Commission numbers. Functional specificity of enzymes often changes drastically by mutations of a small number of residues and therefore, information about these critical residues can potentially help discriminate detailed functions. However, because these residues must be identified by mutagenesis experiments, the available information is limited, and the lack of experimentally verified specificity determining residues (SDRs) has hindered the development of detailed function prediction methods and computational identification of SDRs. Here we present a novel method for predicting enzyme functions by random forests, EFPrf, along with a set of putative SDRs, the random forests derived SDRs (rf-SDRs). EFPrf consists of a set of binary predictors for enzymes in each CATH superfamily and the rf-SDRs are the residue positions corresponding to the most highly contributing attributes obtained from each predictor. EFPrf showed a precision of 0.98 and a recall of 0.89 in a cross-validated benchmark assessment. The rf-SDRs included many residues, whose importance for specificity had been validated experimentally. The analysis of the rf-SDRs revealed both a general tendency that functionally diverged superfamilies tend to include more active site residues in their rf-SDRs than in less diverged superfamilies, and superfamily-specific conservation patterns of each functional residue. EFPrf and the rf-SDRs will be an effective tool for annotating enzyme functions and for understanding how enzyme functions have diverged within each superfamily.
Collapse
Affiliation(s)
- Chioko Nagao
- National Institute of Biomedical Innovation, Ibaraki, Osaka, Japan
- * E-mail: (CN); (KM)
| | - Nozomi Nagano
- Computational Biology Research Center, AIST, Koto-ku, Tokyo, Japan
| | - Kenji Mizuguchi
- National Institute of Biomedical Innovation, Ibaraki, Osaka, Japan
- * E-mail: (CN); (KM)
| |
Collapse
|
13
|
A novel method for classifying body mass index on the basis of speech signals for future clinical applications: a pilot study. EVIDENCE-BASED COMPLEMENTARY AND ALTERNATIVE MEDICINE 2013; 2013:150265. [PMID: 23573116 PMCID: PMC3612486 DOI: 10.1155/2013/150265] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2012] [Revised: 01/11/2013] [Accepted: 01/13/2013] [Indexed: 11/18/2022]
Abstract
Obesity is a serious public health problem because of the risk factors for diseases and psychological problems. The focus of this study is to diagnose the patient BMI (body mass index) status without weight and height measurements for the use in future clinical applications. In this paper, we first propose a method for classifying the normal and the overweight using only speech signals. Also, we perform a statistical analysis of the features from speech signals. Based on 1830 subjects, the accuracy and AUC (area under the ROC curve) of age- and gender-specific classifications ranged from 60.4 to 73.8% and from 0.628 to 0.738, respectively. We identified several features that were significantly different between normal and overweight subjects (P < 0.05). Also, we found compact and discriminatory feature subsets for building models for diagnosing normal or overweight individuals through wrapper-based feature subset selection. Our results showed that predicting BMI status is possible using a combination of speech features, even though significant features are rare and weak in age- and gender-specific groups and that the classification accuracy with feature selection was higher than that without feature selection. Our method has the potential to be used in future clinical applications such as automatic BMI diagnosis in telemedicine or remote healthcare.
Collapse
|
14
|
Zou C, Gong J, Li H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics 2013; 14:90. [PMID: 23497329 PMCID: PMC3602657 DOI: 10.1186/1471-2105-14-90] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2012] [Accepted: 03/04/2013] [Indexed: 11/10/2022] Open
Abstract
Background DNA-binding proteins (DNA-BPs) play a pivotal role in both eukaryotic and prokaryotic proteomes. There have been several computational methods proposed in the literature to deal with the DNA-BPs, many informative features and properties were used and proved to have significant impact on this problem. However the ultimate goal of Bioinformatics is to be able to predict the DNA-BPs directly from primary sequence. Results In this work, the focus is how to transform these informative features into uniform numeric representation appropriately and improve the prediction accuracy of our SVM-based classifier for DNA-BPs. A systematic representation of some selected features known to perform well is investigated here. Firstly, four kinds of protein properties are obtained and used to describe the protein sequence. Secondly, three different feature transformation methods (OCTD, AC and SAA) are adopted to obtain numeric feature vectors from three main levels: Global, Nonlocal and Local of protein sequence and their performances are exhaustively investigated. At last, the mRMR-IFS feature selection method and ensemble learning approach are utilized to determine the best prediction model. Besides, the optimal features selected by mRMR-IFS are illustrated based on the observed results which may provide useful insights for revealing the mechanisms of protein-DNA interactions. For five-fold cross-validation over the DNAdset and DNAaset, we obtained an overall accuracy of 0.940 and 0.811, MCC of 0.881 and 0.614 respectively. Conclusions The good results suggest that it can efficiently develop an entirely sequence-based protocol that transforms and integrates informative features from different scales used by SVM to predict DNA-BPs accurately. Moreover, a novel systematic framework for sequence descriptor-based protein function prediction is proposed here.
Collapse
Affiliation(s)
- Chuanxin Zou
- Shanghai Key Laboratory of New Drug Design, State Key Laboratory of Bioreactor Engineering, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | | | | |
Collapse
|
15
|
Lee BJ, Kim KH, Ku B, Jang JS, Kim JY. Prediction of body mass index status from voice signals based on machine learning for automated medical applications. Artif Intell Med 2013; 58:51-61. [PMID: 23453267 DOI: 10.1016/j.artmed.2013.02.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Revised: 12/21/2012] [Accepted: 02/05/2013] [Indexed: 11/28/2022]
Abstract
OBJECTIVES The body mass index (BMI) provides essential medical information related to body weight for the treatment and prognosis prediction of diseases such as cardiovascular disease, diabetes, and stroke. We propose a method for the prediction of normal, overweight, and obese classes based only on the combination of voice features that are associated with BMI status, independently of weight and height measurements. MATERIALS AND METHODS A total of 1568 subjects were divided into 4 groups according to age and gender differences. We performed statistical analyses by analysis of variance (ANOVA) and Scheffe test to find significant features in each group. We predicted BMI status (normal, overweight, and obese) by a logistic regression algorithm and two ensemble classification algorithms (bagging and random forests) based on statistically significant features. RESULTS In the Female-2030 group (females aged 20-40 years), classification experiments using an imbalanced (original) data set gave area under the receiver operating characteristic curve (AUC) values of 0.569-0.731 by logistic regression, whereas experiments using a balanced data set gave AUC values of 0.893-0.994 by random forests. AUC values in Female-4050 (females aged 41-60 years), Male-2030 (males aged 20-40 years), and Male-4050 (males aged 41-60 years) groups by logistic regression in imbalanced data were 0.585-0.654, 0.581-0.614, and 0.557-0.653, respectively. AUC values in Female-4050, Male-2030, and Male-4050 groups in balanced data were 0.629-0.893 by bagging, 0.707-0.916 by random forests, and 0.695-0.854 by bagging, respectively. In each group, we found discriminatory features showing statistical differences among normal, overweight, and obese classes. The results showed that the classification models built by logistic regression in imbalanced data were better than those built by the other two algorithms, and significant features differed according to age and gender groups. CONCLUSION Our results could support the development of BMI diagnosis tools for real-time monitoring; such tools are considered helpful in improving automated BMI status diagnosis in remote healthcare or telemedicine and are expected to have applications in forensic and medical science.
Collapse
Affiliation(s)
- Bum Ju Lee
- Medical Research Division, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Daejeon 305-811, Republic of Korea
| | | | | | | | | |
Collapse
|
16
|
Sekhwal MK, Sharma V, Sarin R. Identification of MFS proteins in sorghum using semantic similarity. Theory Biosci 2013; 132:105-13. [PMID: 23299296 DOI: 10.1007/s12064-012-0174-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2012] [Accepted: 12/18/2012] [Indexed: 11/26/2022]
Abstract
The antiporters, uniporters and symporters are the functional classes of MFS that play major role in ions homeostasis, regulation of pumps and channels, membrane structure, transporters activity in tolerance to abiotic stresses. Major facilitator superfamily (MFS) encodes Na(+)/H(+) antiporter that are considered as being sensors of the molecule transports. A large number of MFS proteins have been identified in several plants, rice, maize, Arabidopsis etc. However, the majority of proteins in sorghum are described as putative, uncharacterized till date. This suggested that identified proteins of MFS in sorghum are far from saturation. Hence, we developed gene ontology (GO) terms semantic similarity based method using GOSemSim measure of R package. As a result, total 2,568 high (100 %) semantic similar orthologous proteins from 7 plant species were obtained. These data were used to predict function of 257 putative uncharacterized proteins from 18 families of MFS in Sorghum. Consequently, the identified proteins belonged to the function of regulation of pumps and channels, membrane structure, transporters activity, ions homeostasis, transporter mechanisms and binding process. These identified functions appear to have a distinct mechanism of salt-stress adaptation in plants. The proposed method will help in further identifying new proteins that can help in the development of agronomically and economically important plants.
Collapse
Affiliation(s)
- Manoj Kumar Sekhwal
- Department of Bioscience and Biotechnology, Banasthali University, P.O. Banasthali Vidyapith, 304022 Vanasthali, Rajasthan, India
| | | | | |
Collapse
|
17
|
Lee BJ, Ku B, Park K, Kim KH, Kim JY. A new method of diagnosing constitutional types based on vocal and facial features for personalized medicine. J Biomed Biotechnol 2012; 2012:818607. [PMID: 22899890 PMCID: PMC3415144 DOI: 10.1155/2012/818607] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2012] [Accepted: 05/30/2012] [Indexed: 11/18/2022] Open
Abstract
The aim of the present study is to develop an accurate constitution diagnostic method based solely on the individual's physical characteristics, irrespective of psychologic traits, characteristics of clinical medicine, and genetic factors. In this paper, we suggest a novel method for diagnosing constitutional types using only speech and face characteristics. Based on 514 subjects, the area under the receiver operating characteristics curve (AUC) values of classification models in age and gender groups ranged from 0.64 to 0.89. We identified significant features showing statistical differences among three constitutional types by performing statistical analysis. Also, we selected a compact and discriminative feature subset for constitution diagnosis in each age and gender group. Our method may support the direction of improved diagnosis prediction and will serve to develop a personal and automatic constitution diagnosis software for improvement of the effectiveness of prescribed medications and development of personalized medicine.
Collapse
Affiliation(s)
- Bum Ju Lee
- Division of Constitutional Medicine Research, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Deajeon 305-811, Republic of Korea
| | - Boncho Ku
- Division of Constitutional Medicine Research, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Deajeon 305-811, Republic of Korea
| | - Kihyun Park
- Division of Constitutional Medicine Research, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Deajeon 305-811, Republic of Korea
| | - Keun Ho Kim
- Division of Constitutional Medicine Research, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Deajeon 305-811, Republic of Korea
| | - Jong Yeol Kim
- Division of Constitutional Medicine Research, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Deajeon 305-811, Republic of Korea
| |
Collapse
|
18
|
Sekhwal MK, Swami AK, Sarin R, Sharma V. Identification of salt treated proteins in sorghum using gene ontology linkage. PHYSIOLOGY AND MOLECULAR BIOLOGY OF PLANTS : AN INTERNATIONAL JOURNAL OF FUNCTIONAL PLANT BIOLOGY 2012; 18:209-216. [PMID: 23814435 PMCID: PMC3550515 DOI: 10.1007/s12298-012-0121-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Sorghum bicolor (L.) is an important crop of arid and semi arid zones with most of its varieties tolerant to drought, heat and salt stress. Functional identification of many salt tolerant proteins has been reported in Arabidopsis, rice and other plants, however only little functional information has been predicted in sorghum till date. A 2-D gel electrophoresis based proteomic approach with MALDI-TOF mass spectrometer was utilized to analyze the salt stress response of sorghum. Major changes in protein complement were observed at 200 mM NaCl in hydroponic culture after 96 h of salt-stress. Highly expressed five proteins were excised for functional identification. We developed shortest path (SP) analysis based method on Gene Ontology (GO) hierarchy using sum of GO-term's semantic similarities. In this study, we observed that majority of expressed proteins belonged to the functional category of energy production and conversion, signal transduction mechanisms and ribosome maturation. These identified functions suggest a distinct mechanism of salt-stress adaptation in sorghum plant. The proposed method in this paper potentially has great importance to further understanding of newly identified proteins that can help in plant development.
Collapse
Affiliation(s)
- Manoj Kumar Sekhwal
- />Department of Bioscience & Biotechnology, Banasthali University, P.O. Banasthali Vidyapith, 304022 Rajasthan, India
| | - Ajit Kumar Swami
- />Department of Botany and Biotechnology, University of Rajasthan, JLN Marg, Jaipur, 302055 Rajasthan India
| | - Renu Sarin
- />Department of Botany and Biotechnology, University of Rajasthan, JLN Marg, Jaipur, 302055 Rajasthan India
| | - Vinay Sharma
- />Department of Bioscience & Biotechnology, Banasthali University, P.O. Banasthali Vidyapith, 304022 Rajasthan, India
| |
Collapse
|
19
|
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE. clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics 2011; 12:436. [PMID: 22070249 PMCID: PMC3262844 DOI: 10.1186/1471-2105-12-436] [Citation(s) in RCA: 417] [Impact Index Per Article: 32.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2011] [Accepted: 11/09/2011] [Indexed: 12/02/2022] Open
Abstract
Background In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL. Results Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. Conclusions The Cytoscape plugin clusterMaker provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the clusterMaker plugin. clusterMaker is available via the Cytoscape plugin manager.
Collapse
Affiliation(s)
- John H Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, USA.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Liao B, Liao B, Lu X, Cao Z. A novel graphical representation of protein sequences and its application. J Comput Chem 2011; 32:2539-44. [PMID: 21638292 DOI: 10.1002/jcc.21833] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2011] [Revised: 03/22/2011] [Accepted: 04/13/2011] [Indexed: 11/08/2022]
Abstract
On the basis of information on the evolution of the 20 amino acids and their physiochemical characteristics, we propose a new two-dimensional (2D) graphical representation of protein sequences in this article. By this representation method, we use 2D data to represent three-dimensional information constructed by the amino acids' evolution index, the class information of amino acid based on physiochemical characteristics, and the order of the amino acids appearing in the protein sequences. Then, using discrete Fourier transform, the sequence signals with different lengths can be transformed to the frequency domain, in which the sequences are with the same length. A new method is used to analyze the protein sequence similarity and to predict the protein structural class. The experiments indicate that our method is effective and useful.
Collapse
Affiliation(s)
- Bo Liao
- College of Information Science and Technology, Hunan University, Changsha, Hunan, China.
| | | | | | | |
Collapse
|
21
|
Liao B, Liao B, Sun X, Zeng Q. A novel method for similarity analysis and protein sub-cellular localization prediction. Bioinformatics 2010; 26:2678-83. [PMID: 20826879 DOI: 10.1093/bioinformatics/btq521] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Biological sequence was regarded as an important study by many biologists, because the sequence contains a large number of biological information, what is helpful for scientists' studies on biological cells, DNA and proteins. Currently, many researchers used the method based on protein sequences in function classification, sub-cellular location, structure and functional site prediction, including some machine-learning methods. The purpose of this article, is to find a new way of sequence analysis, but more simple and effective. RESULTS According to the nature of 64 genetic codes, we propose a simple and intuitive 2D graphical expression of protein sequences. And based on this expression we give a new Euclidean-distance method to compute the distance of different sequences for the analysis of sequence similarity. This approach contains more sequence information. A typical phylogenetic tree constructed based on this method proved the effectiveness of our approach. Finally, we use this sequence-similarity-analysis method to predict protein sub-cellular localization, in the two datasets commonly used. The results show that the method is reasonable.
Collapse
Affiliation(s)
- Bo Liao
- School of computer and communication, Hunan University, Changsha, Hunan, China.
| | | | | | | |
Collapse
|