1
|
Zhao JO, Patel BK, Krishack P, Stutz MR, Pearson SD, Lin J, Lecompte-Osorio PA, Dugan KC, Kim S, Gras N, Pohlman A, Kress JP, Hall JB, Sperling AI, Adegunsoye A, Verhoef PA, Wolfe KS. Identification of Clinically Significant Cytokine Signature Clusters in Patients With Septic Shock. Crit Care Med 2023; 51:e253-e263. [PMID: 37678209 PMCID: PMC10840934 DOI: 10.1097/ccm.0000000000006032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
OBJECTIVES To identify cytokine signature clusters in patients with septic shock. DESIGN Prospective observational cohort study. SETTING Single academic center in the United States. PATIENTS Adult (≥ 18 yr old) patients admitted to the medical ICU with septic shock requiring vasoactive medication support. INTERVENTIONS None. MEASUREMENTS AND MAIN RESULTS One hundred fourteen patients with septic shock completed cytokine measurement at time of enrollment (t 1 ) and 24 hours later (t 2 ). Unsupervised random forest analysis of the change in cytokines over time, defined as delta (t 2 -t 1 ), identified three clusters with distinct cytokine profiles. Patients in cluster 1 had the lowest initial levels of circulating cytokines that decreased over time. Patients in cluster 2 and cluster 3 had higher initial levels that decreased over time in cluster 2 and increased in cluster 3. Patients in clusters 2 and 3 had higher mortality compared with cluster 1 (clusters 1-3: 11% vs 31%; odds ratio [OR], 3.56 [1.10-14.23] vs 54% OR, 9.23 [2.89-37.22]). Cluster 3 was independently associated with in-hospital mortality (hazard ratio, 5.24; p = 0.005) in multivariable analysis. There were no significant differences in initial clinical severity scoring or steroid use between the clusters. Analysis of either t 1 or t 2 cytokine measurements alone or in combination did not reveal clusters with clear clinical significance. CONCLUSIONS Longitudinal measurement of cytokine profiles at initiation of vasoactive medications and 24 hours later revealed three distinct cytokine signature clusters that correlated with clinical outcomes.
Collapse
Affiliation(s)
- Jack O Zhao
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Bhakti K Patel
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Paulette Krishack
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Matthew R Stutz
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Steven D Pearson
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Julie Lin
- Pulmonary Medicine, MD Anderson Cancer Center, The University of Texas, Houston, TX
| | | | | | - Seoyoen Kim
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Nicole Gras
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Anne Pohlman
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - John P Kress
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Jesse B Hall
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Anne I Sperling
- Pulmonary & Critical Care, University of Virginia, Charlottesville, VA
| | - Ayodeji Adegunsoye
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Philip A Verhoef
- Critical Care Medicine, Hawaii Permanente Medical Group, Honolulu, HI
| | - Krysta S Wolfe
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| |
Collapse
|
2
|
Prognostic and Clinical Value of Cluster Analysis in Idiopathic Pleuroparenchymal Fibroelastosis Phenotypes. J Clin Med 2021; 10:jcm10071498. [PMID: 33916508 PMCID: PMC8038478 DOI: 10.3390/jcm10071498] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 03/26/2021] [Accepted: 04/01/2021] [Indexed: 02/07/2023] Open
Abstract
Idiopathic pleuroparenchymal fibroelastosis (PPFE) is a distinctive interstitial pneumonia with upper lobe predominance that shows unique morphological features among idiopathic interstitial pneumonias (IIPs). Affected patients have a variety of clinical presentations with heterogeneous clinical courses. Cluster analysis is a valuable tool for identifying distinct clinical phenotypes under heterogeneous conditions. This study aimed to identify the phenotypes of patients with idiopathic PPFE. Using cluster analysis, novel PPFE phenotypes were identified among subjects from our multicenter cohort, and outcomes were stratified according to phenotypic clusters. Among the subjects with baseline data (N = 84), four clusters were identified. Cluster 1 included younger male subjects with coexisting non-UIP-like patterns. Cluster 2 included elderly female nonsmokers with low body mass index (BMI). Cluster 3 included elderly male smokers with a coexisting IP-like pattern. Cluster 4 included younger male smokers without lower lobe lesions. Patients in cluster 3 had significantly worse survival outcomes than those in clusters 1, 2, and 4 (p < 0.001, p = 0.0041, and p = 0.0155, respectively). Among idiopathic PPFE patients, cluster analysis using baseline characteristics identified four distinct clinical phenotypes that might predict survival outcomes.
Collapse
|
3
|
Choudhary P, Kumar S, Bachhawat AK, Pandit SB. CSmetaPred: a consensus method for prediction of catalytic residues. BMC Bioinformatics 2017; 18:583. [PMID: 29273005 PMCID: PMC5741869 DOI: 10.1186/s12859-017-1987-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Accepted: 12/05/2017] [Indexed: 01/27/2023] Open
Abstract
Background Knowledge of catalytic residues can play an essential role in elucidating mechanistic details of an enzyme. However, experimental identification of catalytic residues is a tedious and time-consuming task, which can be expedited by computational predictions. Despite significant development in active-site prediction methods, one of the remaining issues is ranked positions of putative catalytic residues among all ranked residues. In order to improve ranking of catalytic residues and their prediction accuracy, we have developed a meta-approach based method CSmetaPred. In this approach, residues are ranked based on the mean of normalized residue scores derived from four well-known catalytic residue predictors. The mean residue score of CSmetaPred is combined with predicted pocket information to improve prediction performance in meta-predictor, CSmetaPred_poc. Results Both meta-predictors are evaluated on two comprehensive benchmark datasets and three legacy datasets using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves. The visual and quantitative analysis of ROC and PR curves shows that meta-predictors outperform their constituent methods and CSmetaPred_poc is the best of evaluated methods. For instance, on CSAMAC dataset CSmetaPred_poc (CSmetaPred) achieves highest Mean Average Specificity (MAS), a scalar measure for ROC curve, of 0.97 (0.96). Importantly, median predicted rank of catalytic residues is the lowest (best) for CSmetaPred_poc. Considering residues ranked ≤20 classified as true positive in binary classification, CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset. Moreover, on the same dataset CSmetaPred_poc predicts all catalytic residues within top 20 ranks for ~73% of enzymes. Furthermore, benchmarking of prediction on comparative modelled structures showed that models result in better prediction than only sequence based predictions. These analyses suggest that CSmetaPred_poc is able to rank putative catalytic residues at lower (better) ranked positions, which can facilitate and expedite their experimental characterization. Conclusions The benchmarking studies showed that employing meta-approach in combining residue-level scores derived from well-known catalytic residue predictors can improve prediction accuracy as well as provide improved ranked positions of known catalytic residues. Hence, such predictions can assist experimentalist to prioritize residues for mutational studies in their efforts to characterize catalytic residues. Both meta-predictors are available as webserver at: http://14.139.227.206/csmetapred/. Electronic supplementary material The online version of this article (10.1186/s12859-017-1987-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Preeti Choudhary
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India
| | - Shailesh Kumar
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India.,Laboratory of Biochemistry and Genetics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Anand Kumar Bachhawat
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India
| | - Shashi Bhushan Pandit
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India.
| |
Collapse
|
4
|
Adegunsoye A, Oldham JM, Chung JH, Montner SM, Lee C, Witt LJ, Stahlbaum D, Bermea RS, Chen LW, Hsu S, Husain AN, Noth I, Vij R, Strek ME, Churpek M. Phenotypic Clusters Predict Outcomes in a Longitudinal Interstitial Lung Disease Cohort. Chest 2017; 153:349-360. [PMID: 28964798 DOI: 10.1016/j.chest.2017.09.026] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2017] [Revised: 09/06/2017] [Accepted: 09/11/2017] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND The current interstitial lung disease (ILD) classification has overlapping clinical presentations and outcomes. Cluster analysis modeling is a valuable tool in identifying distinct clinical phenotypes in heterogeneous diseases. However, this approach has yet to be implemented in ILD. METHODS Using cluster analysis, novel ILD phenotypes were identified among subjects from a longitudinal ILD cohort, and outcomes were stratified according to phenotypic clusters compared with subgroups according to current American Thoracic Society/European Respiratory Society ILD classification criteria. RESULTS Among subjects with complete data for baseline variables (N = 770), four clusters were identified. Cluster 1 (ie, younger white obese female subjects) had the highest baseline FVC and diffusion capacity of the lung for carbon monoxide (Dlco). Cluster 2 (ie, younger African-American female subjects with elevated antinuclear antibody titers) had the lowest baseline FVC. Cluster 3 (ie, elderly white male smokers with coexistent emphysema) had intermediate FVC and Dlco. Cluster 4 (ie, elderly white male smokers with severe honeycombing) had the lowest baseline Dlco. Compared with classification according to ILD subgroup, stratification according to phenotypic clusters was associated with significant differences in monthly FVC decline (Cluster 4, -0.30% vs Cluster 2, 0.01%; P < .0001). Stratification by using clusters also independently predicted progression-free survival (P < .001) and transplant-free survival (P < .001). CONCLUSIONS Among adults with diverse chronic ILDs, cluster analysis using baseline characteristics identified four distinct clinical phenotypes that might better predict meaningful clinical outcomes than current ILD diagnostic criteria.
Collapse
Affiliation(s)
- Ayodeji Adegunsoye
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL.
| | - Justin M Oldham
- Division of Pulmonary, Critical Care & Sleep Medicine, Department of Medicine, University of California at Davis, Davis, CA
| | | | | | - Cathryn Lee
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL
| | - Leah J Witt
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL
| | | | - Rene S Bermea
- Department of Medicine, University of Chicago, Chicago, IL
| | - Lena W Chen
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL
| | - Scully Hsu
- Department of Medicine, University of Chicago, Chicago, IL
| | - Aliya N Husain
- Department of Pathology, University of Chicago, Chicago, IL
| | - Imre Noth
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL
| | - Rekha Vij
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL
| | - Mary E Strek
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL
| | - Matthew Churpek
- Section of Pulmonary & Critical Care, Department of Medicine, University of Chicago, Chicago, IL
| |
Collapse
|
5
|
Žiarovská J, Záhorský M, Gálová Z, Hricová A. Bioinformatic approach in the identification of arabidopsis gene homologous in amaranthus. POTRAVINARSTVO 2015. [DOI: 10.5219/467] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Bioinfomatics offers an efficient tool for molecular genetics applications and sequence homology search algorithms became an inevitable part for many different research strategies. Appropriate managing of known data that are stored in public available databases can be used in many ways in the research. Here, we report the identification of RmlC-like cupins superfamily protein DNA sequence than is known in Arabidopsis genome for the Amaranthus - plant specie where this sequence was still not sequenced. A BLAST based approach was used to identify the homologous sequences in the nucleotide database and to find suitable parts of the Arabidopsis sequence were primers can be designed. In total, 64 hits were found in nucleotide database for Arabidopsis RmlC-like cupins sequence. A query cover ranged from 10% up to the 100% among RmlC-like cupins nucleotides and its homologues that are actually stored in public nucleotide databases. The most conserved region was identified for matches that posses nucleotides in the range of 1506 up to the 1925 bp of RmlC-like cupins DNA sequence stored in the database. The in silico approach was subsequently used in PCR analysis where the specifity of designed primers was approved. A unique, 250 bp long fragment was obtained for Amaranthus cruentus and a hybride Amaranthus hypochondriacus x hybridus in our analysis. Bioinformatic based analysis of unknown parts of the plant genomes as showed in this study is a very good additional tool in PCR based analysis of plant variability. This approach is suitable in the case for plants, where concrete genomic data are still missing for the appropriate genes, as was demonstrated for Amaranthus.
Collapse
|
6
|
Sindhu T, Rajamanikandan S, Srinivasan P. Computational Prediction of Phylogenetically Conserved Sequence Motifs for Five Different Candidate Genes in Type II Diabetic Nephropathy. IRANIAN JOURNAL OF PUBLIC HEALTH 2012; 41:24-33. [PMID: 23113206 PMCID: PMC3469011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2012] [Accepted: 04/24/2012] [Indexed: 11/16/2022]
Abstract
BACKGROUND Computational identification of phylogenetic motifs helps to understand the knowledge about known functional features that includes catalytic site, substrate binding epitopes, and protein-protein interfaces. Furthermore, they are strongly conserved among orthologs, indicating their evolutionary importance. The study aimed to analyze five candidate genes involved in type II diabetic nephropathy and to predict phylogenetic motifs from their corresponding orthologous protein sequences. METHODS AKR1B1, APOE, ENPP1, ELMO1 and IGFBP1 are the genes that have been identified as an important target for type II diabetic nephropathy through experimental studies. Their corresponding protein sequences, structures, orthologous sequences were retrieved from UniprotKB, PDB, and PHOG database respectively. Multiple sequence alignments were constructed using ClustalW and phylogenetic motifs were identified using MINER. The occurrence of amino acids in the obtained phylogenetic motifs was generated using WebLogo and false positive expectations were calculated against phylogenetic similarity. RESULTS In total, 17 phylogenetic motifs were identified from the five proteins and the residues such as glycine, leucine, tryptophan, aspartic acid were found in appreciable frequency whereas arginine identified in all the predicted PMs. The result implies that these residues can be important to the functional and structural role of the proteins and calculated false positive expectations implies that they were generally conserved in traditional sense. CONCLUSION The prediction of phylogenetic motifs is an accurate method for detecting functionally important conserved residues. The conserved motifs can be used as a potential drug target for type II diabetic nephropathy.
Collapse
Affiliation(s)
| | | | - P Srinivasan
- Corresponding Author: Tel: +91-4565-230725, E-mail address:
| |
Collapse
|
7
|
La D, Kihara D. A novel method for protein-protein interaction site prediction using phylogenetic substitution models. Proteins 2011; 80:126-41. [PMID: 21989996 DOI: 10.1002/prot.23169] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2011] [Revised: 07/07/2011] [Accepted: 08/17/2011] [Indexed: 11/10/2022]
Abstract
Protein-protein binding events mediate many critical biological functions in the cell. Typically, functionally important sites in proteins can be well identified by considering sequence conservation. However, protein-protein interaction sites exhibit higher sequence variation than other functional regions, such as catalytic sites of enzymes. Consequently, the mutational behavior leading to weak sequence conservation poses significant challenges to the protein-protein interaction site prediction. Here, we present a phylogenetic framework to capture critical sequence variations that favor the selection of residues essential for protein-protein binding. Through the comprehensive analysis of diverse protein families, we show that protein binding interfaces exhibit distinct amino acid substitution as compared with other surface residues. On the basis of this analysis, we have developed a novel method, BindML, which utilizes the substitution models to predict protein-protein binding sites of protein with unknown interacting partners. BindML estimates the likelihood that a phylogenetic tree of a local surface region in a query protein structure follows the substitution patterns of protein binding interface and nonbinding surfaces. BindML is shown to perform well compared to alternative methods for protein binding interface prediction. The methodology developed in this study is very versatile in the sense that it can be generally applied for predicting other types of functional sites, such as DNA, RNA, and membrane binding sites in proteins.
Collapse
Affiliation(s)
- David La
- Department of Biological Sciences, College of Science, Purdue University, West Lafayette, Indiana 47907, USA
| | | |
Collapse
|
8
|
Tuominen LK, Johnson VE, Tsai CJ. Differential phylogenetic expansions in BAHD acyltransferases across five angiosperm taxa and evidence of divergent expression among Populus paralogues. BMC Genomics 2011; 12:236. [PMID: 21569431 PMCID: PMC3123328 DOI: 10.1186/1471-2164-12-236] [Citation(s) in RCA: 111] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2010] [Accepted: 05/12/2011] [Indexed: 11/26/2022] Open
Abstract
Background BAHD acyltransferases are involved in the synthesis and elaboration of a wide variety of secondary metabolites. Previous research has shown that characterized proteins from this family fall broadly into five major clades and contain two conserved protein motifs. Here, we aimed to expand the understanding of BAHD acyltransferase diversity in plants through genome-wide analysis across five angiosperm taxa. We focus particularly on Populus, a woody perennial known to produce an abundance of secondary metabolites. Results Phylogenetic analysis of putative BAHD acyltransferase sequences from Arabidopsis, Medicago, Oryza, Populus, and Vitis, along with previously characterized proteins, supported a refined grouping of eight major clades for this family. Taxon-specific clustering of many BAHD family members appears pervasive in angiosperms. We identified two new multi-clade motifs and numerous clade-specific motifs, several of which have been implicated in BAHD function by previous structural and mutagenesis research. Gene duplication and expression data for Populus-dominated subclades revealed that several paralogous BAHD members in this genus might have already undergone functional divergence. Conclusions Differential, taxon-specific BAHD family expansion via gene duplication could be an evolutionary process contributing to metabolic diversity across plant taxa. Gene expression divergence among some Populus paralogues highlights possible distinctions between their biochemical and physiological functions. The newly discovered motifs, especially the clade-specific motifs, should facilitate future functional study of substrate and donor specificity among BAHD enzymes.
Collapse
Affiliation(s)
- Lindsey K Tuominen
- Warnell School of Forestry and Natural Resources, University of Georgia, Athens, GA 30602-2152, USA
| | | | | |
Collapse
|
9
|
Tungtur S, Parente DJ, Swint-Kruse L. Functionally important positions can comprise the majority of a protein's architecture. Proteins 2011; 79:1589-608. [PMID: 21374721 DOI: 10.1002/prot.22985] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Revised: 12/08/2010] [Accepted: 12/15/2010] [Indexed: 01/13/2023]
Abstract
Concomitant with the genomic era, many bioinformatics programs have been developed to identify functionally important positions from sequence alignments of protein families. To evaluate these analyses, many have used the LacI/GalR family and determined whether positions predicted to be "important" are validated by published experiments. However, we previously noted that predictions do not identify all of the experimentally important positions present in the linker regions of these homologs. In an attempt to reconcile these differences, we corrected and expanded the LacI/GalR sequence set commonly used in sequence/function analyses. Next, a variety of analyses were carried out (1) for the entire LacI/GalR sequence set and (2) for a subset of homologs with functionally-important "YxPxxxAxxL" motifs in their linkers. This strategy was devised to determine whether predictions could be improved by knowledge-based sequence sorting and-for some analyses-did increase the number of linker positions identified. However, two functionally important linker positions were not reliably identified by any analysis. Finally, we compared the new predictions to all known experimental data for E. coli LacI and three homologous linkers. From these, we estimate that >50% of positions are important to the functions of the LacI/GalR homologs. In corollary, neutral positions might occur less frequently and might be easier to detect in sequence analyses. Although analyses have successfully guided mutations that partially exchange protein functions, a better experimental understanding of the sequence/function relationships in protein families would be helpful for uncovering the remaining rules used by nature to evolve new protein functions.
Collapse
Affiliation(s)
- Sudheer Tungtur
- Department of Biochemistry and Molecular Biology, The University of Kansas Medical Center, MSN 3030, Kansas City, Kansas 66160, USA
| | | | | |
Collapse
|
10
|
Kc DB, Livesay DR. Topology improves phylogenetic motif functional site predictions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:226-233. [PMID: 21071810 DOI: 10.1109/tcbb.2009.60] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Prediction of protein functional sites from sequence-derived data remains an open bioinformatics problem. We have developed a phylogenetic motif (PM) functional site prediction approach that identifies functional sites from alignment fragments that parallel the evolutionary patterns of the family. In our approach, PMs are identified by comparing tree topologies of each alignment fragment to that of the complete phylogeny. Herein, we bypass the phylogenetic reconstruction step and identify PMs directly from distance matrix comparisons. In order to optimize the new algorithm, we consider three different distance matrices and 13 different matrix similarity scores. We assess the performance of the various approaches on a structurally nonredundant data set that includes three types of functional site definitions. Without exception, the predictive power of the original approach outperforms the distance matrix variants. While the distance matrix methods fail to improve upon the original approach, our results are important because they clearly demonstrate that the improved predictive power is based on the topological comparisons. Meaning that phylogenetic trees are a straightforward, yet powerful way to improve functional site prediction accuracy. While complementary studies have shown that topology improves predictions of protein-protein interactions, this report represents the first demonstration that trees improve functional site predictions as well.
Collapse
Affiliation(s)
- Dukka B Kc
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA.
| | | |
Collapse
|
11
|
Dukka BKC, Livesay DR. Improving position-specific predictions of protein functional sites using phylogenetic motifs. ACTA ACUST UNITED AC 2008; 24:2308-16. [PMID: 18723520 DOI: 10.1093/bioinformatics/btn454] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Accurate computational prediction of protein functional sites is critical to maximizing the utility of recent high-throughput sequencing efforts. Among the available approaches, position-specific conservation scores remain among the most popular due to their accuracy and ease of computation. Unfortunately, high false positive rates remain a limiting factor. Using phylogenetic motifs (PMs), we have developed two combined (conservation + PMs) prediction schemes that significantly improve prediction accuracy. RESULTS Our first approach, called position-specific MINER (psMINER), rank orders alignment columns by conservation. Subsequently, positions that are also not identified as PMs are excluded from the prediction set. This approach improves prediction accuracy, in a statistically significant way, compared to the underlying conservation scores. Increased accuracy is a general result, meaning improvement is observed over several different conservation scores that span a continuum of complexity. In addition, a hybrid MINER (hMINER) that quantitatively considers both scoring regimes provides further improvement. More importantly, it provides critical insight into the relative importance of phylogeny versus alignment conservation. Both methods outperform other common prediction algorithms that also utilize phylogenetic concepts. Finally, we demonstrate that the presented results are critically sensitive to functional site definition, thus highlighting the need for more complete benchmarks within the prediction community.
Collapse
Affiliation(s)
- Bahadur K C Dukka
- Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | | |
Collapse
|
12
|
Manning JR, Jefferson ER, Barton GJ. The contrasting properties of conservation and correlated phylogeny in protein functional residue prediction. BMC Bioinformatics 2008; 9:51. [PMID: 18221517 PMCID: PMC2267696 DOI: 10.1186/1471-2105-9-51] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2007] [Accepted: 01/25/2008] [Indexed: 11/21/2022] Open
Abstract
Background Amino acids responsible for structure, core function or specificity may be inferred from multiple protein sequence alignments where a limited set of residue types are tolerated. The rise in available protein sequences continues to increase the power of techniques based on this principle. Results A new algorithm, SMERFS, for predicting protein functional sites from multiple sequences alignments was compared to 14 conservation measures and to the MINER algorithm. Validation was performed on an automatically generated dataset of 1457 families derived from the protein interactions database SNAPPI-DB, and a smaller manually curated set of 148 families. The best performing measure overall was Williamson property entropy, with ROC0.1 scores of 0.0087 and 0.0114 for domain and small molecule contact prediction, respectively. The Lancet method performed worse than random on protein-protein interaction site prediction (ROC0.1 score of 0.0008). The SMERFS algorithm gave similar accuracy to the phylogenetic tree-based MINER algorithm but was superior to Williamson in prediction of non-catalytic transient complex interfaces. SMERFS predicts sites that are significantly more solvent accessible compared to Williamson. Conclusion Williamson property entropy is the the best performing of 14 conservation measures examined. The difference in performance of SMERFS relative to Williamson in manually defined complexes was dependent on complex type. The best choice of analysis method is therefore dependent on the system of interest. Additional computation employed by Miner in calculation of phylogenetic trees did not produce improved results over SMERFS. SMERFS performance was improved by use of windows over alignment columns, illustrating the necessity of considering the local environment of positions when assessing their functional significance.
Collapse
|
13
|
Abstract
Sequence motif discovery algorithms are an important part of the computational biologist's toolkit. The purpose of motif discovery is to discover patterns in biopolymer (nucleotide or protein) sequences in order to better understand the structure and function of the molecules the sequences represent. This chapter provides an overview of the use of sequence motif discovery in biology and a general guide to the use of motif discovery algorithms. The chapter discusses the types of biological features that DNA and protein motifs can represent and their usefulness. It also defines what sequence motifs are, how they are represented, and general techniques for discovering them. The primary focus is on one aspect of motif discovery: discovering motifs in a set of unaligned DNA or protein sequences. Also presented are steps useful for checking the biological validity and investigating the function of sequence motifs using methods such as motif scanning--searching for matches to motifs in a given sequence or a database of sequences. A discussion of some limitations of motif discovery concludes the chapter.
Collapse
Affiliation(s)
- Timothy L Bailey
- ARC Centre of Excellence in Bioinformatics, and Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland, Australia
| |
Collapse
|
14
|
Livesay DR, Kidd PD, Eskandari S, Roshan U. Assessing the ability of sequence-based methods to provide functional insight within membrane integral proteins: a case study analyzing the neurotransmitter/Na+ symporter family. BMC Bioinformatics 2007; 8:397. [PMID: 17941992 PMCID: PMC2194793 DOI: 10.1186/1471-2105-8-397] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2007] [Accepted: 10/17/2007] [Indexed: 01/09/2023] Open
Abstract
Background Efforts to predict functional sites from globular proteins is increasingly common; however, the most successful of these methods generally require structural insight. Unfortunately, despite several recent technological advances, structural coverage of membrane integral proteins continues to be sparse. ConSequently, sequence-based methods represent an important alternative to illuminate functional roles. In this report, we critically examine the ability of several computational methods to provide functional insight within two specific areas. First, can phylogenomic methods accurately describe the functional diversity across a membrane integral protein family? And second, can sequence-based strategies accurately predict key functional sites? Due to the presence of a recently solved structure and a vast amount of experimental mutagenesis data, the neurotransmitter/Na+ symporter (NSS) family is an ideal model system to assess the quality of our predictions. Results The raw NSS sequence dataset contains 181 sequences, which have been aligned by various methods. The resultant phylogenetic trees always contain six major subfamilies are consistent with the functional diversity across the family. Moreover, in well-represented subfamilies, phylogenetic clustering recapitulates several nuanced functional distinctions. Functional sites are predicted using six different methods (phylogenetic motifs, two methods that identify subfamily-specific positions, and three different conservation scores). A canonical set of 34 functional sites identified by Yamashita et al. within the recently solved LeuTAa structure is used to assess the quality of the predictions, most of which are predicted by the bioinformatic methods. Remarkably, the importance of these sites is largely confirmed by experimental mutagenesis. Furthermore, the collective set of functional site predictions qualitatively clusters along the proposed transport pathway, further demonstrating their utility. Interestingly, the various prediction schemes provide results that are predominantly orthogonal to each other. However, when the methods do provide overlapping results, specificity is shown to increase dramatically (e.g., sites predicted by any three methods have both accuracy and coverage greater than 50%). Conclusion The results presented herein clearly establish the viability of sequence-based bioinformatic strategies to provide functional insight within the NSS family. As such, we expect similar bioinformatic investigations will streamline functional investigations within membrane integral families in the absence of structure.
Collapse
Affiliation(s)
- Dennis R Livesay
- Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte, Charlotte, NC 28262, USA.
| | | | | | | |
Collapse
|
15
|
Mistry J, Bateman A, Finn RD. Predicting active site residue annotations in the Pfam database. BMC Bioinformatics 2007; 8:298. [PMID: 17688688 PMCID: PMC2025603 DOI: 10.1186/1471-2105-8-298] [Citation(s) in RCA: 166] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2007] [Accepted: 08/09/2007] [Indexed: 12/03/2022] Open
Abstract
Background Approximately 5% of Pfam families are enzymatic, but only a small fraction of the sequences within these families (<0.5%) have had the residues responsible for catalysis determined. To increase the active site annotations in the Pfam database, we have developed a strict set of rules, chosen to reduce the rate of false positives, which enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family. Description We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and MEROPS we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives. Conclusion We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.
Collapse
Affiliation(s)
- Jaina Mistry
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Alex Bateman
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Robert D Finn
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
16
|
How accurate and statistically robust are catalytic site predictions based on closeness centrality? BMC Bioinformatics 2007; 8:153. [PMID: 17498304 PMCID: PMC1876251 DOI: 10.1186/1471-2105-8-153] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2006] [Accepted: 05/11/2007] [Indexed: 11/25/2022] Open
Abstract
Background We examine the accuracy of enzyme catalytic residue predictions from a network representation of protein structure. In this model, amino acid α-carbons specify vertices within a graph and edges connect vertices that are proximal in structure. Closeness centrality, which has shown promise in previous investigations, is used to identify important positions within the network. Closeness centrality, a global measure of network centrality, is calculated as the reciprocal of the average distance between vertex i and all other vertices. Results We benchmark the approach against 283 structurally unique proteins within the Catalytic Site Atlas. Our results, which are inline with previous investigations of smaller datasets, indicate closeness centrality predictions are statistically significant. However, unlike previous approaches, we specifically focus on residues with the very best scores. Over the top five closeness centrality scores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstrated previously, adding a solvent accessibility filter significantly improves predictive power; the average ratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictions by residue identity improves the results even more than accessibility filtering. Here, we simply eliminate residues with physiochemical properties unlikely to be compatible with catalytic requirements from consideration. Residue identity filtering improves the average true to false positive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results. Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134. Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined. Conclusion Our results resolutely confirm that closeness centrality is a viable prediction scheme whose predictions are statistically significant. Simple filtering schemes substantially improve the method's predicted power. Moreover, no clear effect on performance is observed when comparing ligated and unligated structures. Similarly, the CC prediction results are robust to slight structural perturbations from molecular dynamics simulation.
Collapse
|
17
|
Abstract
Sequence motif discovery algorithms are an important part of the computational biologist's toolkit. The purpose of motif discovery is to discover patterns in biopolymer (nucleotide or protein) sequences to better understand the structure and function of the molecules the sequences represent. This chapter provides an overview of the use of sequence motif discovery in biology and a general guide to the use of motif discovery algorithms. This chapter examines the types of biological features that DNA and protein motifs can represent and their usefulness. This chapter also defines what sequence motifs are, how they are represented, and general techniques for discovering them. The primary focus of the chapter is on one aspect of motif discovery: discovering motifs in a set of unaligned DNA or protein sequences. This chapter also provides the steps useful for checking the biological validity and investigating the function of sequence motifs using methods such as motif scanning-searching for matches to motifs in a given sequence or a database of sequences. A discussion of some limitations of motif discovery concludes the chapter.
Collapse
|
18
|
Incorporating background frequency improves entropy-based residue conservation measures. BMC Bioinformatics 2006. [PMID: 16916457 DOI: 10.1186/1471‐2105‐7‐385] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several entropy-based methods have been developed for scoring sequence conservation in protein multiple sequence alignments. High scoring amino acid positions may correlate with structurally or functionally important residues. However, amino acid background frequencies are usually not taken into account in these entropy-based scoring schemes. RESULTS We demonstrate that using a relative entropy measure that incorporates amino acid background frequency results in improved performance in identifying functional sites from protein multiple sequence alignments. CONCLUSION Our results suggest that the application of appropriate background frequency information may lead to more biologically relevant results in many areas of bioinformatics.
Collapse
|
19
|
Wang K, Samudrala R. Incorporating background frequency improves entropy-based residue conservation measures. BMC Bioinformatics 2006; 7:385. [PMID: 16916457 PMCID: PMC1562451 DOI: 10.1186/1471-2105-7-385] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 08/17/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several entropy-based methods have been developed for scoring sequence conservation in protein multiple sequence alignments. High scoring amino acid positions may correlate with structurally or functionally important residues. However, amino acid background frequencies are usually not taken into account in these entropy-based scoring schemes. RESULTS We demonstrate that using a relative entropy measure that incorporates amino acid background frequency results in improved performance in identifying functional sites from protein multiple sequence alignments. CONCLUSION Our results suggest that the application of appropriate background frequency information may lead to more biologically relevant results in many areas of bioinformatics.
Collapse
Affiliation(s)
- Kai Wang
- Computational Genomics Group, Department of Microbiology, University of Washington, USA
| | - Ram Samudrala
- Computational Genomics Group, Department of Microbiology, University of Washington, USA
| |
Collapse
|