51
|
Malkhed V, Mustyala KK, Potlapally SR, Vuruputuri U. Identification of novel leads applyingin silicostudies for Mycobacterium multidrug resistant (MMR) protein. J Biomol Struct Dyn 2013; 32:1889-906. [DOI: 10.1080/07391102.2013.842185] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
52
|
Lapadula WJ, Sánchez Puerta MV, Juri Ayub M. Revising the taxonomic distribution, origin and evolution of ribosome inactivating protein genes. PLoS One 2013; 8:e72825. [PMID: 24039805 PMCID: PMC3764214 DOI: 10.1371/journal.pone.0072825] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2013] [Accepted: 07/13/2013] [Indexed: 11/24/2022] Open
Abstract
Ribosome inactivating proteins are enzymes that depurinate a specific adenine residue in the alpha-sarcin-ricin loop of the large ribosomal RNA, being ricin and Shiga toxins the most renowned examples. They are widely distributed in plants and their presence has also been confirmed in a few bacterial species. According to this taxonomic distribution, the current model about the origin and evolution of RIP genes postulates that an ancestral RIP domain was originated in flowering plants, and later acquired by some bacteria via horizontal gene transfer. Here, we unequivocally detected the presence of RIP genes in fungi and metazoa. These findings, along with sequence and phylogenetic analyses, led us to propose an alternative, more parsimonious, hypothesis about the origin and evolutionary history of the RIP domain, where several paralogous RIP genes were already present before the three domains of life evolved. This model is in agreement with the current idea of the Last Universal Common Ancestor (LUCA) as a complex, genetically redundant organism. Differential loss of paralogous genes in descendants of LUCA, rather than multiple horizontal gene transfer events, could account for the complex pattern of RIP genes across extant species, as it has been observed for other genes.
Collapse
Affiliation(s)
- Walter J. Lapadula
- Área de Biología Molecular, Departamento de Bioquímica y Ciencias Biológicas, UNSL and Instituto Multidisciplinario de Investigaciones Biológicas de San Luis (IMIBIO-SL-CONICET), San Luis, Argentina
| | - María Virginia Sánchez Puerta
- Instituto de Ciencias Básicas, IBAM-CONICET and Facultad de Ciencias Agrarias, Universidad Nacional de Cuyo, Mendoza, Argentina
| | - Maximiliano Juri Ayub
- Área de Biología Molecular, Departamento de Bioquímica y Ciencias Biológicas, UNSL and Instituto Multidisciplinario de Investigaciones Biológicas de San Luis (IMIBIO-SL-CONICET), San Luis, Argentina
- * E-mail:
| |
Collapse
|
53
|
Belenki L, Sterzik V, Bohnert M. Similarity analysis of spectra obtained via reflectance spectrometry in legal medicine. JOURNAL OF LABORATORY AUTOMATION 2013; 19:110-8. [PMID: 23897013 DOI: 10.1177/2211068213496089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In the present study, a series of reflectance spectra of postmortem lividity, pallor, and putrefaction-affected skin for 195 investigated cases in the course of cooling down the corpse has been collected. The reflectance spectrometric measurements were stored together with their respective metadata in a MySQL database. The latter has been managed via a scientific information repository. We propose similarity measures and a criterion of similarity that capture similar spectra recorded at corpse skin. We systematically clustered reflectance spectra from the database as well as their metadata, such as case number, age, sex, skin temperature, duration of cooling, and postmortem time, with respect to the given criterion of similarity. Altogether, more than 500 reflectance spectra have been pairwisely compared. The measures that have been used to compare a pair of reflectance curve samples include the Euclidean distance between curves and the Euclidean distance between derivatives of the functions represented by the reflectance curves at the same wavelengths in the spectral range of visible light between 380 and 750 nm. For each case, using the recorded reflectance curves and the similarity criterion, the postmortem time interval during which a characteristic change in the shape of reflectance spectrum takes place is estimated. The latter is carried out via a software package composed of Java, Python, and MatLab scripts that query the MySQL database. We show that in legal medicine, matching and clustering of reflectance curves obtained by means of reflectance spectrometry with respect to a given criterion of similarity can be used to estimate the postmortem interval.
Collapse
Affiliation(s)
- Liudmila Belenki
- 1Materials Research Center Freiburg, University of Freiburg, Freiburg, Germany
| | | | | |
Collapse
|
54
|
Mishra S, Saxena A, Sangwan RS. Fundamentals of Homology Modeling Steps and Comparison among Important Bioinformatics Tools: An Overview. ACTA ACUST UNITED AC 2013. [DOI: 10.17311/sciintl.2013.237.252] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
55
|
Gonzalez MW, Spouge JL. Domain analysis of symbionts and hosts (DASH) in a genome-wide survey of pathogenic human viruses. BMC Res Notes 2013; 6:209. [PMID: 23706066 PMCID: PMC3672079 DOI: 10.1186/1756-0500-6-209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Accepted: 05/17/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the coevolution of viruses and their hosts, viruses often capture host genes, gaining advantageous functions (e.g. immune system control). Identifying functional similarities shared by viruses and their hosts can help decipher mechanisms of pathogenesis and accelerate virus-targeted drug and vaccine development. Cellular homologs in viruses are usually documented using pairwise-sequence comparison methods. Yet, pairwise-sequence searches have limited sensitivity resulting in poor identification of divergent homologies. RESULTS Methods based on profiles from multiple sequences provide a more sensitive alternative to identify similarities in host-pathogen systems. The present work describes a profile-based bioinformatics pipeline that we call the Domain Analysis of Symbionts and Hosts (DASH). DASH provides a web platform for the functional analysis of viral and host genomes. This study uses Human Herpesvirus 8 (HHV-8) as a model to validate the methodology. Our results indicate that HHV-8 shares at least 29% of its genes with humans (fourteen immunomodulatory and ten metabolic genes). DASH also suggests functions for fifty-one additional HHV-8 structural and metabolic proteins. We also perform two other comparative genomics studies of human viruses: (1) a broad survey of eleven viruses of disparate sizes and transcription strategies; and (2) a closer examination of forty-one viruses of the order Mononegavirales. In the survey, DASH detects human homologs in 4/5 DNA viruses. None of the non-retro-transcribing RNA viruses in the survey showed evidence of homology to humans. The order Mononegavirales are also non-retro-transcribing RNA viruses, however, and DASH found homology in 39/41 of them. Mononegaviruses display larger fractions of human similarities (up to 75%) than any of the other RNA or DNA viruses (up to 55% and 29% respectively). CONCLUSIONS We conclude that gene sharing probably occurs between humans and both DNA and RNA viruses, in viral genomes of differing sizes, regardless of transcription strategies. Our method (DASH) simultaneously analyzes the genomes of two interacting species thereby mining functional information to identify shared as well as exclusive domains to each organism. Our results validate our approach, showing that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems. DASH results are available at http://tinyurl.com/spouge-dash.
Collapse
Affiliation(s)
- Mileidy W Gonzalez
- National Institutes of Health, National Library of Medicine, National Center for Biotechnology Information, 8600 Rockville Pike, Building 38A, Room 6N611-M, Bethesda, MD 20894, USA.
| | | |
Collapse
|
56
|
Schuepbach T, Pagni M, Bridge A, Bougueleret L, Xenarios I, Cerutti L. pfsearchV3: a code acceleration and heuristic to search PROSITE profiles. Bioinformatics 2013; 29:1215-7. [PMID: 23505298 PMCID: PMC3634184 DOI: 10.1093/bioinformatics/btt129] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Summary: The PROSITE resource provides a rich and well annotated source of signatures in the form of generalized profiles that allow protein domain detection and functional annotation. One of the major limiting factors in the application of PROSITE in genome and metagenome annotation pipelines is the time required to search protein sequence databases for putative matches. We describe an improved and optimized implementation of the PROSITE search tool pfsearch that, combined with a newly developed heuristic, addresses this limitation. On a modern x86_64 hyper-threaded quad-core desktop computer, the new pfsearchV3 is two orders of magnitude faster than the original algorithm. Availability and implementation: Source code and binaries of pfsearchV3 are freely available for download at http://web.expasy.org/pftools/#pfsearchV3, implemented in C and supported on Linux. PROSITE generalized profiles including the heuristic cut-off scores are available at the same address. Contact:pftools@isb-sib.ch
Collapse
Affiliation(s)
- Thierry Schuepbach
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Genopode, UNIL-Sorge, 1015 Lausanne, Switzerland
| | | | | | | | | | | |
Collapse
|
57
|
Maulik U, Sarkar A. Searching remote homology with spectral clustering with symmetry in neighborhood cluster kernels. PLoS One 2013; 8:e46468. [PMID: 23457439 PMCID: PMC3574063 DOI: 10.1371/journal.pone.0046468] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2011] [Accepted: 09/04/2012] [Indexed: 11/18/2022] Open
Abstract
UNLABELLED Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of "recent" paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. CONTACT sarkar@labri.fr.
Collapse
Affiliation(s)
- Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India.
| | | |
Collapse
|
58
|
Udaka K, Mamitsuka H, Nakaseko Y, Abe N. Prediction of MHC class I binding peptides by a query learning algorithm based on hidden markov models. J Biol Phys 2013; 28:183-94. [PMID: 23345768 DOI: 10.1023/a:1019931731519] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A query learning algorithm based on hidden Markov models (HMMs) isdeveloped to design experiments for string analysis and prediction of MHCclass I binding peptides. Query learning is introduced to aim at reducingthe number of peptide binding data for training of HMMs. A multiple numberof HMMs, which will collectively serve as a committee, are trained withbinding data and used for prediction in real-number values. The universeof peptides is randomly sampled and subjected to judgement by the HMMs.Peptides whose prediction is least consistent among committee HMMs aretested by experiment. By iterating the feedback cycle of computationalanalysis and experiment the most wanted information is effectivelyextracted. After 7 rounds of active learning with 181 peptides in all,predictive performance of the algorithm surpassed the so far bestperforming matrix based prediction. Moreover, by combining the bothmethods binder peptides (log Kd < -6) could be predicted with84% accuracy. Parameter distribution of the HMMs that can be inspectedvisually after training further offers a glimpse of dynamic specificity ofthe MHC molecules.
Collapse
Affiliation(s)
- Keiko Udaka
- Department of Biophysics, Kyoto University, Japan
| | | | | | | |
Collapse
|
59
|
Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res 2013; 41:D70-82. [PMID: 23203985 PMCID: PMC3531169 DOI: 10.1093/nar/gks1265] [Citation(s) in RCA: 215] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2012] [Revised: 11/04/2012] [Accepted: 11/05/2012] [Indexed: 11/28/2022] Open
Abstract
We present a database of repetitive DNA elements, called Dfam (http://dfam.janelia.org). Many genomes contain a large fraction of repetitive DNA, much of which is made up of remnants of transposable elements (TEs). Accurate annotation of TEs enables research into their biology and can shed light on the evolutionary processes that shape genomes. Identification and masking of TEs can also greatly simplify many downstream genome annotation and sequence analysis tasks. The commonly used TE annotation tools RepeatMasker and Censor depend on sequence homology search tools such as cross_match and BLAST variants, as well as Repbase, a collection of known TE families each represented by a single consensus sequence. Dfam contains entries corresponding to all Repbase TE entries for which instances have been found in the human genome. Each Dfam entry is represented by a profile hidden Markov model, built from alignments generated using RepeatMasker and Repbase. When used in conjunction with the hidden Markov model search tool nhmmer, Dfam produces a 2.9% increase in coverage over consensus sequence search methods on a large human benchmark, while maintaining low false discovery rates, and coverage of the full human genome is 54.5%. The website provides a collection of tools and data views to support improved TE curation and annotation efforts. Dfam is also available for download in flat file format or in the form of MySQL table dumps.
Collapse
|
60
|
Joshi AG, Raghavender US, Sowdhamini R. Improved performance of sequence search approaches in remote homology detection. F1000Res 2013; 2:93. [PMID: 25469226 PMCID: PMC4240247 DOI: 10.12688/f1000research.2-93.v2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/27/2014] [Indexed: 11/20/2022] Open
Abstract
The protein sequence space is vast and diverse, spanning across different families. Biologically meaningful relationships exist between proteins at superfamily level. However, it is highly challenging to establish convincing relationships at the superfamily level by means of simple sequence searches. It is necessary to design a rigorous sequence search strategy to establish remote homology relationships and achieve high coverage. We have used iterative profile-based methods, along with constraints of sequence motifs, to specify search directions. We address the importance of multiple start points (queries) to achieve high coverage at protein superfamily level. We have devised strategies to employ a structural regime to search sequence space with good specificity and sensitivity. We employ two well-known sequence search methods, PSI-BLAST and PHI-BLAST, with multiple queries and multiple patterns to enhance homologue identification at the structural superfamily level. The study suggests that multiple queries improve sensitivity, while a pattern-constrained iterative sequence search becomes stringent at the initial stages, thereby driving the search in a specific direction and also achieves high coverage. This data mining approach has been applied to the entire structural superfamily database.
Collapse
Affiliation(s)
- Adwait Govind Joshi
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India ; Manipal University, Manipal, Karnataka, 576104, India
| | - Upadhyayula Surya Raghavender
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences (Tata Institute of Fundamental Research), Gandhi Krishi Vignyan Kendra Campus, Bangalore, 560065, India
| |
Collapse
|
61
|
Vyas VK, Ukawala RD, Ghate M, Chintha C. Homology modeling a fast tool for drug discovery: current perspectives. Indian J Pharm Sci 2012. [PMID: 23204616 PMCID: PMC3507339 DOI: 10.4103/0250-474x.102537] [Citation(s) in RCA: 155] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Major goal of structural biology involve formation of protein-ligand complexes; in which the protein molecules act energetically in the course of binding. Therefore, perceptive of protein-ligand interaction will be very important for structure based drug design. Lack of knowledge of 3D structures has hindered efforts to understand the binding specificities of ligands with protein. With increasing in modeling software and the growing number of known protein structures, homology modeling is rapidly becoming the method of choice for obtaining 3D coordinates of proteins. Homology modeling is a representation of the similarity of environmental residues at topologically corresponding positions in the reference proteins. In the absence of experimental data, model building on the basis of a known 3D structure of a homologous protein is at present the only reliable method to obtain the structural information. Knowledge of the 3D structures of proteins provides invaluable insights into the molecular basis of their functions. The recent advances in homology modeling, particularly in detecting and aligning sequences with template structures, distant homologues, modeling of loops and side chains as well as detecting errors in a model contributed to consistent prediction of protein structure, which was not possible even several years ago. This review focused on the features and a role of homology modeling in predicting protein structure and described current developments in this field with victorious applications at the different stages of the drug design and discovery.
Collapse
Affiliation(s)
- V K Vyas
- Department of Pharmaceutical Chemistry, Institute of Pharmacy, Nirma University, Ahmedabad-382 481, India
| | | | | | | |
Collapse
|
62
|
Shih CH, Chang CM, Lin YS, Lo WC, Hwang JK. Evolutionary information hidden in a single protein structure. Proteins 2012; 80:1647-57. [DOI: 10.1002/prot.24058] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2011] [Revised: 02/07/2012] [Accepted: 02/12/2012] [Indexed: 11/07/2022]
|
63
|
Hobiger K, Utesch T, Mroginski MA, Friedrich T. Coupling of Ci-VSP modules requires a combination of structure and electrostatics within the linker. Biophys J 2012; 102:1313-22. [PMID: 22455914 DOI: 10.1016/j.bpj.2012.02.027] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2011] [Revised: 02/01/2012] [Accepted: 02/08/2012] [Indexed: 11/26/2022] Open
Abstract
The voltage-sensitive phosphatase Ci-VSP consists of an intracellular phosphatase domain (PD) coupled to a transmembrane voltage-sensor domain (VSD). Depolarization triggers the selective dephosphorylation of phosphoinositides. However, the molecular mechanisms of coupling are still elusive. To clarify the role of the VSD-PD linker as a putative partner for electrostatic interactions with the membrane, we carried out a cysteine-scanning mutagenesis of the whole motif M240-K257. Upon coexpression with PI(4,5)P(2)-sensitive KCNQ2/KCNQ3 channels in Xenopus oocytes, we identified four positions (A242C, R245C, K252C, and Y255C) with a completely abrogated PD activity. Because the mutation effect occurred periodically, we hypothesize that α-helical elements exist within the linker, with a gap near position S249. The combination of these results with the analysis of transient sensing currents of the VSD revealed distinct roles for the N-terminal (M240-S249) and C-terminal (Q250-K257) linker motifs in the VSD-PD coupling. According to our functional results, the computational structure prediction of the Q239-D258 fragment confirmed α-helical structures within the linker, with a short β-turn around S249 in the activated conformation. Remarkably, the position K252 may be a candidate for interacting with the PD rather than for binding to the membrane. This provides the first insight (to our knowledge) into the direct intervention of the linker in the VSD-PD coupling process.
Collapse
Affiliation(s)
- Kirstin Hobiger
- Berlin Institute of Technology, Institute of Chemistry, Max-Volmer-Laboratory of Biophysical Chemistry, Berlin, Germany.
| | | | | | | |
Collapse
|
64
|
Gong YN, Chen GW, Shih SR. Characterization of subtypes of the influenza A hemagglutinin (HA) gene using profile hidden Markov models. JOURNAL OF MICROBIOLOGY, IMMUNOLOGY, AND INFECTION = WEI MIAN YU GAN RAN ZA ZHI 2011; 45:404-10. [PMID: 22197681 DOI: 10.1016/j.jmii.2011.12.018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2011] [Revised: 09/27/2011] [Accepted: 10/16/2011] [Indexed: 11/27/2022]
Abstract
BACKGROUND The influenza A virus has evolved into 16 hemagglutinin (HA) subtypes with different antigenic properties. Thus far typing has been primarily assay based, but the many sequences available from the US National Center for Biotechnology Information (NCBI) offer alternative ways of characterizing the HA gene. METHODS All available HA sequences from the NCBI were analyzed. The software package HMMER was used to score how a training sequence fitted a profile hidden Markov model (profile HMM) constructed from the consensus sequence of one particular HA subtype, Hx, where x=1 to 16. Scores from sequences of the same subtype and from other subtypes were then compared to see if they were separable. This approach was implemented in a stepwise manner, utilizing a sliding window of 100 amino acids with 10-amino-acid increments to build many subtype-specific models, and then assessing which 100-amino acid segments yielded the desired differentiability. RESULTS Segment-based analysis revealed domains that correlate to HA sequence heterogeneity from one subtype to the others. For example, we showed that H1 segments covering only the second half of HA are not statistically separable from H2, H5 and H6 within the same region, suggesting evolutionary relatedness for these subtypes. The HA1 domain was found to be mostly differentiable between subtypes, which is in line with wet-lab findings that the domain is antigenicity-rich. We also reported a couple of regions that can be conveniently used to characterize all HA subtypes. CONCLUSION We established an analysis framework for assessing sequence-subtype association to provide insights into HA subtypes with close evolutionary relationships.
Collapse
Affiliation(s)
- Yu-Nong Gong
- Graduate Institute of Electrical Engineering, Chang Gung University, Taoyuan, Taiwan
| | | | | |
Collapse
|
65
|
Hong Y, Kang J, Lee D, van Rossum DB. Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding. PLoS One 2010; 5:e13596. [PMID: 21042584 PMCID: PMC2962639 DOI: 10.1371/journal.pone.0013596] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2010] [Accepted: 09/28/2010] [Indexed: 11/28/2022] Open
Abstract
A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the "twilight zone" of sequence similarity (<25% identity). Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.
Collapse
Affiliation(s)
- Yoojin Hong
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, Korea
- Department of Biostatistics, College of Medicine, Korea University, Seoul, Korea
| | - Dongwon Lee
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- College of Information Sciences and Technology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Damian B. van Rossum
- Center for Computational Proteomics, The Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| |
Collapse
|
66
|
A novel secretory poly-cysteine and histidine-tailed metalloprotein (Ts-PCHTP) from Trichinella spiralis (Nematoda). PLoS One 2010; 5:e13343. [PMID: 20967224 PMCID: PMC2954182 DOI: 10.1371/journal.pone.0013343] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2010] [Accepted: 09/16/2010] [Indexed: 11/19/2022] Open
Abstract
Background Trichinella spiralis is an unusual parasitic intracellular nematode causing dedifferentiation of the host myofiber. Trichinella proteomic analyses have identified proteins that act at the interface between the parasite and the host and are probably important for the infection and pathogenesis. Many parasitic proteins, including a number of metalloproteins are unique for the nematodes and trichinellids and therefore present good targets for future therapeutic developments. Furthermore, detailed information on such proteins and their function in the nematode organism would provide better understanding of the parasite - host interactions. Methodology/Principal Findings In this study we report the identification, biochemical characterization and localization of a novel poly-cysteine and histidine-tailed metalloprotein (Ts-PCHTP). The native Ts-PCHTP was purified from T. spiralis muscle larvae that were isolated from infected rats as a model system. The sequence analysis showed no homology with other proteins. Two unique poly-cysteine domains were found in the amino acid sequence of Ts-PCHTP. This protein is also the first reported natural histidine tailed protein. It was suggested that Ts-PCHTP has metal binding properties. Total Reflection X-ray Fluorescence (TXRF) assay revealed that it binds significant concentrations of iron, nickel and zinc at protein:metal ratio of about 1∶2. Immunohistochemical analysis showed that the Ts-PCHTP is localized in the cuticle and in all tissues of the larvae, but that it is not excreted outside the parasite. Conclusions/Significance Our data suggest that Ts-PCHTP is the first described member of a novel nematode poly-cysteine protein family and its function could be metal storage and/or transport. Since this protein family is unique for parasites from Superfamily Trichinelloidea its potential applications in diagnostics and treatment could be exploited in future.
Collapse
|
67
|
Huynen MA, de Hollander M, Szklarczyk R. Mitochondrial proteome evolution and genetic disease. Biochim Biophys Acta Mol Basis Dis 2009; 1792:1122-9. [DOI: 10.1016/j.bbadis.2009.03.005] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2008] [Revised: 03/04/2009] [Accepted: 03/20/2009] [Indexed: 11/16/2022]
|
68
|
Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire. Proc Natl Acad Sci U S A 2009; 106:20216-21. [PMID: 19875695 DOI: 10.1073/pnas.0909775106] [Citation(s) in RCA: 351] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Antibody repertoire diversity, potentially as high as 10(11) unique molecules in a single individual, confounds characterization by conventional sequence analyses. In this study, we present a general method for assessing human antibody sequence diversity displayed on phage using massively parallel pyrosequencing, a novel application of Kabat column-labeled profile Hidden Markov Models, and translated complementarity determining region (CDR) capture-recapture analysis. Pyrosequencing of domain amplicon and RCA PCR products generated 1.5 x 10(6) reads, including more than 1.9 x 10(5) high quality, full-length sequences of antibody variable fragment (Fv) variable domains. Novel methods for germline and CDR classification and fine characterization of sequence diversity in the 6 CDRs are presented. Diverse germline contributions to the repertoire with random heavy and light chain pairing are observed. All germline families were found to be represented in 1.7 x 10(4) sequences obtained from repeated panning of the library. While the most variable CDR (CDR-H3) presents significant length and sequence variability, we find a substantial contribution to total diversity from somatically mutated germline encoded CDRs 1 and 2. Using a capture-recapture method, the total diversity of the antibody library obtained from a human donor Immunoglobulin M (IgM) pool was determined to be at least 3.5 x 10(10). The results provide insights into the role of IgM diversification, display library construction, and productive germline usages in antibody libraries and the humoral repertoire.
Collapse
|
69
|
Dlakić M. HHsvm: fast and accurate classification of profile-profile matches identified by HHsearch. ACTA ACUST UNITED AC 2009; 25:3071-6. [PMID: 19773335 DOI: 10.1093/bioinformatics/btp555] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Recently developed profile-profile methods rival structural comparisons in their ability to detect homology between distantly related proteins. Despite this tremendous progress, many genuine relationships between protein families cannot be recognized as comparisons of their profiles result in scores that are statistically insignificant. RESULTS Using known evolutionary relationships among protein superfamilies in SCOP database, support vector machines were trained on four sets of discriminatory features derived from the output of HHsearch. Upon validation, it was shown that the automatic classification of all profile-profile matches was superior to fixed threshold-based annotation in terms of sensitivity and specificity. The effectiveness of this approach was demonstrated by annotating several domains of unknown function from the Pfam database. AVAILABILITY Programs and scripts implementing the methods described in this manuscript are freely available from http://hhsvm.dlakiclab.org/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mensur Dlakić
- Department of Microbiology, Montana State University, Bozeman, MT 59717-3520, USA.
| |
Collapse
|
70
|
Tångrot JE, Kågström B, Sauer UH. Accurate domain identification with structure-anchored hidden Markov models, saHMMs. Proteins 2009; 76:343-52. [PMID: 19173309 DOI: 10.1002/prot.22349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence-based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three-dimensional structures of domains are much more conserved than their sequences. Based on structure-anchored multiple sequence alignments of low identity homologues we constructed 850 structure-anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI-BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E-value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled "unknown" in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/.
Collapse
|
71
|
Abstract
The protein universe is the set of all proteins of all organisms. Here, all currently known sequences are analyzed in terms of families that have single-domain or multidomain architectures and whether they have a known three-dimensional structure. Growth of new single-domain families is very slow: Almost all growth comes from new multidomain architectures that are combinations of domains characterized by approximately 15,000 sequence profiles. Single-domain families are mostly shared by the major groups of organisms, whereas multidomain architectures are specific and account for species diversity. There are known structures for a quarter of the single-domain families, and >70% of all sequences can be partially modeled thanks to their membership in these families.
Collapse
|
72
|
Lee MM, Chan MK, Bundschuh R. SIB-BLAST: a web server for improved delineation of true and false positives in PSI-BLAST searches. Nucleic Acids Res 2009; 37:W53-6. [PMID: 19429693 PMCID: PMC2703926 DOI: 10.1093/nar/gkp301] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A SIB-BLAST web server (http://sib-blast.osc.edu) has been established for investigators to use the SimpleIsBeautiful (SIB) algorithm for sequence-based homology detection. SIB was developed to overcome the model corruption frequently observed in the later iterations of PSI-BLAST searches. The algorithm compares resultant hits from the second iteration to the final iteration of a PSI-BLAST search, calculates the figure of merit for each 'overlapped' hit and re-ranks the hits according to their figure of merit. By validating hits generated from the last profile against hits from the first profile when the model is least corrupted, the true and false positives are better delineated, which in turn, improves the accuracy of iterative PSI-BLAST searches. Notably, this improvement to PSI-BLAST comes at minimal computational cost as SIB-BLAST utilizes existing results already produced in a PSI-BLAST search.
Collapse
Affiliation(s)
- Marianne M Lee
- The Ohio State Biophysics Program, Ohio State University, Columbus, OH 43210-1117, USA
| | | | | |
Collapse
|
73
|
Brandt BW, Heringa J. webPRC: the Profile Comparer for alignment-based searching of public domain databases. Nucleic Acids Res 2009; 37:W48-52. [PMID: 19420063 PMCID: PMC2703954 DOI: 10.1093/nar/gkp279] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Profile–profile methods are well suited to detect remote evolutionary relationships between protein families. Profile Comparer (PRC) is an existing stand-alone program for scoring and aligning hidden Markov models (HMMs), which are based on multiple sequence alignments. Since PRC compares profile HMMs instead of sequences, it can be used to find distant homologues. For this purpose, PRC is used by, for example, the CATH and Pfam-domain databases. As PRC is a profile comparer, it only reports profile HMM alignments and does not produce multiple sequence alignments. We have developed webPRC server, which makes it straightforward to search for distant homologues or similar alignments in a number of domain databases. In addition, it provides the results both as multiple sequence alignments and aligned HMMs. Furthermore, the user can view the domain annotation, evaluate the PRC hits with the Jalview multiple alignment editor and generate logos from the aligned HMMs or the aligned multiple alignments. Thus, this server assists in detecting distant homologues with PRC as well as in evaluating and using the results. The webPRC interface is available at http://www.ibi.vu.nl/programs/prcwww/.
Collapse
Affiliation(s)
- Bernd W Brandt
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands.
| | | |
Collapse
|
74
|
Koussounadis A, Redfern OC, Jones DT. Improving classification in protein structure databases using text mining. BMC Bioinformatics 2009; 10:129. [PMID: 19416501 PMCID: PMC2688513 DOI: 10.1186/1471-2105-10-129] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2008] [Accepted: 05/05/2009] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions. RESULTS An optimal strategy for the text comparisons was identified by using an established gold standard enzyme dataset. Filtering of the abstracts using a machine learning approach to discriminate sentences containing functional, structural and classification information that are relevant to the protein classification task improved performance. Testing this classification scheme on a dataset of 'borderline' protein domains that lack significant sequence or structure similarity to classified proteins showed that although, as expected, the structural similarity classifiers perform better on average, there is a significant benefit in incorporating text similarity in logistic regression models, indicating significant orthogonality in this additional information. Coverage was significantly increased especially at low error rates, which is important for routine classification tasks: 15.3% for the combined structure and text classifier compared to 10% for the structural classifier alone, at 10-3 error rate. Finally when only the highest scoring predictions were used to infer classification, an extra 4.2% of correct decisions were made by the combined classifier. CONCLUSION We have described a simple text based method to classify protein domains that demonstrates an improvement over existing methods. The method is unique in incorporating structural and text based classifiers directly and is particularly useful in cases where inconclusive evidence from sequence or structure similarity requires laborious manual classification.
Collapse
Affiliation(s)
- Antonis Koussounadis
- Bioinformatics Group, Department of Computer Science, University College of London, London, WC1E 6BT, UK
| | - Oliver C Redfern
- Department of Structural and Molecular Biology, University College of London, London, WC1E 6BT, UK
| | - David T Jones
- Bioinformatics Group, Department of Computer Science, University College of London, London, WC1E 6BT, UK
- Department of Structural and Molecular Biology, University College of London, London, WC1E 6BT, UK
| |
Collapse
|
75
|
Ray S, Bandyopadhyay S, Pal S. Combining Multisource Information Through Functional-Annotation-Based Weighting: Gene Function Prediction in Yeast. IEEE Trans Biomed Eng 2009; 56:229-36. [DOI: 10.1109/tbme.2008.2005955] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
76
|
Frech C, Kommenda M, Dorfer V, Kern T, Hintner H, Bauer JW, Onder K. Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis. BMC Bioinformatics 2009; 10:21. [PMID: 19152684 PMCID: PMC2637843 DOI: 10.1186/1471-2105-10-21] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2008] [Accepted: 01/19/2009] [Indexed: 11/10/2022] Open
Abstract
Background Protein-protein interaction (PPI) data sets generated by high-throughput experiments are contaminated by large numbers of erroneous PPIs. Therefore, computational methods for PPI validation are necessary to improve the quality of such data sets. Against the background of the theory that most extant PPIs arose as a consequence of gene duplication, the sensitive search for homologous PPIs, i.e. for PPIs descending from a common ancestral PPI, should be a successful strategy for PPI validation. Results To validate an experimentally observed PPI, we combine FASTA and PSI-BLAST to perform a sensitive sequence-based search for pairs of interacting homologous proteins within a large, integrated PPI database. A novel scoring scheme that incorporates both quality and quantity of all observed matches allows us (1) to consider also tentative paralogs and orthologs in this analysis and (2) to combine search results from more than one homology detection method. ROC curves illustrate the high efficacy of this approach and its improvement over other homology-based validation methods. Conclusion New PPIs are primarily derived from preexisting PPIs and not invented de novo. Thus, the hallmark of true PPIs is the existence of homologous PPIs. The sensitive search for homologous PPIs within a large body of known PPIs is an efficient strategy to separate biologically relevant PPIs from the many spurious PPIs reported by high-throughput experiments.
Collapse
Affiliation(s)
- Christian Frech
- Upper Austria University of Applied Sciences, Hagenberg, Austria.
| | | | | | | | | | | | | |
Collapse
|
77
|
Goonesekere NC. Evaluating the efficacy of a structure-derived amino acid substitution matrix in detecting protein homologs by BLAST and PSI-BLAST. Adv Appl Bioinform Chem 2009; 2:71-8. [PMID: 21918617 PMCID: PMC3169949 DOI: 10.2147/aabc.s5553] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
Collapse
Affiliation(s)
- Nalin Cw Goonesekere
- Department of Chemistry and Biochemistry, University of Northern Iowa, Cedar Falls, IA, USA
| |
Collapse
|
78
|
Immunogenicity in peptide-immunotherapy: from self/nonself to similar/dissimilar sequences. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2008; 640:198-207. [PMID: 19065793 DOI: 10.1007/978-0-387-09789-3_15] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The nature of the relationship between an antigenic amino acid sequence and its capability to evoke an immune response is still an unsolved problem. Although experiments indicate that specific (dis)continuous amino acid sequences may determine specific immune responses, how immunogenic properties and recognition informations are mapped onto a non-linear sequence is not understood. Immunology has invoked the concept of self/nonself discrimination in order to explain the capability of the organism to selectively immunoreact. However, no clear, logical and rational pathway has emerged to relate a structure and its immuno-nonreactivity. It cannot yet be dismissed what Koshland wrote in 1990: "Of all the mysteries of modern science, the mechanism of self versus nonself recognition in the immune system ranks at or near the top". This chapter reviews the concept of self/nonself discrimination in the immune system starting from the historical perspective and the conceptual framework that underlie immune reaction pattern. It also introduces future research directions based on a proteomic dissection of the immune unit, qualitatively defined as a low-similarity sequence and quantitatively delimitated by the minimum amino acid requisite able to evoke an immune response, independently ofany, microbial or viral, "foreignness".
Collapse
|
79
|
Protein subfamily assignment using the Conserved Domain Database. BMC Res Notes 2008; 1:114. [PMID: 19014584 PMCID: PMC2632666 DOI: 10.1186/1756-0500-1-114] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2008] [Accepted: 11/14/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Domains, evolutionarily conserved units of proteins, are widely used to classify protein sequences and infer protein function. Often, two or more overlapping domain models match a region of a protein sequence. Therefore, procedures are required to choose appropriate domain annotations for the protein. Here, we propose a method for assigning NCBI-curated domains from the Curated Domain Database (CDD) that takes into account the organization of the domains into hierarchies of homologous domain models. FINDINGS Our analysis of alignment scores from NCBI-curated domain assignments suggests that identifying the correct model among closely related models is more difficult than choosing between non-overlapping domain models. We find that simple heuristics based on sorting scores and domain-specific thresholds are effective at reducing classification error. In fact, in our test set, the heuristics result in almost 90% of current misclassifications due to missing domain subfamilies being replaced by more generic domain assignments, thereby eliminating a significant amount of error within the database. CONCLUSION Our proposed domain subfamily assignment rule has been incorporated into the CD-Search software for assigning CDD domains to query protein sequences and has significantly improved pre-calculated domain annotations on protein sequences in NCBI's Entrez resource.
Collapse
|
80
|
Chen Z, Harb OS, Roos DS. In silico identification of specialized secretory-organelle proteins in apicomplexan parasites and in vivo validation in Toxoplasma gondii. PLoS One 2008; 3:e3611. [PMID: 18974850 PMCID: PMC2575384 DOI: 10.1371/journal.pone.0003611] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2008] [Accepted: 10/06/2008] [Indexed: 12/04/2022] Open
Abstract
Apicomplexan parasites, including the human pathogens Toxoplasma gondii and Plasmodium falciparum, employ specialized secretory organelles (micronemes, rhoptries, dense granules) to invade and survive within host cells. Because molecules secreted from these organelles function at the host/parasite interface, their identification is important for understanding invasion mechanisms, and central to the development of therapeutic strategies. Using a computational approach based on predicted functional domains, we have identified more than 600 candidate secretory organelle proteins in twelve apicomplexan parasites. Expression in transgenic T. gondii of eight proteins identified in silico confirms that all enter into the secretory pathway, and seven target to apical organelles associated with invasion. An in silico approach intended to identify possible host interacting proteins yields a dataset enriched in secretory/transmembrane proteins, including most of the antigens known to be engaged by apicomplexan parasites during infection. These domain pattern and projected interactome approaches significantly expand the repertoire of proteins that may be involved in host parasite interactions.
Collapse
Affiliation(s)
- ZhongQiang Chen
- Department of Biology, Penn Genomic Frontiers Institute, and the Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Omar S. Harb
- Department of Biology, Penn Genomic Frontiers Institute, and the Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- * E-mail: (DSR); (OSH)
| | - David S. Roos
- Department of Biology, Penn Genomic Frontiers Institute, and the Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- * E-mail: (DSR); (OSH)
| |
Collapse
|
81
|
Frenkel ZM. Does Protein Relatedness Require Sequence Matching? AlignmentviaNetworks in Sequence Space. J Biomol Struct Dyn 2008; 26:215-22. [DOI: 10.1080/07391102.2008.10507237] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
82
|
Cavallaro G, Decaria L, Rosato A. Genome-Based Analysis of Heme Biosynthesis and Uptake in Prokaryotic Systems. J Proteome Res 2008; 7:4946-54. [DOI: 10.1021/pr8004309] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Gabriele Cavallaro
- Magnetic Resonance Center (CERM), University of Florence, Via L. Sacconi 6, 50019 Sesto Fiorentino, Italy, and Department of Chemistry, University of Florence, Via della Lastruccia 3, 50019 Sesto Fiorentino, Italy
| | - Leonardo Decaria
- Magnetic Resonance Center (CERM), University of Florence, Via L. Sacconi 6, 50019 Sesto Fiorentino, Italy, and Department of Chemistry, University of Florence, Via della Lastruccia 3, 50019 Sesto Fiorentino, Italy
| | - Antonio Rosato
- Magnetic Resonance Center (CERM), University of Florence, Via L. Sacconi 6, 50019 Sesto Fiorentino, Italy, and Department of Chemistry, University of Florence, Via della Lastruccia 3, 50019 Sesto Fiorentino, Italy
| |
Collapse
|
83
|
Phylogenetic profiles reveal evolutionary relationships within the "twilight zone" of sequence similarity. Proc Natl Acad Sci U S A 2008; 105:13474-9. [PMID: 18765810 DOI: 10.1073/pnas.0803860105] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Inferring evolutionary relationships among highly divergent protein sequences is a daunting task. In particular, when pairwise sequence alignments between protein sequences fall <25% identity, the phylogenetic relationships among sequences cannot be estimated with statistical certainty. Here, we show that phylogenetic profiles generated with the Gestalt Domain Detection Algorithm-Basic Local Alignment Tool (GDDA-BLAST) are capable of deriving, ab initio, phylogenetic relationships for highly divergent proteins in a quantifiable and robust manner. Notably, the results from our computational case study of the highly divergent family of retroelements accord with previous estimates of their evolutionary relationships. Taken together, these data demonstrate that GDDA-BLAST provides an independent and powerful measure of evolutionary relationships that does not rely on potentially subjective sequence alignment. We demonstrate that evolutionary relationships can be measured with phylogenetic profiles, and therefore propose that these measurements can provide key insights into relationships among distantly related and/or rapidly evolving proteins.
Collapse
|
84
|
Bernsel A, Viklund H, Elofsson A. Remote homology detection of integral membrane proteins using conserved sequence features. Proteins 2008; 71:1387-99. [PMID: 18076048 DOI: 10.1002/prot.21825] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Compared with globular proteins, transmembrane proteins are surrounded by a more intricate environment and, consequently, amino acid composition varies between the different compartments. Existing algorithms for homology detection are generally developed with globular proteins in mind and may not be optimal to detect distant homology between transmembrane proteins. Here, we introduce a new profile-profile based alignment method for remote homology detection of transmembrane proteins in a hidden Markov model framework that takes advantage of the sequence constraints placed by the hydrophobic interior of the membrane. We expect that, for distant membrane protein homologs, even if the sequences have diverged too far to be recognized, the hydrophobicity pattern and the transmembrane topology are better conserved. By using this information in parallel with sequence information, we show that both sensitivity and specificity can be substantially improved for remote homology detection in two independent test sets. In addition, we show that alignment quality can be improved for the most distant homologs in a public dataset of membrane protein structures. Applying the method to the Pfam domain database, we are able to suggest new putative evolutionary relationships for a few relatively uncharacterized protein domain families, of which several are confirmed by other methods. The method is called Searcher for Homology Relationships of Integral Membrane Proteins (SHRIMP) and is available for download at http://www.sbc.su.se/shrimp/.
Collapse
Affiliation(s)
- Andreas Bernsel
- Center for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
| | | | | |
Collapse
|
85
|
Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen MY, Pieper U, Sali A. Comparative protein structure modeling using MODELLER. ACTA ACUST UNITED AC 2008; Chapter 2:Unit 2.9. [PMID: 18429317 DOI: 10.1002/0471140864.ps0209s50] [Citation(s) in RCA: 761] [Impact Index Per Article: 44.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Functional characterization of a protein sequence is a common goal in biology, and is usually facilitated by having an accurate three-dimensional (3-D) structure of the studied protein. In the absence of an experimentally determined structure, comparative or homology modeling can sometimes provide a useful 3-D model for a protein that is related to at least one known protein structure. Comparative modeling predicts the 3-D structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described.
Collapse
Affiliation(s)
- Narayanan Eswar
- University of California at San Francisco, San Francisco, California, USA
| | | | | | | | | | | | | | | |
Collapse
|
86
|
Lingner T, Meinicke P. Word correlation matrices for protein sequence analysis and remote homology detection. BMC Bioinformatics 2008; 9:259. [PMID: 18522726 PMCID: PMC2438326 DOI: 10.1186/1471-2105-9-259] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2008] [Accepted: 06/03/2008] [Indexed: 11/30/2022] Open
Abstract
Background Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive. Results In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection. Conclusion Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.
Collapse
Affiliation(s)
- Thomas Lingner
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen, Germany.
| | | |
Collapse
|
87
|
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008; 4:e1000069. [PMID: 18516236 PMCID: PMC2396288 DOI: 10.1371/journal.pcbi.1000069] [Citation(s) in RCA: 243] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 03/26/2008] [Indexed: 11/19/2022] Open
Abstract
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America.
| |
Collapse
|
88
|
Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen MY, Pieper U, Sali A. Comparative protein structure modeling using Modeller. ACTA ACUST UNITED AC 2008; Chapter 5:Unit-5.6. [PMID: 18428767 DOI: 10.1002/0471250953.bi0506s15] [Citation(s) in RCA: 1820] [Impact Index Per Article: 107.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Functional characterization of a protein sequence is one of the most frequent problems in biology. This task is usually facilitated by accurate three-dimensional (3-D) structure of the studied protein. In the absence of an experimentally determined structure, comparative or homology modeling can sometimes provide a useful 3-D model for a protein that is related to at least one known protein structure. Comparative modeling predicts the 3-D structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described.
Collapse
Affiliation(s)
- Narayanan Eswar
- University of California at San Francisco San Francisco, California
| | - Ben Webb
- University of California at San Francisco San Francisco, California
| | | | - M S Madhusudhan
- University of California at San Francisco San Francisco, California
| | - David Eramian
- University of California at San Francisco San Francisco, California
| | - Min-Yi Shen
- University of California at San Francisco San Francisco, California
| | - Ursula Pieper
- University of California at San Francisco San Francisco, California
| | - Andrej Sali
- University of California at San Francisco San Francisco, California
| |
Collapse
|
89
|
Gholami A, Kassis R, Real E, Delmas O, Guadagnini S, Larrous F, Obach D, Prevost MC, Jacob Y, Bourhy H. Mitochondrial dysfunction in lyssavirus-induced apoptosis. J Virol 2008; 82:4774-84. [PMID: 18321977 PMCID: PMC2346764 DOI: 10.1128/jvi.02651-07] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2007] [Accepted: 02/22/2008] [Indexed: 12/25/2022] Open
Abstract
Lyssaviruses are highly neurotropic viruses associated with neuronal apoptosis. Previous observations have indicated that the matrix proteins (M) of some lyssaviruses induce strong neuronal apoptosis. However, the molecular mechanism(s) involved in this phenomenon is still unknown. We show that for Mokola virus (MOK), a lyssavirus of low pathogenicity, the M (M-MOK) targets mitochondria, disrupts the mitochondrial morphology, and induces apoptosis. Our analysis of truncated M-MOK mutants suggests that the information required for efficient mitochondrial targeting and dysfunction, as well as caspase-9 activation and apoptosis, is held between residues 46 and 110 of M-MOK. We used a yeast two-hybrid approach, a coimmunoprecipitation assay, and confocal microscopy to demonstrate that M-MOK physically associates with the subunit I of the cytochrome c (cyt-c) oxidase (CcO) of the mitochondrial respiratory chain; this is in contrast to the M of the highly pathogenic Thailand lyssavirus (M-THA). M-MOK expression induces a significant decrease in CcO activity, which is not the case with M-THA. M-MOK mutations (K77R and N81E) resulting in a similar sequence to M-THA at positions 77 and 81 annul cyt-c release and apoptosis and restore CcO activity. As expected, the reverse mutations, R77K and E81N, introduced in M-THA induce a phenotype similar to that due to M-MOK. These features indicate a novel mechanism for energy depletion during lyssavirus-induced apoptosis.
Collapse
Affiliation(s)
- Alireza Gholami
- Unité Postulante de Recherche et d'Expertise Dynamique des Lyssavirus et Adaptation à l'Hôte, Institut Pasteur, 25 rue du Docteur Roux, 75724 Paris Cedex 15, France
| | | | | | | | | | | | | | | | | | | |
Collapse
|
90
|
Lee B, Lee D. DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture. Nucleic Acids Res 2008; 36:W60-4. [PMID: 18411203 PMCID: PMC2447808 DOI: 10.1093/nar/gkn172] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
We present DAhunter, a web-based server that identifies homologous proteins by comparing domain architectures, the organization of protein domains. A major obstacle in comparison of domain architecture is the existence of ‘promiscuous’ domains, which carry out auxiliary functions and appear in many unrelated proteins. To distinguish these promiscuous domains from protein domains, we assigned a weight score to each domain extracted from RefSeq proteins, based on its abundance and versatility. A domain's score represents its importance in the ‘protein world’ and is used in the comparison of domain architectures. In scoring domains, DAhunter also considers domain combinations as well as single domains. To measure the similarity of two domain architectures, we developed several methods that are based on algorithms used in information retrieval (the cosine similarity, the Goodman–Kruskal γ function, and domain duplication index) and then combined these into a similarity score. Compared with other domain architecture algorithms, DAhunter is better at identifying homology. The server is available at http://www.dahunter.kr and http://localodom.kobic.re.kr/dahunter/index.htm
Collapse
Affiliation(s)
- Byungwook Lee
- Korean BioInformation Center, KRIBB, Daejeon 305-806 and Department of Bio and Brain Engineering, KAIST, Daejeon 305-701, Korea.
| | | |
Collapse
|
91
|
Lee MM, Chan MK, Bundschuh R. Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches. Bioinformatics 2008; 24:1339-43. [DOI: 10.1093/bioinformatics/btn130] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
92
|
Durand PM, Coetzer TL. Utility of computational methods to identify the apoptosis machinery in unicellular eukaryotes. Bioinform Biol Insights 2008; 2:101-17. [PMID: 19812769 PMCID: PMC2735952 DOI: 10.4137/bbi.s430] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Apoptosis is the phenotypic result of an active, regulated process of self-destruction. Following various cellular insults, apoptosis has been demonstrated in numerous unicellular eukaryotes, but very little is known about the genes and proteins that initiate and execute this process in this group of organisms. A bioinformatic approach presents an array of powerful methods to direct investigators in the identification of the apoptosis machinery in protozoans. In this review, we discuss some of the available computational methods and illustrate how they may be applied using the identification of a Plasmodium falciparum metacaspase gene as an example.
Collapse
Affiliation(s)
- Pierre Marcel Durand
- Department of Molecular Medicine and Haematology, University of the Witwatersrand and National Health Laboratory Service, Johannesburg, South Africa.
| | | |
Collapse
|
93
|
Abstract
Most newly sequenced proteins are likely to adopt a similar structure to one which has already been experimentally determined. For this reason, the most successful approaches to protein structure prediction have been template-based methods. Such prediction methods attempt to identify and model the folds of unknown structures by aligning the target sequences to a set of representative template structures within a fold library. In this chapter, I discuss the development of template-based approaches to fold prediction, from the traditional techniques to the recent state-of-the-art methods. I also discuss the recent development of structural annotation databases, which contain models built by aligning the sequences from entire proteomes against known structures. Finally, I run through a practical step-by-step guide for aligning target sequences to known structures and contemplate the future direction of template-based structure prediction.
Collapse
|
94
|
Abstract
Protein sequence alignment is the task of identifying evolutionarily or structurally related positions in a collection of amino acid sequences. Although the protein alignment problem has been studied for several decades, many recent studies have demonstrated considerable progress in improving the accuracy or scalability of multiple and pairwise alignment tools, or in expanding the scope of tasks handled by an alignment program. In this chapter, we review state-of-the-art protein sequence alignment and provide practical advice for users of alignment tools.
Collapse
Affiliation(s)
- Chuong B Do
- Computer Science Department, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
95
|
|
96
|
|
97
|
Lee MM, Bundschuh R, Chan MK. Distant homology detection using a LEngth and STructure-based sequence Alignment Tool (LESTAT). Proteins 2007; 71:1409-19. [DOI: 10.1002/prot.21830] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
98
|
Frenkel ZM, Trifonov EN. From protein sequence space to elementary protein modules. Gene 2007; 408:64-71. [PMID: 18022768 DOI: 10.1016/j.gene.2007.10.024] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2007] [Revised: 08/14/2007] [Accepted: 10/15/2007] [Indexed: 11/17/2022]
Abstract
The formatted protein sequence space is built from identical size fragments of prokaryotic proteins (112 complete proteomes). Connecting sequence-wise similar fragments (points in the space) results in the formation of numerous networks, that combine sometimes different types of proteins sharing, though, fragments with similar or distantly related sequences. The networks are mapped on individual protein sequences revealing distinct regions (modules) associated with prominent networks with well-defined functional identities. Presence of multiple sites of sequence conservation (modules) in a given protein sequence suggests that the annotated protein function may be decomposed in "elementary" subfunctions of the respective modules. The modules correspond to previously discovered conserved closed loop structures and their sequence prototypes.
Collapse
Affiliation(s)
- Zakharia M Frenkel
- Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa, Israel.
| | | |
Collapse
|
99
|
Goonesekere NCW, Lee B. Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins 2007; 71:910-9. [DOI: 10.1002/prot.21775] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
100
|
Bernardes JS, Dávila AMR, Costa VS, Zaverucha G. Improving model construction of profile HMMs for remote homology detection through structural alignment. BMC Bioinformatics 2007; 8:435. [PMID: 17999748 PMCID: PMC2245980 DOI: 10.1186/1471-2105-8-435] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Accepted: 11/09/2007] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Remote homology detection is a challenging problem in Bioinformatics. Arguably, profile Hidden Markov Models (pHMMs) are one of the most successful approaches in addressing this important problem. pHMM packages present a relatively small computational cost, and perform particularly well at recognizing remote homologies. This raises the question of whether structural alignments could impact the performance of pHMMs trained from proteins in the Twilight Zone, as structural alignments are often more accurate than sequence alignments at identifying motifs and functional residues. Next, we assess the impact of using structural alignments in pHMM performance. RESULTS We used the SCOP database to perform our experiments. Structural alignments were obtained using the 3DCOFFEE and MAMMOTH-mult tools; sequence alignments were obtained using CLUSTALW, TCOFFEE, MAFFT and PROBCONS. We performed leave-one-family-out cross-validation over super-families. Performance was evaluated through ROC curves and paired two tailed t-test. CONCLUSION We observed that pHMMs derived from structural alignments performed significantly better than pHMMs derived from sequence alignment in low-identity regions, mainly below 20%. We believe this is because structural alignment tools are better at focusing on the important patterns that are more often conserved through evolution, resulting in higher quality pHMMs. On the other hand, sensitivity of these tools is still quite low for these low-identity regions. Our results suggest a number of possible directions for improvements in this area.
Collapse
Affiliation(s)
- Juliana S Bernardes
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| | | | - Vítor S Costa
- DCC-FCUP e LIACC, Universidade do Porto, Porto, Portugal
| | - Gerson Zaverucha
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|