151
|
Yan Y, Moult J. Protein Family Clustering for Structural Genomics. J Mol Biol 2005; 353:744-59. [PMID: 16185712 DOI: 10.1016/j.jmb.2005.08.058] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2005] [Revised: 08/18/2005] [Accepted: 08/24/2005] [Indexed: 11/26/2022]
Abstract
A major goal of structural genomics is the provision of a structural template for a large fraction of protein domains. The magnitude of this task depends on the number and nature of protein sequence families. With a large number of bacterial genomes now fully sequenced, it is possible to obtain improved estimates of the number and diversity of families in that kingdom. We have used an automated clustering procedure to group all sequences in a set of genomes into protein families. Bench-marking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. This comprehensive protein family set has been used to address the following questions. (1) What is the structure coverage for currently known families? (2) How will the number of known apparent families grow as more genomes are sequenced? (3) What is a practical strategy for maximizing structure coverage in future? Our study indicates that approximately 20% of known families with three or more members currently have a representative structure. The study indicates also that the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes have been sequenced. However, the vast majority of these families will be small, and it will be possible to obtain structural templates for 70-80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families.
Collapse
Affiliation(s)
- Yongpan Yan
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | | |
Collapse
|
152
|
Pearson WR, Sierk ML. The limits of protein sequence comparison? Curr Opin Struct Biol 2005; 15:254-60. [PMID: 15919194 PMCID: PMC2845305 DOI: 10.1016/j.sbi.2005.05.005] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2005] [Revised: 04/30/2005] [Accepted: 05/05/2005] [Indexed: 11/29/2022]
Abstract
Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.
Collapse
Affiliation(s)
- William R Pearson
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA.
| | | |
Collapse
|
153
|
Doolittle RF. Evolutionary aspects of whole-genome biology. Curr Opin Struct Biol 2005; 15:248-53. [PMID: 15963888 DOI: 10.1016/j.sbi.2005.04.001] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2005] [Revised: 02/08/2005] [Accepted: 04/12/2005] [Indexed: 11/28/2022]
Abstract
A decade of access to whole-genome sequences has been increasingly revealing about the informational network relating all living organisms. Although at one point there was concern that extensive horizontal gene transfer might hopelessly muddle phylogenies, it has not proved a severe hindrance. The melding of sequence and structural information is being used to great advantage, and the prospect exists that some of the earliest aspects of life on Earth can be reconstructed, including the invention of biosynthetic and metabolic pathways. Still, some fundamental phylogenetic problems remain, including determining the root--if there is one--of the historical relationship between Archaea, Bacteria and Eukarya.
Collapse
Affiliation(s)
- Russell F Doolittle
- Department of Chemistry & Biochemistry, University of California San Diego, La Jolla, CA 92093-0314, USA.
| |
Collapse
|
154
|
Abstract
MOTIVATION Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrices that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations and by assessing the ability of this algorithm to detect remote homologies. RESULTS Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation.
Collapse
Affiliation(s)
- Gavin E Crooks
- Department of Plant and Microbial Biology 111 Koshland Hall #3102 University of California, Berkeley, CA 94720-3102, USA.
| | | | | |
Collapse
|
155
|
Price GA, Crooks GE, Green RE, Brenner SE. Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics 2005; 21:3824-31. [PMID: 16105900 DOI: 10.1093/bioinformatics/bti627] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein sequence comparison methods are routinely used to infer the intricate network of evolutionary relationships found within the rapidly growing library of protein sequences, and thereby to predict the structure and function of uncharacterized proteins. In the present study, we detail an improved statistical benchmark of pairwise protein sequence comparison algorithms. We use bootstrap resampling techniques to determine standard statistical errors and to estimate the confidence of our conclusions. We show that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased. Consequently, the standard bootstrap underpredicts average performance when used in the context of evaluating sequence comparison methods. We have developed, as an alternative, an unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap. RESULTS We apply our analysis to the comparative study of amino acid substitution matrix families and find that using modern matrices results in a small, but statistically significant improvement in remote homology detection compared with the classic PAM and BLOSUM matrices. AVAILABILITY The sequence sets and code for performing these analyses are available from http://compbio.berkeley.edu/. CONTACT brenner@compbio.berkeley.edu.
Collapse
Affiliation(s)
- Gavin A Price
- Department of Bioengineering, University of California, Berkeley, 94720, USA
| | | | | | | |
Collapse
|
156
|
Abstract
Modeling a protein structure based on a homologous structure is a standard method in structural biology today. In this process an alignment of a target protein sequence onto the structure of a template(s) is used as input to a program that constructs a 3D model. It has been shown that the most important factor in this process is the correctness of the alignment and the choice of the best template structure(s), while it is generally believed that there are no major differences between the best modeling programs. Therefore, a large number of studies to benchmark the alignment qualities and the selection process have been performed. However, to our knowledge no large-scale benchmark has been performed to evaluate the programs used to transform the alignment to a 3D model. In this study, a benchmark of six different homology modeling programs- Modeller, SegMod/ENCAD, SWISS-MODEL, 3D-JIGSAW, nest, and Builder-is presented. The performance of these programs is evaluated using physiochemical correctness and structural similarity to the correct structure. From our analysis it can be concluded that no single modeling program outperform the others in all tests. However, it is quite clear that three modeling programs, Modeller, nest, and SegMod/ ENCAD, perform better than the others. Interestingly, the fastest and oldest modeling program, SegMod/ ENCAD, performs very well, although it was written more than 10 years ago and has not undergone any development since. It can also be observed that none of the homology modeling programs builds side chains as well as a specialized program (SCWRL), and therefore there should be room for improvement.
Collapse
Affiliation(s)
- Björn Wallner
- Stockholm Bioinformatics Center, Albanova University Center, Stockholm University, Stockholm, Sweden.
| | | |
Collapse
|
157
|
Margelevičius M, Venclovas Č. PSI-BLAST-ISS: an intermediate sequence search tool for estimation of the position-specific alignment reliability. BMC Bioinformatics 2005; 6:185. [PMID: 16033659 PMCID: PMC1187875 DOI: 10.1186/1471-2105-6-185] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2005] [Accepted: 07/21/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein sequence alignments have become indispensable for virtually any evolutionary, structural or functional study involving proteins. Modern sequence search and comparison methods combined with rapidly increasing sequence data often can reliably match even distantly related proteins that share little sequence similarity. However, even highly significant matches generally may have incorrectly aligned regions. Therefore when exact residue correspondence is used to transfer biological information from one aligned sequence to another, it is critical to know which alignment regions are reliable and which may contain alignment errors. RESULTS PSI-BLAST-ISS is a standalone Unix-based tool designed to delineate reliable regions of sequence alignments as well as to suggest potential variants in unreliable regions. The region-specific reliability is assessed by producing multiple sequence alignments in different sequence contexts followed by the analysis of the consistency of alignment variants. The PSI-BLAST-ISS output enables the user to simultaneously analyze alignment reliability between query and multiple homologous sequences. In addition, PSI-BLAST-ISS can be used to detect distantly related homologous proteins. The software is freely available at: http://www.ibt.lt/bioinformatics/iss. CONCLUSION PSI-BLAST-ISS is an effective reliability assessment tool that can be useful in applications such as comparative modelling or analysis of individual sequence regions. It favorably compares with the existing similar software both in the performance and functional features.
Collapse
|
158
|
Johnston CR, Shields DC. A sequence sub-sampling algorithm increases the power to detect distant homologues. Nucleic Acids Res 2005; 33:3772-8. [PMID: 16006623 PMCID: PMC1174907 DOI: 10.1093/nar/gki687] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Searching databases for distant homologues using alignments instead of individual sequences increases the power of detection. However, most methods assume that protein evolution proceeds in a regular fashion, with the inferred tree of sequences providing a good estimation of the evolutionary process. We investigated the combined HMMER search results from random alignment subsets (with three sequences each) drawn from the parent alignment (Rand-shuffle algorithm), using the SCOP structural classification to determine true similarities. At false-positive rates of 5%, the Rand-shuffle algorithm improved HMMER's sensitivity, with a 37.5% greater sensitivity compared with HMMER alone, when easily identified similarities (identifiable by BLAST) were excluded from consideration. An extension of the Rand-shuffle algorithm (Ali-shuffle) weighted towards more informative sequence subsets. This approach improved the performance over HMMER alone and PSI-BLAST, particularly at higher false-positive rates. The improvements in performance of these sequence sub-sampling methods may reflect lower sensitivity to alignment error and irregular evolutionary patterns. The Ali-shuffle and Rand-shuffle sequence homology search programs are available by request from the authors.
Collapse
Affiliation(s)
- Catrióna R Johnston
- Department of Clinical Pharmacology, Bioinformatics Group, Royal College of Surgeons in Ireland, 123 St Stephens Green, Dublin 2, Ireland.
| | | |
Collapse
|
159
|
Choo KH, Tong JC, Zhang L. Recent applications of Hidden Markov Models in computational biology. GENOMICS PROTEOMICS & BIOINFORMATICS 2005; 2:84-96. [PMID: 15629048 PMCID: PMC5172443 DOI: 10.1016/s1672-0229(04)02014-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
This paper examines recent developments and applications of Hidden Markov Models (HMMs) to various problems in computational biology, including multiple sequence alignment, homology detection, protein sequences classification, and genomic annotation.
Collapse
Affiliation(s)
- Khar Heng Choo
- Department of Biochemistry, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260
| | - Joo Chuan Tong
- Department of Biochemistry, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, 2 Science Drive 2, Singapore 117543
- Corresponding author.
| |
Collapse
|
160
|
Sillitoe I, Dibley M, Bray J, Addou S, Orengo C. Assessing strategies for improved superfamily recognition. Protein Sci 2005; 14:1800-10. [PMID: 15937274 PMCID: PMC2253352 DOI: 10.1110/ps.041056105] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (approximately 13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence-based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single-seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D-HMM library, CATH-ISL increased the coverage to 86%. The single-seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss-Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.
Collapse
Affiliation(s)
- Ian Sillitoe
- Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, UK
| | | | | | | | | |
Collapse
|
161
|
Abstract
We can now assign about two thirds of the sequences from completed genomes to as few as 1400 domain families for which structures are known and thus more ancient evolutionary relationships established. About 200 of these domain families are common to all kingdoms of life and account for nearly 50% of domain structure annotations in the genomes. Some of these domain families have been very extensively duplicated within a genome and combined with different domain partners giving rise to different multidomain proteins. The ways in which these domain combinations evolve tend to be specific to the organism so that less than 15% of the protein families found within a genome appear to be common to all kingdoms of life. Recent analyses of completed genomes, exploiting the structural data, have revealed the extent to which duplication of these domains and modifications of their functions can expand the functional repertoire of the organism, contributing to increasing complexity.
Collapse
Affiliation(s)
- Christine A Orengo
- Department of Biochemistry and Molecular Biology, University College, London WC1E 6BT, United Kingdom.
| | | |
Collapse
|
162
|
Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble WS. Semi-supervised protein classification using cluster kernels. Bioinformatics 2005; 21:3241-7. [PMID: 15905279 DOI: 10.1093/bioinformatics/bti497] [Citation(s) in RCA: 127] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Building an accurate protein classification system depends critically upon choosing a good representation of the input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data--examples with known 3D structures, organized into structural classes--whereas in practice, unlabeled data are far more plentiful. RESULTS In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods and at the same time achieving far greater computational efficiency. AVAILABILITY Source code is available at www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot. The Spider matlab package is available at www.kyb.tuebingen.mpg.de/bs/people/spider. SUPPLEMENTARY INFORMATION www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot.
Collapse
Affiliation(s)
- Jason Weston
- NEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA.
| | | | | | | | | | | |
Collapse
|
163
|
Song J, Bonner CA, Wolinsky M, Jensen RA. The TyrA family of aromatic-pathway dehydrogenases in phylogenetic context. BMC Biol 2005; 3:13. [PMID: 15888209 PMCID: PMC1173090 DOI: 10.1186/1741-7007-3-13] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2005] [Accepted: 05/12/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The TyrA protein family includes members that catalyze two dehydrogenase reactions in distinct pathways leading to L-tyrosine and a third reaction that is not part of tyrosine biosynthesis. Family members share a catalytic core region of about 30 kDa, where inhibitors operate competitively by acting as substrate mimics. This protein family typifies many that are challenging for bioinformatic analysis because of relatively modest sequence conservation and small size. RESULTS Phylogenetic relationships of TyrA domains were evaluated in the context of combinatorial patterns of specificity for the two substrates, as well as the presence or absence of a variety of fusions. An interactive tool is provided for prediction of substrate specificity. Interactive alignments for a suite of catalytic-core TyrA domains of differing specificity are also provided to facilitate phylogenetic analysis. tyrA membership in apparent operons (or supraoperons) was examined, and patterns of conserved synteny in relationship to organismal positions on the 16S rRNA tree were ascertained for members of the domain Bacteria. A number of aromatic-pathway genes (hisHb, aroF, aroQ) have fused with tyrA, and it must be more than coincidental that the free-standing counterparts of all of the latter fused genes exhibit a distinct trace of syntenic association. CONCLUSION We propose that the ancestral TyrA dehydrogenase had broad specificity for both the cyclohexadienyl and pyridine nucleotide substrates. Indeed, TyrA proteins of this type persist today, but it is also common to find instances of narrowed substrate specificities, as well as of acquisition via gene fusion of additional catalytic domains or regulatory domains. In some clades a qualitative change associated with either narrowed substrate specificity or gene fusion has produced an evolutionary "jump" in the vertical genealogy of TyrA homologs. The evolutionary history of gene organizations that include tyrA can be deduced in genome assemblages of sufficiently close relatives, the most fruitful opportunities currently being in the Proteobacteria. The evolution of TyrA proteins within the broader context of how their regulation evolved and to what extent TyrA co-evolved with other genes as common members of aromatic-pathway regulons is now feasible as an emerging topic of ongoing inquiry.
Collapse
Affiliation(s)
- Jian Song
- Los Alamos National Laboratory, Los Alamos, New Mexico, 87545, USA
| | - Carol A Bonner
- Emerson Hall, University of Florida, P.O. Box 14425, Gainesville, Florida, 32604-2425, USA
| | - Murray Wolinsky
- Los Alamos National Laboratory, Los Alamos, New Mexico, 87545, USA
| | - Roy A Jensen
- Los Alamos National Laboratory, Los Alamos, New Mexico, 87545, USA
- Emerson Hall, University of Florida, P.O. Box 14425, Gainesville, Florida, 32604-2425, USA
| |
Collapse
|
164
|
Blades MJ, Ison JC, Ranasinghe R, Findlay JBC. Automatic generation and evaluation of sparse protein signatures for families of protein structural domains. Protein Sci 2005; 14:13-23. [PMID: 15608116 PMCID: PMC2253312 DOI: 10.1110/ps.04929005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We identified key residues from the structural alignment of families of protein domains from SCOP which we represented in the form of sparse protein signatures. A signature-generating algorithm (SigGen) was developed and used to automatically identify key residues based on several structural and sequence-based criteria. The capacity of the signatures to detect related sequences from the SWISSPROT database was assessed by receiver operator characteristic (ROC) analysis and jack-knife testing. Test signatures for families from each of the main SCOP classes are described in relation to the quality of the structural alignments, the SigGen parameters used, and their diagnostic performance. We show that automatically generated signatures are potently diagnostic for their family (ROC50 scores typically >0.8), consistently outperform random signatures, and can identify sequence relationships in the "twilight zone" of protein sequence similarity (<40%). Signatures based on 15%-30% of alignment positions occurred most frequently among the best-performing signatures. When alignment quality is poor, sparser signatures perform better, whereas signatures generated from higher-quality alignments of fewer structures require more positions to be diagnostic. Our validation of signatures from the Globin family shows that when sequences from the structural alignment are removed and new signatures generated, the omitted sequences are still detected. The positions highlighted by the signature often correspond (alignment specificity >0.7) to the key positions in the original (non-jack-knifed) alignment. We discuss potential applications of sparse signatures in sequence annotation and homology modeling.
Collapse
Affiliation(s)
- Matthew J Blades
- AstraZeneca R&D Charnwood, Bakewell Road, Loughborough, Leicestershire LE11 5RH, England.
| | | | | | | |
Collapse
|
165
|
Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, Akpor A, Maibaum M, Harrison A, Dallman T, Reeves G, Diboun I, Addou S, Lise S, Johnston C, Sillero A, Thornton J, Orengo C. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 2005; 33:D247-51. [PMID: 15608188 PMCID: PMC539978 DOI: 10.1093/nar/gki024] [Citation(s) in RCA: 185] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath/) currently contains 43,229 domains classified into 1467 superfamilies and 5107 sequence families. Each structural family is expanded with sequence relatives from GenBank and completed genomes, using a variety of efficient sequence search protocols and reliable thresholds. This extended CATH protein family database contains 616,470 domain sequences classified into 23,876 sequence families. This results in the significant expansion of the CATH HMM model library to include models built from the CATH sequence relatives, giving a 10% increase in coverage for detecting remote homologues. An improved Dictionary of Homologous superfamilies (DHS) (http://www.biochem.ucl.ac.uk/bsm/dhs/) containing specific sequence, structural and functional information for each superfamily in CATH considerably assists manual validation of homologues. Information on sequence relatives in CATH superfamilies, GenBank and completed genomes is presented in the CATH associated DHS and Gene3D resources. Domain partnership information can be obtained from Gene3D (http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/). A new CATH server has been implemented (http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl) providing automatic classification of newly determined sequences and structures using a suite of rapid sequence and structure comparison methods. The statistical significance of matches is assessed and links are provided to the putative superfamily or fold group to which the query sequence or structure is assigned.
Collapse
Affiliation(s)
- Frances Pearl
- Biochemistry and Molecular Biology Department, University College London, University of London, Gower Street, London WC1E 6BT, UK
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
166
|
Wistrand M, Sonnhammer ELL. Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics 2005; 6:99. [PMID: 15831105 PMCID: PMC1097716 DOI: 10.1186/1471-2105-6-99] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Accepted: 04/15/2005] [Indexed: 11/24/2022] Open
Abstract
Background Profile hidden Markov model (HMM) techniques are among the most powerful methods for protein homology detection. Yet, the critical features for successful modelling are not fully known. In the present work we approached this by using two of the most popular HMM packages: SAM and HMMER. The programs' abilities to build models and score sequences were compared on a SCOP/Pfam based test set. The comparison was done separately for local and global HMM scoring. Results Using default settings, SAM was overall more sensitive. SAM's model estimation was superior, while HMMER's model scoring was more accurate. Critical features for model building were then analysed by comparing the two packages' algorithmic choices and parameters. The weighting between prior probabilities and multiple alignment counts held the primary explanation why SAM's model building was superior. Our analysis suggests that HMMER gives too much weight to the sequence counts. SAM's emission prior probabilities were also shown to be more sensitive. The relative sequence weighting schemes are different in the two packages but performed equivalently. Conclusion SAM model estimation was more sensitive, while HMMER model scoring was more accurate. By combining the best algorithmic features from both packages the accuracy was substantially improved compared to their default performance.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | - Erik LL Sonnhammer
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| |
Collapse
|
167
|
Pellegrini-Calace M, Thornton JM. Detecting DNA-binding helix-turn-helix structural motifs using sequence and structure information. Nucleic Acids Res 2005; 33:2129-40. [PMID: 15831786 PMCID: PMC1079965 DOI: 10.1093/nar/gki349] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
In this work, we analyse the potential for using structural knowledge to improve the detection of the DNA-binding helix–turn–helix (HTH) motif from sequence. Starting from a set of DNA-binding protein structures that include a functional HTH motif and have no apparent sequence similarity to each other, two different libraries of hidden Markov models (HMMs) were built. One library included sequence models of whole DNA-binding domains, which incorporate the HTH motif, the second library included shorter models of ‘partial’ domains, representing only the fraction of the domain that corresponds to the functionally relevant HTH motif itself. The libraries were scanned against a dataset of protein sequences, some containing the HTH motifs, others not. HMM predictions were compared with the results obtained from a previously published structure-based method and subsequently combined with it. The combined method proved more effective than either of the single-featured approaches, showing that information carried by motif sequences and motif structures are to some extent complementary and can successfully be used together for the detection of DNA-binding HTHs in proteins of unknown function.
Collapse
|
168
|
Anand B, Gowri VS, Srinivasan N. Use of multiple profiles corresponding to a sequence alignment enables effective detection of remote homologues. Bioinformatics 2005; 21:2821-6. [PMID: 15817691 DOI: 10.1093/bioinformatics/bti432] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Position specific scoring matrices (PSSMs) corresponding to aligned sequences of homologous proteins are commonly used in homology detection. A PSSM is generated on the basis of one of the homologues as a reference sequence, which is the query in the case of PSI-BLAST searches. The reference sequence is chosen arbitrarily while generating PSSMs for reverse BLAST searches. In this work we demonstrate that the use of multiple PSSMs corresponding to a given alignment and variable reference sequences is more effective than using traditional single PSSMs and hidden Markov models. RESULTS Searches for proteins with known 3-D structures have been made against three databases of protein family profiles corresponding to known structures: (1) One PSSM per family; (2) multiple PSSMs corresponding to an alignment and variable reference sequences for every family; and (3) hidden Markov models. A comparison of the performances of these three approaches suggests that the use of multiple PSSMs is most effective. CONTACT ns@mbu.iisc.ernet.in.
Collapse
Affiliation(s)
- B Anand
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | | | | |
Collapse
|
169
|
Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de la Banda MG, Whisstock JC. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res 2005; 15:537-51. [PMID: 15805494 PMCID: PMC1074368 DOI: 10.1101/gr.3096505] [Citation(s) in RCA: 151] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Expansion of "low complex" repeats of amino acids such as glutamine (Poly-Q) is associated with protein misfolding and the development of degenerative diseases such as Huntington's disease. The mechanism by which such regions promote misfolding remains controversial, the function of many repeat-containing proteins (RCPs) remains obscure, and the role (if any) of repeat regions remains to be determined. Here, a Web-accessible database of RCPs is presented. The distribution and evolution of RCPs that contain homopeptide repeats tracts are considered, and the existence of functional patterns investigated. Generally, it is found that while polyamino acid repeats are extremely rare in prokaryotes, several eukaryote putative homologs of prokaryote RCP-involved in important housekeeping processes-retain the repetitive region, suggesting an ancient origin for certain repeats. Within eukarya, the most common uninterrupted amino acid repeats are glutamine, asparagines, and alanine. Interestingly, while poly-Q repeats are found in vertebrates and nonvertebrates, poly-N repeats are only common in more primitive nonvertebrate organisms, such as insects and nematodes. We have assigned function to eukaryote RCPs using Online Mendelian Inheritance in Man (OMIM), the Human Reference Protein Database (HRPD), FlyBase, and Wormpep. Prokaryote RCPs were annotated using BLASTp searches and Gene Ontology. These data reveal that the majority of RCPs are involved in processes that require the assembly of large, multiprotein complexes, such as transcription and signaling.
Collapse
Affiliation(s)
- Noel G Faux
- Protein Crystallography Unit, Department of Biochemistry and Molecular Biology, School of Computer Science and Software Engineering, Monash University, Clayton Campus, Melbourne, VIC 3800, Australia
| | | | | | | | | | | | | |
Collapse
|
170
|
Pallen MJ, Beatson SA, Bailey CM. Bioinformatics analysis of the locus for enterocyte effacement provides novel insights into type-III secretion. BMC Microbiol 2005; 5:9. [PMID: 15757514 PMCID: PMC1084347 DOI: 10.1186/1471-2180-5-9] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2004] [Accepted: 03/09/2005] [Indexed: 12/17/2022] Open
Abstract
Background Like many other pathogens, enterohaemorrhagic and enteropathogenic strains of Escherichia coli employ a type-III secretion system to translocate bacterial effector proteins into host cells, where they then disrupt a range of cellular functions. This system is encoded by the locus for enterocyte effacement. Many of the genes within this locus have been assigned names and functions through homology with the better characterised Ysc-Yop system from Yersinia spp. However, the functions and homologies of many LEE genes remain obscure. Results We have performed a fresh bioinformatics analysis of the LEE. Using PSI-BLAST we have been able to identify several novel homologies between LEE-encoded and Ysc-Yop-associated proteins: Orf2/YscE, Orf5/YscL, rORF8/EscI, SepQ/YscQ, SepL/YopN-TyeA, CesD2/LcrR. In addition, we highlight homology between EspA and flagellin, and report many new homologues of the chaperone CesT. Conclusion We conclude that the vast majority of LEE-encoded proteins do indeed possess homologues and that homology data can be used in combination with experimental data to make fresh functional predictions.
Collapse
Affiliation(s)
- Mark J Pallen
- Bacterial Pathogenesis and Genomics Unit, Division of Immunity and Infection, Medical School, University of Birmingham, Birmingham, B15 2TT, UK
| | - Scott A Beatson
- Bacterial Pathogenesis and Genomics Unit, Division of Immunity and Infection, Medical School, University of Birmingham, Birmingham, B15 2TT, UK
| | - Christopher M Bailey
- Bacterial Pathogenesis and Genomics Unit, Division of Immunity and Infection, Medical School, University of Birmingham, Birmingham, B15 2TT, UK
| |
Collapse
|
171
|
Abstract
MOTIVATION Evolutionary conservation estimated from a multiple sequence alignment is a powerful indicator of the functional significance of a residue and helps to predict active sites, ligand binding sites, and protein interaction interfaces. Many algorithms that calculate conservation work well, provided an accurate and balanced alignment is used. However, such a strong dependence on the alignment makes the results highly variable. We attempted to improve the conservation prediction algorithm by making it more robust and less sensitive to (1) local alignment errors, (2) overrepresentation of sequences in some branches and (3) occasional presence of unrelated sequences. RESULTS A novel method is presented for robust constrained Bayesian estimation of evolutionary rates that avoids overfitting independent rates and satisfies the above requirements. The method is evaluated and compared with an entropy-based conservation measure on a set of 1494 protein interfaces. We demonstrated that approximately 62% of the analyzed protein interfaces are more conserved than the remaining surface at the 5% significance level. A consistent method to incorporate alignment reliability is proposed and demonstrated to reduce arbitrary variation of calculated rates upon inclusion of distantly related or unrelated sequences into the alignment.
Collapse
Affiliation(s)
- Andrew J Bordner
- Molsoft LLC, 3366 North Torrey Pines Court, Suite 300, La Jolla, CA 92037, USA.
| | | |
Collapse
|
172
|
|
173
|
Pirun M, Babnigg G, Stevens FJ. Template-based recognition of protein fold within the midnight and twilight zones of protein sequence similarity. J Mol Recognit 2005; 18:203-12. [PMID: 15540237 DOI: 10.1002/jmr.728] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Most homologous pairs of proteins have no significant sequence similarity to each other and are not identified by direct sequence comparison or profile-based strategies. However, multiple sequence alignments of low similarity homologues typically reveal a limited number of positions that are well conserved despite diversity of function. It may be inferred that conservation at most of these positions is the result of the importance of the contribution of these amino acids to the folding and stability of the protein. As such, these amino acids and their relative positions may define a structural signature. We demonstrate that extraction of this fold template provides the basis for the sequence database to be searched for patterns consistent with the fold, enabling identification of homologs that are not recognized by global sequence analysis. The fold template method was developed to address the need for a tool that could comprehensively search the midnight and twilight zones of protein sequence similarity without reliance on global statistical significance. Manual implementations of the fold template method were performed on three folds--immunoglobulin, c-lectin and TIM barrel. Following proof of concept of the template method, an automated version of the approach was developed. This automated fold template method was used to develop fold templates for 10 of the more populated folds in the SCOP database. The fold template method developed three-dimensional structural motifs or signatures that were able to return a diverse collection of proteins, while maintaining a low false positive rate. Although the results of the manual fold template method were more comprehensive than the automated fold template method, the diversity of the results from the automated fold template method surpassed those of current methods that rely on statistical significance to infer evolutionary relationships among divergent proteins.
Collapse
Affiliation(s)
- Mono Pirun
- Department of Bioengineering, University of Illinois at Chicago, 60607, USA
| | | | | |
Collapse
|
174
|
Stevens FJ. Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation. J Mol Recognit 2005; 18:139-49. [PMID: 15558595 DOI: 10.1002/jmr.721] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A substantial fraction of protein sequences derived from genomic analyses is currently classified as representing 'hypothetical proteins of unknown function'. In part, this reflects the limitations of methods for comparison of sequences with very low identity. We evaluated the effectiveness of a Psi-BLAST search strategy to identify proteins of similar fold at low sequence identity. Psi-BLAST searches for structurally characterized low-sequence-identity matches were carried out on a set of over 300 proteins of known structure. Searches were conducted in NCBI's non-redundant database and were limited to three rounds. Some 614 potential homologs with 25% or lower sequence identity to 166 members of the search set were obtained. Disregarding the expect value, level of sequence identity and span of alignment, correspondence of fold between the target and potential homolog was found in more than 95% of the Psi-BLAST matches. Restrictions on expect value or span of alignment improved the false positive rate at the expense of eliminating many true homologs. Approximately three-quarters of the putative homologs obtained by three rounds of Psi-BLAST revealed no significant sequence similarity to the target protein upon direct sequence comparison by BLAST, and therefore could not be found by a conventional search. Although three rounds of Psi-BLAST identified many more homologs than a standard BLAST search, most homologs were undetected. It appears that more than 80% of all homologs to a target protein may be characterized by a lack of significant sequence similarity. We suggest that conservative use of Psi-BLAST has the potential to propose experimentally testable functions for the majority of proteins currently annotated as 'hypothetical proteins of unknown function'.
Collapse
Affiliation(s)
- F J Stevens
- Biosciences Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| |
Collapse
|
175
|
Kann MG, Thiessen PA, Panchenko AR, Schäffer AA, Altschul SF, Bryant SH. A structure-based method for protein sequence alignment. Bioinformatics 2004; 21:1451-6. [PMID: 15613392 DOI: 10.1093/bioinformatics/bti233] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to describe protein families. PSSMs can capture information about conserved patterns within families, which can be used to increase the sensitivity of searches for related sequences. Certain types of structural information, however, are not generally captured by PSSM search methods. Here we introduce a program, Structure-based ALignment TOol (SALTO), that aligns protein query sequences to PSSMs using rules for placing and scoring gaps that are consistent with the conserved regions of domain alignments from NCBI's Conserved Domain Database. RESULTS In most cases, the alignment scores obtained using the local alignment version follow an extreme value distribution. SALTO's performance in finding related sequences and producing accurate alignments is similar to or better than that of IMPALA; one advantage of SALTO is that it imposes an explicit gapping model on each protein family. AVAILABILITY A stand-alone version of the program that can generate global or local alignments is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/), and has been incorporated to Cn3D structure/alignment viewer. CONTACT bryant@ncbi.nlm.nih.gov.
Collapse
Affiliation(s)
- Maricel G Kann
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD 20894, USA
| | | | | | | | | | | |
Collapse
|
176
|
Abstract
MOTIVATION The observed correlations between pairs of homologous protein sequences are typically explained in terms of a Markovian dynamic of amino acid substitution. This model assumes that every location on the protein sequence has the same background distribution of amino acids, an assumption that is incompatible with the observed heterogeneity of protein amino acid profiles and with the success of profile multiple sequence alignment. RESULTS We propose an alternative model of amino acid replacement during protein evolution based upon the assumption that the variation of the amino acid background distribution from one residue to the next is sufficient to explain the observed sequence correlations of homologs. The resulting dynamical model of independent replacements drawn from heterogeneous backgrounds is simple and consistent, and provides a unified homology match score for sequence-sequence, sequence-profile and profile-profile alignment.
Collapse
Affiliation(s)
- Gavin E Crooks
- Department of Plant and Microbial Biology 111 Koshland Hall #3102 University of California Berkeley, CA 94720-3102, USA.
| | | |
Collapse
|
177
|
Liu J, Hegyi H, Acton TB, Montelione GT, Rost B. Automatic target selection for structural genomics on eukaryotes. Proteins 2004; 56:188-200. [PMID: 15211504 DOI: 10.1002/prot.20012] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | | | | | | | |
Collapse
|
178
|
Abstract
Seven protein structure comparison methods and two sequence comparison programs were evaluated on their ability to detect either protein homologs or domains with the same topology (fold) as defined by the CATH structure database. The structure alignment programs Dali, Structal, Combinatorial Extension (CE), VAST, and Matras were tested along with SGM and PRIDE, which calculate a structural distance between two domains without aligning them. We also tested two sequence alignment programs, SSEARCH and PSI-BLAST. Depending upon the level of selectivity and error model, structure alignment programs can detect roughly twice as many homologous domains in CATH as sequence alignment programs. Dali finds the most homologs, 321-533 of 1120 possible true positives (28.7%-45.7%), at an error rate of 0.1 errors per query (EPQ), whereas PSI-BLAST finds 365 true positives (32.6%), regardless of the error model. At an EPQ of 1.0, Dali finds 42%-70% of possible homologs, whereas Matras finds 49%-57%; PSI-BLAST finds 36.9%. However, Dali achieves >84% coverage before the first error for half of the families tested. Dali and PSI-BLAST find 9.2% and 5.2%, respectively, of the 7056 possible topology pairs at an EPQ of 0.1 and 19.5, and 5.9% at an EPQ of 1.0. Most statistical significance estimates reported by the structural alignment programs overestimate the significance of an alignment by orders of magnitude when compared with the actual distribution of errors. These results help quantify the statistical distinction between analogous and homologous structures, and provide a benchmark for structure comparison statistics.
Collapse
Affiliation(s)
- Michael L Sierk
- Department of Biochemistry and Molecular Genetics, University of Virginia Health System, Charlottesville, VA 22908, USA
| | | |
Collapse
|
179
|
Abstract
The accuracy of an alignment between two protein sequences can be improved by including other detectably related sequences in the comparison. We optimize and benchmark such an approach that relies on aligning two multiple sequence alignments, each one including one of the two protein sequences. Thirteen different protocols for creating and comparing profiles corresponding to the multiple sequence alignments are implemented in the SALIGN command of MODELLER. A test set of 200 pairwise, structure-based alignments with sequence identities below 40% is used to benchmark the 13 protocols as well as a number of previously described sequence alignment methods, including heuristic pairwise sequence alignment by BLAST, pairwise sequence alignment by global dynamic programming with an affine gap penalty function by the ALIGN command of MODELLER, sequence-profile alignment by PSI-BLAST, Hidden Markov Model methods implemented in SAM and LOBSTER, pairwise sequence alignment relying on predicted local structure by SEA, and multiple sequence alignment by CLUSTALW and COMPASS. The alignment accuracies of the best new protocols were significantly better than those of the other tested methods. For example, the fraction of the correctly aligned residues relative to the structure-based alignment by the best protocol is 56%, which can be compared with the accuracies of 26%, 42%, 43%, 48%, 50%, 49%, 43%, and 43% for the other methods, respectively. The new method is currently applied to large-scale comparative protein structure modeling of all known sequences.
Collapse
Affiliation(s)
- Marc A Marti-Renom
- Mission Bay Genentech Hall, University of California, San Francisco, San Francisco, CA 94143, USA.
| | | | | |
Collapse
|
180
|
Wistrand M, Sonnhammer ELL. transition priors for protein hidden Markov models: an empirical study towards maximum discrimination. J Comput Biol 2004; 11:181-93. [PMID: 15072695 DOI: 10.1089/106652704773416957] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Insertions and deletions in a profile hidden Markov model (HMM) are modeled by transition probabilities between insert, delete and match states. These are estimated by combining observed data and prior probabilities. The transition prior probabilities can be defined either ad hoc or by maximum likelihood (ML) estimation. We show that the choice of transition prior greatly affects the HMM's ability to discriminate between true and false hits. HMM discrimination was measured using the HMMER 2.2 package applied to 373 families from Pfam. We measured the discrimination between true members and noise sequences employing various ML transition priors and also systematically scanned the parameter space of ad hoc transition priors. Our results indicate that ML priors produce far from optimal discrimination, and we present an empirically derived prior that considerably decreases the number of misclassifications compared to ML. Most of the difference stems from the probabilities for exiting a delete state. The ML prior, which is unaware of noise sequences, estimates a delete-to-delete probability that is relatively high and does not penalize noise sequences enough for optimal discrimination.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
181
|
Theobald DL, Wuttke DS. Prediction of Multiple Tandem OB-Fold Domains in Telomere End-Binding Proteins Pot1 and Cdc13. Structure 2004; 12:1877-9. [PMID: 15458635 DOI: 10.1016/j.str.2004.07.015] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2004] [Revised: 07/21/2004] [Accepted: 07/31/2004] [Indexed: 10/26/2022]
Abstract
The heterodimeric Oxytricha nova telomere end binding protein, the original telomere end binding protein characterized, contains four OB-fold domains used for recognition of single-stranded telomeric DNA. In contrast, only solitary OB-fold domains have been found in the telomere end binding proteins from yeast and higher eukaryotes. Using a sliding-window algorithm coupled with sequence profile-profile analysis, we provide support for the existence of multiple OB-fold domains in two other telomeric ssDNA binding proteins, vertebrate Pot1 and budding yeast Cdc13. This common usage of multiple, tandem OB-fold domains in telomeric end binding proteins extends the known evolutionary conservation of eukaryotic end-protection mechanisms.
Collapse
Affiliation(s)
- Douglas L Theobald
- Department of Chemistry and Biochemistry, University of Colorado at Boulder, Boulder, CO 80309, USA
| | | |
Collapse
|
182
|
Magnani E, Sjölander K, Hake S. From endonucleases to transcription factors: evolution of the AP2 DNA binding domain in plants. THE PLANT CELL 2004; 16:2265-77. [PMID: 15319480 PMCID: PMC520932 DOI: 10.1105/tpc.104.023135] [Citation(s) in RCA: 181] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/08/2004] [Accepted: 06/17/2004] [Indexed: 05/18/2023]
Abstract
All members of the AP2/ERF family of plant transcription regulators contain at least one copy of a DNA binding domain called the AP2 domain. The AP2 domain has been considered plant specific. Here, we show that homologs are present in the cyanobacterium Trichodesmium erythraeum, the ciliate Tetrahymena thermophila, and the viruses Enterobacteria phage Rb49 and Bacteriophage Felix 01. We demonstrate that the T. erythraeum AP2 domain selectively binds stretches of poly(dG)/poly(dC) showing functional conservation with plant AP2/ERF proteins. The newly discovered nonplant proteins bearing an AP2 domain are predicted to be HNH endonucleases. Sequence conservation extends outside the AP2 domain to include part of the endonuclease domain for the T. erythraeum protein and some plant AP2/ERF proteins. Our phylogenetic analysis of the broader family of AP2 domains supports the possibility of lateral gene transfer. We hypothesize that a horizontal transfer of an HNH-AP2 endonuclease from bacteria or viruses into plants may have led to the origin of the AP2/ERF family of transcription factors via transposition and homing processes.
Collapse
Affiliation(s)
- Enrico Magnani
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA.
| | | | | |
Collapse
|
183
|
Tramontano A, Morea V. Exploiting evolutionary relationships for predicting protein structures. Biotechnol Bioeng 2004; 84:756-62. [PMID: 14708116 DOI: 10.1002/bit.10850] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In the last few years there have been many developments in computational biology, particularly with regard to novel, imaginative exploitation of genomic data. Disappointingly, there has been a lack of progress in the methodology for prediction of protein structures. In the last several years, however, promising new methods have finally begun to emerge. These methods are increasing the power and scope of the methodology, but, most importantly, they are generating new areas of investigation that we believe will accelerate progress in the field. In this review we describe recent developments and highlight the implications of their success as well as areas where efforts should be focused.
Collapse
Affiliation(s)
- Anna Tramontano
- Department of Biochemical Sciences A. Rossi Fanelli, University La Sapienza, P. le Aldo Moro 5, 00185 Rome, Italy.
| | | |
Collapse
|
184
|
Hou Y, Hsu W, Lee ML, Bystroff C. Remote homolog detection using local sequence-structure correlations. Proteins 2004; 57:518-30. [DOI: 10.1002/prot.20221] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
185
|
Tomiki T, Saitou N. Phylogenetic Analysis of Proteins Associated in the Four Major Energy Metabolism Systems: Photosynthesis, Aerobic Respiration, Denitrification, and Sulfur Respiration. J Mol Evol 2004; 59:158-76. [PMID: 15486691 DOI: 10.1007/s00239-004-2610-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2003] [Accepted: 11/28/2004] [Indexed: 11/27/2022]
Abstract
The four electron transfer energy metabolism systems, photosynthesis, aerobic respiration, denitrification, and sulfur respiration, are thought to be evolutionarily related because of the similarity of electron transfer patterns and the existence of some homologous proteins. How these systems have evolved is elusive. We therefore conducted a comprehensive homology search using PSI-BLAST, and phylogenetic analyses were conducted for the three homologous groups (groups 1-3) based on multiple alignments of domains defined in the Pfam database. There are five electron transfer types important for catalytic reaction in group 1, and many proteins bind molybdenum. Deletions of two domains led to loss of the function of binding molybdenum and ferredoxin, and these deletions seem to be critical for the electron transfer pattern changes in group 1. Two types of electron transfer were found in group 2, and all its member proteins bind siroheme and ferredoxin. Insertion of the pyridine nucleotide disulfide oxidoreductase domain seemed to be the critical point for the electron transfer pattern change in this group. The proteins belonging to group 3 are all flavin enzymes, and they bind flavin adenine dinucleotide (FAD) or flavin mononucleotide (FMN). Types of electron transfer in this group are divergent, but there are two common characteristics. NAD(P)H works as an electron donor or acceptor, and FAD or FMN transfers electrons from/to NAD(P)H. Electron transfer functions might be added to these common characteristics by the addition of functional domains through the evolution of group 3 proteins. Based on the phylogenetic analyses in this study and previous studies, we inferred the phylogeny of the energy metabolism systems as follows: photosynthesis (and possibly aerobic respiration) and the sulfur/nitrogen assimilation system first diverged, then the sulfur/nitrogen dissimilation system was produced from the latter system.
Collapse
Affiliation(s)
- Takeshi Tomiki
- Division of Population Genetics, National Institute of Genetics, and Department of Genetics, School of Life Sciences, Graduate University for Advanced Studies, Mishima, Japan
| | | |
Collapse
|
186
|
Abstract
We developed a variant of the intermediate sequence search method (ISS(new)) for detection and alignment of weakly similar pairs of protein sequences. ISS(new) relates two query sequences by an intermediate sequence that is potentially homologous to both queries. The improvement was achieved by a more robust overlap score for a match between the queries through an intermediate. The approach was benchmarked on a data set of 2369 sequences of known structure with insignificant sequence similarity to each other (BLAST E-value larger than 0.001); 2050 of these sequences had a related structure in the set. ISS(new) performed significantly better than both PSI-BLAST and a previously described intermediate sequence search method. PSI-BLAST could not detect correct homologs for 1619 of the 2369 sequences. In contrast, ISS(new) assigned a correct homolog as the top hit for 121 of these 1619 sequences, while incorrectly assigning homologs for only nine targets; it did not assign homologs for the remainder of the sequences. By estimate, ISS(new) may be able to assign the folds of domains in approximately 29,000 of the approximately 500,000 sequences unassigned by PSI-BLAST, with 90% specificity (1 - false positives fraction). In addition, we show that the 15 alignments with the most significant BLAST E-values include the nearly best alignments constructed by ISS(new).
Collapse
Affiliation(s)
- Bino John
- Laboratory of Molecular Biophysics, Pels Family Center for Biochemistry and Structural Biology, The Rockefeller University, New York, New York 10021, USA
| | | |
Collapse
|
187
|
|
188
|
Cameron M, Williams HE, Cannane A. Improved gapped alignment in BLAST. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2004; 1:116-29. [PMID: 17048387 DOI: 10.1109/tcbb.2004.32] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is BLAST, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the BLAST algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step-semigapped alignment-compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing BLAST to accurately filter sequences with lower computational cost. In addition, we propose a heuristic-restricted insertion alignment-that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimization of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in BLAST. We conclude that our techniques are an important improvement to the BLAST algorithm. Source code for the alignment algorithms is available for download at http://www.bsg.rmit.edu.au/iga/.
Collapse
Affiliation(s)
- Michael Cameron
- School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne, Australia.
| | | | | |
Collapse
|
189
|
Nakamura T, Motoyama T, Hirokawa T, Hirono S, Yamaguchi I. Computer-aided modeling of pentachlorophenol 4-monooxygenase and site-directed mutagenesis of its active site. Chem Pharm Bull (Tokyo) 2004; 51:1293-8. [PMID: 14600375 DOI: 10.1248/cpb.51.1293] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Homology modeling was used to construct a model of the three-dimensional structure of pentachlorophenol 4-monooxygenase (PcpB). A PSI-BLAST homology search was initially performed to identify the 3D structure of proteins homologous with PcpB. The feasibility of modeled structures of PcpB was evaluated by Verify3D, which calculated structural compatibility scores based on 3D-1D profiles. The predicted structure of PcpB had an acceptable 3D-1D self-compatibility score, beyond the incorrect fold score threshold. A PcpB-pentachlorophenol (PCP) complex was then constructed utilizing the modeled PcpB structure. After energy minimization of the complex, and successive minimizations of the system that consisted of the complex and the water layer surrounding the complex, the molecular dynamics of the system were simulated. The active-site residues of PcpB were identified on the basis of the modeled structure, and PcpB mutants were then designed to change the active site residues, expressed, and purified by affinity chromatography. The mutant activity was compared with that of the wild-type to investigate the validity of the modeled structure. The experimental results suggested that Phe85, Tyr216, and Arg235 were relevant to enzyme activity, and that Tyr397 and Phe87 were important for stabilization of the structure of PcpB.
Collapse
Affiliation(s)
- Takashi Nakamura
- Laboratory for Remediation Research, Environmental Plant Research Group, Plant Science Center, RIKEN Institute, yokohama, Konagawa, Japan.
| | | | | | | | | |
Collapse
|
190
|
Goonesekere NCW, Lee B. Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Res 2004; 32:2838-43. [PMID: 15155852 PMCID: PMC419611 DOI: 10.1093/nar/gkh610] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Gap penalty is an important component of the scoring scheme that is needed when searching for homologous proteins and for accurate alignment of protein sequences. Most homology search and sequence alignment algorithms employ a heuristic 'affine gap penalty' scheme q + r x n, in which q is the penalty for opening a gap, r the penalty for extending it and n the gap length. In order to devise a more rational scoring scheme, we examined the pattern of gaps that occur in a database of structurally aligned protein domain pairs. We find that the logarithm of the frequency of gaps varies linearly with the length of the gap, but with a break at a gap of length 3, and is well approximated by two linear regression lines with R2 values of 1.0 and 0.99. The bilinear behavior is retained when gaps are categorized by secondary structures of the two residues flanking the gap. Similar results were obtained when another, totally independent, structurally aligned protein pair database was used. These results suggest a modification of the affine gap penalty function.
Collapse
Affiliation(s)
- Nalin C W Goonesekere
- Laboratory of Molecular Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 37, Room 5120, 37 Convent Drive MSC 4264, Bethesda, MD 20892-4264, USA
| | | |
Collapse
|
191
|
Ohlson T, Wallner B, Elofsson A. Profile-profile methods provide improved fold-recognition: A study of different profile-profile alignment methods. Proteins 2004; 57:188-97. [PMID: 15326603 DOI: 10.1002/prot.20184] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
To improve the detection of related proteins, it is often useful to include evolutionary information for both the query and target proteins. One method to include this information is by the use of profile-profile alignments, where a profile from the query protein is compared with the profiles from the target proteins. Profile-profile alignments can be implemented in several fundamentally different ways. The similarity between two positions can be calculated using a dot-product, a probabilistic model, or an information theoretical measure. Here, we present a large-scale comparison of different profile-profile alignment methods. We show that the profile-profile methods perform at least 30% better than standard sequence-profile methods both in their ability to recognize superfamily-related proteins and in the quality of the obtained alignments. Although the performance of all methods is quite similar, profile-profile methods that use a probabilistic scoring function have an advantage as they can create good alignments and show a good fold recognition capacity using the same gap-penalties, while the other methods need to use different parameters to obtain comparable performances.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden
| | | | | |
Collapse
|
192
|
Coin L, Bateman A, Durbin R. Enhanced protein domain discovery using taxonomy. BMC Bioinformatics 2004; 5:56. [PMID: 15137915 PMCID: PMC434490 DOI: 10.1186/1471-2105-5-56] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2004] [Accepted: 05/11/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids. RESULTS We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques. CONCLUSIONS Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.
Collapse
Affiliation(s)
- Lachlan Coin
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Alex Bateman
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| |
Collapse
|
193
|
Wistrand M, Sonnhammer ELL. Improving Profile HMM Discrimination by Adapting Transition Probabilities. J Mol Biol 2004; 338:847-54. [PMID: 15099750 DOI: 10.1016/j.jmb.2004.03.023] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2003] [Revised: 02/25/2004] [Accepted: 03/04/2004] [Indexed: 12/21/2022]
Abstract
Profile hidden Markov models (HMMs) are used to model protein families and for detecting evolutionary relationships between proteins. Such a profile HMM is typically constructed from a multiple alignment of a set of related sequences. Transition probability parameters in an HMM are used to model insertions and deletions in the alignment. We show here that taking into account unrelated sequences when estimating the transition probability parameters helps to construct more discriminative models for the global/local alignment mode. After normal HMM training, a simple heuristic is employed that adjusts the transition probabilities between match and delete states according to observed transitions in the training set relative to the unrelated (noise) set. The method is called adaptive transition probabilities (ATP) and is based on the HMMER package implementation. It was benchmarked in two remote homology tests based on the Pfam and the SCOP classifications. Compared to the HMMER default procedure, the rate of misclassification was reduced significantly in both tests and across all levels of error rate.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | |
Collapse
|
194
|
Weston J, Elisseeff A, Zhou D, Leslie CS, Noble WS. Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci U S A 2004; 101:6559-63. [PMID: 15087500 PMCID: PMC404084 DOI: 10.1073/pnas.0308067101] [Citation(s) in RCA: 73] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Biologists regularly search databases of DNA or protein sequences for evolutionary or functional relationships to a given query sequence. We describe a ranking algorithm that exploits the entire network structure of similarity relationships among proteins in a sequence database by performing a diffusion operation on a precomputed, weighted network. The resulting ranking algorithm, evaluated by using a human-curated database of protein structures, is efficient and provides significantly better rankings than a local network search algorithm such as psi-blast.
Collapse
Affiliation(s)
- Jason Weston
- NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540, USA
| | | | | | | | | |
Collapse
|
195
|
Bhaduri A, Pugalenthi G, Sowdhamini R. PASS2: an automated database of protein alignments organised as structural superfamilies. BMC Bioinformatics 2004; 5:35. [PMID: 15059245 PMCID: PMC407847 DOI: 10.1186/1471-2105-5-35] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2003] [Accepted: 04/02/2004] [Indexed: 12/02/2022] Open
Abstract
Background The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. Description An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. Conclusions The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at
Collapse
Affiliation(s)
- Anirban Bhaduri
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bellary Road, Bangalore, Karnataka 560 065, India
| | - Ganesan Pugalenthi
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bellary Road, Bangalore, Karnataka 560 065, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bellary Road, Bangalore, Karnataka 560 065, India
| |
Collapse
|
196
|
Wallner B, Fang H, Ohlson T, Frey-Skött J, Elofsson A. Using evolutionary information for the query and target improves fold recognition. Proteins 2004; 54:342-50. [PMID: 14696196 DOI: 10.1002/prot.10565] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In this study, we show that it is possible to increase the performance over PSI-BLAST by using evolutionary information for both query and target sequences. This information can be used in three different ways: by sequence linking, profile-profile alignments, and by combining sequence-profile and profile-sequence searches. If only PSI-BLAST is used, 16% of superfamily-related protein domains can be detected at 90% specificity, but if a sequence-profile and a profile-sequence search are combined, this is increased to 20%, profile-profile searches detects 19%, whereas a linking procedure identifies 22% of these proteins. All three methods show equal performance, but the best combination of speed and accuracy seems to be obtained by the combined searches, because this method shows a good performance even at high specificity and the lowest computational cost. In addition, we show that the E-values reported by all these methods, including PSI-BLAST, underestimate the true rate of false positives. This behavior is seen even if a very strict E-value cutoff and a limited number of iterations are used. However, the difference is more pronounced with a looser E-value cutoff and more iterations.
Collapse
Affiliation(s)
- Björn Wallner
- Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden
| | | | | | | | | |
Collapse
|
197
|
Soeria-Atmadja D, Zorzet A, Gustafsson MG, Hammerling U. Statistical Evaluation of Local Alignment Features Predicting Allergenicity Using Supervised Classification Algorithms. Int Arch Allergy Immunol 2004; 133:101-12. [PMID: 14739578 DOI: 10.1159/000076382] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2003] [Accepted: 10/07/2003] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Recently, two promising alignment-based features predicting food allergenicity using the k nearest neighbor (kNN) classifier were reported. These features are the alignment score and alignment length of the best local alignment obtained in a database of known allergen sequences. METHODS In the work reported here a much more comprehensive statistical evaluation of the potential of these features was performed, this time for the prediction of allergenicity in general. The evaluation consisted of the following four key components. (1) A new high quality database consisting of 318 carefully selected, non-redundant allergens and 1,007 sequences carefully selected to be non-allergens. (2) Three different supervised algorithms: the kNN classifier, the Bayesian linear Gaussian classifier, and the Bayesian quadratic Gaussian classifier. (3) A large set of local alignment procedures defined using the FASTA3 alignment program by means of a wide range of different parameter settings. (4) Novel performance curves, alternative to conventional receiver-operating characteristic curves, to display not only average behaviors but also statistical variations due to small data sets. RESULTS The linear Gaussian classifier proved most useful among the tested supervised machine learning algorithms, closely followed by the quadratic Gaussian equivalent and kNN. The overall best classification results were obtained with a novel feature vector consisting of the combined alignment scores derived from local alignment procedures using different substitution matrices. CONCLUSIONS The models reported here should be useful as a part of an integrated assessment scheme for potential protein allergenicity and for future comparisons with alternative bioinformatic approaches.
Collapse
Affiliation(s)
- D Soeria-Atmadja
- Division of Toxicology, National Food Administration, Uppsala University, Uppsala, Sweden
| | | | | | | |
Collapse
|
198
|
Tian Y, Fan L, Thurau T, Jung C, Cai D. The absence of TIR-type resistance gene analogues in the sugar beet (Beta vulgaris L.) genome. J Mol Evol 2004; 58:40-53. [PMID: 14743313 DOI: 10.1007/s00239-003-2524-4] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2003] [Accepted: 07/15/2003] [Indexed: 12/11/2022]
Abstract
The majority of known plant resistance genes encode proteins with conserved nucleotide-binding sites and leucine-rich repeats (NBS-LRR). Degenerate primers based on conserved NBS-LRR motifs were used to amplify analogues of resistance genes from the dicot sugar beet. Along with a cDNA library screen, the PCR screen identified 27 genomic and 12 expressed NBS-LRR RGAs (nlRGAs) sugar beet clones. The clones were classified into three subfamilies based on nucleotide sequence identity. Sequence analyses suggested that point mutations, such as nucleotide substitutions and insertion/deletions, are probably the primary source of diversity of sugar beet nlRGAs. A phylogenetic analysis revealed an ancestral relationship among sugar beet nlRGAs and resistance genes from various angiosperm species. One group appeared to share the same common ancestor as Prf, Rx, RPP8, and Mi, whereas the second group originated from the ancestral gene from which 12C1, Xa1, and Cre3 arose. The predicted protein products of the nlRGAs isolated in this study are all members of the non-TIR-type resistance gene subfamily and share strong sequence and structural similarities with non-TIR-type resistance proteins. No representatives of the TIR-type RGAs were detected either by PCR amplification using TIR type-specific primers or by in silico screening of more than 16,000 sugar beet ESTs. These findings suggest that TIR type of RGAs is absent from the sugar beet genome. The possible evolutionary loss of TIR type RGAs in the sugar beet is discussed.
Collapse
Affiliation(s)
- Yanyan Tian
- Institute of Crop Science and Plant Breeding, Christian-Albrechts-University of Kiel, Kiel, Germany
| | | | | | | | | |
Collapse
|
199
|
Beesley J, Roush C, Baker L. High-throughput molecular pathology in human tissues as a method for driving drug discovery. Drug Discov Today 2004; 9:182-9. [PMID: 14960398 DOI: 10.1016/s1359-6446(03)02973-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
To facilitate prioritization of potential drug targets, gene expression can be localized to individual cell types in normal and diseased tissues. Given the complexity of molecular physiology and pathology, the creation of large-scale molecular pathology databases collating data obtained from human tissues is a challenging marriage of old and new technologies, particularly when considering the many issues that preclude easy access to substantial quantities of human tissues. Molecular pathology databases are powerful tools and are essential for early-stage drug discovery, enabling informed decisions to be made with respect to scientific direction and follow-up research.
Collapse
Affiliation(s)
- Julian Beesley
- LifeSpan BioSciences, 2401 4th Avenue, Suite 900, Seattle, WA 98121, USA.
| | | | | |
Collapse
|
200
|
Vogel C, Teichmann SA, Chothia C. The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity. Development 2004; 130:6317-28. [PMID: 14623821 DOI: 10.1242/dev.00848] [Citation(s) in RCA: 86] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Drosophila melanogaster is an arthropod with a much more complex anatomy and physiology than the nematode Caenorhabditis elegans. We investigated one of the protein superfamilies in the two organisms that plays a major role in development and function of cell-cell communication: the immunoglobulin superfamily (IgSF). Using hidden Markov models, we identified 142 IgSF proteins in Drosophila and 80 in C. elegans. Of these, 58 and 22, respectively, have been previously identified by experiments. On the basis of homology and the structural characterisation of the proteins, we can suggest probable types of function for most of the novel proteins. Though overall Drosophila has fewer genes than C. elegans, it has many more IgSF cell-surface and secreted proteins. Half the IgSF proteins in C. elegans and three quarters of those in Drosophila have evolved subsequent to the divergence of the two organisms. These results suggest that the expansion of this protein superfamily is one of the factors that have contributed to the formation of the more complex physiological features that are found in Drosophila.
Collapse
|