1
|
Karami A, Fayyaz Movaghar A, Mercier S, Ferre L. New Approximate Statistical Significance of Gapped Alignments Based on the Greedy Extension Model. J Comput Biol 2020; 27:1361-1372. [PMID: 31913652 DOI: 10.1089/cmb.2018.0203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Sequence alignment is a fundamental concept in bioinformatics to distinguish regions of similarity among various sequences. The degree of similarity has been considered as a score. There are a number of various methods to find the statistical significance of similarity in the gapped and ungapped cases. In this article, we improve the statistical significance accuracy of the local score by introducing a new approximate p-value. This is developed according to Poisson clumping and the exact distribution of a partial sum of random variables. The efficiency of the proposed method is compared with that of previous methods on real and simulated data. The results yield a remarkable improvement in accuracy of the p-value in the gapped case. This is an evidence for the method to be considered as a prospective candidate for sequences comparison.
Collapse
Affiliation(s)
- Amirhossein Karami
- Department of Statistics, Faculty of Mathematical Sciences, University of Mazandaran, Babolsar, Iran
| | - Afshin Fayyaz Movaghar
- Department of Statistics, Faculty of Mathematical Sciences, University of Mazandaran, Babolsar, Iran
| | - Sabine Mercier
- Institut de Mathematiques de Toulouse, Department of Mathematics and Computer Science, Universite Toulouse Jean Jaures, Toulouse, France
| | - Louis Ferre
- Institut de Mathematiques de Toulouse, Toulouse, France
| |
Collapse
|
2
|
Margelevičius M. Estimating statistical significance of local protein profile-profile alignments. BMC Bioinformatics 2019; 20:419. [PMID: 31409275 PMCID: PMC6693267 DOI: 10.1186/s12859-019-2913-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Accepted: 05/23/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment of sequence families described by profiles provides a sensitive means for establishing homology between proteins and is important in protein evolutionary, structural, and functional studies. In the context of a steadily growing amount of sequence data, estimating the statistical significance of alignments, including profile-profile alignments, plays a key role in alignment-based homology search algorithms. Still, it is an open question as to what and whether one type of distribution governs profile-profile alignment score, especially when profile-profile substitution scores involve such terms as secondary structure predictions. RESULTS This study presents a methodology for estimating the statistical significance of this type of alignments. The methodology rests on a new algorithm developed for generating random profiles such that their alignment scores are distributed similarly to those obtained for real unrelated profiles. We show that improvements in statistical accuracy and sensitivity and high-quality alignment rate result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profile characteristics. Implemented in the COMER software, the proposed methodology yielded an increase of up to 34.2% in the number of true positives and up to 61.8% in the number of high-quality alignments with respect to the previous version of the COMER method. CONCLUSIONS The more accurate estimation of statistical significance is implemented in the COMER method, which is now more sensitive and provides an increased rate of high-quality profile-profile alignments. The results of the present study also suggest directions for future research.
Collapse
Affiliation(s)
- Mindaugas Margelevičius
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio al. 7, Vilnius, 10257, Lithuania.
| |
Collapse
|
3
|
Transcriptomic investigation of wound healing and regeneration in the cnidarian Calliactis polypus. Sci Rep 2017; 7:41458. [PMID: 28150733 PMCID: PMC5288695 DOI: 10.1038/srep41458] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Accepted: 12/19/2016] [Indexed: 12/11/2022] Open
Abstract
Wound healing and regeneration in cnidarian species, a group that forms the sister phylum to Bilateria, remains poorly characterised despite the ability of many cnidarians to rapidly repair injuries, regenerate lost structures, or re-form whole organisms from small populations of somatic cells. Here we present results from a fully replicated RNA-Seq experiment to identify genes that are differentially expressed in the sea anemone Calliactis polypus following catastrophic injury. We find that a large-scale transcriptomic response is established in C. polypus, comprising an abundance of genes involved in tissue patterning, energy dynamics, immunity, cellular communication, and extracellular matrix remodelling. We also identified a substantial proportion of uncharacterised genes that were differentially expressed during regeneration, that appear to be restricted to cnidarians. Overall, our study serves to both identify the role that conserved genes play in eumetazoan wound healing and regeneration, as well as to highlight the lack of information regarding many genes involved in this process. We suggest that functional analysis of the large group of uncharacterised genes found in our study may contribute to better understanding of the regenerative capacity of cnidarians, as well as provide insight into how wound healing and regeneration has evolved in different lineages.
Collapse
|
4
|
Spouge JL. Finite-size corrections to Poisson approximations of rare events in renewal processes. J Appl Probab 2016. [DOI: 10.1239/jap/996986762] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Consider a renewal process. The renewal events partition the process into i.i.d. renewal cycles. Assume that on each cycle, a rare event called 'success’ can occur. Such successes lend themselves naturally to approximation by Poisson point processes. If each success occurs after a random delay, however, Poisson convergence may be relatively slow, because each success corresponds to a time interval, not a point. In 1996, Altschul and Gish proposed a finite-size correction to a particular approximation by a Poisson point process. Their correction is now used routinely (about once a second) when computers compare biological sequences, although it lacks a mathematical foundation. This paper generalizes their correction. For a single renewal process or several renewal processes operating in parallel, this paper gives an asymptotic expansion that contains in successive terms a Poisson point approximation, a generalization of the Altschul-Gish correction, and a correction term beyond that.
Collapse
|
5
|
Abstract
Consider a renewal process. The renewal events partition the process into i.i.d. renewal cycles. Assume that on each cycle, a rare event called 'success’ can occur. Such successes lend themselves naturally to approximation by Poisson point processes. If each success occurs after a random delay, however, Poisson convergence may be relatively slow, because each success corresponds to a time interval, not a point. In 1996, Altschul and Gish proposed a finite-size correction to a particular approximation by a Poisson point process. Their correction is now used routinely (about once a second) when computers compare biological sequences, although it lacks a mathematical foundation. This paper generalizes their correction. For a single renewal process or several renewal processes operating in parallel, this paper gives an asymptotic expansion that contains in successive terms a Poisson point approximation, a generalization of the Altschul-Gish correction, and a correction term beyond that.
Collapse
|
6
|
The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment. Methods Mol Biol 2016; 1415:477-506. [PMID: 27115649 DOI: 10.1007/978-1-4939-3572-7_25] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2023]
|
7
|
Altindis E, Cozzi R, Di Palo B, Necchi F, Mishra RP, Fontana MR, Soriani M, Bagnoli F, Maione D, Grandi G, Liberatori S. Protectome analysis: a new selective bioinformatics tool for bacterial vaccine candidate discovery. Mol Cell Proteomics 2014; 14:418-29. [PMID: 25368410 DOI: 10.1074/mcp.m114.039362] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
New generation vaccines are in demand to include only the key antigens sufficient to confer protective immunity among the plethora of pathogen molecules. In the last decade, large-scale genomics-based technologies have emerged. Among them, the Reverse Vaccinology approach was successfully applied to the development of an innovative vaccine against Neisseria meningitidis serogroup B, now available on the market with the commercial name BEXSERO® (Novartis Vaccines). The limiting step of such approaches is the number of antigens to be tested in in vivo models. Several laboratories have been trying to refine the original approach in order to get to the identification of the relevant antigens straight from the genome. Here we report a new bioinformatics tool that moves a first step in this direction. The tool has been developed by identifying structural/functional features recurring in known bacterial protective antigens, the so called "Protectome space," and using such "protective signatures" for protective antigen discovery. In particular, we applied this new approach to Staphylococcus aureus and Group B Streptococcus and we show that not only already known protective antigens were re-discovered, but also two new protective antigens were identified.
Collapse
Affiliation(s)
- Emrah Altindis
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Roberta Cozzi
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Benedetta Di Palo
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Francesca Necchi
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Ravi P Mishra
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Maria Rita Fontana
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Marco Soriani
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Fabio Bagnoli
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Domenico Maione
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Guido Grandi
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| | - Sabrina Liberatori
- From the ‡Research Department, Novartis Vaccines and Diagnostics, 53100 Siena, Italy
| |
Collapse
|
8
|
Abstract
The comparison of homologous proteins from different species is a first step toward a function assignment and a reconstruction of the species evolution. Though local alignment is mostly used for this purpose, global alignment is important for constructing multiple alignments or phylogenetic trees. However, statistical significance of global alignments is not completely clear, lacking a specific statistical model to describe alignments or depending on computationally expensive methods like Z-score. Recently we presented a normalized global alignment, defined as the best compromise between global alignment cost and length, and showed that this new technique led to better classification results than Z-score at a much lower computational cost. However, it is necessary to analyze the statistical significance of the normalized global alignment in order to be considered a completely functional algorithm for protein alignment. Experiments with unrelated proteins extracted from the SCOP ASTRAL database showed that normalized global alignment scores can be fitted to a log-normal distribution. This fact, obtained without any theoretical support, can be used to derive statistical significance of normalized global alignments. Results are summarized in a table with fitted parameters for different scoring schemes.
Collapse
Affiliation(s)
- Guillermo Peris
- Department de Llenguatges i Sistemes Informátics, Universitat Jaume I , Castelló, Spain
| | | |
Collapse
|
9
|
Kaznadzey A, Alexandrova N, Novichkov V, Kaznadzey D. PSimScan: algorithm and utility for fast protein similarity search. PLoS One 2013; 8:e58505. [PMID: 23505522 PMCID: PMC3591303 DOI: 10.1371/journal.pone.0058505] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2012] [Accepted: 02/07/2013] [Indexed: 01/19/2023] Open
Abstract
In the era of metagenomics and diagnostics sequencing, the importance of protein comparison methods of boosted performance cannot be overstated. Here we present PSimScan (Protein Similarity Scanner), a flexible open source protein similarity search tool which provides a significant gain in speed compared to BLASTP at the price of controlled sensitivity loss. The PSimScan algorithm introduces a number of novel performance optimization methods that can be further used by the community to improve the speed and lower hardware requirements of bioinformatics software. The optimization starts at the lookup table construction, then the initial lookup table–based hits are passed through a pipeline of filtering and aggregation routines of increasing computational complexity. The first step in this pipeline is a novel algorithm that builds and selects ‘similarity zones’ aggregated from neighboring matches on small arrays of adjacent diagonals. PSimScan performs 5 to 100 times faster than the standard NCBI BLASTP, depending on chosen parameters, and runs on commodity hardware. Its sensitivity and selectivity at the slowest settings are comparable to the NCBI BLASTP’s and decrease with the increase of speed, yet stay at the levels reasonable for many tasks. PSimScan is most advantageous when used on large collections of query sequences. Comparing the entire proteome of Streptocuccus pneumoniae (2,042 proteins) to the NCBI’s non-redundant protein database of 16,971,855 records takes 6.5 hours on a moderately powerful PC, while the same task with the NCBI BLASTP takes over 66 hours. We describe innovations in the PSimScan algorithm in considerable detail to encourage bioinformaticians to improve on the tool and to use the innovations in their own software development.
Collapse
Affiliation(s)
- Anna Kaznadzey
- Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
| | - Natalia Alexandrova
- Genome Designs, Inc., Walnut Creek, California, United States of America
- * E-mail:
| | | | - Denis Kaznadzey
- DOE Joint Genome Institute, Walnut Creek, California, United States of America
| |
Collapse
|
10
|
Zhang Y, Misra S, Agrawal A, Patwary MMA, Liao WK, Qin Z, Choudhary A. Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power. BMC Bioinformatics 2012; 13 Suppl 5:S3. [PMID: 22537007 PMCID: PMC3318904 DOI: 10.1186/1471-2105-13-s5-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. Results We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel© Core™i7 CPU 920 processor. Conclusions Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.
Collapse
Affiliation(s)
- Yuhong Zhang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China.
| | | | | | | | | | | | | |
Collapse
|
11
|
Peris G, Marzal A. Normalized global alignment for protein sequences. J Theor Biol 2011; 291:22-8. [PMID: 21945336 DOI: 10.1016/j.jtbi.2011.09.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2011] [Revised: 07/19/2011] [Accepted: 09/08/2011] [Indexed: 10/17/2022]
Abstract
Global alignment is used to compare proteins in different fields, for example in phylogenetic research. In order to reduce the length and composition dependence of global alignment scores, Z-score is computed with a Monte-Carlo algorithm. This technique requires a great number of sequence alignments on shuffled sequences, leading to a high computational cost. In this work, a normalized global alignment score is introduced in order to correct the length dependence of global alignments. This score is defined as the best ratio between the score of an alignment and its length, and an algorithm to compute it based on fractional programming is implemented. The properties and effectiveness of normalized global alignment applied to protein comparison are analyzed. Experiments with proteins selected from the SCOP ASTRAL database were run to study relationship of normalized global alignment with Z-score and performance in homologous detection. Results show that normalized global alignment has a computational cost equivalent to 2.5 Needleman-Wunsch runs and a linear relationship with Z-score. This linearity allows us to use normalized global alignment as a cheap substitute to a computationally expensive Z-score. Experiments show that normalized global alignment improves the ability to identify homologous proteins. Software used to compute normalized global alignments is available from http://www3.uji.es/∼peris/nga.
Collapse
Affiliation(s)
- Guillermo Peris
- Department de Llenguatges i Sistemes Informátics, Universitat Jaume I, 12071 Castelló, Spain.
| | | |
Collapse
|
12
|
WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.
Collapse
Affiliation(s)
- WING-CHEONG WONG
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - SEBASTIAN MAURER-STROH
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 63755, Singapore
| | - FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
13
|
Pérez-Nadales E, Di Pietro A. The membrane mucin Msb2 regulates invasive growth and plant infection in Fusarium oxysporum. THE PLANT CELL 2011; 23:1171-85. [PMID: 21441438 PMCID: PMC3082261 DOI: 10.1105/tpc.110.075093] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/03/2010] [Revised: 02/18/2011] [Accepted: 03/08/2011] [Indexed: 05/20/2023]
Abstract
Fungal pathogenicity in plants requires a conserved mitogen-activated protein kinase (MAPK) cascade homologous to the yeast filamentous growth pathway. How this signaling cascade is activated during infection remains poorly understood. In the soil-borne vascular wilt fungus Fusarium oxysporum, the orthologous MAPK Fmk1 (Fusarium MAPK1) is essential for root penetration and pathogenicity in tomato (Solanum lycopersicum) plants. Here, we show that Msb2, a highly glycosylated transmembrane protein, is required for surface-induced phosphorylation of Fmk1 and contributes to a subset of Fmk1-regulated functions related to invasive growth and virulence. Mutants lacking Msb2 share characteristic phenotypes with the Δfmk1 mutant, including defects in cellophane invasion, penetration of the root surface, and induction of vascular wilt symptoms in tomato plants. In contrast with Δfmk1, Δmsb2 mutants were hypersensitive to cell wall targeting compounds, a phenotype that was exacerbated in a Δmsb2 Δfmk1 double mutant. These results suggest that the membrane mucin Msb2 promotes invasive growth and plant infection upstream of Fmk1 while contributing to cell integrity through a distinct pathway.
Collapse
|
14
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:194-205. [PMID: 21071807 DOI: 10.1109/tcbb.2009.69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
15
|
Park Y, Sheetlin S, Spouge JL. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 2009; 37:3697. [PMID: 20148197 DOI: 10.1214/08-aos663] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
Collapse
Affiliation(s)
- Yonil Park
- National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, Maryland 20894 USA
| | | | | |
Collapse
|
16
|
Poleksic A. Island method for estimating the statistical significance of profile-profile alignment scores. BMC Bioinformatics 2009; 10:112. [PMID: 19379500 PMCID: PMC2678096 DOI: 10.1186/1471-2105-10-112] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2008] [Accepted: 04/20/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive. RESULTS We demonstrate that the background distribution of profile-profile alignment scores heavily depends on profiles' composition and thus the distribution parameters must be estimated independently, for each pair of profiles of interest. We also show that accurate estimates of statistical parameters can be obtained using the "island statistics" for profile-profile alignments. CONCLUSION The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization. Since multiple island scores can be extracted from a single comparison of two profiles, the island method has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.
Collapse
Affiliation(s)
- Aleksandar Poleksic
- Department of Computer Science, University of Northern Iowa, Cedar Falls, IA 50614, USA.
| |
Collapse
|
17
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty. BMC Bioinformatics 2009; 10 Suppl 3:S1. [PMID: 19344477 PMCID: PMC2665049 DOI: 10.1186/1471-2105-10-s3-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. RESULTS Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. CONCLUSION The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
18
|
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008; 4:e1000069. [PMID: 18516236 PMCID: PMC2396288 DOI: 10.1371/journal.pcbi.1000069] [Citation(s) in RCA: 229] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 03/26/2008] [Indexed: 11/19/2022] Open
Abstract
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America.
| |
Collapse
|
19
|
Abstract
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.
Collapse
Affiliation(s)
- Xiaoqiu Huang
- Department of Computer Science, Iowa State University, Ames, Iowa 50011-1040, USA.
| |
Collapse
|
20
|
Sharon I, Birkland A, Chang K, El-Yaniv R, Yona G. Correcting BLAST e-values for low-complexity segments. J Comput Biol 2008; 12:980-1003. [PMID: 16201917 DOI: 10.1089/cmb.2005.12.980] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
The statistical estimates of BLAST and PSI-BLAST are of extreme importance to determine the biological relevance of sequence matches. While being very effective in evaluating most matches, these estimates usually overestimate the significance of matches in the presence of low complexity segments. In this paper, we present a model, based on divergence measures and statistics of the alignment structure, that corrects BLAST e-values for low complexity sequences without filtering or excluding them and generates scores that are more effective in distinguishing true similarities from chance similarities. We evaluate our method and compare it to other known methods using the Gene Ontology (GO) knowledge resource as a benchmark. Various performance measures, including ROC analysis, indicate that the new model improves upon the state of the art. The program is available at biozon.org/ftp/ and www.cs.technion.ac.il/ approximately itaish/lowcomp/.
Collapse
Affiliation(s)
- Itai Sharon
- Department of Computer Science, Technion, Haifa, Israel
| | | | | | | | | |
Collapse
|
21
|
Biegert A, Söding J. De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics 2008; 24:807-14. [DOI: 10.1093/bioinformatics/btn039] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
22
|
Mitrophanov AY, Borodovsky M. Statistical significance in biological sequence analysis. Brief Bioinform 2008; 7:2-24. [PMID: 16761361 DOI: 10.1093/bib/bbk001] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One of the major goals of computational sequence analysis is to find sequence similarities, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Since the degree of similarity is usually assessed by the sequence alignment score, it is necessary to know if a score is high enough to indicate a biologically interesting alignment. A powerful approach to defining score cutoffs is based on the evaluation of the statistical significance of alignments. The statistical significance of an alignment score is frequently assessed by its P-value, which is the probability that this score or a higher one can occur simply by chance, given the probabilistic models for the sequences. In this review we discuss the general role of P-value estimation in sequence analysis, and give a description of theoretical methods and computational approaches to the estimation of statistical signifiance for important classes of sequence analysis problems. In particular, we concentrate on the P-value estimation techniques for single sequence studies (both score-based and score-free), global and local pairwise sequence alignments, multiple alignments, sequence-to-profile alignments and alignments built with hidden Markov models. We anticipate that the review will be useful both to researchers professionally working in bioinformatics as well as to biomedical scientists interested in using contemporary methods of DNA and protein sequence analysis.
Collapse
|
23
|
Realm of PD-(D/E)XK nuclease superfamily revisited: detection of novel families with modified transitive meta profile searches. BMC STRUCTURAL BIOLOGY 2007; 7:40. [PMID: 17584917 PMCID: PMC1913061 DOI: 10.1186/1472-6807-7-40] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2007] [Accepted: 06/20/2007] [Indexed: 11/30/2022]
Abstract
Background PD-(D/E)XK nucleases constitute a large and highly diverse superfamily of enzymes that display little sequence similarity despite retaining a common core fold and a few critical active site residues. This makes identification of new PD-(D/E)XK nuclease families a challenging task as they usually escape detection with standard sequence-based methods. We developed a modified transitive meta profile search approach and to consider the structural diversity of PD-(D/E)XK nuclease fold more thoroughly we analyzed also lower than threshold Meta-BASIC hits to select potentially correct predictions placed among unreliable or incorrect ones. Results Application of a modified transitive Meta-BASIC searches on updated PFAM families and PDB structures resulted in detection of five new PD-(D/E)XK nuclease families encompassing hundreds of so far uncharacterized and poorly annotated proteins. These include four families catalogued in PFAM database as domains of unknown function (DUF506, DUF524, DUF1626 and DUF1703) and YhgA-like family of putative transposases. Three of these families represent extremely distant homologs (DUF506, DUF524, and YhgA-like), while two are newly defined in updated database (DUF1626 and DUF1703). In addition, we also confidently identified an extended AAA-ATPase domain in the N-terminal region of DUF1703 family proteins. Conclusion Obtained results suggest that detailed analysis of below threshold Meta-BASIC hits may push limits further for distant homology detection in the 'midnight zone' of homology. All identified families conserve the core evolutionary fold, secondary structure and hydrophobic patterns common to existing PD-(D/E)XK nucleases and maintain critical active site motifs that contribute to nucleic acid cleavage. Further experimental investigations should address the predicted activity and clarify potential substrates providing further insight into detailed biological role of these newly detected nucleases.
Collapse
|
24
|
Hemalatha GR, Rao DS, Guruprasad L. Identification and analysis of novel amino-acid sequence repeats in Bacillus anthracis str. Ames proteome using computational tools. Comp Funct Genomics 2007:47161. [PMID: 17538688 PMCID: PMC1876623 DOI: 10.1155/2007/47161] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2006] [Revised: 12/06/2006] [Accepted: 12/09/2006] [Indexed: 11/18/2022] Open
Abstract
We have identified four repeats and ten domains that are novel
in proteins encoded by the Bacillus
anthracis str. Ames proteome using automated
in silico methods. A “repeat” corresponds to a region
comprising less than 55-amino-acid residues that occur
more than once in the protein sequence and sometimes present
in tandem. A “domain” corresponds to a conserved region with
greater than 55-amino-acid residues and may be present as
single or multiple copies in the protein sequence.
These correspond to (1) 57-amino-acid-residue PxV domain,
(2) 122-amino-acid-residue FxF domain, (3) 111-amino-acid-residue
YEFF domain, (4) 109-amino-acid-residue IMxxH domain,
(5) 103-amino-acid-residue VxxT domain, (6) 84-amino-acid-residue
ExW domain, (7) 104-amino-acid-residue NTGFIG domain,
(8) 36-amino-acid-residue NxGK repeat, (9) 95-amino-acid-residue
VYV domain, (10) 75-amino-acid-residue KEWE domain,
(11) 59-amino-acid-residue AFL domain, (12) 53-amino-acid-residue
RIDVK repeat, (13) (a) 41-amino-acid-residue AGQF repeat and
(b) 42-amino-acid-residue GSAL repeat. A repeat or domain type is
characterized by specific conserved sequence motifs. We discuss
the presence of these repeats and domains in proteins from other
genomes and their probable secondary structure.
Collapse
Affiliation(s)
- G. R. Hemalatha
- School of Chemistry, University of Hyderabad,
Hyderabad 500 046, India
| | | | - L. Guruprasad
- School of Chemistry, University of Hyderabad,
Hyderabad 500 046, India
- *L. Guruprasad:
| |
Collapse
|
25
|
Yu YK, Gertz EM, Agarwala R, Schäffer AA, Altschul SF. Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 2006; 34:5966-73. [PMID: 17068079 PMCID: PMC1635310 DOI: 10.1093/nar/gkl731] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.
Collapse
Affiliation(s)
| | | | | | | | - Stephen F. Altschul
- To whom correspondence should be addressed. Tel: +301 435 7803; Fax: +301 480 2288;
| |
Collapse
|
26
|
Abstract
We examine a Poisson heuristic for judging the significance of local sequence alignments with gaps. Model parameters are estimated directly from the sequences to be aligned, so that laborious prior simulation studies or database comparisons for the estimation of parameters describing the connection between score and E-value are unnecessary. Simulation studies give evidence that this method gives reasonable results even when the usual assumptions like the independence of sequence positions are violated.
Collapse
Affiliation(s)
- Dirk Metzler
- Institut für Informatik, Johann Wolfgang Goethe-Universität, Frankfurt am Main, Germany.
| |
Collapse
|
27
|
Kalita MK, Ramasamy G, Duraisamy S, Chauhan VS, Gupta D. ProtRepeatsDB: a database of amino acid repeats in genomes. BMC Bioinformatics 2006; 7:336. [PMID: 16827924 PMCID: PMC1538635 DOI: 10.1186/1471-2105-7-336] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2006] [Accepted: 07/07/2006] [Indexed: 11/13/2022] Open
Abstract
Background Genome wide and cross species comparisons of amino acid repeats is an intriguing problem in biology mainly due to the highly polymorphic nature and diverse functions of amino acid repeats. Innate protein repeats constitute vital functional and structural regions in proteins. Repeats are of great consequence in evolution of proteins, as evident from analysis of repeats in different organisms. In the post genomic era, availability of protein sequences encoded in different genomes provides a unique opportunity to perform large scale comparative studies of amino acid repeats. ProtRepeatsDB is a relational database of perfect and mismatch repeats, access to which is designed as a resource and collection of tools for detection and cross species comparisons of different types of amino acid repeats. Description ProtRepeatsDB (v1.2) consists of perfect as well as mismatch amino acid repeats in the protein sequences of 141 organisms, the genomes of which are now available. The web interface of ProtRepeatsDB consists of different tools to perform repeat s; based on protein IDs, organism name, repeat sequences, and keywords as in FASTA headers, size, frequency, gene ontology (GO) annotation IDs and regular expressions (REGEXP) describing repeats. These tools also allow formulation of a variety of simple, complex and logical queries to facilitate mining and large-scale cross-species comparisons of amino acid repeats. In addition to this, the database also contains sequence analysis tools to determine repeats in user input sequences. Conclusion ProtRepeatsDB is a multi-organism database of different types of amino acid repeats present in proteins. It integrates useful tools to perform genome wide queries for rapid screening and identification of amino acid repeats and facilitates comparative and evolutionary studies of the repeats. The database is useful for identification of species or organism specific repeat markers, interspecies variations and polymorphism.
Collapse
Affiliation(s)
- Mridul K Kalita
- Structural and Computational Biology Group, Malaria Group, International Centre for Genetic Engineering and Biotechnology (ICGEB), Aruna Asaf Ali Marg, New Delhi 110067, India
| | - Gowthaman Ramasamy
- Structural and Computational Biology Group, Malaria Group, International Centre for Genetic Engineering and Biotechnology (ICGEB), Aruna Asaf Ali Marg, New Delhi 110067, India
| | - Sekhar Duraisamy
- Dana-farber Cancer Institute, Harvard Medical School, Dana-830, 44-Binney street, Boston, MA-02115, USA
| | - Virander S Chauhan
- Malaria Group, International Centre for Genetic Engineering and Biotechnology (ICGEB), Aruna Asaf Ali Marg, New Delhi 110067, India
| | - Dinesh Gupta
- Structural and Computational Biology Group, Malaria Group, International Centre for Genetic Engineering and Biotechnology (ICGEB), Aruna Asaf Ali Marg, New Delhi 110067, India
| |
Collapse
|
28
|
Hassenforder C, Mercier S. Exact Distribution of the Local Score for Markovian Sequences. ANN I STAT MATH 2006. [DOI: 10.1007/s10463-006-0064-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
29
|
Fleming K, Kelley LA, Islam SA, MacCallum RM, Muller A, Pazos F, Sternberg MJ. The proteome: structure, function and evolution. Philos Trans R Soc Lond B Biol Sci 2006; 361:441-51. [PMID: 16524832 PMCID: PMC1609342 DOI: 10.1098/rstb.2005.1802] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper reports two studies to model the inter-relationships between protein sequence, structure and function. First, an automated pipeline to provide a structural annotation of proteomes in the major genomes is described. The results are stored in a database at Imperial College, London (3D-GENOMICS) that can be accessed at www.sbg.bio.ic.ac.uk. Analysis of the assignments to structural superfamilies provides evolutionary insights. 3D-GENOMICS is being integrated with related proteome annotation data at University College London and the European Bioinformatics Institute in a project known as e-protein (http://www.e-protein.org/). The second topic is motivated by the developments in structural genomics projects in which the structure of a protein is determined prior to knowledge of its function. We have developed a new approach PHUNCTIONER that uses the gene ontology (GO) classification to supervise the extraction of the sequence signal responsible for protein function from a structure-based sequence alignment. Using GO we can obtain profiles for a range of specificities described in the ontology. In the region of low sequence similarity (around 15%), our method is more accurate than assignment from the closest structural homologue. The method is also able to identify the specific residues associated with the function of the protein family.
Collapse
Affiliation(s)
- Keiran Fleming
- Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College of Science, Technology and MedicineLondon SW7 2AZ, UK
| | - Lawrence A Kelley
- Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College of Science, Technology and MedicineLondon SW7 2AZ, UK
- Biomolecular Modelling Laboratory, Cancer Research UK44 Lincoln's Inn Fields, London WC2A 3PX, UK
| | - Suhail A Islam
- Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College of Science, Technology and MedicineLondon SW7 2AZ, UK
- Biomolecular Modelling Laboratory, Cancer Research UK44 Lincoln's Inn Fields, London WC2A 3PX, UK
| | - Robert M MacCallum
- Biomolecular Modelling Laboratory, Cancer Research UK44 Lincoln's Inn Fields, London WC2A 3PX, UK
| | - Arne Muller
- Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College of Science, Technology and MedicineLondon SW7 2AZ, UK
- Biomolecular Modelling Laboratory, Cancer Research UK44 Lincoln's Inn Fields, London WC2A 3PX, UK
| | - Florencio Pazos
- Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College of Science, Technology and MedicineLondon SW7 2AZ, UK
| | - Michael J.E Sternberg
- Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College of Science, Technology and MedicineLondon SW7 2AZ, UK
- Biomolecular Modelling Laboratory, Cancer Research UK44 Lincoln's Inn Fields, London WC2A 3PX, UK
- Author for correspondence ()
| |
Collapse
|
30
|
Knizewski Ł, Ginalski K. Bacillus subtilis YkuK protein is distantly related to RNase H. FEMS Microbiol Lett 2006; 251:341-6. [PMID: 16165328 DOI: 10.1016/j.femsle.2005.08.020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2005] [Revised: 08/12/2005] [Accepted: 08/16/2005] [Indexed: 11/22/2022] Open
Abstract
In addition to one hypothetical viral sequence from Bacteriophage KVP40, the PfamA family of unknown function DUF458 (Pfam Accession No. PF04308) encompasses several uncharacterized bacterial proteins including Bacillus subtilis YkuK protein. Using Meta-BASIC, a highly sensitive method for detection of distant similarity between proteins, we assign DUF458 family members to the ribonuclease H-like (RNase H-like) superfamily. DUF458 sequences maintain all core secondary structure elements of RNase H-like fold and share several conserved, presumably active site residues with RNase HI, including an invariant DDE motif. In addition to providing a model structure for a previously uncharacterized protein family, this finding suggests that DUF458 proteins function as nucleases. The unusual phyletic pattern, together with a presence of DUF458 in several thermophilic organisms, may suggest a potential role of these proteins in DNA repair in stressful conditions such as an extreme heat or other stress that causes spore formation.
Collapse
Affiliation(s)
- Łukasz Knizewski
- Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University, Pawińskiego 5A, 02-106 Warsaw, Poland
| | | |
Collapse
|
31
|
Pang H, Tang J, Chen SS, Tao S. Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinformatics 2005; 6:257. [PMID: 16225696 PMCID: PMC1276786 DOI: 10.1186/1471-2105-6-257] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2005] [Accepted: 10/15/2005] [Indexed: 12/02/2022] Open
Abstract
Background The inference of homology from statistically significant sequence similarity is a central issue in sequence alignments. So far the statistical distribution function underlying the optimal global alignments has not been completely determined. Results In this study, random and real but unrelated sequences prepared in six different ways were selected as reference datasets to obtain their respective statistical distributions of global alignment scores. All alignments were carried out with the Needleman-Wunsch algorithm and optimal scores were fitted to the Gumbel, normal and gamma distributions respectively. The three-parameter gamma distribution performs the best as the theoretical distribution function of global alignment scores, as it agrees perfectly well with the distribution of alignment scores. The normal distribution also agrees well with the score distribution frequencies when the shape parameter of the gamma distribution is sufficiently large, for this is the scenario when the normal distribution can be viewed as an approximation of the gamma distribution. Conclusion We have shown that the optimal global alignment scores of random protein sequences fit the three-parameter gamma distribution function. This would be useful for the inference of homology between sequences whose relationship is unknown, through the evaluation of gamma distribution significance between sequences.
Collapse
Affiliation(s)
- Hongxia Pang
- School of Life Science, Northwest A&F University, Yangling, Shaanxi, China
- Institute of Bioinformatics, Northwest A&F University, Yangling, Shaanxi, China
| | - Jiaowei Tang
- School of Life Science, Northwest A&F University, Yangling, Shaanxi, China
- Institute of Bioinformatics, Northwest A&F University, Yangling, Shaanxi, China
| | - Su-Shing Chen
- Institute of Bioinformatics, Northwest A&F University, Yangling, Shaanxi, China
| | - Shiheng Tao
- School of Life Science, Northwest A&F University, Yangling, Shaanxi, China
- Institute of Bioinformatics, Northwest A&F University, Yangling, Shaanxi, China
| |
Collapse
|
32
|
Sheetlin S, Park Y, Spouge JL. The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res 2005; 33:4987-94. [PMID: 16147981 PMCID: PMC1199557 DOI: 10.1093/nar/gki800] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.
Collapse
Affiliation(s)
| | | | - John L. Spouge
- To whom correspondence should be addressed. Tel: +301 402 9310; Fax: +301 480 2288;
| |
Collapse
|
33
|
Lai X, Guo J, Zhang X, Wang H. Identification of a novel domain â DIM, which defines a new family composed mainly of bacterial membrane proteins. FEMS Microbiol Lett 2005; 246:87-90. [PMID: 15869966 DOI: 10.1016/j.femsle.2005.03.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2004] [Revised: 02/27/2005] [Accepted: 03/23/2005] [Indexed: 10/25/2022] Open
Abstract
We report here the identification of a novel domain - DIM (N-terminal domain in bacterial membrane proteins and other proteins) present exclusively in bacterial species including mycobacteria, revealed by PSI-BLAST iterative searches. DIM comprises about 53 amino acids in length with conserved Leu, Ile and Gly residues. Secondary structure prediction indicated that this domain contains two alpha-helices. DIM occurs at the N-terminus of proteins, and was found particularly but not exclusively in proteins with a transmembrane domain, and also in proteins with a FHA domain or RPT repeats. DIM-containing proteins have been reported to be involved in pathogenicity, signal transduction or small solute transport.
Collapse
Affiliation(s)
- Xuhui Lai
- State Key Laboratory of Genetic Engineering, Institute of Genetics, School of Life Sciences, Fudan University, Handan Road 220, Shanghai 200433, PR China
| | | | | | | |
Collapse
|
34
|
Poleksic A, Danzer JF, Hambly K, Debe DA. Convergent Island Statistics: a fast method for determining local alignment score significance. Bioinformatics 2005; 21:2827-31. [PMID: 15817690 DOI: 10.1093/bioinformatics/bti433] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Background distribution statistics for profile-based sequence alignment algorithms cannot be calculated analytically, and hence such algorithms must resort to measuring the significance of an alignment score by assessing its location among a distribution of background alignment scores. The Gumbel parameters that describe this background distribution are usually pre-computed for a limited number of scoring systems, gap schemes, and sequence lengths and compositions. The use of such look-ups is known to introduce errors, which compromise the significance assessment of a remote homology relationship. One solution is to estimate the background distribution for each pair of interest by generating a large number of sequence shuffles and use the distribution of their scores to approximate the parameters of the underlying extreme value distribution. This is computationally very expensive, as a large number of shuffles are needed to precisely estimate the score statistics. RESULTS Convergent Island Statistics (CIS) is a computationally efficient solution to the problem of calculating the Gumbel distribution parameters for an arbitrary pair of sequences and an arbitrary set of gap and scoring schemes. The basic idea behind our method is to recognize the lack of similarity for any pair of sequences early in the shuffling process and thus save on the search time. The method is particularly useful in the context of profile-profile alignment algorithms where the normalization of alignment scores has traditionally been a challenging task. CONTACT aleksandar@eidogen.com SUPPLEMENTARY INFORMATION http://www.eidogen-sertanty.com/Documents/convergent_island_stats_sup.pdf.
Collapse
|
35
|
Velayos-Baeza A, Vettori A, Copley RR, Dobson-Stone C, Monaco AP. Analysis of the human VPS13 gene family. Genomics 2005; 84:536-49. [PMID: 15498460 DOI: 10.1016/j.ygeno.2004.04.012] [Citation(s) in RCA: 152] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Accepted: 04/19/2004] [Indexed: 01/11/2023]
Abstract
The gene mutated in chorea-acanthocytosis (CHAC; approved gene symbol VPS13A) encodes chorein, a protein similar to yeast Vps13p. We detected several similar putative human proteins by BLAST analysis of chorein. We characterized the structure of three new genes encoding these CHAC-similar proteins, located on chromosomes 1p36, 8q22, and 15q21. The most similar gene in yeast to all four human genes is Vps13, and therefore the human genes were named VPS13A (CHAC, 9q21), VPS13B (8q22), VPS13C (15q21), and VPS13D (1p36). VPS13B has recently been reported as COH1, altered in Cohen syndrome. For each gene, we describe several alternative splicing variants; at least two transcripts per gene are major forms. The expression pattern of these genes is ubiquitous, with some tissue-specific differences between several transcript variants. Protein sequence comparisons suggest that intramolecular duplications have played an important role in the evolution of this gene family.
Collapse
Affiliation(s)
- Antonio Velayos-Baeza
- Wellcome Trust Centre for Human Genetics, University of Oxford, Headington, OX3 7BN Oxford, UK
| | | | | | | | | |
Collapse
|
36
|
Kann MG, Thiessen PA, Panchenko AR, Schäffer AA, Altschul SF, Bryant SH. A structure-based method for protein sequence alignment. Bioinformatics 2004; 21:1451-6. [PMID: 15613392 DOI: 10.1093/bioinformatics/bti233] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to describe protein families. PSSMs can capture information about conserved patterns within families, which can be used to increase the sensitivity of searches for related sequences. Certain types of structural information, however, are not generally captured by PSSM search methods. Here we introduce a program, Structure-based ALignment TOol (SALTO), that aligns protein query sequences to PSSMs using rules for placing and scoring gaps that are consistent with the conserved regions of domain alignments from NCBI's Conserved Domain Database. RESULTS In most cases, the alignment scores obtained using the local alignment version follow an extreme value distribution. SALTO's performance in finding related sequences and producing accurate alignments is similar to or better than that of IMPALA; one advantage of SALTO is that it imposes an explicit gapping model on each protein family. AVAILABILITY A stand-alone version of the program that can generate global or local alignments is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/), and has been incorporated to Cn3D structure/alignment viewer. CONTACT bryant@ncbi.nlm.nih.gov.
Collapse
Affiliation(s)
- Maricel G Kann
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD 20894, USA
| | | | | | | | | | | |
Collapse
|
37
|
Park Y, Sheetlin S, Spouge JL. Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. ACTA ACUST UNITED AC 2004. [DOI: 10.1088/0305-4470/38/1/006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
38
|
Grossmann S, Yakir B. Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments. BERNOULLI 2004. [DOI: 10.3150/bj/1099579157] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Steffen Grossmann
- Department of Computational Molecular Biology, Max-Planck-Institute for MolecularGenetics
| | | |
Collapse
|
39
|
Smyth I, Du X, Taylor MS, Justice MJ, Beutler B, Jackson IJ. The extracellular matrix gene Frem1 is essential for the normal adhesion of the embryonic epidermis. Proc Natl Acad Sci U S A 2004; 101:13560-5. [PMID: 15345741 PMCID: PMC518794 DOI: 10.1073/pnas.0402760101] [Citation(s) in RCA: 97] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2004] [Indexed: 11/18/2022] Open
Abstract
Fraser syndrome is a rare recessive disorder characterized by cryptophthalmos, syndactyly, renal defects, and a range of other developmental abnormalities. Because of their extensive phenotypic overlap, the mouse blebbing mutants have been considered models of this disorder, and the recent isolation of mutations in Fras1 in both the blebbed mouse and human Fraser patients confirms this hypothesis. Here we report the identification of mutations in an extracellular matrix gene Fras1-related extracellular matrix gene 1 (Frem1) in both the classic head blebs mutant and in an N-ethyl-N-nitrosourea-induced allele. We show that inactivation of the gene results in the formation of in utero epidermal blisters beneath the lamina densa of the basement membrane and also in renal agenesis. Frem1 is expressed widely in the developing embryo in regions of epithelial/mesenchymal interaction and epidermal remodeling. Furthermore, Frem1 appears to act as a dermal mediator of basement membrane adhesion, apparently independently of the other known "blebs" proteins Fras1 and Grip1. Unlike both Fras1 and Grip1 mutants, collagen VI and Fras1 deposition in the basement membrane is normal, indicating that the protein plays an independent role in epidermal differentiation and is required for epidermal adhesion during embryonic development.
Collapse
Affiliation(s)
- Ian Smyth
- Medical Research Council Human Genetics Unit, Crewe Road, Edinburgh EH4 2XU, Scotland, United Kingdom.
| | | | | | | | | | | |
Collapse
|
40
|
Chia N, Bundschuh R. Finite width model sequence comparison. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2004; 70:021906. [PMID: 15447514 DOI: 10.1103/physreve.70.021906] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/17/2004] [Indexed: 05/24/2023]
Abstract
Sequence comparison is a widely used computational technique in modern molecular biology. In spite of the frequent use of sequence comparisons, the important problem of assigning statistical significance to a given degree of similarity is still outstanding. Analytical approaches to filling this gap usually make use of an approximation that neglects certain correlations in the disorder underlying the sequence comparison algorithm. Here, we use the longest common subsequence problem, a prototype sequence comparison problem, to analytically establish that this approximation does make a difference to certain sequence comparison statistics. In the course of establishing this difference we develop a method that can systematically deal with these disorder correlations.
Collapse
Affiliation(s)
- Nicholas Chia
- Department of Physics, Ohio State University, 174 West 18th Street, Columbus, Ohio 43210, USA
| | | |
Collapse
|
41
|
Ginalski K, Kinch L, Rychlewski L, Grishin NV. BOF: a novel family of bacterial OB-fold proteins. FEBS Lett 2004; 567:297-301. [PMID: 15178340 DOI: 10.1016/j.febslet.2004.04.086] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2004] [Accepted: 04/19/2004] [Indexed: 11/22/2022]
Abstract
Using top-of-the-line fold recognition methods, we assigned an oligonucleotide/oligosaccharide-binding (OB)-fold structure to a family of previously uncharacterized hypothetical proteins from several bacterial genomes. This novel family of bacterial OB-fold (BOF) proteins present in a number of pathogenic strains encompasses sequences of unknown function from DUF388 (in Pfam database) and COG3111. The BOF proteins can be linked evolutionarily to other members of the OB-fold nucleic acid-binding superfamily (anticodon-binding and single strand DNA-binding domains), although they probably lack nucleic acid-binding properties as implied by the analysis of the potential binding site. The presence of conserved N-terminal predicted signal peptide indicates that BOF family members localize in the periplasm where they may function to bind proteins, small molecules, or other typical OB-fold ligands. As hypothesized for the distantly related OB-fold containing bacterial enterotoxins, the loss of nucleotide-binding function and the rapid evolution of the BOF ligand-binding site may be associated with the presence of BOF proteins in mobile genetic elements and their potential role in bacterial pathogenicity.
Collapse
Affiliation(s)
- Krzysztof Ginalski
- Department of Biochemistry, University of Texas, Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9038, USA.
| | | | | | | |
Collapse
|
42
|
Staub E, Fiziev P, Rosenthal A, Hinzmann B. Insights into the evolution of the nucleolus by an analysis of its protein domain repertoire. Bioessays 2004; 26:567-81. [PMID: 15112237 DOI: 10.1002/bies.20032] [Citation(s) in RCA: 86] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Recently, the first investigation of nucleoli using mass spectrometry led to the identification of 271 proteins. This represents a rich resource for a comprehensive investigation of nucleolus evolution. We applied a protocol for the identification of known and novel conserved protein domains of the nucleolus, resulting in the identification of 115 known and 91 novel domain profiles. The phyletic distribution of nucleolar protein domains in a collection of complete proteomes of selected organisms from all domains of life confirms the archaebacterial origin of the core machinery for ribosome maturation and assembly, but also reveals substantial eubacterial and eukaryotic contributions to nucleolus evolution. We predict that, in different phases of nucleolus evolution, protein domains with different biochemical functions were recruited to the nucleolus. We suggest a model for the late and continuous evolution of the nucleolus in early eukaryotes and argue against an endosymbiotic origin of the nucleolus and the nucleus. Supplementary material for this article can be found on the BioEssays website at http://www.interscience.wiley.com/jpages/0265-9247/suppmat/index.html.
Collapse
Affiliation(s)
- Eike Staub
- metaGen Pharmaceuticals GmbH, Berlin, Germany.
| | | | | | | |
Collapse
|
43
|
Ginalski K, Rychlewski L, Baker D, Grishin NV. Protein structure prediction for the male-specific region of the human Y chromosome. Proc Natl Acad Sci U S A 2004; 101:2305-10. [PMID: 14983005 PMCID: PMC356946 DOI: 10.1073/pnas.0306306101] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The complete sequence of the male-specific region of the human Y chromosome (MSY) has been determined recently; however, detailed characterization for many of its encoded proteins still remains to be done. We applied state-of-the-art protein structure prediction methods to all 27 distinct MSY-encoded proteins to provide better understanding of their biological functions and their mechanisms of action at the molecular level. The results of such large-scale structure-functional annotation provide a comprehensive view of the MSY proteome, shedding light on MSY-related processes. We found that, in total, at least 60 domains are encoded by 27 distinct MSY genes, of which 42 (70%) were reliably mapped to currently known structures. The most challenging predictions include the unexpected but confident 3D structure assignments for three domains identified here encoded by the USP9Y, UTY, and BPY2 genes. The domains with unknown 3D structures that are not predictable with currently available theoretical methods are established as primary targets for crystallographic or NMR studies. The data presented here set up the basis for additional scientific discoveries in human biology of the Y chromosome, which plays a fundamental role in sex determination.
Collapse
Affiliation(s)
- Krzysztof Ginalski
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9038, USA.
| | | | | | | |
Collapse
|
44
|
Ciccarelli FD, Izaurralde E, Bork P. The PAM domain, a multi-protein complex-associated module with an all-alpha-helix fold. BMC Bioinformatics 2003; 4:64. [PMID: 14687415 PMCID: PMC319699 DOI: 10.1186/1471-2105-4-64] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2003] [Accepted: 12/19/2003] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Multimeric protein complexes have a role in many cellular pathways and are highly interconnected with various other proteins. The characterization of their domain composition and organization provides useful information on the specific role of each region of their sequence. RESULTS We identified a new module, the PAM domain (PCI/PINT associated module), present in single subunits of well characterized multiprotein complexes, like the regulatory lid of the 26S proteasome, the COP-9 signalosome and the Sac3-Thp1 complex. This module is an around 200 residue long domain with a predicted TPR-like all-alpha-helical fold. CONCLUSIONS The occurrence of the PAM domain in specific subunits of multimeric protein complexes, together with the role of other all-alpha-helical folds in protein-protein interactions, suggest a function for this domain in mediating transient binding to diverse target proteins.
Collapse
Affiliation(s)
- Francesca D Ciccarelli
- European Molecular Biology Laboratory, Meyerhofstr. 1, 69012 Heidelberg, Germany
- Max-Delbrueck-Centrum, PO Box 740238, D-13092 Berlin, Germany
| | - Elisa Izaurralde
- European Molecular Biology Laboratory, Meyerhofstr. 1, 69012 Heidelberg, Germany
| | - Peer Bork
- European Molecular Biology Laboratory, Meyerhofstr. 1, 69012 Heidelberg, Germany
- Max-Delbrueck-Centrum, PO Box 740238, D-13092 Berlin, Germany
| |
Collapse
|
45
|
Copley RR, Ponting CP, Schultz J, Bork P. Sequence analysis of multidomain proteins: past perspectives and future directions. ADVANCES IN PROTEIN CHEMISTRY 2003; 61:75-98. [PMID: 12461821 DOI: 10.1016/s0065-3233(02)61002-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
46
|
Eisenhaber B, Maurer-Stroh S, Novatchkova M, Schneider G, Eisenhaber F. Enzymes and auxiliary factors for GPI lipid anchor biosynthesis and post-translational transfer to proteins. Bioessays 2003; 25:367-85. [PMID: 12655644 DOI: 10.1002/bies.10254] [Citation(s) in RCA: 139] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
GPI lipid anchoring is an important post-translational modification of eukaryote proteins in the endoplasmic reticulum. In total, 19 genes have been directly implicated in the anchor synthesis and the substrate protein modification pathway. Here, the molecular functions of the respective proteins and their evolution are analyzed in the context of reported literature data and sequence analysis studies for the complete pathway (http://mendel.imp.univie.ac.at/SEQUENCES/gpi-biosynthesis/) and questions for future experimental investigation are discussed. Studies of two of these proteins have provided new mechanistic insights. The cytosolic part of PIG-A/GPI3 has a two-domain alpha/beta/alpha-layered structure; it is suggested that its C-terminal subsegment binds UDP-GlcNAc whereas the N-terminal domain interacts with the phosphatidylinositol moiety. The lumenal part of PIG-T/GPI16 apparently consists of a beta-propeller with a central hole that regulates the access of substrate protein C termini to the active site of the cysteine protease PIG-K/GPI8 (gating mechanism) as well as of a polypeptide hook that embraces PIG-K/GPI8. This structural proposal would explain the paradoxical properties of the GPI lipid anchor signal motif and of PIG-K/GPI8 orthologs without membrane insertion regions in some species.
Collapse
Affiliation(s)
- Birgit Eisenhaber
- Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030 Vienna, Republic Austria
| | | | | | | | | |
Collapse
|
47
|
Abstract
This paper reports an analysis of the encoded proteins (the proteome) of the genomes of human, fly, worm, yeast, and representatives of bacteria and archaea in terms of the three-dimensional structures of their globular domains together with a general sequence-based study. We show that 39% of the human proteome can be assigned to known structures. We estimate that for 77% of the proteome, there is some functional annotation, but only 26% of the proteome can be assigned to standard sequence motifs that characterize function. Of the human protein sequences, 13% are transmembrane proteins, but only 3% of the residues in the proteome form membrane-spanning regions. There are substantial differences in the composition of globular domains of transmembrane proteins between the proteomes we have analyzed. Commonly occurring structural superfamilies are identified within the proteome. The frequencies of these superfamilies enable us to estimate that 98% of the human proteome evolved by domain duplication, with four of the 10 most duplicated superfamilies specific for multicellular organisms. The zinc-finger superfamily is massively duplicated in human compared to fly and worm, and occurrence of domains in repeats is more common in metazoa than in single cellular organisms. Structural superfamilies over- and underrepresented in human disease genes have been identified. Data and results can be downloaded and analyzed via web-based applications at http://www.sbg.bio.ic.ac.uk.
Collapse
Affiliation(s)
- Arne Müller
- Biomolecular Modelling Laboratory, Cancer Research UK, London, United Kingdom
| | | | | |
Collapse
|
48
|
Abstract
In order to assess the significance of sequence alignments, it is crucial to know the distribution of alignment scores of pairs of random sequences. For gapped local alignment, it is empirically known that the shape of this distribution is of the Gumbel form. However, the determination of the parameters of this distribution is a computationally very expensive task. We present a new algorithmic approach which allows estimation of the more important of the Gumbel parameters at least five times faster than the traditional methods. Actual runtimes of our algorithm between less than a second and a few minutes on a workstation bring significance estimation into the realm of interactive applications.
Collapse
Affiliation(s)
- Ralf Bundschuh
- Department of Physics, Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
49
|
Staub E, Hinzmann B, Rosenthal A. A novel repeat in the melanoma-associated chondroitin sulfate proteoglycan defines a new protein family. FEBS Lett 2002; 527:114-8. [PMID: 12220645 DOI: 10.1016/s0014-5793(02)03195-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
The human melanoma-associated chondroitin sulfate proteoglycan (MCSP) and its rat ortholog NG2 are thought to play important roles in angiogenesis-dependent processes like wound healing and tumor growth. Based on electron microscopy studies, the highly glycosylated ectodomain of NG2 has been subdivided into the globular N-terminus, a flexible rod-like central region and a C-terminal portion in globular conformation. We identified a novel repeat named CSPG in the central ectodomain of NG2, MCSP and other proteins from fly, worm, human, sea urchin and a cyanobacterium which shows similarity to cadherin repeats. As earlier electron microscopy studies indicate, the folding of the tandem repeats compresses the length of the proposed repeat region by a factor of approximately 10 compared to the fully extended peptide chain. We identified two conserved negatively charged residues which might govern the binding properties of CSPG repeats. The phyletic distribution of CSPG repeats suggests that horizontal gene transfer contributed to their evolutionary history.
Collapse
Affiliation(s)
- Eike Staub
- metaGen Pharmaceuticals GmbH, Oudenarder Str. 16, D-13347, Berlin, Germany.
| | | | | |
Collapse
|
50
|
Staub E, Pérez-Tur J, Siebert R, Nobile C, Moschonas NK, Deloukas P, Hinzmann B. The novel EPTP repeat defines a superfamily of proteins implicated in epileptic disorders. Trends Biochem Sci 2002; 27:441-4. [PMID: 12217514 DOI: 10.1016/s0968-0004(02)02163-1] [Citation(s) in RCA: 86] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Recent studies suggest that mutations in the LGI1/Epitempin gene cause autosomal dominant lateral temporal epilepsy. This gene encodes a protein of unknown function, which we postulate is secreted. The LGI1 protein has leucine-rich repeats in the N-terminal sequence and a tandem repeat (which we named EPTP) in its C-terminal region. A redefinition of the C-terminal repeat and the application of sensitive sequence analysis methods enabled us to define a new superfamily of proteins carrying varying numbers of the novel EPTP repeats in combination with various extracellular domains. Genes encoding proteins of this family are located in genomic regions associated with epilepsy and other neurological disorders.
Collapse
Affiliation(s)
- Eike Staub
- metaGen Pharmaceuticals GmbH, Oudenarder Strasse 16, D-13347 Berlin, Germany.
| | | | | | | | | | | | | |
Collapse
|