1
|
Duality Between the Local Score of One Sequence and Constrained Hidden Markov Model. Methodol Comput Appl Probab 2022. [DOI: 10.1007/s11009-021-09856-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
2
|
Margelevičius M. Estimating statistical significance of local protein profile-profile alignments. BMC Bioinformatics 2019; 20:419. [PMID: 31409275 PMCID: PMC6693267 DOI: 10.1186/s12859-019-2913-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Accepted: 05/23/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment of sequence families described by profiles provides a sensitive means for establishing homology between proteins and is important in protein evolutionary, structural, and functional studies. In the context of a steadily growing amount of sequence data, estimating the statistical significance of alignments, including profile-profile alignments, plays a key role in alignment-based homology search algorithms. Still, it is an open question as to what and whether one type of distribution governs profile-profile alignment score, especially when profile-profile substitution scores involve such terms as secondary structure predictions. RESULTS This study presents a methodology for estimating the statistical significance of this type of alignments. The methodology rests on a new algorithm developed for generating random profiles such that their alignment scores are distributed similarly to those obtained for real unrelated profiles. We show that improvements in statistical accuracy and sensitivity and high-quality alignment rate result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profile characteristics. Implemented in the COMER software, the proposed methodology yielded an increase of up to 34.2% in the number of true positives and up to 61.8% in the number of high-quality alignments with respect to the previous version of the COMER method. CONCLUSIONS The more accurate estimation of statistical significance is implemented in the COMER method, which is now more sensitive and provides an increased rate of high-quality profile-profile alignments. The results of the present study also suggest directions for future research.
Collapse
Affiliation(s)
- Mindaugas Margelevičius
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio al. 7, Vilnius, 10257, Lithuania.
| |
Collapse
|
3
|
Screening of nucleotide variations in genomic sequences encoding charged protein regions in the human genome. BMC Genomics 2017; 18:588. [PMID: 28789634 PMCID: PMC5549384 DOI: 10.1186/s12864-017-4000-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 08/01/2017] [Indexed: 11/24/2022] Open
Abstract
Background Studying genetic variation distribution in proteins containing charged regions, called charge clusters (CCs), is of great interest to unravel their functional role. Charge clusters are 20 to 75 residue segments with high net positive charge, high net negative charge, or high total charge relative to the overall charge composition of the protein. We previously developed a bioinformatics tool (FCCP) to detect charge clusters in proteomes and scanned the human proteome for the occurrence of CCs. In this paper we investigate the genetic variations in the human proteins harbouring CCs. Results We studied the coding regions of 317 positively charged clusters and 1020 negatively charged ones previously detected in human proteins. Results revealed that coding parts of CCs are richer in sequence variants than their corresponding genes, full mRNAs, and exonic + intronic sequences and that these variants are predominately rare (Minor allele frequency < 0.005). Furthermore, variants occurring in the coding parts of positively charged regions of proteins are more often pathogenic than those occurring in negatively charged ones. Classification of variants according to their types showed that substitution is the major type followed by Indels (Insertions-deletions). Concerning substitutions, it was found that within clusters of both charges, the charged amino acids were the greatest loser groups whereas polar residues were the greatest gainers. Conclusions Our findings highlight the prominent features of the human charged regions from the DNA up to the protein sequence which might provide potential clues to improve the current understanding of those charged regions and their implication in the emergence of diseases. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4000-3) contains supplementary material, which is available to authorized users.
Collapse
|
4
|
Lagnoux A, Mercier S, Vallois P. Statistical significance based on length and position of the local score in a model of i.i.d. sequences. Bioinformatics 2017; 33:654-660. [PMID: 28035025 DOI: 10.1093/bioinformatics/btw699] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 11/08/2016] [Indexed: 11/14/2022] Open
Abstract
Motivation The local score of a biological sequence analysis is a mathematical tool largely used to analyse biological sequences. Consequently, determining an accurate estimation of its distribution is crucial. Results First, we study the accuracy of classical results on the local score distribution in independent and identically distributed model using a Kolmogorov-Smirnov goodness of fit test. Second, we highlight how the length of the segment that realizes the local score improves the classical setting based on local score only. Finally, we study which part of the sequence contributes to the local score. Contact mercier@univ-tlse2.fr.
Collapse
Affiliation(s)
- Agnès Lagnoux
- Institut de Mathématiques de Toulouse, UMR5219, Université de Toulouse 2 Jean Jaurès, 5 allées Antonio Machado, Toulouse, Cedex 09 31058, France
| | - Sabine Mercier
- Institut de Mathématiques de Toulouse, UMR5219, Université de Toulouse 2 Jean Jaurès, 5 allées Antonio Machado, Toulouse, Cedex 09 31058, France
| | - Pierre Vallois
- Institut Elie Cartan, UMR7502 CNRS, INRIA-BIGS, Université de Lorraine, Vandoeuvre-lès-Nancy Cedex 54506, France
| |
Collapse
|
5
|
Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter. J Theor Biol 2015; 387:88-100. [PMID: 26427337 DOI: 10.1016/j.jtbi.2015.09.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2015] [Revised: 09/10/2015] [Accepted: 09/15/2015] [Indexed: 12/20/2022]
Abstract
Empirical analysis on k-mer DNA has been proven as an effective tool in finding unique patterns in DNA sequences which can lead to the discovery of potential sequence motifs. In an extensive study of empirical k-mer DNA on hundreds of organisms, the researchers found unique multi-modal k-mer spectra occur in the genomes of organisms from the tetrapod clade only which includes all mammals. The multi-modality is caused by the formation of the two lowest modes where k-mers under them are referred as the rare k-mers. The suppression of the two lowest modes (or the rare k-mers) can be attributed to the CG dinucleotide inclusions in them. Apart from that, the rare k-mers are selectively distributed in certain genomic features of CpG Island (CGI), promoter, 5' UTR, and exon. We correlated the rare k-mers with hundreds of annotated features using several bioinformatic tools, performed further intrinsic rare k-mer analyses within the correlated features, and modeled the elucidated rare k-mer clustering feature into a classifier to predict the correlated CGI and promoter features. Our correlation results show that rare k-mers are highly associated with several annotated features of CGI, promoter, 5' UTR, and open chromatin regions. Our intrinsic results show that rare k-mers have several unique topological, compositional, and clustering properties in CGI and promoter features. Finally, the performances of our RWC (rare-word clustering) method in predicting the CGI and promoter features are ranked among the top three, in eight of the CGI and promoter evaluations, among eight of the benchmarked datasets.
Collapse
|
6
|
Belmabrouk S, Kharrat N, Benmarzoug R, Rebai A. Exploring proteome-wide occurrence of clusters of charged residues in eukaryotes. Proteins 2015; 83:1252-61. [PMID: 25963617 DOI: 10.1002/prot.24823] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Revised: 04/17/2015] [Accepted: 04/29/2015] [Indexed: 11/09/2022]
Abstract
Clusters of charged residues are one of the key features of protein primary structure since they have been associated to important functions of proteins. Here, we present a proteome wide scan for the occurrence of Charge Clusters in Protein sequences using a new search tool (FCCP) based on a score-based methodology. The FCCP was run to search charge clusters in seven eukaryotic proteomes: Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Homo sapiens, Mus musculus, and Saccharomyces cerevisiae. We found that negative charge clusters (NCCs) are three to four times more frequent than positive charge clusters (PCCs). The Drosophila proteome is on average the most charged, whereas the human proteome is the least charged. Only 3 to 8% of the studied protein sequences have negative charge clusters, while 1.6 to 3% having PCCs and only 0.07 to 0.6% have both types of clusters. NCCs are localized predominantly in the N-terminal and C-terminal domains, while PCCs tend to be localized within the functional domains of the protein sequences. Furthermore, the gene ontology classification revealed that the protein sequences with negative and PCCs are mainly binding proteins.
Collapse
Affiliation(s)
- Sabrine Belmabrouk
- Laboratory of Molecular and Cellular Screening Processes, Centre De Biotechnologie De Sfax, Bioinformatics Group, PoBox '1177,'3018 Sfax, Tunisia
| | - Najla Kharrat
- Laboratory of Molecular and Cellular Screening Processes, Centre De Biotechnologie De Sfax, Bioinformatics Group, PoBox '1177,'3018 Sfax, Tunisia
| | - Riadh Benmarzoug
- Laboratory of Molecular and Cellular Screening Processes, Centre De Biotechnologie De Sfax, Bioinformatics Group, PoBox '1177,'3018 Sfax, Tunisia
| | - Ahmed Rebai
- Laboratory of Molecular and Cellular Screening Processes, Centre De Biotechnologie De Sfax, Bioinformatics Group, PoBox '1177,'3018 Sfax, Tunisia
| |
Collapse
|
7
|
Kryukov K, Sumiyama K, Ikeo K, Gojobori T, Saitou N. A new database (GCD) on genome composition for eukaryote and prokaryote genome sequences and their initial analyses. Genome Biol Evol 2012; 4:501-12. [PMID: 22417913 PMCID: PMC3342873 DOI: 10.1093/gbe/evs026] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Eukaryote genomes contain many noncoding regions, and they are quite complex. To understand these complexities, we constructed a database, Genome Composition Database, for the whole genome composition statistics for 101 eukaryote genome data, as well as more than 1,000 prokaryote genomes. Frequencies of all possible one to ten oligonucleotides were counted for each genome, and these observed values were compared with expected values computed under observed oligonucleotide frequencies of length 1-4. Deviations from expected values were much larger for eukaryotes than prokaryotes, except for fungal genomes. Mammalian genomes showed the largest deviation among animals. The results of comparison are available online at http://esper.lab.nig.ac.jp/genome-composition-database/.
Collapse
Affiliation(s)
- Kirill Kryukov
- Division of Population Genetics, National Institute of Genetics, Mishima, Japan
| | | | | | | | | |
Collapse
|
8
|
Fu WJ, Stromberg AJ, Viele K, Carroll RJ, Wu G. Statistics and bioinformatics in nutritional sciences: analysis of complex data in the era of systems biology. J Nutr Biochem 2010; 21:561-72. [PMID: 20233650 DOI: 10.1016/j.jnutbio.2009.11.007] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2008] [Revised: 11/10/2009] [Accepted: 11/12/2009] [Indexed: 10/19/2022]
Abstract
Over the past 2 decades, there have been revolutionary developments in life science technologies characterized by high throughput, high efficiency, and rapid computation. Nutritionists now have the advanced methodologies for the analysis of DNA, RNA, protein, low-molecular-weight metabolites, as well as access to bioinformatics databases. Statistics, which can be defined as the process of making scientific inferences from data that contain variability, has historically played an integral role in advancing nutritional sciences. Currently, in the era of systems biology, statistics has become an increasingly important tool to quantitatively analyze information about biological macromolecules. This article describes general terms used in statistical analysis of large, complex experimental data. These terms include experimental design, power analysis, sample size calculation, and experimental errors (Type I and II errors) for nutritional studies at population, tissue, cellular, and molecular levels. In addition, we highlighted various sources of experimental variations in studies involving microarray gene expression, real-time polymerase chain reaction, proteomics, and other bioinformatics technologies. Moreover, we provided guidelines for nutritionists and other biomedical scientists to plan and conduct studies and to analyze the complex data. Appropriate statistical analyses are expected to make an important contribution to solving major nutrition-associated problems in humans and animals (including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine growth retardation).
Collapse
Affiliation(s)
- Wenjiang J Fu
- Department of Epidemiology, Michigan State University, East Lansing, MI 48824, USA
| | | | | | | | | |
Collapse
|
9
|
Benachenhou F, Jern P, Oja M, Sperber G, Blikstad V, Somervuo P, Kaski S, Blomberg J. Evolutionary conservation of orthoretroviral long terminal repeats (LTRs) and ab initio detection of single LTRs in genomic data. PLoS One 2009; 4:e5179. [PMID: 19365549 PMCID: PMC2664473 DOI: 10.1371/journal.pone.0005179] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2008] [Accepted: 03/10/2009] [Indexed: 01/06/2023] Open
Abstract
Background Retroviral LTRs, paired or single, influence the transcription of both retroviral and non-retroviral genomic sequences. Vertebrate genomes contain many thousand endogenous retroviruses (ERVs) and their LTRs. Single LTRs are difficult to detect from genomic sequences without recourse to repetitiveness or presence in a proviral structure. Understanding of LTR structure increases understanding of LTR function, and of functional genomics. Here we develop models of orthoretroviral LTRs useful for detection in genomes and for structural analysis. Principal Findings Although mutated, ERV LTRs are more numerous and diverse than exogenous retroviral (XRV) LTRs. Hidden Markov models (HMMs), and alignments based on them, were created for HML- (human MMTV-like), general-beta-, gamma- and lentiretroviruslike LTRs, plus a general-vertebrate LTR model. Training sets were XRV LTRs and RepBase LTR consensuses. The HML HMM was most sensitive and detected 87% of the HML LTRs in human chromosome 19 at 96% specificity. By combining all HMMs with a low cutoff, for screening, 71% of all LTRs found by RepeatMasker in chromosome 19 were found. HMM consensus sequences had a conserved modular LTR structure. Target site duplications (TG-CA), TATA (occasionally absent), an AATAAA box and a T-rich region were prominent features. Most of the conservation was located in, or adjacent to, R and U5, with evidence for stem loops. Several of the long HML LTRs contained long ORFs inserted after the second A rich module. HMM consensus alignment allowed comparison of functional features like transcriptional start sites (sense and antisense) between XRVs and ERVs. Conclusion The modular conserved and redundant orthoretroviral LTR structure with three A-rich regions is reminiscent of structurally relaxed Giardia promoters. The five HMMs provided a novel broad range, repeat-independent, ab initio LTR detection, with prospects for greater generalisation, and insight into LTR structure, which may aid development of LTR-targeted pharmaceuticals.
Collapse
Affiliation(s)
- Farid Benachenhou
- Section of Virology, Department of Medical Sciences, Uppsala University, Uppsala, Sweden
| | - Patric Jern
- Section of Virology, Department of Medical Sciences, Uppsala University, Uppsala, Sweden
| | - Merja Oja
- Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki and Laboratory of Computer and Information Science, Helsinki University of Technology, Helsinki, Finland
| | - Göran Sperber
- Unit of Physiology, Department of Neuroscience, Uppsala University, Uppsala, Sweden
| | - Vidar Blikstad
- Section of Virology, Department of Medical Sciences, Uppsala University, Uppsala, Sweden
| | - Panu Somervuo
- Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki and Laboratory of Computer and Information Science, Helsinki University of Technology, Helsinki, Finland
| | - Samuel Kaski
- Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki and Laboratory of Computer and Information Science, Helsinki University of Technology, Helsinki, Finland
| | - Jonas Blomberg
- Section of Virology, Department of Medical Sciences, Uppsala University, Uppsala, Sweden
- * E-mail:
| |
Collapse
|
10
|
Compressing proteomes: the relevance of medium range correlations. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2008:60723. [PMID: 18256727 DOI: 10.1155/2007/60723] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2007] [Revised: 05/28/2007] [Accepted: 09/10/2007] [Indexed: 11/17/2022]
Abstract
We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences.
Collapse
|
11
|
Aschard H, Guedj M, Demenais F. A two-step multiple-marker strategy for genome-wide association studies. BMC Proc 2007; 1 Suppl 1:S134. [PMID: 18466477 PMCID: PMC2367542 DOI: 10.1186/1753-6561-1-s1-s134] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Genome-wide association studies raise study-design and analytical issues that are still being debated. Among them, stands the issue of reducing the number of markers to be genotyped without loss of efficiency in identifying trait loci, which can reduce the cost of studies and minimize the multiple testing problem. With this aim, we proposed a two-step strategy based on two analytical methods suited to examine sets of markers rather than single markers: the local score, which screens the genome to select candidate regions in Step 1, and FBAT-LC, a multiple-marker family-based association test used to obtain significance levels of regions at step 2. The performance of this strategy was evaluated on all replicates of Genetic Analysis Workshop 15 Problem 3 simulated data, using the answers to that problem. Overall, seven of the nine generated trait loci were detected in at least 87% of the replicates using a framework designed to handle either association with the disease or association with the severity of disease. This multiple-marker strategy was compared to the single-marker approach. By considering regions instead of single markers, this strategy minimizes the multiple testing problem and the number of false-positive results.
Collapse
Affiliation(s)
- Hugues Aschard
- INSERM, U794, Tour Evry 2, 523 Place des Terrasses de l'Agora, 91034, Evry, France.,Université d'Evry Val d'Essonne, Boulevard François Mitterrand, 91025, Evry, France
| | - Mickaël Guedj
- Serono, France.,Laboratoire Statistique et Génome, CNRS UMR8071, INRA U1152, Université d'Evry, Tour Evry 2, 523 Place des Terrasses de l'Agora, 91034, Evry, France
| | - Florence Demenais
- INSERM, U794, Tour Evry 2, 523 Place des Terrasses de l'Agora, 91034, Evry, France.,Université d'Evry Val d'Essonne, Boulevard François Mitterrand, 91025, Evry, France
| |
Collapse
|
12
|
Reconsidering the significance of genomic word frequencies. Trends Genet 2007; 23:543-6. [PMID: 17964682 DOI: 10.1016/j.tig.2007.07.008] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2007] [Revised: 06/26/2007] [Accepted: 07/09/2007] [Indexed: 11/22/2022]
Abstract
By conventional wisdom, a feature that occurs too often or too rarely in a genome can indicate a functional element. To infer functionality from frequency, it is crucial to precisely characterize occurrences in randomly evolving DNA. We find that the frequency of oligonucleotides in a genomic sequence follows primarily a Pareto-lognormal distribution, which encapsulates lognormal and power-law features found across all known genomes. Such a distribution could be the result of completely random evolution by a copying process. Our characterization of the entire frequency distribution of genomic words opens a way to a more accurate reasoning about their over- and underrepresentation in genomic sequences.
Collapse
|
13
|
Chew DSH, Leung MY, Choi KP. AT excursion: a new approach to predict replication origins in viral genomes by locating AT-rich regions. BMC Bioinformatics 2007; 8:163. [PMID: 17517140 PMCID: PMC1904460 DOI: 10.1186/1471-2105-8-163] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2006] [Accepted: 05/21/2007] [Indexed: 11/12/2022] Open
Abstract
Background Replication origins are considered important sites for understanding the molecular mechanisms involved in DNA replication. Many computational methods have been developed for predicting their locations in archaeal, bacterial and eukaryotic genomes. However, a prediction method designed for a particular kind of genomes might not work well for another. In this paper, we propose the AT excursion method, which is a score-based approach, to quantify local AT abundance in genomic sequences and use the identified high scoring segments for predicting replication origins. This method has the advantages of requiring no preset window size and having rigorous criteria to evaluate statistical significance of high scoring segments. Results We have evaluated the AT excursion method by checking its predictions against known replication origins in herpesviruses and comparing its performance with an existing base weighted score method (BWS1). Out of 43 known origins, 39 are predicted by either one or the other method and 26 origins are predicted by both. The excursion method identifies six origins not predicted by BWS1, showing that the AT excursion method is a valuable complement to BWS1. We have also applied the AT excursion method to two other families of double stranded DNA viruses, the poxviruses and iridoviruses, of which very few replication origins are documented in the public domain. The prediction results are made available as supplementary materials at [1]. Preliminary investigation shows that the proposed method works well on some larger genomes too. Conclusion The AT excursion method will be a useful computational tool for identifying replication origins in a variety of genomic sequences.
Collapse
Affiliation(s)
- David SH Chew
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore
| | - Ming-Ying Leung
- Department of Mathematical Sciences and Bioinformatics Program, The University of Texas at El Paso, TX 79968, USA
| | - Kwok Pui Choi
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore
- Department of Mathematics, National University of Singapore, Singapore 117543, Singapore
| |
Collapse
|
14
|
Mrázek J, Karlin S. Distinctive features of large complex virus genomes and proteomes. Proc Natl Acad Sci U S A 2007; 104:5127-32. [PMID: 17360339 PMCID: PMC1829274 DOI: 10.1073/pnas.0700429104] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
More than a dozen large DNA viruses exceeding 240-kb genome size were recently discovered, including the "giant" mimivirus with a 1.2-Mb genome size. The detection of mimivirus and other large viruses has stimulated new analysis and discussion concerning the early evolution of life and the complexity and mechanisms of evolutionary transitions. This paper presents analysis in three contexts. (i) Genome signatures of large viruses tend to deviate from the genome signatures of their hosts, perhaps indicating that the large viruses are lytic in the hosts. (ii) Proteome composition within these viral genomes contrast with cellular organisms; for example, most eukaryotic genomes, with respect to acidic residue usages, select Glu over Asp, but the opposite generally prevails for the large viral genomes preferring Asp more than Glu. In comparing Phe vs. Tyr usage, the viral genomes select mostly Tyr over Phe, whereas in almost all bacterial and eukaryotic genomes, Phe is used more than Tyr. Interpretations of these contrasts are proffered with respect to protein structure and function. (iii) Frequent oligonucleotides and peptides are characterized in the large viral genomes. The frequent words may provide structural flexibility to interact with host proteins.
Collapse
Affiliation(s)
- Jan Mrázek
- Department of Microbiology and Institute of Bioinformatics, University of Georgia, Athens, GA 30602; and
| | - Samuel Karlin
- Department of Mathematics, Stanford University, Stanford, CA 94305
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|