1
|
Rahmani RS, Decap D, Fostier J, Marchal K. BLSSpeller to discover novel regulatory motifs in maize. DNA Res 2022; 29:6651838. [PMID: 35904558 PMCID: PMC9358016 DOI: 10.1093/dnares/dsac029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
With the decreasing cost of sequencing and availability of larger numbers of sequenced genomes, comparative genomics is becoming increasingly attractive to complement experimental techniques for the task of transcription factor (TF) binding site identification. In this study, we redesigned BLSSpeller, a motif discovery algorithm, to cope with larger sequence datasets. BLSSpeller was used to identify novel motifs in Zea mays in a comparative genomics setting with 16 monocot lineages. We discovered 61 motifs of which 20 matched previously described motif models in Arabidopsis. In addition, novel, yet uncharacterized motifs were detected, several of which are supported by available sequence-based and/or functional data. Instances of the predicted motifs were enriched around transcription start sites and contained signatures of selection. Moreover, the enrichment of the predicted motif instances in open chromatin and TF binding sites indicates their functionality, supported by the fact that genes carrying instances of these motifs were often found to be co-expressed and/or enriched in similar GO functions. Overall, our study unveiled several novel candidate motifs that might help our understanding of the genotype to phenotype association in crops.
Collapse
Affiliation(s)
- Razgar Seyed Rahmani
- Department of Plant Biotechnology and Bioinformatics, Ghent University , Gent, Belgium
- Department of Information Technology, IDLab, Ghent University—imec , Gent, Belgium
| | - Dries Decap
- Department of Information Technology, IDLab, Ghent University—imec , Gent, Belgium
| | - Jan Fostier
- Department of Information Technology, IDLab, Ghent University—imec , Gent, Belgium
| | - Kathleen Marchal
- Department of Plant Biotechnology and Bioinformatics, Ghent University , Gent, Belgium
- Department of Information Technology, IDLab, Ghent University—imec , Gent, Belgium
- Department of Biochemistry, Genetics and Microbiology, University of Pretoria , Pretoria, South Africa
| |
Collapse
|
2
|
Korotkov EV, Suvorova YM, Kostenko DO, Korotkova MA. Multiple Alignment of Promoter Sequences from the Arabidopsis thaliana L. Genome. Genes (Basel) 2021; 12:135. [PMID: 33494278 PMCID: PMC7909805 DOI: 10.3390/genes12020135] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 01/15/2021] [Accepted: 01/18/2021] [Indexed: 11/16/2022] Open
Abstract
In this study, we developed a new mathematical method for performing multiple alignment of highly divergent sequences (MAHDS), i.e., sequences that have on average more than 2.5 substitutions per position (x). We generated sets of artificial DNA sequences with x ranging from 0 to 4.4 and applied MAHDS as well as currently used multiple sequence alignment algorithms, including ClustalW, MAFFT, T-Coffee, Kalign, and Muscle to these sets. The results indicated that most of the existing methods could produce statistically significant alignments only for the sets with x < 2.5, whereas MAHDS could operate on sequences with x = 4.4. We also used MAHDS to analyze a set of promoter sequences from the Arabidopsis thaliana genome and discovered many conserved regions upstream of the transcription initiation site (from -499 to +1 bp); a part of the downstream region (from +1 to +70 bp) also significantly contributed to the obtained alignments. The possibilities of applying the newly developed method for the identification of promoter sequences in any genome are discussed. A server for multiple alignment of nucleotide sequences has been created.
Collapse
Affiliation(s)
- Eugene V. Korotkov
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Bld.2, 33 Leninsky Ave., 119071 Moscow, Russia;
- National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), 31 Kashirskoye Shosse, 115409 Moscow, Russia; (D.O.K.); (M.A.K.)
| | - Yulia M. Suvorova
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Bld.2, 33 Leninsky Ave., 119071 Moscow, Russia;
| | - Dmitrii O. Kostenko
- National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), 31 Kashirskoye Shosse, 115409 Moscow, Russia; (D.O.K.); (M.A.K.)
| | - Maria A. Korotkova
- National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), 31 Kashirskoye Shosse, 115409 Moscow, Russia; (D.O.K.); (M.A.K.)
| |
Collapse
|
3
|
Khelik K, Lagesen K, Sandve GK, Rognes T, Nederbragt AJ. NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences. BMC Bioinformatics 2017; 18:338. [PMID: 28701187 PMCID: PMC5508607 DOI: 10.1186/s12859-017-1748-z] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 07/04/2017] [Indexed: 12/05/2022] Open
Abstract
Background Comparing sets of sequences is a situation frequently encountered in bioinformatics, examples being comparing an assembly to a reference genome, or two genomes to each other. The purpose of the comparison is usually to find where the two sets differ, e.g. to find where a subsequence is repeated or deleted, or where insertions have been introduced. Such comparisons can be done using whole-genome alignments. Several tools for making such alignments exist, but none of them 1) provides detailed information about the types and locations of all differences between the two sets of sequences, 2) enables visualisation of alignment results at different levels of detail, and 3) carefully takes genomic repeats into consideration. Results We here present NucDiff, a tool aimed at locating and categorizing differences between two sets of closely related DNA sequences. NucDiff is able to deal with very fragmented genomes, repeated sequences, and various local differences and structural rearrangements. NucDiff determines differences by a rigorous analysis of alignment results obtained by the NUCmer, delta-filter and show-snps programs in the MUMmer sequence alignment package. All differences found are categorized according to a carefully defined classification scheme covering all possible differences between two sequences. Information about the differences is made available as GFF3 files, thus enabling visualisation using genome browsers as well as usage of the results as a component in an analysis pipeline. NucDiff was tested with varying parameters for the alignment step and compared with existing alternatives, called QUAST and dnadiff. Conclusions We have developed a whole genome alignment difference classification scheme together with the program NucDiff for finding such differences. The proposed classification scheme is comprehensive and can be used by other tools. NucDiff performs comparably to QUAST and dnadiff but gives much more detailed results that can easily be visualized. NucDiff is freely available on https://github.com/uio-cels/NucDiff under the MPL license. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1748-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ksenia Khelik
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Karin Lagesen
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Norwegian Veterinary Institute, PO Box 750 Sentrum, 0106, Oslo, Norway
| | - Geir Kjetil Sandve
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Torbjørn Rognes
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Department of Microbiology, Oslo University Hospital, Rikshospitalet, PO Box 4950 Nydalen, 0424, Oslo, Norway
| | - Alexander Johan Nederbragt
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway. .,Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, PO Box 1066 Blindern, 0316, Oslo, Norway.
| |
Collapse
|
4
|
Kwak D, Kam A, Becerra D, Zhou Q, Hops A, Zarour E, Kam A, Sarmenta L, Blanchette M, Waldispühl J. Open-Phylo: a customizable crowd-computing platform for multiple sequence alignment. Genome Biol 2014; 14:R116. [PMID: 24148814 DOI: 10.1186/gb-2013-14-10-r116] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 10/22/2013] [Indexed: 11/10/2022] Open
Abstract
Citizen science games such as Galaxy Zoo, Foldit, and Phylo aim to harness the intelligence and processing power generated by crowds of online gamers to solve scientific problems. However, the selection of the data to be analyzed through these games is under the exclusive control of the game designers, and so are the results produced by gamers. Here, we introduce Open-Phylo, a freely accessible crowd-computing platform that enables any scientist to enter our system and use crowds of gamers to assist computer programs in solving one of the most fundamental problems in genomics: the multiple sequence alignment problem.
Collapse
|
5
|
Wang B, Kennedy MA. Principal components analysis of protein sequence clusters. ACTA ACUST UNITED AC 2014; 15:1-11. [PMID: 24496727 DOI: 10.1007/s10969-014-9173-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Accepted: 01/24/2014] [Indexed: 12/21/2022]
Abstract
Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.
Collapse
Affiliation(s)
- Bo Wang
- Department of Chemistry and Biochemistry, Miami University, Oxford, OH, 45056, USA
| | | |
Collapse
|
6
|
Soto W, Becerra D. A Multi-Objective Evolutionary Algorithm for Improving Multiple Sequence Alignments. ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2014. [DOI: 10.1007/978-3-319-12418-6_10] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
7
|
Blanchette M. Exploiting ancestral mammalian genomes for the prediction of human transcription factor binding sites. BMC Bioinformatics 2012; 13 Suppl 19:S2. [PMID: 23281809 PMCID: PMC3526440 DOI: 10.1186/1471-2105-13-s19-s2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background The computational prediction of Transcription Factor Binding Sites (TFBS) remains a challenge due to their short length and low information content. Comparative genomics approaches that simultaneously consider several related species and favor sites that have been conserved throughout evolution improve the accuracy (specificity) of the predictions but are limited due to a phenomenon called binding site turnover, where sequence evolution causes one TFBS to replace another in the same region. In parallel to this development, an increasing number of mammalian genomes are now sequenced and it is becoming possible to infer, to a surprisingly high degree of accuracy, ancestral mammalian sequences. Results We propose a TFBS prediction approach that makes use of the availability of inferred ancestral mammalian genomes to improve its accuracy. This method aims to identify binding loci, which are regions of a few hundred base pairs that have preserved their potential to bind a given transcription factor over evolutionary time. After proposing a neutral evolutionary model of predicted TFBS counts in a DNA region of a given length, we use it to identify regions that have preserved the number of predicted TFBS they contain to an unexpected degree given their divergence. The approach is applied to human chromosome 1 and shows significant gains in accuracy as compared to both existing single-species and multi-species TFBS prediction approaches, in particular for transcription factors that are subject to high turnover rates. Availability The source code and predictions made by the program are available at http://www.cs.mcgill.ca/~blanchem/bindingLoci.
Collapse
Affiliation(s)
- Mathieu Blanchette
- McGill Centre for Bioinformatics and School of Computer Science, McGill University, H3C 2B4 Québec, Canada.
| |
Collapse
|
8
|
Sun H, Buhler JD. PhyLAT: a phylogenetic local alignment tool. Bioinformatics 2012; 28:1336-44. [PMID: 22492645 PMCID: PMC3465089 DOI: 10.1093/bioinformatics/bts158] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2011] [Revised: 03/29/2012] [Accepted: 03/30/2012] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The expansion of DNA sequencing capacity has enabled the sequencing of whole genomes from a number of related species. These genomes can be combined in a multiple alignment that provides useful information about the evolutionary history at each genomic locus. One area in which evolutionary information can productively be exploited is in aligning a new sequence to a database of existing, aligned genomes. However, existing high-throughput alignment tools are not designed to work effectively with multiple genome alignments. RESULTS We introduce PhyLAT, the phylogenetic local alignment tool, to compute local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyLAT uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. It combines a probabilistic approach to alignment with seeding and expansion heuristics to accelerate discovery of significant alignments. We provide evidence, using alignments of human chromosome 22 against a five-species alignment from the UCSC Genome Browser database, that PhyLAT's alignments are more accurate than those of other commonly used programs, including BLAST, POY, MAFFT, MUSCLE and CLUSTAL. PhyLAT also identifies more alignments in coding DNA than does pairwise alignment alone. Finally, our tool determines the evolutionary relationship of query sequences to the database more accurately than do POY, RAxML, EPA or pplacer.
Collapse
Affiliation(s)
- Hongtao Sun
- Department of Computer Science and Engineering, Washington University, Saint Louis, MO 63130, USA.
| | | |
Collapse
|
9
|
Kawrykow A, Roumanis G, Kam A, Kwak D, Leung C, Wu C, Zarour E, Sarmenta L, Blanchette M, Waldispühl J. Phylo: a citizen science approach for improving multiple sequence alignment. PLoS One 2012; 7:e31362. [PMID: 22412834 PMCID: PMC3296692 DOI: 10.1371/journal.pone.0031362] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Accepted: 01/09/2012] [Indexed: 01/07/2023] Open
Abstract
Background Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server. Methodology/Principal Findings We introduce Phylo, a human-based computing framework applying “crowd sourcing” techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered. Conclusions/Significance We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of “human-brain peta-flops” of computation that are spent every day playing games. Phylo is available at: http://phylo.cs.mcgill.ca.
Collapse
Affiliation(s)
- Alexander Kawrykow
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Sacan A, Ekins S, Kortagere S. Applications and limitations of in silico models in drug discovery. Methods Mol Biol 2012; 910:87-124. [PMID: 22821594 DOI: 10.1007/978-1-61779-965-5_6] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Drug discovery in the late twentieth and early twenty-first century has witnessed a myriad of changes that were adopted to predict whether a compound is likely to be successful, or conversely enable identification of molecules with liabilities as early as possible. These changes include integration of in silico strategies for lead design and optimization that perform complementary roles to that of the traditional in vitro and in vivo approaches. The in silico models are facilitated by the availability of large datasets associated with high-throughput screening, bioinformatics algorithms to mine and annotate the data from a target perspective, and chemoinformatics methods to integrate chemistry methods into lead design process. This chapter highlights the applications of some of these methods and their limitations. We hope this serves as an introduction to in silico drug discovery.
Collapse
Affiliation(s)
- Ahmet Sacan
- School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA
| | | | | |
Collapse
|
11
|
Kortagere S, Lill M, Kerrigan J. Role of computational methods in pharmaceutical sciences. Methods Mol Biol 2012; 929:21-48. [PMID: 23007425 DOI: 10.1007/978-1-62703-050-2_3] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/08/2022]
Abstract
Over the past two decades computational methods have eased up the financial and experimental burden of early drug discovery process. The in silico methods have provided support in terms of databases, data mining of large genomes, network analysis, systems biology on the bioinformatics front and structure-activity relationship, similarity analysis, docking, and pharmacophore methods for lead design and optimization. This review highlights some of the applications of bioinformatics and chemoinformatics methods that have enriched the field of drug discovery. In addition, the review also provided insights into the use of free energy perturbation methods for efficiently computing binding energy. These in silico methods are complementary and can be easily integrated into the traditional in vitro and in vivo methods to test pharmacological hypothesis.
Collapse
Affiliation(s)
- Sandhya Kortagere
- Department of Microbiology and Immunology, Drexel University College of Medicine, Philadelphia, PA, USA.
| | | | | |
Collapse
|
12
|
Kim J, Ma J. PSAR: measuring multiple sequence alignment reliability by probabilistic sampling. Nucleic Acids Res 2011; 39:6359-68. [PMID: 21576232 PMCID: PMC3159474 DOI: 10.1093/nar/gkr334] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2011] [Revised: 04/18/2011] [Accepted: 04/24/2011] [Indexed: 11/14/2022] Open
Abstract
Multiple sequence alignment, which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the alignments and incorporate it into downstream analyses. We propose a new probabilistic sampling-based alignment reliability (PSAR) score. Instead of relying on heuristic assumptions, such as the correlation between alignment quality and guide tree uncertainty in progressive alignment methods, we directly generate suboptimal alignments from an input multiple sequence alignment by a probabilistic sampling method, and compute the agreement of the input alignment with the suboptimal alignments as the alignment reliability score. We construct the suboptimal alignments by an approximate method that is based on pairwise comparisons between each single sequence and the sub-alignment of the input alignment where the chosen sequence is left out. By using simulation-based benchmarks, we find that our approach is superior to existing ones, supporting that the suboptimal alignments are highly informative source for assessing alignment reliability. We apply the PSAR method to the alignments in the UCSC Genome Browser to measure the reliability of alignments in different types of regions, such as coding exons and conserved non-coding regions, and use it to guide cross-species conservation study.
Collapse
Affiliation(s)
- Jaebum Kim
- Institute for Genomic Biology and Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Jian Ma
- Institute for Genomic Biology and Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
13
|
Dewey CN. Positional orthology: putting genomic evolutionary relationships into context. Brief Bioinform 2011; 12:401-12. [PMID: 21705766 PMCID: PMC3178058 DOI: 10.1093/bib/bbr040] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Orthology is a powerful refinement of homology that allows us to describe more precisely the evolution of genomes and understand the function of the genes they contain. However, because orthology is not concerned with genomic position, it is limited in its ability to describe genes that are likely to have equivalent roles in different genomes. Because of this limitation, the concept of ‘positional orthology’ has emerged, which describes the relation between orthologous genes that retain their ancestral genomic positions. In this review, we formally define this concept, for which we introduce the shorter term ‘toporthology’, with respect to the evolutionary events experienced by a gene’s ancestors. Through a discussion of recent studies on the role of genomic context in gene evolution, we show that the distinction between orthology and toporthology is biologically significant. We then review a number of orthology prediction methods that take genomic context into account and thus that may be used to infer the important relation of toporthology.
Collapse
Affiliation(s)
- Colin N Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 5785 Medical Sciences Center, 1300 University Ave, Madison, WI 53706, USA.
| |
Collapse
|
14
|
Abstract
We present a graph-based model for representing two aligned genomic sequences. An alignment graph is a mixed graph consisting of two sets of vertices, each representing one of the input sequences, and three sets of edges. These edges allow the model to represent a number of evolutionary events. This model is used to perform sequence alignment at the level of nucleotides. We define a scoring function for alignment graphs. We show that minimizing the score is NP-complete. However, we present a dynamic programming algorithm that solves the minimization problem optimally for a certain class of alignments, called breakable arrangements. Algorithms for analyzing breakable arrangements are presented. We also present a greedy algorithm that is capable of representing reversals. We present a dynamic programming algorithm that optimally aligns two genomic sequences, when one of the input sequences is a breakable arrangement of the other. Comparing what we define as breakable arrangements to alignments generated by other algorithms, it is seen that many already aligned genomes fall into the category of being breakable. Moreover, the greedy algorithm is shown to represent reversals, besides rearrangements, mutations, and other evolutionary events.
Collapse
Affiliation(s)
- Nahla A Belal
- Department of Computer Science, AAST, Alexandria, Egypt
| | | |
Collapse
|
15
|
Mahmood K, Konagurthu AS, Song J, Buckle AM, Webb GI, Whisstock JC. EGM: encapsulated gene-by-gene matching to identify gene orthologs and homologous segments in genomes. Bioinformatics 2010; 26:2076-84. [DOI: 10.1093/bioinformatics/btq339] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|
16
|
Comparative assessment of methods for aligning multiple genome sequences. Nat Biotechnol 2010; 28:567-72. [PMID: 20495551 DOI: 10.1038/nbt.1637] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2009] [Accepted: 04/27/2010] [Indexed: 01/22/2023]
Abstract
Multiple sequence alignment is a difficult computational problem. There have been compelling pleas for methods to assess whole-genome multiple sequence alignments and compare the alignments produced by different tools. We assess the four ENCODE alignments, each of which aligns 28 vertebrates on 554 Mbp of total input sequence. We measure the level of agreement among the alignments and compare their coverage and accuracy. We find a disturbing lack of agreement among the alignments not only in species distant from human, but even in mouse, a well-studied model organism. Overall, the assessment shows that Pecan produces the most accurate or nearly most accurate alignment in all species and genomic location categories, while still providing coverage comparable to or better than that of the other alignments in the placental mammals. Our assessment reveals that constructing accurate whole-genome multiple sequence alignments remains a significant challenge, particularly for noncoding regions and distantly related species.
Collapse
|
17
|
Bickel PJ, Brown JB, Huang H, Li Q. An overview of recent developments in genomics and associated statistical methods. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2009; 367:4313-37. [PMID: 19805447 DOI: 10.1098/rsta.2009.0164] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensive sequencing has shifted the primary focus from the acquisition of biological sequences to the study of biological function. Assays have been developed to study many intricacies of biological systems, and publicly available databases have given rise to integrative analyses that combine information from many sources to draw complex conclusions. Such research was the focus of the recent workshop at the Isaac Newton Institute, 'High dimensional statistics in biology'. Many computational methods from modern genomics and related disciplines were presented and discussed. Using, as much as possible, the material from these talks, we give an overview of modern genomics: from the essential assays that make data-generation possible, to the statistical methods that yield meaningful inference. We point to current analytical challenges, where novel methods, or novel applications of extant methods, are presently needed.
Collapse
Affiliation(s)
- Peter J Bickel
- Department of Statistics University of California, Berkeley, CA, USA
| | | | | | | |
Collapse
|
18
|
Buschiazzo E, Gemmell NJ. Evolutionary and phylogenetic significance of platypus microsatellites conserved in mammalian and other vertebrate genomes. AUST J ZOOL 2009. [DOI: 10.1071/zo09038] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Building on the recent publication of the first monotreme genome, that of the platypus, and the discovery that many platypus microsatellites are found in the genomes of three mammals (opossum, human, mouse) and two non-mammalian vertebrates (chicken, lizard), we investigated further the evolutionary conservation of microsatellites identified in the monotreme lineage and tested whether the conservation of microsatellites we observe in vertebrates has phylogenetic signal. Most conserved platypus microsatellites (75%) were found in one species, with the platypus sharing many more microsatellites with mammals than with reptiles (83% versus 30%). Within mammals, unexpectedly, many more platypus microsatellites had orthologues in the opossum genome than in that of either human or mouse, which was at odds with the very well supported view that monotremes diverged from a lineage containing both eutherians and marsupials (Theria hypothesis). We investigated the phylogenetic significance of microsatellite conservation through Bayesian and maximum parsimony tree reconstruction using presence/absence data of microsatellite loci conserved in a total of 18 species, including the platypus. Although models of evolution implemented in current phylogenetic reconstruction algorithms are not tailor-made for microsatellite data, we were able to construct vertebrate phylogenies that correspond well to the accepted mammalian phylogeny, with two of our three reconstructions supporting the Theria hypothesis. Our analysis provides ground for new theoretical development in phylogeny-based analyses of conserved microsatellite data.
Collapse
|
19
|
Lin MF, Deoras AN, Rasmussen MD, Kellis M. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. PLoS Comput Biol 2008; 4:e1000067. [PMID: 18421375 PMCID: PMC2291194 DOI: 10.1371/journal.pcbi.1000067] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2007] [Accepted: 03/20/2008] [Indexed: 01/22/2023] Open
Abstract
Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (< or =240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human.
Collapse
Affiliation(s)
- Michael F. Lin
- Broad Institute of MIT and Harvard University, Cambridge, Massachusetts, United States of America
| | - Ameya N. Deoras
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Matthew D. Rasmussen
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Manolis Kellis
- Broad Institute of MIT and Harvard University, Cambridge, Massachusetts, United States of America
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
20
|
Yavatkar AS, Lin Y, Ross J, Fann Y, Brody T, Odenwald WF. Rapid detection and curation of conserved DNA via enhanced-BLAT and EvoPrinterHD analysis. BMC Genomics 2008; 9:106. [PMID: 18307801 PMCID: PMC2268679 DOI: 10.1186/1471-2164-9-106] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2007] [Accepted: 02/28/2008] [Indexed: 12/04/2022] Open
Abstract
Background Multi-genome comparative analysis has yielded important insights into the molecular details of gene regulation. We have developed EvoPrinter, a web-accessed genomics tool that provides a single uninterrupted view of conserved sequences as they appear in a species of interest. An EvoPrint reveals with near base-pair resolution those sequences that are essential for gene function. Results We describe here EvoPrinterHD, a 2nd-generation comparative genomics tool that automatically generates from a single input sequence an enhanced view of sequence conservation between evolutionarily distant species. Currently available for 5 nematode, 3 mosquito, 12 Drosophila, 20 vertebrate, 17 Staphylococcus and 20 enteric bacteria genomes, EvoPrinterHD employs a modified BLAT algorithm [enhanced-BLAT (eBLAT)], which detects up to 75% more conserved bases than identified by the BLAT alignments used in the earlier EvoPrinter program. The new program also identifies conserved sequences within rearranged DNA, highlights repetitive DNA, and detects sequencing gaps. EvoPrinterHD currently holds over 112 billion bp of indexed genomes in memory and has the flexibility of selecting a subset of genomes for analysis. An EvoDifferences profile is also generated to portray conserved sequences that are uniquely lost in any one of the orthologs. Finally, EvoPrinterHD incorporates options that allow for (1) re-initiation of the analysis using a different genome's aligning region as the reference DNA to detect species-specific changes in less-conserved regions, (2) rapid extraction and curation of conserved sequences, and (3) for bacteria, identifies unique or uniquely shared sequences present in subsets of genomes. Conclusion EvoPrinterHD is a fast, high-resolution comparative genomics tool that automatically generates an uninterrupted species-centric view of sequence conservation and enables the discovery of conserved sequences within rearranged DNA. When combined with cis-Decoder, a program that discovers sequence elements shared among tissue specific enhancers, EvoPrinterHD facilitates the analysis of conserved sequences that are essential for coordinate gene regulation.
Collapse
Affiliation(s)
- Amarendra S Yavatkar
- 1Division of Intramural Research, Information Technology Program, NINDS, NIH, Bethesda, Maryland, USA.
| | | | | | | | | | | |
Collapse
|